Error Approximation and Bias Correction in Dynamic Problems using a Recurrent Neural Network/Finite Element Hybrid Model

This work proposes a hybrid modeling framework based on recurrent neural networks (RNNs) and the finite element (FE) method to approximate model discrepancies in time dependent, multi-fidelity problems, and use the trained hybrid models to perform bias correction of the low-fidelity models. The hybrid model uses FE basis functions as a spatial basis and RNNs for the approximation of the time dependencies of the FE basis' degrees of freedom. The training data sets consist of sparse, non-uniformly sampled snapshots of the discrepancy function, pre-computed from trajectory data of low- and high-fidelity dynamic FE models. To account for data sparsity and prevent overfitting, data upsampling and local weighting factors are employed, to instigate a trade-off between physically conforming model behavior and neural network regression. The proposed hybrid modeling methodology is showcased in three highly non-trivial engineering test-cases, all featuring transient FE models, namely, heat diffusion out of a heat sink, eddy-currents in a quadrupole magnet, and sound wave propagation in a cavity. The results show that the proposed hybrid model is capable of approximating model discrepancies to a high degree of accuracy and accordingly correct low-fidelity models.


Introduction
Over the last decade, machine learning (ML) has established itself as one of the prime research subjects in the contemporary sciences, finding applications in biology [1], engineering [28,44,62], physics [9,27], and medicine [67], to name only a few areas. Even though this development can to some extent be attributed to hype [59], there are domains in which ML methods have outperformed state-of-the-art techniques. Explicit examples include the development of convolutional neural networks (CNNs) and graph neural networks for image classification [11,42,66], large language models for natural language processing [47], and deep reinforcement learning for applications in autonomous driving [56]. However, ML models are subjected to the curse of dimensionality and training often involves solving nonlinear optimization problems which causes training times to be a significant concern. Nevertheless, with processing units such as CPUs and GPUs becoming more performant each year [7], the impact of ML will most likely increase even further.
On the basis of these recent successes, ML quickly penetrated traditional scientific computing methods [30]. Recent advances in this area lie in the emergence of physics-informed neural networks (PINNs) [8,22,29], hybrid modeling [34,41], but also in the field of data-driven computing [18,33]. PINNs incorporate physics-based loss functions to train neural networks (NNs) to exhibit physically conforming behavior, hybrid modeling approaches seek to combine "the best of both worlds", and data-driven computing methods extend traditional numerical approximation methods for partial differential equations (PDEs) to directly incorporate measurement data. Worthy of note, however, is that ML does not always improve upon established methods and its success or lack thereof significantly depends on the problem setting to which it is applied [43]. For instance, deep learning inherently requires nonlinear optimization and is at a natural disadvantage when competing with traditional solvers in low dimensional, linear problem settings, consequently requiring nonlinear and/or high-dimensional problems problem settings to achieve a computational benefit [16].
In the authors' opinion, data-based bias correction and model validation, a framework discussed in detail in the recent work of Levine and Stuart [37], provides exactly such a setting. In particular, let us consider the case of multifidelity modeling, where computationally inexpensive, low-fidelity models are employed to approximate time-and resource-demanding high-fidelity models, resulting in a model hierarchy according to accuracy and computational cost [1,48,50]. Bias correction and model validation occur naturally in multifidelity modeling. In bias correction, a corrective term is computed, which accounts for the discrepancy between models of different fidelity [50]. In model validation, the accuracy of the low-fidelity model is validated against a reference [46,54,55]. In both cases, approximating the systematic error, also referred to as model bias or discrepancy, is necessary. Primary approaches for learning a discrepancy function include Bayesian inference [12,13,38,64] and Gaussian processes [3,32,19,38,49]. In more recent years, bias correction by means of neural networks has also been attempted [36,45,63]. Specifically, Um et al. [61] successfully calculate a correction term for a fluid flow simulation by integrating the solver directly into the training loop of a CNN. Nevertheless, this approach, calculates a correction operator and not the discrepancy function itself.
In this work, we approximate the discrepancy functions between low and high-fidelity finite element (FE) simulations of dynamical systems, by employing a recurrent neural network (RNN)/FE-based hybrid model, which is trained on sparse, non-uniformly sampled trajectory data of the high-fidelity solution. The high and low-fidelity simulations differ by choice of computational mesh and modeling assumptions, consequently inducing both discretization and modeling errors. Similarly, the RNN/FE-based hybrid model is constructed by splitting spatial and temporal dependencies. To account for the sparsity of the training data and prevent the overfitting phenomenon, the training data is upsampled artificially using linear interpolation operators and localized Gaussian priors at missing time steps to provide noise stabilization, an approach commonly used in game design and image processing [26,31,51]. Local weighting factors on the training data are also employed to control the interpolation behavior of the model. The hybrid model is applied to three engineering-relevant test cases, namely, heat diffusion out of a heat sink, eddy-currents in a quadrupole magnet, and sound wave propagation in a cavity. The results show that the proposed hybrid model and training regimen is capable of approximating model discrepancies to a high degree of accuracy.
The main contribution of this work lies in the application of RNNs for discrepancy function approximation in the setting of multi-fidelity modeling and simulation, in combination with the FE method. Even though surrogate modeling has been used extensively to approximate high-fidelity simulations [14,10,21], using RNNs to learn model errors has been first proposed in [37], however, not in combination with FE-based models and with only limited numerical experiments that do not include highly non-trivial engineering applications, as the ones available in the present work. Additional contributions concern the specific architecture of the hybrid model, as well as the treatment of hybrid model training with sparse data, such that the physically correct interpolation behavior is ensured.
The remaining of this paper provides, first, a theoretical overview of data-driven discrepancy approximation for model validation and bias correction (Section 2), followed by a presentation of the suggested hybrid modeling approach (Section 3), which highlights the hybrid model architecture and discusses the treatment of sparse training data. Numerical experiments showcasing the benefits of the proposed hybrid modeling approach are presented in Section 4. Last, concluding remarks and considerations for further methodological developments are available in Section 5.

Dynamical system simulation
The simulation of dynamical systems commonly requires the spatial and temporal discretization of an underlying boundary value problem (BVP), defined on Ω × [0, T ], where Ω ⊂ R d denotes the spatial domain at dimension d ≤ 3 and T ∈ R ≥0 the time range. Spatial discretization refers to the approximation of the system states by a finite dimensional basis representation and temporal discretization refers to state propagation along a discrete time axis. Therein, implicit time stepping schemes are frequently used, due to their advantages in numerical stability. The discrete dynamical system can be described by , r ∈ Ω, are the system states at time t = t k where k is the time step indexing. In the case where the spatial discretization does not change over time, Ψ only operates on the degree of freedom (dof) of the states' basis coefficients. Accordingly, we can express (1) denotes the number of spatial dof. Given an initial state x t 0 , we can solve (1) for N T ∈ N time steps of size ∆t ∈ R. The solution is an approximation of the states' trajectories, the accuracy of which is dictated by the values of N dof and N T and the time stepping scheme. In this work, we focus on the FE method for spatial discretization and an implicit Euler scheme for time-stepping.

Model fidelity and discrepancy
Real-world scientific and engineering applications governed by dynamical problems are inherently complex, often requiring a large number of dof and time steps to be resolved to sufficient accuracy. Quite often, solution accuracy must be balanced against limitations in computational resources. A predominant factor in this trade-off is model fidelity, that is, the model's capability to precisely represent the physical system it approximates [17]. In the following, high-and low-fidelity models are denoted with Ψ hifi and Ψ lofi , respectively, as they will refer to dynamical system models, as they were introduced in Section 2.1. Accordingly, low-and high-fidelity system states at time t k are respectively denoted with x lofi t k and x hifi t k . The difference in the results between a low-and a high-fidelity model can be predominantly attributed to a systematic error, which is commonly called model discrepancy or prediction bias. Model discrepancy can be quantified using a so-called discrepancy function , which maps the system state onto its corresponding systematic error [3,40]. Non-systematic errors such as noise can similarly be expressed using an error function ε t ∶ R d → R d , where a common assumption is ε t ∼ N (0, Σ), Σ ∈ R d×d . Under standard assumptions regarding the nature of the error and bias terms, for example, additivity or multiplicativity, the relationship between Ψ hifi and Ψ lofi can be expressed explicitly. For instance, under the additive bias and noise assumption, it holds that for k = 1, ..., N T . Multiplicative or more complex relations are also possible. For the remaining of this work, we neglect the noise term in (2) and focus solely on the discrepancy function.

Data-driven discrepancy function approximation for model validation and correction
In the field of model validation, one is interested in whether the discrepancy function, δ t , fulfills certain conditions, as to determine whether the use of Ψ lofi is justifiable [35]. In this context, a very general validation condition reads where C > 0 and ⋅ 2 the L 2 (Ω) norm. If δ t fulfills (3), the use of Ψ lofi is tenable [35]. To that end, the functional form of δ t k , which essentially quantifies the accuracy of Ψ lofi with respect to Ψ hifi , must be inferred. While this could be accomplished by sufficiently sampling the high-and the lowfidelity models, the evaluation of the high-fidelity model must in many cases be avoided, typically due to its prohibitive computational cost. Instead, , of its state trajectory may be available [37], for instance, originating from measurement setups or auxiliary simulations. To that end the superscript "o" is chosen to denote an observation.
Then, the validation of the low-fidelity model requires a parametric model δ θ , which approximates the discrepancy based on the snapshots X. This approximation can be accomplished with supervised learning algorithms. However, using data as a reference increases substantially the complexity of the validation problem, since X might contain sparse, non-uniformly sampled, and noisy data. Possible remedies to this issue can be sought in novel physics-informed ML approaches, which integrate physics-inspired objective functions into the approximation process to induce physics conforming behavior in sparse data regimes [29]. Irrespective of the choice of the parametric model δ θ and the multi-fidelity models, the minimization problem to be solved in order to determine an optimal parameter set θ * given the training data set D ∶= where N θ denotes the number of parametric dof, can be chosen to map the low-fidelity states x lofi t k ↦ δ θ onto the respective systematic error for all time instances t k ∈ T o and points in the spatial domain r ∈ Ω [61]. We note however, that this is one example and that other mappings are also possible.
Assuming a successful solution of the minimization problem (4), there are numerous uses for the function δ θ * . On the one hand, δ θ * can be used to approximate the validation condition (3), ultimately verifying the low-fidelity model. On the other hand, it can be used to for bias correction, where we define a corrected system state where δ θ * models the discrepancy between the low-fidelity model and the reference data. The state trajectory resulting from (5) is the bias corrected trajectory of the low-fidelity model. In any case, model validation or correction depend significantly on the accuracy of the parametric model.

Modeling and discretization errors in transient finite element models
Multi-fidelity modeling and simulation introduces different types of systematic errors. In the particular case of FE analysis, these errors can be categorized into three different classes, namely, discretization errors, numerical errors, and modeling errors [5,60]. Discretization errors occur when a function of a continuous variable is approximated by a finite dimensional basis representation, and primarily depend on the mesh resolution and the choice of the finite dimensional ansatz space. Reducing the discretization error requires a finer mesh resolution, thus increasing the number of dof. Numerical errors occur due to the finite precision of computation hardware, e.g., in the representation of real numbers as floating data types and the finite precision of iterative solvers. Truncation errors fall in the same category. Modeling errors occur due to assumptions and simplifications with respect to the problem itself. Common examples include misspecification of boundary conditions, incorrect definition of loading terms, linear approximations of otherwise nonlinear material responses, and 2D approximations of 3D geometries.
For the approximation of the discrepancy function between FE models of varying fidelity, a suitable representation basis must be chosen. Assuming solely the discretization error δ d , a natural choice would be to use the FE basis functions. In the context of this work, we use the FE basis of the low-fidelity model Ψ lofi , as we focus on bias correction. Consequently, a low-fidelity representation of the high-fidelity system states must be made available. Thus, it is necessary to define a projection operator T ∶ R N hifi dof → R N lofi dof , where N hifi dof and N lofi dof denote the number of dof of the high-and the low-fidelity model, respectively. 1 Assuming Ψ hifi and Ψ lofi to be defined on the same time grid, the discretization error at time step t k is given as where are the high-fidelity and low-fidelity states at time t.
On the other hand, modeling errors δ m occur in the assumptions that define the BVP, ultimately affecting the definition of the time propagators. This fact is illustrated most clearly by assuming that Ψ hifi and Ψ lofi are parametrized by parameter sets λ λ λ ∶= (λ 1 , ..., λ l ), such that x t k = Ψ(x t k−1 λ λ λ), and defined on the same computational mesh. Consequently, if λ λ λ hifi ≠ λ λ λ lofi , the high-fidelity model At each time step, this error can be quantified as In most multi-fidelity modeling settings, discretization and modeling errors are present simultaneously. Thus, a combined error term must be considered, which captures both. Considering (6) and (7), the combined error term can be quantified as where In the following sections, (8) provides the basis for preprocessing the training data.

Hybrid model elements and architecture
In the following section, we discuss the nature of the discrepancy function δ t and our subsequent choice for the parametric model δ θ ≈ δ t . Discrepancy functions often display complex dynamics, exhibiting piecewise smooth and non-smooth behavior in the problem domain. These dynamics occur because discrepancy functions are required to capture harmonic propagating behavior, but also phase transitions, material interfaces, and interpolation errors. This phenomenon can be attributed to a variety of reasons, arising partially due to the underlying physics, but also due to the spatial and temporal discretization schemes, e.g., because of mesh inconsistencies or time stepping errors.
Considering these challenges, we split the approximation of time-dependent and spatially-dependent effects, such that an RNN is used to approximate the sequential, temporal dynamics, and the low-fidelity FE-basis functions are used to account for the spatial effects. As we seek to calculate a bias correction term for the low-fidelity model, we assume that the discrepancy function has the form whereδ i,t k are the time dependent dof of the finite dimensional basis at time t k and the i-th spatial dof, are the basis functions of the lowfidelity model, and r ∈ Ω. We denote withδ t k ∶= δ 1,t k , ...,δ N lofi dof ,t k the vector containing the coefficients of (9).
The suggested hybrid modeling approach then consists of using the RNN to learn the time-dependent dof ,δ t k , at each time step. Due to their universal approximation property, as well as their use of residual connections to account for time-dependencies in the data, RNNs offer sufficient flexibility to approximate the dynamics in the data. In particular, a subcategory of RNNs called long-short term memory (LSTM) cells, are used to circumvent the vanishing gradient problem, by sharing information between the individual units and using hidden state variables to control the gating mechanism and shape information flow [25].
In our case, the RNN architecture itself comprises of two building blocks, namely a concatenation of LSTM cells and a linear output layer. A visualization is provided in Figure 1. In each training iteration, the RNN considers N p consecutive time steps simultaneously, where the coefficients of the lowfidelity system states, x t k , serve as inputs and the outputs approximate the discrepancy function's coefficients,δ t k . We denote the mapping of the RNN where denotes the component-wise output of the RNN. As numerous time steps are mapped simultaneously, the in-and output dimension of the RNN is Np −1 yields an approximation ofδ i,t k on the whole trajectory. Choosing a similar basis representation as in (9), we can define the RNN/FE hybrid model as As the FE basis functions already provide a basis for the discretized spatial domain, the RNN needs only to handle the variation of the coefficients in time, thus significantly simplifying the optimization problem and reducing training times. In Figure 1, the vectors c t and h t , t = t k , ..., t k+Np−1 , denote the cell state and hidden state of the LSTM cell. The cell state of the LSTM consists of a parameter vector controlling the gating mechanism of the LSTM cell and, ultimately, information flow, while the hidden state of the LSTM consists of the cell output, which is simultaneously fed to the consecutive cell and to the output layer. We choose rectified linear units (ReLU) activation functions, which is a necessary choice for the approximation of the piecewise and non-differentiable behavior occurring in phase transitions.

Normalization by non-dimensionalization
A significant drawback of NNs is their inability to adequately represent data with small values, a scenario which comes up when considering error functions. Thus, normalization procedures are in order, which scale the input data and significantly improve the performance of NNs. Especially in physical systems, the quantities of interest can become very small as they are expressed relative to very small physical constants. Examples of such constants are the magnetic permeability µ ∼ 10 −7 Vs Am or the thermal conductivity κ ∼ 10 −3 W mK . Small physical geometries induce similar problems. The resulting error functions of such systems consist of small, fluctuating values, thus requiring a normalization procedure which scales the dynamical system appropriately, while remaining physically conforming at the same time.
An example of such a scaling procedure is the non-dimensionalization of the physical system. In essence, non-dimensionalization removes the physical dimensions from underlying differential equations by suitable variable substitutions. The resulting differential equations have their physical dimensions partially or even completely removed. As an illustrative example, we present how the temporal and spatial differential operators transform under change of variables. Let r = (r x , r y ) and ∈ R, be a coordinate transformation for which physical dimensions have been removed. The non-dimensionalised n-th time derivative as well as the Laplace operator transformation is given by where g is the metric tensor and ∆ the Laplace-Beltrami operator. The transformed wave equation then reads where the solution to the original system can then be recovered by the back- x,c r x and r y = r −1 y,c r y . The resulting system can then be scaled such that NNs can optimally fit the data, as well as conform to the physical equations.

Hybrid model training
In this section, we discuss the training of the RNN, including details on the choice of training data and how we deal with data sparsity. The training data set consists of snapshot data of the discrepancy function, which are calculated from the down-projected high-fidelity solution and the lowfidelity solution. Assuming N hifi T trajectory samples of the high-fidelity solu- T } the respective time instances on which they are defined. Then, we can partition the time axis of the BVP into seperate intervals I k ∶= [t k , t k+1 ] such that [0, T ] = ⋃ N hifi T k=0 I k holds for t 0 = 0 and t N hifi T = T . As we can evaluate the low-fidelity model at more time instances than those included in the high-fidelity data, low-fidelity states x lofi t k can be evaluated for t k ∈ T hifi , but also on intermediate time instances t k < t k j < t k+1 with j = 1, ..., N I k , where N I k denotes the number of intermediate time steps in the interval I k . Consequently, the low-fidelity states x lofi t k are defined on more time steps, namely T lofi ∶= T hifi ∪ ⋃ N hifi T k=1 ⋃ N I k j=1 t k j . We construct T lofi by choice of the low-fidelity model, such that T hifi ≤ T lofi holds and the time instances are uniform, i.e., t k+1 − t k = ∆t, ∀ t k , t k+1 ∈ T lofi . Especially the latter point is important for the inputs of RNNs. For a visualization of the difference between the sets T hifi and T lofi , see Figure 3.3. Given theses considerations, the training data consisting of discrepancy function snapshots is given as the set where T is a linear projection operator and t k ∈ T hifi . Note that the sampled instances of the high-fidelity data are not chosen randomly. Instead, they depend on the dynamics of the underlying physical system. In areas where x hifi the high-fidelity systems is very dynamic, for instance, when it is heavily excited, the trajectory is sampled more frequently than in areas where the system approaches steady state. In this work, we sample heuristically and note that there are more sophisticated methods to perform the data sampling, for instance, using active learning or optimal experimental design techniques. Whilst sparse data sets are the norm for most practical scenarios, for example in measurement setups, NNs typically show bad interpolation behavior when trained solely with sparse data sets. This is partially due to the fact that no knowledge of the objective function is provided in sparsely sampled areas, wherein the RNN's interpolation accuracy decreases dramatically. In these scenarios, one has to "inform" the NN on the correct mode of behavior. One way to achieve this would be to add physically motivated constraints to the loss function, an approach frequently pursued in so-called physics-informed machine learning [29]. Another approach is to upsample the available training data as to reflect correct physical behavior, an approach commonly used in game design and image processing [26,31,51]. The main difference between these two approaches is on how the information regarding the correct interpolation behavior is encoded. In the former case, it is encoded in the formulation of the optimization problem. In the latter case, in the artificial (upsampled) data points the model is trained with. The latter approach is explained in more detail in the next section.

Localized data upsampling and noise stabilization
In this work, we resort to the approach of data upsampling, such that model training is performed within a standard supervised learning context. Introducing physics-inspired loss functions, whilst also a promising avenue, has the drawback of significantly complicating the optimization problem, often resulting in elongated training times.
We propose an upsampling scheme based on a combination of localized, linear interpolation to calculate intermediate artificial system states for t k ∈ T up ∶= T hifi T lofi , and a Gaussian prior for noise stabilization to prevent the overfitting phenomenon. Noise stabilization is a common approach in Gaussian Processes to improve numerical stability [4,65] and can be used to prevent NNs from overfitting. The linear interpolation aspects controls the NN's behavior in the sparse data regime, whereas the prior distribution prevents the NN from overfitting. In each training epoch, the prior distribution is sampled to generate new artificial states that are locally bounded by the variance of the Gaussian prior. We choose a linear interpolation approach due to its simplicity and ease of use, as well as its applicability to a large category of problem types. We denote the locally linear interpolation Essentially, (15) interpolates linearly in intervals of sparse data, based on the boundary values found in the training data. We apply noise stabilization by assuming a Gaussian prior on the artificial intermediate states. Let δ i,t k j = δ I k (t k j ) i be evaluated in t k j ∈ I k for j = 1, ..., N I k and restricted to the i-th spatial dof. Then, we define the Gaussian prior for i = 1, ..., N lofi dof and t k j ∈ T up , where α ∈ R is a weighting factor controlling the variance. The distribution (16) is defined locally for the individual dof, where a lower variance is assumed for dof close to zero, thus preserving known boundary conditions, and a larger variance is allowed in domains where the discrepancy is non-zero. An upsampled data set D up for t k j ∈ T up based on (16) is then given as In each training epoch of the RNN, D up is generated by sampling (16) for all spatial dof. Thus, the sparse time series is upsampled to a complete trajectory using the varying artificial system states in D up and the elements of D d which remain fixed as part of the ground truth. The training of the RNN is based on a locally weighted loss function, given by where δ t k ∈ D d are part of the ground truth states, δ * t k j ∈ D up the sampled artificial states, β t k ∈ R a local, time-dependent weighting factors, and ⋅ the Euclidean norm. In the initial training stages, we choose β t k = 1, ∀t k ∈ T hifi , until the RNN has a coarse fit on the data. In the later training stages, we change the local weighting factors on D d to β t k = 1 + δ t k − δ θ 2 , which are calculated by numerical quadrature. Note that β t k ≥ 1. In that way, we ensure that the data set D d is always weighted more heavily than D up during RNN training. For the optimization, we use the adaptive moment estimation (ADAM) algorithm with learning rate decay.

Numerical Examples
In the following numerical investigations, we employ the proposed hybrid model to approximate discrepancy functions in three engineering test cases governed by transient PDEs, namely, heat diffusion on a heat sink, eddy-currents in a quadrupole magnet, and sound wave propagation inside a cavity. For each test case, we consider a high-fidelity and a low-fidelity FE representation of the BVP. The difference between the two lies in the FE mesh refinement, as well as in modeling errors in the material laws, excitation, and domain geometry, depending on the test case. To highlight the necessity of the upsampling approach proposed in Section 3.4, we consider hybrid models with and without data upsampling and observe the impact of overfitting in the latter case. Both simulations are then compared to reference data of the discrepancy function, which is calculated from densely sampled high-fidelity data. The trained hybrid models are then employed for bias correction of the low-fidelity models using (5), leading to significantly more accurate results. To assess the accuracy of the discrepancy function approximation and the bias corrected model, we consider the relative L 2 (Ω) errors where the ⋅ 2 is approximated via numerical integration over the computational mesh and ∫ [0,T ] ⋅ dt via Riemannian sums. Last, considering the FE simulations, we restrict ourselves to the 2D case without loss of generality. Consequently, we can restrict our space of test functions to

Heat diffusion on a heat sink
As first test case, we consider the heat diffusion problem on a 2D crosssection of a heat sink, see Figure 4.1 (left). The heat sink geometry is defined on the domain Ω = [−l, l] 2 , l ∈ R, which consists of a thermally conductive region Ω con and a less thermally conductive air region Ω air . The boundary of the geometry consists of ∂Ω d = ∂Ω ∩ ∂Ω air , to which we apply homogeneous Dirichlet boundary conditions and ∂Ω nd = ∂Ω ∩ ∂Ω con , to which we apply non-homogeneous Dirichlet boundary conditions. The BVP reads where u is the temperature, ρc v the heat capacity, κ the thermal conductivity, and c a fixed temperature. The non-homogeneous Dirichlet boundary condition signifies a heat source underneath the heat sink, while the homogeneous Dirichlet boundary conditions assume constant temperature on the boundary. On the fins of the heat sink, we assume a heat conductivity of κ = 50 W (mK) −1 , following a linear deterioration of the heat conductivity by 25% close to the proximity of the tips. This deteroration can be attributed to material aging or other defects. Figure 4.1 (right) shows the thermal conductivity including material defects, κ hifi , plotted against the thermal conductivity without material defects, κ lofi , along the middle fin of the heat sink.
To solve the heat diffusion problem using the FE method, we introduce the corresponding variational form by multiplying (21) with a test function w i ∈ H 0 (grad; Ω) and integrate over the domain. The variational form reads for i = 1, ..., N lofi dof . Applying integration by parts and first order difference quotients ∂u ∂t = ut k+1 −ut k ∆t results in the implicit Euler time stepping scheme We discretize the temperature u = ∑ discretize the non-homogeneous Dirichlet data. The resulting system reads where A is the stiffness matrix and M the mass matrix and the individual entries of the respective matrices are given as The columns of A and M for j = N lofi dof + 1, ..., N lofi dof + N lofi bdry are shifted from the left to right-hand side of (27) by applying a Dirichlet lift [52].
For the simulation with either model, i.e, low-or high-fidelity, we assume ρc v = 1 JK −1 m −3 , c = 10 K, ∆t = 2 ⋅ 10 −2 s, and N T = 100. The low-fidelity model Ψ lofi û lofi t k κ lofi has thermal conductivity κ lofi with N lofi dof = 278, while the high-fidelity model Ψ hifi û hifi t k κ hifi has thermal conductivity κ hifi and N hifi dof = 1408. The different material choices and mesh discretizations induce modeling and discretization errors between low-fidelity and high-fidelity model. A visualization of the heat distribution of the high-fidelity and lowfidelity model is depicted in Figure 5 (rows 1 and 3). In both cases, we can observe that the heat sink rapidly transports heat from the excited base to its surroundings, eventually reaching a steady-state.
To approximate the discrepancy function, we employ 19 out of 100 trajectory samples as training data, where more samples are chosen in the initial stages of excitation and a reduced number of samples when the system reaches steady-state. The training data instances are depicted as crosses at the respective time steps in Figure 4. We train the RNN according to the parameters in Table 2, from which we observe that 1000 training epochs are required to reduce ∆ L 2 δ θ of the upsampled model to 0.27 % and equivalently, the non-upsampled model to 8.187 %.
In Figure 4 we display the spatially integrated discrepancy function δ t 2 in each time step, once for the hybrid model, with and without upsampling, and once for reference data. In case the hybrid model is trained solely on the sparse data set, we observe a good agreement on the training data, however, large errors appear for previously unseen data, due to overfitting. This phenomenon is especially pronounced in areas where there exist larger gaps in the training data, for example between the time steps t 30 = 0.6 s and t 100 = 2 s. However, we also note that upsampling is not necessary in all regions on the trajectory for the hybrid model to exhibit correct interpolation behavior, as can be observed between time steps t 5 = 0.1 s and t 15 = 0.3 s for the nonupsampled model.

Description
Symbol 0-500 500-1000 Learning rate η 1 ⋅ 10 −3 1 ⋅ 10 −4 Local weighting factors β × ✓ Variance weighting factor α 1 25 1 25 Error with upsampling ∆ L 2 δ θ up 2.502 % 0.270 % Error without upsampling ∆ L 2 δ θ 10.014 % 8.187 %    In Figure 5 (row 2), we observe the absolute value of the discrepancy function between the low and high fidelity model. As expected, the maximum error is observed at the tips of the heat sink fins, that is, in the area where the material defect occurs. Figure 5 (row 4) shows the bias corrected model, for which we have ∆ L 2 u corr = 7.59 ⋅ 10 −3 %. Compared to ∆ L 2 u lofi = 2.73 %, this is a significant correction to the low-fidelity model.

Transient eddy-current problem in a quadrupole magnet
As second test case, we examine the transient eddy-current problem on a 2D cross-section of a quadrupole magnet. The geometry is depicted in Figure 6 (left) [15]. The quadrupole is defined on a circular domain Ω, which consists of an iron yoke Ω Fe , coils for current excitation Ω s , and an aperture Ω p . The domain boundary ∂Ω consists of the outer boundary of the iron domain ∂Ω Fe . The BVP describing the dynamical system is given by the magneto-quasistatic Maxwell equations. Choosing the vector potential ansatz b = ∇×a, where a is the magnetic vector potential and b the magnetic flux density, the eddy-current problem can be given as the time-dependent BVP where j s the source current density. For the current excitation, we assume the current I hifi (t) depicted in Figure 6 (right). This choice of current excitation is motivated qualitatively by the ramping procedure that provides a linear field increase during acceleration with current plateaus for beam insertion and extraction and a current decay after switch-off. An approximate current density I lofi (t) is additionally considered, also depicted in Figure 6, which consists of a linear ramp and de-ramp. In both cases, the currents are distributed in the excitation domain, yielding a current density that can be calculated by j s (r, t) = Ω s −1 I(t) e z , where e z is the unit vector in zdirection. To confine the magnetic quadrupole field inside the magnet, we choose homogeneous boundary conditions on ∂Ω. Similar to the previous section, we express (26) in its variational form and spatially discretize with vectorial first-order shape and test functions over a triangulation of Ω. Due to geometrical considerations, we can safely neglect transverse effects of the vector potential, which results in a single vectorial component w i = w i e z , with w i ∈ H 0 (grad; Ω). The magnetic vector potential is approximated via the ansatz function a = ∑ N lofi dof j=1â j w j , where the dof {â j } j≤N lofi dof lie on the mesh nodes. In matrix-vector notation, the FE formulation reads where A and M are the stiffness and mass matrix, respectively, and b is the loading vector. The entries of A and M are given as while the entries of the right hand side loading vector are For the simulation, we assume σ Fe = 1.04 ⋅ 10 7 Sm −1 and ν Fe = 2 ⋅ 10 −3 ν 0 in Ω Fe , and σ = 1 Sm −1 and ν = ν 0 in Ω p and Ω s . Furthermore, we assume a constant time step of ∆t = 1 ⋅ 10 −2 s and N T = 327 time steps.
The low-fidelity model Ψ lofi â lofi t k+1 ,â lofi t k , ∆t I lofi (t k+1 ) is parametrized with the current I lofi and mesh resolution N lofi dof = 895, while the high-fidelity model Ψ hifi â hifi t k+1 ,â hifi t k , ∆t I hifi (t k+1 ) with current I hifi and N hifi dof = 277 594. Both multi-fidelity models are non-dimensionalized as described in Section 3.2, such that the magnetic vector potential is normalized to the unit square.

Description
Symbol 0-500 500-1000 Learning rate η 1 ⋅ 10 −3 1 ⋅ 10 −4 Local weighting factors β × ✓ Variance weighting factor α 1 25 1 25 Error with upsampling ∆ L 2 δ θ up 2.806 % 1.538 % Error without upsampling ∆ L 2 δ θ 13.094 % 8.187 %   A visualization of the potential distribution of the high-fidelity and lowfidelity model is depicted in Figure 8 (rows 1 and 3). In both cases, we can observe the eddy current phenomenon occurring in the iron yoke ∂Ω Fe . As training data, we use 79 out of 327 trajectory samples at the respective time steps, depicted as crosses in Figure 7. In this test case, we increase the samples in areas where the current excitation changes from an increasing to a decreasing flank. We train the model according to the parameters in Table 2, from which we observe that 1000 training epochs are required to reduce ∆ L 2 δ θ to 1.538% for the upsampled model, and to 8.187% for the non-upsampled model.
In Figure 7 we display the spatially integrated discrepancy function δ t 2 at each time step, for the hybrid model with and without upsampling and the reference data. If the hybrid model is trained using solely the sparse data set, we observe a good agreement on the data points. However, the model overfits and performs poorly for previously unseen data, albeit not as strongly as in the heat sink test case. This phenomenon is again prevalent in domains with sparse training data. Similar to the previous section, the hybrid model with artificial upsampling generally yields good agreement on the complete data set. However, there are certain areas in which the approximation is suboptimal, in particular when the current excitation changes from a rising to a falling flank.
A visualization of the discrepancy is given in Figure 8 (row 2), where we observe that the discrepancy primarily occurs at the interface between the iron and the air domain. In Figure 8 (row 4) shows the bias corrected model, for which we have ∆ L 2 a corr = 0.613 %. Compared to ∆ L 2 a lofi = 39.847 %, this is a significant correction to the low-fidelity model. The correction is obvious both visually and by comparing the approximation errors of the low-fidelity and the bias corrected models.

Wave propagation in a cavity
For the third test case, we consider wave propagation in a closed cavity, where one commonly aims at detecting resonance frequencies and eigenmodes. After excitation, the wave propagates through the cavity until it eventually dissipates and the dominant frequencies are identified. As to the geometry of the cavity, we assume a "true" geometry Ω hifi , with rounded-off corners between the connectors and the cavity itself, as well as an approximate design, Ω lofi , with sharp corners at the connectors. Both geometries are depicted in Figure 9 (left). The equation describing the BVP reads where u is the acoustic pressure field, v is the velocity of the propagating wave, f ∈ C ∞ (Ω) the excitation function, and Ω ∈ {Ω lofi , Ω hifi } the low and high-fidelity domains. For boundary conditions, we apply homogeneous Dirichlet boundary conditions on ∂Ω, such that the wave is reflected off the cavity walls and contained therein. The wave excitation f is a Gaussian impulse, as shown in Figure 9 (right), applied as a point source at r s ∈ Ω as indicated in Figure 9 (left). Similar to the previous sections, we express (30) in its variational form and spatially discretize with first-order shape and test functions w i ∈ H 0 (grad; Ω). The wave equation solution is approximated via u = ∑ N lofi dof j=1û j w j , where the dof {û j } j≤N lofi dof lie on the mesh nodes. As (30) is an equation of second order, discretization in time requires a central differences scheme ∂ 2 u ∂t 2 = ut k+1 −2ut k +ut k−1 ∆t 2 . The resulting discretized FE system reads whereû is a vector containing the dof of the FE basis and the mass and stiffness matrices are given as and the right hand side loading vector b is given at each time instance t k as For the simulation, we assume v = 343 ms −1 , ∆t = 4 ⋅ 10 −5 s, and N T = 100 time steps. The low-fidelity model Ψ lofi û lofi t k+1 ,û lofi t k ,û lofi t k−1 , ∆t Ω lofi uses the approximate cavity geometry Ω lofi with N lofi dof = 169. The high-fidelity model Ψ hifi û hifi t k+1 ,û hifi t k ,û hifi t k−1 , ∆t Ω hifi uses the "true" cavity geometry Ω hifi with N hifi dof = 3997. The modeling and discretization errors are here induced by the geometry variation and the difference in mesh discretization. A visualization of the wave propagation of the high-fidelity and low-fidelity model is depicted in Figure 11 (rows 1 and 3). We observe that both waves propagate simultaneously, however, the low-fidelity solution is much coarser and its magnitude is significantly off.
For the discrepancy function approximation, we use 51 out of 100 trajectory instances as training data, depicted as crosses at the respective time steps in Figure 10. The hybrid model is trained according to the parameters in Table 3, from which we observe that 2500 training epochs are required to reduce ∆ L 2 δ θ to 0.200% for the upsampled model and to 14.797% for the non-upsampled model. In Figure 10, the spatially integrated discrepancy function δ t 2 is displayed at each time step, for the hybrid model with and without upsampling and the reference data. Similar to the previous sections, the hybrid model with artificial upsampling generally yields good agreement    on the complete data set and the hybrid model without upsampling tends to overfit in regions with sparse data. Nevertheless, in this case, we require a larger training data set to achieve a good approximation, that is, slightly more than half of the available samples. This can be attributed to the fact that the underlying dynamical system is of second order and that the underlying discrepancy function exhibits locally non-smooth behavior. Consequently, higher demands are placed on the choice of linear upsampling scheme resulting in the training data to be sampled more densely. A visualization of the discrepancy function is given in Figure 11 (row 2), where we observe in t 50 = 2 ⋅ 10 −3 s and t 60 = 2.4 ⋅ 10 −3 s that a significant portion of the error occurs near the variation in the geometry. In Figure 11 (row 4), the solution of the bias corrected model is shown, for which we have ∆ L 2 u corr = 9.99 ⋅ 10 −2 %. In comparison to ∆ L 2 u lofi = 27.53 %, we again verify that a significant correction to the low-fidelity model has been achieved.

Conclusion and Outlook
This work presented a hybrid modeling framework for the approximation and correction of model bias in the setting of multi-fidelity modeling and simulation of dynamic/transient systems. The proposed hybrid modeling approach combines standard FE approximation with an RNN, where the former accounts for the spatial effects of the approximation and latter for the temporal dynamics. The hybrid model first approximates the discrepancy between models of varying fidelity and is subsequently used to correct the low-fidelity models and increase their accuracy.
The presented numerical results show that the proposed hybrid modeling method is capable of yielding accurate approximations of the discrepancy function and, accordingly, significantly improved bias corrected models, for a variety of dynamical FE engineering simulations. In all considered test cases, the discrepancy function is approximated with an error below 2%, even if it displays a locally non-smooth behavior. Furthermore, the hybrid model remains accurate even if sparse training data sets are employed, given that a suitable upsampling scheme is provided, which controls the interpolation behavior in regions with only sparse data. Local weighting factors further ensure that the RNN provides a good fit of the training data and avoids over-smoothing.
In all considered test cases, the RNNs, consequently, the hybrid model is trained to sufficient accuracy after 1000 -2500 epochs, thus resulting in quite modest training times, given the dimensionality of input and output. This is accomplished due to the nature of the proposed hybrid model, which splits spatial and temporal dependencies. The model already provides a spatial basis in form of the FE basis functions, hence, the RNN focuses on approximating the temporal dependencies only, which is a task very much suited to RNNs in general.
Despite the promising results shown in this work, there are limitations to the proposed hybrid modeling method worth mentioning. First, the accuracy of the hybrid model depends significantly on the upsampling scheme, especially within the domains with sparse data. The choice of upsampling scheme is thus problem dependent and requires assumptions on the underlying dynamical systems and discrepancy functions. In addition, the training data cannot be chosen completely at random, as samples within specific regions of the trajectory contain more information about the underlying dynamics than others. Introducing an optimal experimental design approach, e.g., knowing a priori which time steps to sample as to achieve a good approximation, would be advisable. Finally, the projection operator is problem dependent as well, depending primarily on the choice of basis functions used to represent the discrepancy function. In the context of this work, this choice comes naturally due to the use of the FE method, however, the generalization to other basis representations should be explored.
To provide an outlook, there are numerous possible extensions to the presented methodology. Apart from considering more sophisticated RNN architectures, a possible avenue would be to explore localized Gaussian Processes with physics-inspired priors to upsample the physical data in a better way [23,65]. Another possibility would be to include physics inspired loss functions, similar to the concept of PINNs [8,29,53], as a substitute or complement to data upsampling. Last, the performance of hybrid models trained with real-world observations, for example, collected from measurements or experiments, could be investigated. In such cases, one would face the additional task noise treatment, exemplarily, separating noise from the underlying trajectory data.