Low-rank statistical finite elements for scalable model-data synthesis

Statistical learning additions to physically derived mathematical models are gaining traction in the literature. A recent approach has been to augment the underlying physics of the governing equations with data driven Bayesian statistical methodology. Coined statFEM, the method acknowledges a priori model misspecification, by embedding stochastic forcing within the governing equations. Upon receipt of additional data, the posterior distribution of the discretised finite element solution is updated using classical Bayesian filtering techniques. The resultant posterior jointly quantifies uncertainty associated with the ubiquitous problem of model misspecification and the data intended to represent the true process of interest. Despite this appeal, computational scalability is a challenge to statFEM's application to high-dimensional problems typically experienced in physical and industrial contexts. This article overcomes this hurdle by embedding a low-rank approximation of the underlying dense covariance matrix, obtained from the leading order modes of the full-rank alternative. Demonstrated on a series of reaction-diffusion problems of increasing dimension, using experimental and simulated data, the method reconstructs the sparsely observed data-generating processes with minimal loss of information, in both the posterior mean and variance, paving the way for further integration of physical and probabilistic approaches to complex systems.


Introduction
The statistical finite element method (statFEM) [1] provides a novel statistical construction of the finite element method (FEM) to reconcile imperfect models with observational data. Potential model misspecification, as represented by a Gaussian process (GP), is addressed through sequentially updating the FEM coefficients upon the arrival of observed data, within a filtering context, to compute a physics-informed posterior distribution. In [2], it is seen that for 1D nonlinear partial differential equations (PDEs), statFEM is able to give an accurate approximation to the underlying data generating process, with uncertainty quantification (UQ), synthesising physics and data to give an interpretable posterior distribution. Utilising such an approach allows for the application of simpler computational models, correcting for their possible deficiencies with data. This approach is also shown to be beneficial in small data regimes.
Recent advances in both the computational physics and machine learning communities have seen additional approaches proposed for combining physics and data. In [3], a sparse regression approach is taken, to learn the underlying PDE from spatio-temporal measurements inside the problem domain, building on the similar work of [4], to address system identification from a library of candidate functions. Neural networks can also be used, for both data-driven solution and system identification, and also for estimating optimal discretisations from data [5,6]. Perhaps the most similar to stat-FEM is [7], where a GP prior is placed over the PDE variable, making the covariance kernel encode all a priori physical assumptions. This allows for system identification through transforming PDE coefficients into GP kernel hyperparameters, the estimation of which is well-studied [8].
StatFEM, instead, places a GP prior over the forcing of the PDE and focuses on updating solutions, combining methodology from Bayesian inversion [9] and data assimilation [10]. As in Bayesian inversion, statFEM uses a Gaussian prior placed over a function inside of the governing PDE. Though instead of estimating the posterior distribution of this unknown function, statFEM proceeds to sequentially update the discretised PDE solution, as in ples. The first is a 1D example with experimental data, and we show that the LR-ExKF is able to accurately reproduce the full-rank ExKF, with relative errors of O(10 −6 ) on the mean and O(10 −5 ) on the variance. The next two examples deal with synthetic data, using the Oregonator [20,21], a 2D system of two coupled PDEs, and we demonstrate scalability, taking the state dimension up to 132, 098 degrees-of-freedom (DOFs). Examples show that the LR-ExKF accurately reproduces the data generating process under misspecified initial conditions, with relative errors of order O(10 −2 ) in the posterior mean. We also use the effective rank [22,23] to verify the effective dimensionality of the covariance matrix, ensuring that the number of modes chosen is adequate. Section 4 concludes the paper, and the Appendix includes two additional examples of interest to practitioners, dealing with parameter estimation (Appendix C) and filter divergence (Appendix D). We also provide code to reproduce all results and figures in this paper, available at https://github.com/connor-duffin/low-rank-statfem.
(1) Equation 1 describes the evolution of a system state (e.g. concentration of chemical species) which diffuses throughout the medium and has nonlinear interactions. All parameters of Equation (1) above are assumed known and are denoted by Λ. In this section, without loss of generality, we deal with single-state equations with u ∈ R. Extensions to systems of equations with states u ∈ R s are covered in Section 3.
A GP, ξ := ξ(x, t), is introduced inside of the PDE to represent prior uncertainty in the model specification: The spatial covariance kernel k θ (·, ·) encodes information on model error; we use the squared exponential kernel [8] under the a priori assumption that model errors are spatially smooth. Hyperparameters θ = (ρ, ) respectively parameterise the variance and correlation length-scales of the model error, and are either set to fixed values a priori or estimated using maximum-a-posteriori methods [24]. Hyperparameter estimation is discussed after the algorithm presentation and is illustrated in Appendix C. Delta correlations in time have the implication that ξ is the weak derivative of a function-valued Wiener process such that ξ(x, t + ∆ t ) − ξ(x, t) ∼ GP(0, ∆ t k θ (x, x )). We assume that ξ(x, 0) ≡ 0, and thus ξ(x, 1) ∼ GP(0, k θ (x, x )). Spatially, the process inherits the regularity as implied by k θ (·, ·), which in this work is assumed to give ξ(x, ·) ∈ L 2 (Ω). For further details we refer to [25].
Spatial discretisation proceeds through multiplying by testing functions ϕ ∈ H 1 (Ω) and integrating over the problem domain. This gives the semidiscrete weak form where A(u, ϕ) = Ω ∇u · ∇ϕ dx, f, g = Ω f g dx: these are respectively the bilinear form induced by the Laplacian operator and the L 2 (Ω) inner product. Neumann boundary conditions result in the boundary terms dropping out of the weak form.
The finite element mesh is given by subdividing the domain Ω into the triangulation Ω h ⊆ Ω with vertices {x j } nu j=1 , where the maximal length of the sides of the triangulation is given by h. The polynomial basis functions {φ j } nu j=1 are defined on the mesh, having the property that φ j ( which corresponds to the n u -dimensional SDE for the FEM coefficients , φ j , and u(t) = (u 1 (t), . . . , u nu (t)) . The additive noise process β(t) := (β 1 (t), . . . β nu (t)) has the increments In the absence of additive noise (i.e., for the ODE case), under bounded derivative conditions on r, solutions will be unique, by the Picard-Lindelöf theorem [18]. We use Crank-Nicolson for time integration; writing u n = (u 1 (n∆ t ), . . . , u nu (n∆ t )) , for timestep size ∆ t > 0, yields a fully discrete system accurate to order in which u n−1/2 = (u n + u n−1 )/2. In practice any appropriate time discretisation can be implemented and the choice will be problem-dependent. In this paper Crank-Nicolson discretisations ensure stability, motivated by the example given in Appendix D. In this example, it is seen that for sufficiently nonlinear dynamics the resultant filtering algorithm may diverge when the time discretisation is of implicit-explicit-type [26]. The divergence is avoided when Crank-Nicolson is used.
Given a fixed set of hyperparameters θ, the Euler-Maruyama discretisation [27] of Equation (3) draws sample paths from the measure p(u n |θ, Λ) for each n = 1, . . . , n t . These samples are drawn from a physics-motivated prior distribution, which can be updated, via Bayes theorem, to give a posterior distribution over the FEM coefficients. Model misspecification can then be directly corrected for through computing the posterior distribution.
The data generating process is assumed to be y n = Hu n + η n . The data y n ∈ R ny are assumed to be corrupted by an independent measurement noise process, η n ∼ N (0, σ 2 I ny ), and are observed via the linear observation operator H : R nu → R ny . For each n, this defines the likelihood p(y n | u n , σ) = N (Hu n , σ 2 I ny ). The posterior distribution p(u n | y 1:n , θ, σ, Λ), with y 1:n = (y 1 , . . . , y n ), can be computed through nonlinear filtering algorithms. In previous work the ExKF and EnKF were used [2], however as discussed in the introduction these algorithms can perform poorly in high-dimensional settings. To circumvent this computational bottleneck we use the LR-ExKF algorithm, which computes the Gaussian approximation p(u n | y 1:n , θ, σ, Λ) ≈ N (m n , C n ) from a low-rank approximation of the state covariance matrix C n = L n L n . This is constructed with the leading eigenvalues and eigenvectors of the prior covariance matrix G θ and the previous timestep covariance C n−1 . For similar approaches see [11,12,13,14].
Assume that the distribution of the previous state is given by p(u n−1 |y 1:n−1 , θ, σ, Λ) = N (m n−1 , C n−1 ), with C n−1 = L n−1 L n−1 , L n−1 ∈ R nu×k . Furthermore, assume that a low-rank square root of G θ = G Also note that D nr := ∂r(u n−1/2 )/∂u n , the Jacobian matrix. For all timesteps n = 1, . . . , n t , the LR-ExKF proceeds as: for the prediction meanm n and compute the prediction covariance square root: noting thatL nL n =Ĉ n (the prediction covariance), and thatL n ∈ R nu×(k+k ) . Each column ofL n can be formed in parallel; see the following discussion for further details. 2. (Truncation step) Take the eigendecompositionL nL n = V n Σ n V n , where Σ n = diag(ς 1 , . . . , ς k+k ). ApproximateL n =L n [V n ] :,1:k for the highest magnitude k modes, so the prediction covariance isĈ n =L nL n . 3. (Update step) Update the mean: m n =m n + (HĈ n ) HĈ n H + σ 2 I ny −1 (y n − Hm n ).
And the covariance: R n R n = I k −L n H (HĈ n H + σ 2 I ny ) −1 HL n , using a Cholesky decomposition or otherwise.
If k = k = n u then the LR-ExKF recovers the full ExKF exactly. If the datum y n is missing then only the prediction and truncation steps are completed, to produce the posterior p(u n | y 1:n , θ, σ, Λ) ≡ p(u n | y 1:n−1 , θ, σ, Λ).

Discussion
First we compare the ExKF and LR-ExKF in terms of memory and operation counts. For a general reference to these matrix computations we refer to [28]. The standard ExKF requires that C n and G θ are stored in memory, which is O(n 2 u ) in space. For large DOF problems this is infeasible, and provides the main motivation for the low-rank approach. If one employs the standard ExKF, though, then the prediction step for the covariance matrix requires the solution of the sparse n u × n u matrix M + ∆ t (κA + D nr ), 2n u times for each timestep. For large n u this becomes prohibitively expensive. If using a direct solver is feasible, then this can be slightly mitigated by computing the LU factorisation of M + ∆ t (κA + D nr ), and reusing the factors when solving for each column ofL n−1 , though this still requires running forward-and back-substitution 2n u times per timestep.
Furthermore, the update step requires the solution of the system HĈ n H + σ 2 I ny , n u times, for each timestep, which requires O(n 3 y /3) operations to take the Cholesky decomposition and O(n 2 y n u ) to solve. In comparison, LR-ExKF requires solving M + ∆ t (κA + D nr ), k + k times for each timestep, and is O((k + k )n u ) in space. The truncation step incurs a cost of O((k + k ) 3 ) to compute the eigendecomposition. The update step has a cost of O(n 3 y /3) operations to take the Cholesky factorisation of HĈ n H + σ 2 I ny , but requires only k solves, to give the cost O(n 2 y k). Finally, the cost of the Cholesky factor R n is O(k 3 /3). However, note that k n u , and in our experience the cost of these decompositions is dwarfed by solving the system in the prediction step.
Note also that each column of the prediction covariance square rootL n is able to be computed in parallel. As each column requires solving M + ∆ t (κA + D nr ) this can result in lower runtimes, especially when combined with, for example, algebraic multigrid solvers [29]. This parallelisation is likely to be necessary when scaling up the LR-ExKF to larger systems than those considered here, such as high-dimensional 3D models. This is also similar to the parallelisation potential of the EnKF prediction step, which has enabled widespread adoption of this algorithm [30].
It is also assumed that the covariance spectrum of G θ is rapidly decaying so that the majority of the variance can be explained by k n u dominant modes. In this work we use the squared-exponential covariance function, which is known to have a rapid spectral decay [31,32]. For efficient methods to decompose GP covariance matrices into their leading eigenvalues we refer to [33,34,35,36]. As the filter is initialised with the l 2 -optimal low-rank square-root and the truncation step preserves the dominant modes of variation, it is thought that this scheme is able to provide accurate UQ; we plan to verify this in future work.
The low-rank approximation of the covariance matrix will lead to underestimation. If we write the prediction covariance asĈ n +Ĉ n,err then the norm of the discarded component is given by Ĉ n,err = ς k+1 . This is noted in [12], where it is also commented that this underestimation could lead to similar problems as encountered in ensemble Kalman filtering, such as catastrophic filter divergence [22]. We have observed this when using unstable time-integration schemes, in the spiral wave regime; we refer to the Appendix D for full details. There is additional symmetry to the EnKF: both algorithms propagate a low-rank approximation to the covariance square root (L n in the LR-ExKF; the ensemble in the EnKF), and make the Gaussian assumption in the update step. It is thought that the LR-ExKF is similar to an EnKF where the ensemble members are chosen to optimally represent the variance (in the l 2 sense), and the propagation is done through the tangent linear model.
In Appendix C, we use the weakly informative priors ρ n ∼ N + (1, 1 2 ) and σ n ∼ N + (0, 1 2 ), reflecting the a priori uncertainty in the optimal choices of these hyperparameters. For the optimisation routine, L-BFGS-B [38] with positivity constraints, as implemented in SciPy [39], has worked well in our experience.

Case studies
The methodology is demonstrated on three RD examples. The first is a verification of the method on a 1D example of cell RD, using the experimental data of [40], and shows a coherent synthesis of data with a prior physical model. It also confirms that the low-rank filter accurately reproduces the full-rank filter, with relative errors of O(10 −6 ) for the mean, and O(10 −4 ) for the variance.
The next two examples are in 2D, using the Oregonator RD model [20,21], with misspecificied initial conditions. The first 2D example demonstrates that observations of a single component of the system can correct for misspecification on the unobserved component. The second 2D example studies the affect of increasing the mismatch variance ρ on filter performance and also shows scalability, using a state dimension of n u = 132, 098. Conditioning on data corrects for misspecification, and in both cases statFEM recovers the underlying data generating process to relative errors of O(10 −2 ). In these examples, running the full-rank filter is prohibitively expensive, so we also use the effective rank ofL n as a measure for filter performance.
The supplement contains two additional studies, which discuss the parameter estimation methodology, and a case of catastrophic filter divergence, respectively. Code to run all examples is available on a public GitHub repository 1 , with all finite element discretisations done in Fenics [41].

Experimental data: verification
We consider a system of two coupled nonlinear RD equations, which model the densities of two different cell populations, as cells react with one another and diffuse throughout the domain, as discussed in [40]. The model is a coupled system of two nonlinear RD equations, with stochastic forcing.
The combined system state is given by w = (u, v) ∈ R 2 : Coefficients are set to In contrast to [40], who linearly interpolate the data to give their initial conditions, we assume the fixed piecewise initial conditions as above. Instead of interpolating the data at time t = 0, we condition on it. Equation (4) is discretised using the standard linear polynomial "hat" basis functions, for each component, on a regular mesh with 200 cells on the interval [0, 1300]. Crank-Nicolson is used for the time discretisation, with timestep size ∆ t = 0.1.
Hyperparameters of ξ u and ξ v are set to the same values, which are constant across all times; this is to avoid propagating poor estimates through the simulation. The covariance structure of Equation (2), is used, and crosscorrelations are assumed to be zero: E(ξ u ξ v ) ≡ 0. Hyperparameters are set to ρ = 2 × 10 −3 and = 100.
Laboratory data of this system is contained in the supplementary information of [40], freely available online. These data consist of observations of the two species at 4 times, t obs = 0, 16, 32, 48 hours. The concatenated state, w n = (u n , v n ) , gives the data generating process y n = Hw n + η n at the observation times, with noise η n ∼ N (0, σ 2 I ny ), σ = 0.01.
The posterior p(w n | y 1:n , θ, σ, Λ) = N (m n , L n L n ) is computed with the LR-ExKF, using k = k = 32 modes for both the state covariance, and for the covariances of the GPs, ξ u and ξ v . More than 99% of the variance is retained in the variance truncation at each timestep. The resulting posterior means m u n , m v n , and 95% posterior credible intervals are shown for both the observation times and five hours after the observation times, in Figure 1. For the v component, there is little discrepancy between the data and the prior assumed model, however for the u component there is some degree of model mismatch, which conditioning on data can partially account for. The posterior means for each component across the entire space-time grid are also plotted in Figure 2, and demonstrate the immediate effects of conditioning on data at the times at which these data are observed.
The low-and full-rank ExKF's are now compared through the posterior means and variances, in terms of the relative error with the norm · taken as the standard Euclidean norm (the l 2 -norm). Shown in Figure 3, for a fixed number of modes (k = k = 32), the relative errors are small, at approximately 10 −7 for the posterior mean, and 10 −5 for the posterior variance. Sharp increases are observed at the times at which data is observed, in both the posterior mean and the posterior variance. The relative error at the end time of the simulation is also shown in Figure 4, as both k and k are increased. In either case, the errors are computed over {4, 8,16,32,48, 64} modes, with either k or k fixed to 32. In each case, as the variable number of modes increases the error decreases, with minor gains in the final error observed when going from 32 to 64 modes.
As k is increased (Figure 4a), past k = k = 32 there is a small increase in the accuracy of the filter, with a minor increase from k = 32 to k = 48. This minor increase is thought to be due to the inclusion of information available from the data, not present in the prior alone. As k is increased (Figure 4b), past k = k = 32 there is visually no gain from adding in additional modes in the prior, as these end up being truncated to k = 32. In both cases, increasing from 48 to 64 modes shows little decrease in the posterior mean and variance errors. These results suggest that there is a number of effective modes that capture the dominant modes of variation; beyond some effective number of dimensions, taking more modes does not yield significant gains in accuracy.
The low-rank approximation of the prior covariance matrix has a large affect on the accuracy of the estimated posterior covariance matrix. Recall that in the problem specification the uncertainty is induced via the additive GP, ξ, and no other sources of uncertainty are considered; if the covariance of ξ is approximated to a sufficient degree then the low-rank approximation of the filter is accurate.
This can be seen from the ExKF approximation of the prior measure, which for exposition is considered in the case in which G θ is rank-deficient but is stored fully in memory. This gives an iterative approximation for the covariance matrix [2, Supplementary Information] where J n is the Jacobian matrix of the FEM model with respect to u n ,    evaluated at (m n , m n−1 ). Given the recursive nature of computation and the fact that C 0 = 0 (due to the assumed initial conditions being exact) the prior covariance is a sum of matrix products with G θ and the Jacobian matrices, at each timestep. The prior covariance will be accurate if the lowrank approximation of G θ is accurate. Empirically, this is also seen for the posterior covariance. This phenomenon is analysed further in Appendix B, in which we investigate the errors on the mean and the variance as both k and σ are varied.

Mismatch via initial conditions: spiral regime
We now consider a two-dimensional example of misspecified initial conditions with the Oregonator [20,21], a coupled PDE with state w = (u, v) ∈ R 2 . Adding stochastic forcing on the observed v-component gives the two- The Oregonator has been well-studied after being derived as a simplified model for the chemical reaction kinetics of the Belousov-Zhabotinskii (BZ) reaction [20,21,42,43]. It is a classical example of an activator-inhibitor system, sharing similar behaviour to the Fitzhugh-Nagumo model in certain parameter regimes [44]. We study the Oregonator in the excitable regime, setting f = 2, q = 0.002, and ε = 0.02. Diffusion constants are set to D u = 1, D v = 0.6, The GP ξ v has the covariance kernel of Equation (2), setting θ = (ρ, ) = (0.001, 5).
To set the initial conditions, we use the procedure of [43]. To induce some degree of mismatch, initial conditions of the statFEM filtering posterior are set by pushing the initial condition of the truth through a heat equation with the diffusion coefficient D u = 1 (i.e. the Oregonator sans reaction terms), for t = 0.1, resulting in the amplitude of the initial conditions being dampened and blurred (see Figure 5).
The spatial discretisation uses the standard linear polynomial hat functions, over a regular mesh with 128 × 128 cells (16, 641 nodes) so that the total state dimension is 33, 282 DOFs. Due to the nonlinear terms dominating the dynamics, Crank-Nicolson is used for time discretisation, with timesteps ∆ t = 0.001. Data is observed on the slow v component, at 1041 locations inside the discretised domain Ω h . The time between observations is 0.005 (5 timesteps). Data is sparse in space, having approximately 3% of the state dimension observed at observation times, and the excitable u is only updated through the observed v values. As previous we concatenate the discretised state vector into w n = (u n , v n ) so that the data generating process is y n = Hw n + η n , η n ∼ N (0, σ 2 I ny ). The measurement fidelity is σ = 0.01.
We run the LR-ExKF to obtain the posterior p(w n | y 1:n , θ, σ, Λ) = N (m n , L n L n ) for all n, using k = 250 (≈ 0.75% of the state dimension) modes to represent the state covariance L n L n , and using k = 150 modes to approximate the covariance matrix of ξ v , G θ . When running the filter, more than 99% of the variance is retained at each truncation step.
To give a snapshot of results, the data, posterior mean, posterior variance, and leading order modes are each shown in Figure 6, at time t = 5. The dominant region of variance appears to be at the spiral tip, with additional variation observed about the boundary of the spiral on the u component. This hierarchy is seen in both the colour intensities of the variance plots in Figure 6a and in the columns of L n , in Figure 6b; for UQ, there is a hierarchy of variance regions of decreasing importance.
Performance of the filter is verified through computing the relative errors on the means m u n , m u n against the truth: These are shown in Figure 7a. After an initial period of disparity, the stat-FEM mean m n then closely tracks the truth, reaching a stable configuration after this initial warm-up period. Observations can thus correct for misspecification on the unobserved component, with small (O(10 −2 )) relative errors in the mean of the statFEM posterior. We also compute the effective rank D eff [22,23] of the prediction covariance matrix square-rootL n , using the eigenvalues from the truncation step. We define the effective rank from the eigenvalues (diagonal of Σ n ) ς 1 ≥ ς 2 ≥ · · · ≥ ς k of the k × k matrixL nL n  which takes values D eff ∈ [1, min{k, n u }]. This measures the alignment of the columns ofL n [22] and can be used to diagnose problems (for example filter collapse), or verify performance (for example, checking that k and k are not over-or under-specified). For this example, this is plotted in Figure 7b, and appears to be stable after an initial drop, further suggesting that the filter has reached a stable configuration. The initial drop in the effective rank appears almost immediately after the filter is started, whereas for the relative errors this is at time t = 2. This suggests that reaching a stable configuration in the covariance is perhaps necessary before the same occurs in the mean. The estimated D eff has a mean value of approximately D eff ≈ 45, indicating that  the choice of k = 250 is perhaps excessive for this example.

Mismatch via initial conditions: oscillatory regime
In the oscillatory regime, the Oregonator has f = 0.95, ε = 0.75, q = 0.002, and the diffusion coefficients D u = D v = 0.001 [44]. Stochastic forcing is now included on the u component The amplitude κ is set to 0.02 in our case studies, and for the true (u 0,true , v 0,true ), these are set from running a pilot simulation for 100, 000 timesteps, where the pilot simulation has randomly generated initial conditions u, v ∼ Unif(0, 0.15). The upper and lower bounds on the uniform distribution are determined from the attractor of the corresponding Oregonator ODE. Both the perturbed and true initial conditions are shown in Figure 8. This initial condition is included in the accompanying GitHub repository. As previous the spatial discretisation uses linear polynomial basis functions, on a refined mesh with 256 × 256 cells. Crank-Nicolson is used for the time discretisation, with a timestep ∆ t = 10 −2 , observing data y n = Hw n +η n at every timestep up to time t = 10. As previous, η n ∼ N (0, σ 2 I ny ), with σ = 10 −2 , and the u-component is observed at each timestep, at 512 observation locations. The posterior p(w n | y 1:n , θ, σ, Λ) = N (m n , L n L n ) is computed using the LR-ExKF with k = 128 and k = 64. The initial leading-order eigendecomposition of the GP covariance matrix G θ is done using Lanczos iterations [29] as implemented in Scipy [39], with the matrixvector products done on the GPU, implemented with KeOps [36].
We compare three filters, each with the GP covariance kernel of Equation (2), which set ρ ∈ {10 −2 , 10 −3 , 2 × 10 −4 }, with = 10 and σ = 10 −2 for each. In each filter more than 99% of the variance is retained at each truncation step. For the remainder of this discussion these are referred to as the large-ρ filter, the moderate-ρ filter, and the small-ρ filter, respectively.
Relative errors for the statFEM posterior means are shown alongside the statFEM prior mean in Figure 9a. The small-ρ filter shows the slowest misspecification correction compared to the others; the filter is more certain of the model predictions and takes longer than the others to reach a stable filtering configuration. It also results in the largest relative error by the end of the simulation. The moderate ρ-filter, however, despite slower initial misspecification correction, results in the lowest relative error by t = 10. The large-ρ filter displays the most rapid convergence, yet results in relative errors increasing after the initial correction. This is thought to be due to ρ being equal to the noise ρ = σ = 10 −2 , which results in assimilation of spurious noise perturbations. These results confirm the role of the ρ hyperparameter in specifying our UQ of the physical model. Higher variance implies less certainty and more rapid corrections for model misspecification.   Figures 9a and 9b), and the small-ρ filter (blue ( ) in Figures 9a and 9b). Shown are the modes for the u-component. This trend of slower corrections, for more certain models, is also seen in the effective rank D eff of the prediction covariance matrixL nL n , plotted for each of the filters in Figure 9b. In terms of D eff , the large-ρ filter has the most rapid convergence to a stable filtering regime, with the small-ρ filter the slowest. Once in this stable filtering regime the small-ρ filter has a larger effective rank compared to the large-ρ filter, and it is posited that a higher effective rank implies a more complex covariance structure. To check this the leading modes of the covariance matrix are plotted at the end time t = 10, in Figure 9c. The results suggests this is the case, and we see that the modes for the small-ρ filter display more localised structures pertaining to the dynamics of the model, when compared with those of the large-ρ filter, which appear more similar to the eigenfunctions of the GP covariance kernel k θ (·, ·).
This also merits another interpretation of the variance hyperparameter ρ: that of controlling the weight of the a priori misspecification covariance matrix G θ in comparison to the (tangent linear) dynamical evolution of the previous timestep covariance C n−1 = L n−1 L n−1 . In this case, due to the complex spatial oscillations seen in the dynamical model, by decreasing the weight of the simpler GP covariance, the more complex dynamical interactions are present in the posterior covariance, resulting in an increased D eff .

Conclusions
We present a low-rank extended Kalman filter algorithm for the statistical finite element method that is highly scalable, by representing the covariance matrix through its leading k modes. Owing to the rapid spectral decay of this posterior covariance matrix the LR-ExKF is able to provide a sensible and interpretable UQ, retaining, in the given examples, at least 99% of the variance at each timestep. Through the use of the Jacobian matrix (assembled from the Fréchet derivative of the weak form), the statFEM LR-ExKF is straightforward to implement with standard finite element libraries and has the potential for massive parallelisation in the bottleneck prediction step.
We demonstrate, using experimental and synthetic data, that statFEM is able to combine models and data in a coherent statistical framework, correcting for model misspecification in various forms. Results in 1D, with experimental data, show that the LR-ExKF can approximate the full-rank alternative, with small relative errors in both the posterior mean and variance. These results also confirm the efficacy of statFEM in the face of temporally sparse observation regimes. Results with the 2D Oregonator system demonstrate scalability, with the statFEM LR-ExKF correcting for model misspecification in the initial conditions, through observing a single component of the underlying data-generating process. These results are robust under different model parameter regimes, and, under increasing the state dimension to n u = 132, 098. We also show that the effective rank of the covariance matrix can verify the number of modes taken in the low-rank approximation, providing an additional check of filter convergence.
This work provides a scalable statFEM that will enable the application of the methodology to domains across the physical sciences. Our method is similar to the EnKF, representing the covariance matrix through a low-rank approximation (c.f. Appendix D), though by using the leading eigenvalues and eigenvectors we ensure that the UQ is empirically justified. An additional source of uncertainty that is not considered in this work is that of the hyperparameters, θ, which would ideally be marginalised over [45]. The investigation of this estimation approach would be of interest.
Additional extensions to this work include verifying that the LR-ExKF produces accurate estimates of the true filtering posterior, and theoretical work to check the accuracy and stability of the filter. Furthermore, when applying to arbitrary time-dependent PDEs one may require the use of bespoke timestepping schemes, going beyond the Crank-Nicolson scheme used in this paper. One may also need to constrain the timesteps to ensure that the tangent linear approximation to the covariance matrix does not induce too much error or to control for possible filter divergence. An iterative extension may also be required for sufficiently nonlinear PDEs [46]. the posterior mean appear to increase more rapidly with the decreasing σ, in comparison to the variance. It is posited that this discrepancy is seen due to the posterior mean update using the full nonlinear dynamics when completing the prediction step -the covariance propagation, in comparison, uses the tangent linear dynamics.  hyperparameters for each (each filter is run for 1000 timesteps). Note that this does not yield the same integration window, one being two orders of magnitude smaller than the other.
Data, of the u component is observed at 512 locations at each timestep (locations shown in Figure C.11). The assumed data generating process is y n = Hw n + η n , η n ∼ N (0, σ 2 I ny ). Data is generated with σ = 0.01, and θ = (ρ, ) = (10 −3 , 10). Filtering is done using LR-ExKF, with k = k = 128 modes, to compute the posterior p(w n | y 1:n , θ 1:n , σ 1:n , Λ) ∼ N (m n , L n L n ). For the various filters in this section, each retain at least 99% of the variance at each timestep. The leading eigenvalues of K θ , which determine the accuracy of the low-rank approximation G To avoid recomputing G θ at each iteration, is fixed at = 10 for all n, and we estimate ρ n and σ n at each n, assuming that ρ n and σ n are independent for each n. Incorporating more complex temporal structure on these hyperparameters is of interest and is a possible avenue for future research. Priors are set to the weakly informative Gaussian priors ρ n ∼ N + (1, 1 2 ) and σ n ∼ N + (0, 1 2 ).
The hyperparameter estimates are shown in Figures C.13a and C.13b, and demonstrate that the noise is identified at each timestep. For the larger timestep ∆ t = 10 −2 the hyperparameter ρ n is poorly identified. Estimates appear to be contained within two point clouds, the topmost cloud being identified with the noise σ n . This inaccuracy is thought to be due to the combination of linearized dynamics for the covariance update combined with the low-rank approximation for the square root G θ . Note also that the inclusion of the truncated Gaussian prior ρ n ∼ N + (1, 1) will also have an effect. This is confirmed by running the same filter with the smaller timestep ∆ t = 10 −4 , thought to increase the accuracy of the linearised covariance prediction step. The filter appears to better identify the scale hyperparameter ρ n (see Figure C.13b). Some variation remains, however, which is thought to be due to the low-rank approximation to G θ and the truncated Gaussian prior. For completeness we also plot the relative l 2 errors in Figure C.13c and the effective rank of the covariance matrix in Figure C.13d. We see that the variation of the effective rank appears to be due to the variation in the estimates of the hyperparameters (c.f. the smooth effective rank results seen in the main text for fixed hyperparameter values).

Appendix D. Catastrophic filter divergence in the spiral wave regime
In this case study we show that the LR-ExKF can have catastrophic filter divergence occur, which is also observed with the EnKF [22,47]. Catastrophic filter divergence is where the posterior mean estimate m n diverges to machine infinity in finite time. Previous studies are contextually similar, with EnKF divergence occurring in sparsely observed dissipative nonlinear systems with small noise. The mechanism of the divergence is the use of an unstable time-integration scheme [22], which we verify.
We simulate data according to a deterministic Oregonator in the spiral wave regime (recall f = 2, q = 0.002, and ε = 0.02). These data are observed at every timestep, on the slow v component, across 512 observation locations. The initial conditions are the same as those in the second case study in the main text. Additive Gaussian noise is added, so that y n = Hw n + η n , η n ∼  N (0, 10 −4 I ny ). There is no model mismatch between the statFEM model and the underlying data generating process for y n ; the only difference is that the statFEM model includes the stochastic process ξ v on the component v, which has the covariance kernel of Equation (2). GP hyperparameters are fixed, with θ = (ρ, ) = (10 −3 , 10). For the numerics, both models use an FEM mesh with 128 × 128 cells, and the timestep size is ∆ t = 10 −3 .
Filtering is done using the LR-ExKF, with k = 512 and k = 128 modes, to compute the posterior p(w n | y 1:n , θ, σ, Λ) ∼ N (m n , L n L n ). For each filter more than 99% of the variance is retained at each timestep.. We say that the filter has diverged if any of elements u n,i ≥ 10 4 for i = 1, . . . , n u , and two separate filters are run: one with the IMEX scheme of Appendix C for timesteps, and the other with the Crank-Nicolson (CN) scheme for timesteps.
The IMEX filter diverges to machine infinity after 23 timesteps (see Figure D.14, top). The proposed mechanism of this divergence is that the posterior estimates of the mean do not accord with the underlying attractor, resulting in stiffness in the underlying dynamical model [22]. Hence the time integration becomes stiff, which the IMEX scheme is not able to resolve, and the filter diverges. In [22] this is accompanied by the effective rank of the covariance matrix reducing to one, which gives spurious correlations and thus poor posterior estimates in the update step. This is observed in this scenario, too, with the effective rank dropping to near unity in finite time by the divergent timestep n = 23. Note also the resemblance to particle filter degeneracy, which results in the collapse of the particle weights to a Dirac measure.
As in examples in the main text, the behaviour of the relative norm also "lags" the behaviour of the effective rank; the effective rank seems to drop sharply whilst the relative norms of the IMEX and CN are both visually indistinguishable. Only after the effective rank drops to near unity is the divergence seen, giving a reminder of the influence of the posterior covariance matrix C n on posterior estimates of the mean m n . Changing the time integrator to CN results in catastrophic filter divergence being avoided, and the relative l 2 norm and D eff appear to asymptotically approach some limiting value after an initial increase (see Figure D.14). Note also the smoothness of the effective rank D eff resulting from the fixed choice of hyperparameter ρ.