Framework for the comparison of a priori and a posteriori error variance estimation and tuning schemes

The performance of an assimilation system is strongly dependent on the quality of the error statistics used. A number of error statistics estimation and tuning methods have previously been developed to better assess and determine these statistics. Many of these are a posteriori methods which make use of quantities calculated during the assimilation procedure, while other a priori methods do not require information from the assimilation. In this study, we develop a conceptual framework that relates these methods when applied to error variance determination, where each method is associated with the minimization of a particular cost function. The minimization of these cost functions describes a fitting procedure that fits parts of the prescribed modelled innovation covariance to its observed values. Each method must in some way separate the innovation covariance into its contributions from the background and the observations, which are then used in the fitting procedure. It is shown that the examined a posteriori methods use the analysis filter to make this separation and that the minimization of their associated cost functions is done implicitly within the tuning procedure. Analytical expressions for the expectation value and variance of estimates for error variance scaling parameters are determined for each method. The expressions for the expectation values of these estimates show that the accu-racy of each method is dependent on its ability to separate the background from the observation contributions to the innovation covariance. This separability is quantified by use of the Frobenius inner product between the background- and observation-error covariances, which additionally allows for geometric interpretations of the covariances to be made. Comparisons between variance parameter estimates from different methods are made for


INTRODUCTION
Data assimilation is the process of statistically blending information from observations with a model and is widely used in atmospheric sciences, particularly for numerical weather forecasting. The degree of influence the observations have on the blended model state, known as the analysis, is dictated by the error covariances assigned to the observations and prior model state in the assimilation process. The analysis increment (the quantity added to the background state to form the analysis) is strongly influenced by the background-error covariance that describes the prior model uncertainty; it spatially spreads out information contained in the observations and, for multivariate assimilation, can impose balance between different variables (Bannister 2008 gives a review). As the analysis increment lies in the subspace spanned by the background-error covariance matrix (Bannister, 2008), accurately specifying the background-error covariance is of particular importance in data assimilation. Background-error covariances are influenced by, for example, the growth of errors in the forecasting model and, in an assimilation cycle, are also influenced by past observation accuracies and their spatial and temporal distributions. In addition to instrument uncertainties, the observation-error covariances include representativeness errors and errors from imperfect observation operators (Janjić et al., 2018). The performance of an assimilation system depends on how accurately these error covariances are specified (Daley, 1992). In practice, obtaining covariances that accurately describe these uncertainties can be difficult and many simplifying assumptions are often made (Rutherford, 1972;Parrish and Derber, 1992;Evensen, 1994;Polavarapu et al., 2005). As such, methods to diagnose and tune error statistics are of significant interest. Methods that use the statistics of assimilation residuals fall into two basic types: a priori diagnostic methods that use observation-minus-background residuals (known as innovations when the observations are assimilated) and a posteriori diagnostics that require analyses.
The method of Hollingsworth and Lönnberg (1986) (HL; actually first introduced in Rutherford 1972) is a well-known a priori method that consists of calculating the autocorrelation statistics of observation-minus-background residuals as a function of the separation distance between a pair of observation sites. In this method, assuming that the observation errors are spatially uncorrelated (or have a small length-scale in comparison to the spatial correlations of the background errors) allows one to distinguish and estimate the statistics of the observation and background errors. The HL method has been successfully extended to estimate multichannel observation-error covariances (Garand et al., 2007;, temporal correlations of observation errors (Macpherson and Laroche, 2019), and to incorporate spatial observation-error correlations (Eresmaa and Järvinen, 2005). In the context of atmospheric chemistry, chemical error statistics have been obtained by using the HL method with a single polar-orbiting satellite by using the distance between consecutive measurements along the satellite track (Ménard et al., 2019).
The maximum likelihood method  is another a priori method that involves the minimization of a cost function in observation space. This method requires knowledge of the probability distribution of the innovations (usually assumed to be normally distributed). While the maximum likelihood method can easily become computationally expensive, it has been used in meteorology Tangborn et al., 2002) and in air quality . The maximum likelihood method has been used to estimate error variances, background and observation correlation length-scales, and cross-covariances between observation and background errors (Tangborn et al., 2002). A priori error statistics are also used in consistency diagnostics, such as the 2 diagnostic (Bennett, 1992;Ménard and Chang, 2000) and the variance and covariance consistency diagnostics (Ménard, 2016).
The development of a posteriori methods began with the so-called J min consistency diagnostic introduced in Tarantola (1987) and Talagrand (1999). Desroziers and Ivanov (2001) (DI01) adapted this method to iteratively estimate covariance scaling parameters for both the background and observation covariances in a variational assimilation system. Chapnik et al. (2004) extended the DI01 method to diagnose scaling parameters for subsets of the error covariances. Chapnik et al. (2004) also showed that the DI01 method is equivalent to maximum likelihood estimation for normally distributed innovations. The DI01 method was tested with an operational system in Chapnik et al. (2006), but the computational cost of carrying out the estimations remained high.
Another a posteriori method was developed by Desroziers et al. (2005) (D05) based on statistical properties of analysis residuals. As the D05 method is often easier to apply for high-dimensional problems, this method gained popularity in numerical weather prediction, especially for the estimation of satellite observation-error covariances Stewart et al., 2014;Weston et al., 2014;Todling, 2015;Gauthier et al., 2018). The D05 method has also been used to estimate model-error variances in ensemble Kalman filtering (Li et al., 2009), Doppler radar radial wind error correlations (Waller et al., 2016b), spatial and temporal observation-error correlations for ground-based Global Navigation Satellite System (GNSS) Zenith Total Delay observations (Bennitt et al., 2017;Macpherson and Laroche, 2019), and observation errors in multi-species atmospheric chemical constituent assimilations (Skachko et al., 2016). Although devised as an iterative algorithm, most operational applications of the D05 algorithm only apply one iteration due to the added computational cost of computing multiple analyses.
Properties of the D05 algorithm were first derived in the context of a simplified observation network by Desroziers et al. (2005). Waller et al. (2016a) examined how mis-specifications of the initial error variances and correlation lengths influences the D05 estimates of observation errors in a simplified observation network. Ménard (2016) considered the iterative estimation of both the observation-and background-error covariances in a simplified observing network and showed that convergence to the true error covariances in the D05 scheme is possible if the error correlations are correctly specified. Ménard (2016) also established that to estimate the observation-error covariances alone requires that the background-error covariance be correctly specified, which was later shown by Bathmann (2018) to be a general conclusion and not restricted to the simplified observation network. Similarly, Gauthier et al. (2018) concluded that the iterative estimation of the observation-error covariance using the D05 method requires a good estimation of the background-error covariance.
While a number of the studies referenced above have examined the behaviour of the D05 algorithm, there have been relatively few comparisons between the D05, DI01, and HL methods. In one of the few studies that compare the D05 and HL methods, Macpherson and Laroche (2019) found that the two methods produced significantly different results for the temporal correlations for GNSS Zenith Total Delay observation errors. The lack of a mathematical formalism for these comparisons contributes to the difficulty of comparing these methods. One of the objectives of this study will be to develop this formalism. This unifying framework will allow for more straightforward comparisons between algorithms. Furthermore, we will develop a geometrical interpretation of the error covariances which will allow us to interpret these algorithms geometrically as well as provide a simple set of descriptive quantities that summarize the error statistics used in assimilation. We will also extend these interpretations to consistency diagnostics such as the 2 diagnostic. Beyond the new framework developed to compare methods, the statistical properties of the algorithms will be presented and compared to one another. In this work, we restrict our analysis to the estimation of error variances for simplicity. Accordingly, this assumes that the error correlations have already been adequately specified.
We note in passing that methods other than those discussed above have been developed to estimate forecast-error statistics in model space, such as Peña and Toth (2014) and Feng et al. (2020), and to estimate forecast sensitivities to error statistics parameters such as Langland and Baker (2004), Daescu and Todling (2010), Daescu and Langland (2013), and Hotta et al. (2017).
This paper is organized as follows. A brief review of assimilation is given in Section 2. Section 3 discusses the role of filtering in assimilation and Section 4 introduces the Frobenius inner product, which will be used frequently in this paper. Sections 5, 6, and 7 examines the DI01, D05, and HL methods, respectively, and cast each method into a form that allows for a direct comparison between methods. The geometric interpretation of covariances will be detailed in Section 8, while some of the statistical properties of each algorithm are examined in Section 9. Results of the DI01, D05, and HL methods for the case of homogeneous covariances defined on a 1D periodic domain are discussed in Section 10. Section 11 further discusses the theoretical differences between the DI01 and D05 algorithms and conclusions are given in Section 12.

PRELIMINARIES
Before examining assimilation error statistics more closely, we give a brief review of the analysis procedure. The impact the observations have on the analysis is set by specifying a background-error covarianceB and an observation-error covarianceR, which in general differ from their true values B t and R t , respectively. Here and throughout this paper, prescribed (or modelled) quantities used in assimilation are denoted with a tilde ( ∼ ) while their true values are denoted with the superscript ( ) t . Observation-minus-background residuals (innovations) are often key quantities when assessing the error covariances used in assimilation. The innovation vector d is given by d = y − H(x b ), where y is the vector of observations, x b is the background (or trial) model field, and H is the (in general) nonlinear observation operator. The innovation vector can be approximated to linear order as d ≈ o − H b , where o and b are the observation and background errors, respectively, and H is the linearized observation operator. In this approximation, the innovation involves both the background and observation errors without reference to the true state of either variable, but only provides a sum of these errors and not each error individually. If there is no correlation between the background and observation errors (Rutherford, 1972;Daley, 1991), the modelled innovation covariance is given bỹ D = HBH T +R. Note that, for the case of a highly nonlinear observation operator, where the background-error covariance in observation space is not well approximated by the linear approximation HBH T , the background-error covariance in observation space can instead be approximated using higher-order Taylor expansions, Monte Carlo methods, or by other means (Julier and Uhlmann, 2004), though for the rest of the paper we assume that the linear approximation is adequate.
The analysis x a , as given by the best linear unbiased estimate (BLUE), can be expressed as where x a is the analysis increment andK = BH T (HBH T +R) −1 is the Kalman gain. If the observation operator is linear and the background and observation errors are normally distributed, the BLUE analysis also minimizes the variational cost function J = J b + J o (Lorenc, 1986), where the cost functions J b and J o measure the mismatches between a field and the background and observations, respectively. Evaluated at the BLUE solution given in Equation (1), these cost functions can be expressed as where we have used the cycling permutation property of the trace operator and have defined D = dd T . In this paper, we assume that the observed innovation vector d is drawn from a (not necessarily normal) distribution with zero means and so will refer to D as the observed innovation covariance.

SPECTRA AND FILTERING OF ERROR COVARIANCES
At many points in this paper, it will prove useful to transform to a basis that simultaneously diagonalizes HBH T andR, which can be done for any covariance matrices. The standard method for doing so constructs an operator S that transforms HBH T into the identity matrix and transforms R into a diagonal matrix which we note by (see section 12.4 of Noble 1969, or Hollingsworth 1986 in the context of data assimilation), so that The diagonal elements of then represent the ratio of the observation to background-error variances in the simultaneously diagonalized basis. The full description of constructing the operator S is described in Appendix A. Looking at this transformation in another light, this transformation solves the generalized eigenvalue problemRS = (HBH T )S , so that S and are matrices of the generalized eigenvectors and eigenvalues, respectively. Note that this decomposition is general and only assumes that HBH T andR are positive definite matrices and does not assume that they represent homogeneous or isotropic covariances, or that the observations are uniformly spaced. In the special case where HBH T andR are defined on a uniformly spaced periodic 1D domain, the simultaneous eigenvectors are the Fourier modes. In the general case, the simultaneous eigenvalues of HBH T andR are referred to as the "spectra" of these matrices, even when this decomposition does not coincide with the Fourier or spherical harmonic decompositions.
A well-known property of the analysis process is its ability to filter noise from observations. By defining the filter operatorF ≡ HK, for the case of a linear observation operator, we can write the analysis in observation space as Equation (4) shows that, in the linear case, the analysis in observation space is obtained by operating withF on the observation vector and operating with I −F on the background state (in observation space). In the context of remote sensing, the operatorF is related to the averaging kernel A =KH by HA =FH. The operatorsF and I −F can be expressed in terms of the simultaneous eigendecomposion of HBH T andR via Equations (3a) and (3b) asF As noted in Hollingsworth (1986), it is apparent thatF is a filter that dampens modes prominent inR as compared to HBH T , while I −F dampens modes prominent in HBH T as compared toR; modes where the simultaneous eigenvalues of HBH T are significantly larger than those ofR correspond to elements in that are much smaller than unity, so the corresponding elements in (I + ) −1 and (I + ) −1 will be close to unity and zero, respectively, and  (Figure 1 bottom plots give an illustration). Error statistics tuning algorithms that make use of innovation statistics (either directly or indirectly via assimilation) must in some way differentiate which parts of the innovation statistics can be attributed to the background and which to the observations. When explicitly demonstrating how the DI01 and D05 algorithms make this separation, it will prove useful to introduce the following notation related to the filterF used to generate the analysis. For a matrix A, we define the matrices A B and A R as the matrix A filtered byF and I −F, respectively, so that The subscripts " B " and " R " on these quantities are to signify that A B only retains the modes in A that are prominent in HBH T and A R only retains the modes in A that are prominent inR (tildes on the subscripts are implied but not explicitly written). With this, we define the filtered observed innovation covariances D B and D R as and their modelled analoguesD B andD R as Filtering the innovation covariance usingF and I − F offers a means to isolate the contribution of the background from the observations ( Figure 6 below shows an example of the background-filtered innovation covariance). For the BLUE analysis, the analysis-minusbackground difference in observation space, which approximately equals the analysis increment in observations space, is given by H(x a ) − H(x b ) ≈ H x a = Fd. Similarly, the observation-minus-analysis difference, which we can also refer to as the a posteriori observation residual, is given by y − H(x a ) ≈ d − H x a = (I −F)d. The filtered observed innovation covariances D B and D R can then be expressed in terms of these differences as As such, we may refer to D B and D R as the analysis increment covariance and the a posteriori observation residual covariance, respectively.

THE FROBENIUS INNER PRODUCT
Another important concept, which we will use extensively in this paper, is the Frobenius inner product. An inner product space associates a scalar quantity to a pair of vectors defined in a vector space. The most common inner product is the ubiquitous dot product. This concept can be extended to matrices by use of the vectorization operator vec (⋅), which converts a matrix into a column vector by stacking the columns of the matrix on top of one another. The Frobenius inner product between the real-valued matrices A and B, denoted by ⟨A, B⟩, is defined as the dot product between the vectorizations of the matrices, so that In this way, the Frobenius inner product uses the trace operator to map two matrices into a scalar. The Frobenius product of a matrix with itself defines the square of the Frobenius norm, which for a matrix A is denoted by ||A|| = ⟨A, A⟩ 1/2 . As a norm, the Frobenius norm is always greater or equal to zero. As with other inner products, we can extend the definition of the Frobenius inner product to include a weighting. Although we may weight the inner product by any positive definite matrix, for this work we restrict ourselves to weightings that can be written in the form W 1 ⊗ W 2 , where W 1 and W 2 are (symmetric) positive definite matrices and ⊗ denotes the Kronecker product. The weighted Frobenius inner product is then given by where in the second equality we have made use of Equation (B9) from Appendix B. We direct the reader to Appendix B for a review of the Kronecker product, the vectorization operator, and some related identities. In the last equality of Equation (11), we see that, by choosing a weighting of the form W 1 ⊗ W 2 , the weighted Frobenius inner product can be written in terms of the unweighted product by weighting each factor in the product separately. The weighted Frobenius norm of matrix A weighed by W 1 ⊗ W 2 is given by As with the Euclidean inner product, we can decompose the (weighted) Frobenius inner product into norms and an angle defined with respect to the Frobenius inner product. The angle A,B between matrices A and B with respect to the weighted Frobenius inner product is defined by In this paper both inner products and expectation values will be discussed. As in some references the angled brackets ⟨⋅⟩ are used to denote the expectation value, we use the notation E [⋅] instead to denote the expectation value and caution the reader not to conflate these two operations.
We now proceed to discuss three error estimation methods in Sections 5, 6, and 7, and reformulate them in terms of minimization of cost functions.

THE DESROZIERS AND IVANOV (2001) METHOD
The first error covariance tuning method we will examine is the tuning method of Desroziers and Ivanov (2001) (DI01). While the DI01 method can be difficult to implement in practice due to the need to evaluate the trace of a large non-diagonal matrix (Chapnik et al., 2006 for possible operational implementations), the insights gained by examining the DI01 method will carry over to other methods examined in this paper.

Description of the algorithm
In this section, we give a brief review of the DI01 algorithm. The DI01 algorithm is rooted in a comparison of the expectation values of the background and observation cost functions evaluated at the analysis to their actual values as calculated within an assimilation. The expectation values of J b and J o evaluated at the BLUE analysis can be found via Equation (B11) of Appendix B as As noted in Talagrand (1999); Talagrand (2010), these expectation values simplify considerably if the modelled innovation covarianceD equals the expectation value of D, a case known as innovation covariance consistency (Ménard, 2016), which are shown in Equations (13a) and (13b) to the right of the arrows. Note that the rightmost expressions in Equations (13a) and (13b) depend on the modelled quantity HK as opposed to its true value. In practice, diagnostic relations are found by replacing the expectation values E[J b (x a )] and E[J o (x a )] in Equations (13a) and (13b) with their actual observed quantities. These diagnostic relations are only a consistency check on the error statistics, but not yet an algorithm by which the error statistics can be adjusted. The DI01 method establishes an iterative tuning method by assigning an iteration number to each modelled value (i.e., each term with a tilde), making the choice of iteration values where the subscript i denotes values for the ith iteration of the algorithm. The error covariances are then parametrized through two scaling parameters s b and s o , in which the error covariancesB i+1 andR i+1 at iteration i + 1 are computed from their values at the previous iteration , can then be solved as and the initial scaling values of (s DI01 b ) i=0 = 1 and (s DI01 o ) i=0 = 1 are chosen. The DI01 algorithm was designed for use within a variational assimilation system, where the numerators of the scaling parameters in Equations (15a) and (15b) would have already been calculated in the cost function computation required for a variational system. Additional computations must be done for the denominators which, for example, can be calculated by means of a Monte Carlo estimate of the trace (Chapnik et al., 2006). For the rest of the paper, we drop the iteration number subscript to simplify notation.
The DI01 algorithm was only formulated to compute variance scaling factors, either global scalings or for various data subsets defined by projection operators (Chapnik et al., 2004). Although Equations (14a) and (14b) can be used to accommodate other types of parametrization (for instance for correlation lengths), finding an explicit analytical solution for these parametrizations may be difficult.

Associated cost functions
We now aim to associate the variance scaling parameters of the DI01 algorithm to cost functions that when minimized yield the iterative relations in Equations (15a) and (15b).
To this end, using the filtered innovation covariances introduced in the previous section, we rewrite the expressions where we have used Equations (11) to write the cost functions in terms of the trace operator and the Frobenius inner product in addition to the vectorization operator. Appendix C gives details on the minimization of these cost functions. As a notational note, the cost functions J DI01 is the linear least-squares coefficient for fitting the prescribed background-filtered innovation covarianceD B to its observed value D B , and similarly s DI01 o is the linear least-squares coefficient for fitting the observation-filtered quantities. It is now clear that the DI01 method implicitly uses the filterF that produces the analysis in observation space (through Equation (4)) to separate the background from the observational contributions to the innovation covariance.
If an analysis has already been produced using a variational assimilation system, by using the same filter that produces the analysis, the DI01 algorithm does not need to explicitly compute the contributions of D B and D R to the tuning parameters as this would have already been computed during the evaluation of the cost functions in the variational system. Although one could envision using different filter operators in J DI01 B and J DI01 R , using that same filter as used to produce the analysis significantly decreases the computational cost of the algorithm. As the DI01 algorithm is reliant on the prescribed filter operator being able to adequately separate the background from the observational contributions to the innovation covariance, the algorithm will only be successful if this filtering is done to a sufficiently accurate degree, which will be quantified in Section 8.
In this section, we have seen that the DI01 method is equivalent to the minimization of the cost functions J DI01 B and J DI01 R , defined in Equations (17a) and (17b). By casting the DI01 method in terms of the minimization of particular cost functions, we can directly compare the DI01 method to other algorithms by comparing cost functions, provided we know the cost functions associated with the other method. Accordingly, we will derive the cost functions associated with the D05 and HL methods in the following two sections in order to make these comparisons.

THE DESROZIERS ET AL. (2005) METHOD
Although the DI01 algorithm may be used to tune error variances, the evaluation of traces required is often computationally demanding in operation settings. In addition, the DI01 method as originally devised only allows for the tuning of error variances, while other covariance parameters, such as correlation lengths, are not updated. An alternative method for tuning error statistics was devised by Desroziers et al. (2005) which is typically less computationally demanding and can be used to tune correlation parameters. In this section, we examine the D05 method and associate quadratic cost functions with it, which allows for a direct comparison between the DI01 and D05 algorithms. As we wish to compare the DI01 and D05 methods, we concentrate on scaling parameters for the error variances, but note that the methods presented in this section can be generalized for other tuning parameters.

6.1
Description of the algorithm D05 notes that one can form various consistency checks for the error covariances by taking the expectation value of various outer products of differences in observation space, which include the relations If the modelled innovation covarianceD is equal to the expectation value of the observed innovation covariance D, then the expectation values in Equations (18a) and (18b) simply become equal to HBH T andR. Based on these relations, D05 proposed that the iterative relations be used as a tuning algorithm. Ménard (2016) showed that, when the full matrices of both HBH T andR are updated with this algorithm, innovation consistency is reached after a single iteration. The D05 algorithm as given in Equations (19a) and (19b) describes the tuning of the full covariance matrices HBH T andR. If the covariance matrices are parametrized using a set of variables and those parameters are tuned instead of the full matrices, the system of equations represented by Equations (19a) and (19b) will in general be overdetermined. As previously mentioned, we will focus on scaling the error variances only, which for the D05 algorithm will be set by the scaling parameters s Comparing Equations (20a) and (20b) to Equations (15a) and (15b) for the DI01 algorithm, we see that, for the case of variance scalings, the two methods differ only by an extra factor ofD −1 i present in all traces in the expressions for the DI01 scaling parameters. The absence of the extra factor ofD −1 i in Equations (20a) and (20b) makes the traces more easily computed than those required for the DI01 scaling parameters.

Associated cost functions
As previously mentioned, the D05 variance scaling parameters as originally devised are admitted from the non-parametric relations in Equations (19a) and (19b) by taking the trace of each side of the equations. It will be shown that this is a specific case of a more general method to solve the overdetermined system in Equations (19a) and (19b).
To generalize the method for solving the overdetermined set in Equations (19a) and (19b) for parametrized covariances, we set the tuning parameters as the values that minimize the square of the Frobenius norm of the mismatch between left-and right-hand sides of Equations (19a) and (19b). In other words, for an arbitrary parametrization of the covariances, the tuning parameters should minimize the cost functions J B and J R defined by where we have allowed for the (as of yet unspecified) weightings W b1 ⊗ W b2 and W o1 ⊗ W o2 . By selecting the covariance parameters in this manner, the parameters are the best fit parameters corresponding to the cost functions J B and J R . We note that, contrary to the DI01 cost functions in Equations (17a) and (17b), as the cost functions in Equations (21a) and (21b) do not specify a particular parametrization, they can be used to estimate error correlations as well. Now restricting ourselves to scalings of the variances, these cost functions simplify to where Equations (11), (B4) and (B6) have been used in the third lines to move the filter operators from inside the norm to the weighting. At this point, as with the DI01 method, we have dropped the iteration index to simplify notation. These cost functions are minimized by the scaling parameterŝ Note that these expressions have the scaling parameter estimates written with the accent ∧ instead of having the superscript D05, as this superscript will be reserved for the particular choice of weightings chosen in Equations (20a) and ( where we have definedB B ≡FHBH TF T andR R ≡ (I −F)R(I −F) T via Equations (6a) and (6b), respectively (B B will be written without H or H T for ease of notation). (17a) and (17b) for the DI01 method, we see that, for the scaling of the variances, the two methods differ only in their weighting of the cost functions. Therefore, although the DI01 and D05 algorithms have important differences in their numerics and ease of implementation, conceptually both methods are solutions to essentially the same problem (minimization of the cost functions associated with the filtered innovation covariances) with only minor differences (the weighting factors in the cost functions). In other words, both DI01 and D05 methods implicitly fit a modelled covariance to an observed covariance, but give a different weighting to each term in the covariance matrix. The differences in assigned weightings between the two methods is further explored in Section 11. As we have parametrized the modelled covariances with scaling factors, the implicit fitting procedure is simply a linear least-squares fit. Also note that, with this choice of weighting, J D05 The weightings chosen in Equations (24a) and (24b) are only one possible choice and, by choosing different weightings, one can produce different iterative schemes for the variance scaling coefficients. Each choice of weightings will impact the performance of the iterative tuning scheme. For instance, with the choice of W b1 = W b2 = (HBH T ) −1 and W o1 = W o2 =R −1 , the scheme gives the where N obs is the number of observations. In this case, the ratio between the background-and observation-error variances would not change between iterations, thereby not affecting the analyses made using the updated error covariances. With this choice of weightings, this method simplifies to an algorithm that satisfies the 2 diagnostic without changing the relative scaling between the background-and observation-error covariances. If instead the weightings are chosen as and W o2 =R −1 , then the DI01 method is recovered. As such, for the case of variance scaling, both the DI01 and D05 methods belong to a larger class of statistics tuning methods defined by Equations (22a) and (22b). As with the DI01 method, we have seen that the tuning of variances using the D05 method is equivalent to the minimization of quadratic cost functions. Additionally, both the DI01 and D05 methods use the analysis filter to attribute portions of the innovation covariance to the background and observations. A comparison of their associated cost functions shows that, fundamentally, the two algorithms differ only by the weightings used in each cost function. The DI01 and D05 methods both belong to the class of methods defined by Equations (22a) and (22b), in which cost functions measuring the mismatch between parts of the modelled and observed innovation covariances are minimized, using the analysis filter for attribution to either the background or observations.

THE HOLLINGWORTH-LÖNNBERG METHOD
The method of Hollingsworth and Lönnberg (1986) provides a method for diagnosing error covariances through the use of innovation statistics. But unlike the DI01 and D05 algorithms, the HL method does not use the filter that is used to produce the analysis to differentiate between the observation and background contributions to the innovation covariance. Instead, the observation errors are assumed to be spatially uncorrelated, so all spatial correlations in the innovation covariance are attributed to the background. As the analysis filter is not used, the HL method is an a priori scheme.

Description of the algorithm
Unlike the DI01 and D05 methods, the HL method determines the background and observation parameters sequentially. First the background-error covariance parameters are determined by fitting the innovation covariance matrix without the diagonal terms so as to utilize only the parts of the innovation covariance affected solely by the background. The modelled best-fit parameters are then used to extrapolate the background-error covariance to zero spatial separation, so that the background contribution to the innovation variance can be removed and an estimate of the observation-error variance can be made. The HL method typically parametrizes both the variances and correlations of the background covariance. However, to compare with the DI01 and D05 methods, we restrict ourselves to a single global background-error variance scaling parameter s HL b . The first step in the HL method fits the off-diagonal elements of HBH T , which are the same as the off-diagonal elements ofD since the observation errors are assumed to be uncorrelated. As in Ménard et al. (2016), this fitting procedure can be done via the minimization of the cost function J HL B , defined as where i, j are matrix indices,̃b is a diagonal weighting matrix, andC is the modelled background-error correlation matrix. We takẽb as the diagonal matrix of square roots of the modelled background variances, so that the modelled background covariance can be expressed as HBH T =̃bC̃b. An estimate of the background-error variance can then be made using this fit and can then be subtracted from the observed innovation-error variance to estimate the observation-error variance, which may be performed as where s HL o is the observation-error variance scaling parameter for the HL method and s HL b is the previously determined best-fit parameter for s b .

Associated cost functions
Although the previous section introduced the HL background cost function J HL B , its form in Equation (25) is difficult to compare with the cost functions for the DI01 and D05 methods, and we have yet to introduce a cost function for the observation-error covariance. In this section, we complete the formulation of the HL method in terms of the minimization of cost functions that measure the mismatch between the modelled and observed backgroundand observation-error covariances.
By introducing the constant N obs × N obs matrix V that has all elements equal to one, except the diagonal elements that are equal to zero, the cost function J HL B in Equation (25) can be written as where • denotes the Hadamard or element-wise product (Equation (B10)) and Equation (11) has been used in the last line. Introducing the matrix V with the Hadamard product allows us to sum over all matrix indices and consequently J HL B can be expressed in terms of a trace, allowing for a direct comparison with the DI01 and D05 methods. The most notable difference between J HL B and its analogues for the DI01 and D05 methods, found in Equations (17a) and (24a), respectively, is the terms involving the filtered innovation covariances, D B andD B , are replaced by V•D and V•D in J HL B , respectively. As such, the element-wise multiplication between the matrix V and the innovation covariance is analogous to the filtering that is done using the operatorF in the DI01 and D05 methods. The weight- B , but we note that J HL B could be defined with a more complicated weighting than that used here. The value of s b that minimizes J HL B is given by After the background-error covariance parameters have been determined, a fit for the observation-error variance parameters can be made using the diagonal elements of the innovation covariance matrix. We define the cost function J HL R as wherẽo is a diagonal weighting matrix. As the fits for the background-and observation-error covariance parameters are done sequentially, the best-fit background parameter s HL b appears in the cost function J HL R . In Equation (29), the Hadamard product is again used to extend the sum over all matrix indices, this time by an element-wise multiplication with the identity matrix (of dimension N obs × N obs ). As with J HL B , by comparing J HL R with its analogues for the DI01 and D05 methods, found in Equations (17b) and (24b), respectively, we see that the Hadamard product with the identity matrix in J HL R is analogous to the filtering done with the operator I −F in the DI01 and D05 methods. The cost function J HL R is minimized by the scaling coefficient If the weighting̃o is chosen to be the diagonal matrix of the square roots of the modelled observation-error variances, as we do for the remainder of the paper, then Equation (30) simplifies to Equation (26). By using the Hadamard product with the matrix V or I, we have seen that we can construct cost functions for the HL method that can be directly compared to the cost functions associated with the DI01 method (Equations (17a) and (17b)) or with the D05 method (Equations (24a) and (24b)). By comparing their associated cost fuctions, all three methods are associated with cost functions quadratic in the variance scaling parameters and all three methods have different weightings in their associated cost functions. While the DI01 and D05 methods use the analysis filter to attribute portions of the innovation covariance to either the background or observations, the HL method attributes all correlations to the background.

GEOMETRIC INTERPRETATIONS OF COVARIANCES
We have seen that both the DI01 and D05 a posteriori algorithms rely on the filter used to produce the analysis to separate the contributions of the background and observations to the innovation covariance and are a posteriori algorithms precisely due to the use of this filter. The ability of the filter to accurately perform this separation is crucial for successful applications of the DI01 and D05 methods. This will be dependent on how different the backgroundand observation-error covariance spectra are from one another. This section introduces measures to help quantify the degree of difference between the background-and observation-error covariances, which will be done using the Frobenius inner product. The use of the Frobenius inner product will also allow us to interpret the covariance matrices geometrically.

Inner products between covariances
We begin this section by returning to Sections 5.1 and 6.1 and rewriting the expressions for the variance scaling parameters in terms of the Frobenius inner product. For the DI01 method, Equations (15a) and (15b) take the form while Equations (20a) and (20b) for the D05 method are writtens These scaling parameters are given by the ratio of the inner products of HBH T orR with the observed and modelled innovation covariance. The only difference between the two methods is their choice of weights for the inner product. The cost functions J b and J o evaluated at the BLUE analysis, as given in Equations (2a) and (2b), can also be expressed in a similar manner as We can see that 2J b (x a ) and 2J o (x a ) are given by the Frobenius inner products of the observed innovation covariance D with HBH T andR, respectively, weighted by (D ⊗D) −1 . The 2 for the innovation statistics can be expressed as All of these expressions involve inner products between covariance, either weighted by (D ⊗D) −1 or (D ⊗ I) −1 . As such, inner products between covariances play a central role in these methods.
In this section, we will develop an intuitive understanding of the inner product between covariances. With a matrix norm and angle defined via the Frobenius inner product, we can interpret the covariances geometrically. This geometric interpretation is not merely by analogy, as each covariance matrix can be converted into a vector by use of the vectorization operator. As weighting these inner products by (D ⊗D) −1 yields unitless products, we will illustrate these products using this weighting, but note that similar results are obtained when weighting by (D ⊗ I) −1 instead.

Spectral distinctiveness of analysis filters
As a first step, we consider inner products where both of the operands are modelled quantities. The inner products ⟨HBH T ,D⟩ (D⊗D) −1 and ⟨R,D⟩ (D⊗D) −1 which appear in Equations (31a) and (31b) can be decomposed as which involve the norms of HBH T andR and the inner product between them. Using the relations in Equations (5a) and (5b), these factors can be written as where i is the ith diagonal element of the matrix . From Equations (36b), and (36c), we see that ||HBH T || 2 (D⊗D) −1 and ||R|| 2 (D⊗D) −1 are equal to the sum of the square of the filtersF and I −F, or (1 + ) −1 and (1 + ) −1 in spectral space, respectively. ⟨HBH T ,R⟩ (D⊗D) −1 measures the overlap betweenF and I −F, which is the sum of the product of (1 + ) −1 and (1 + ) −1 , and quantifies how distinguishable HBH T is fromR (orF is from I −F) based on their spectra.
We can further quantify how spectrally distinct the filtersF and I −F are from one another by using Equation (12) to define the angleB ,R between HBH T and R weighted by (D ⊗D) −1 as In this nomenclature, factors of H in subscript ofB ,R are assumed but not written explicitly in order to simplify notation. Additionally, a weighting of (D ⊗D) −1 is assumed forB ,R unless stated otherwise. Accordingly, cos(B ,R ) is the ratio of the filter overlap term to the product of the root-sum-square of each filter.
As an illustration, consider the case where the spectra of HBH T andR are such that, for the wavenumbers where the spectrum of HBH T is significant, the spectrum ofR is negligible, and at the wavenumbers where the spectrum ofR is significant the spectrum of HBH T is negligible. We can say that the spectral overlap between HBH T andR is negligible in this situation (lower plot of Figure 1d for a case where the spectral overlap is relatively small). In this case, at most wavenumbers, either the spectrum of the fil-terF is close to unity and the spectrum of the filter I −F is close to zero (i.e., (1 + i ) −1 ≈ 1 and i (1 + i ) −1 ≈ 0), or vice versa. This implies that the filtersF and I −F can relatively unambiguously attribute a particular wavenumber to either the background or the observations in this situation. Consequently, at almost all wavenumbers, the product between (1 + i ) −1 and i (1 + i ) −1 would be negligible and ⟨HBH T ,R⟩ (D⊗D) −1 , being the sum of this product over all wavenumbers, would be small. In the limit of no overlap between the spectra of HBH T andR, as the filter spectra (1 + i ) −1 and i (1 + i ) −1 would be equal to either zero or one, ||HBH T || 2 (D⊗D) −1 and ||R|| 2 (D⊗D) −1 would be equal to the number of non-zero modes in HBH T and R, respectively. In this case, Equation (37) implies thatB ,R would be close to 90 • , and so we can consider HBH T and R to be (nearly) orthogonal to one another. Notice from Equation (36a) that, ifF is a projection operator (i.e.,F 2 = F), then HBH T andR will be orthogonal to one another.
On the other extreme, if the spectral structure of HBH T andR are similar to one another, the spectra of HBH T andR will be significant (or insignificant) at the same wavenumbers. In other words, the two spectra will overlap considerably. In this case, the filtersF and I −F will have similar spectra and consequently there will be few wavenumbers that could be used to distinguish between the two error covariances. In this case, at most wavelengths the product between (1 + i ) −1 and i (1 + i ) −1 would be significant, leading to a non-negligible value of ⟨HBH T ,R⟩ (D⊗D) −1 .
We have seen in previous sections that both the DI01 and D05 algorithms can be viewed as fitting procedures for the filtered innovation covariancesD B andD R . As will be demonstrated in later sections, the behaviour of these algorithms will depend on how well the filters attribute a particular wavenumber of the innovation covariance to either the background or the observations. The angleB ,R is one possible measure to quantify how well the filtersF and I −F isolate the background and observational contributions to the innovation covariance. Another measure examines the differences HBH T −D B andR −D R , or, in other words, the differences between the full error covariance matrix (either the background or observation) and the appropriately filtered innovation covariance. It can be easily shown that HBH T −D B =R −D R =FD(I −F) T = HBH TD −1R , which has a norm give by Comparing this relation with Equation (36a), we see that the square of this norm is similar to ⟨HBH T ,R⟩ (D⊗D) −1 , but contains two factors ofF(I −F) within the trace instead of one. This norm gives a slightly different measure for the degree of spectral overlap between the modes of HBH T andR. If HBH T andR have no non-zero modes in common (so thatB ,R = 90 • ), then the filtersF and I −F can perfectly divide the innovation covariance into parts associated with either HBH T orR, in which case we would haveD B = HBH T andD R =R. The larger the overlap between the modes of HBH T andR, the furtherD B is from HBH T andD R is fromR. Lastly, we note that, as these inner products have been weighted by (D ⊗D) −1 , ||D|| 2 (D⊗D) −1 is simply equal to the total number of observations N obs . The law of cosines as applied to the triangle formed by HBH T ,R, andD yields Therefore, for a constant number of observations, ||HBH T || (D⊗D) −1 , ||R|| (D⊗D) −1 , andB ,R are not all independent of one another.

Example: 1D periodic domain
We now illustrate the inner products introduced in the last section with the example of homogeneous covariances defined on a 1D periodic domain, though we emphasize that neither of these properties have been assumed in the above derivations. We also assume that the observations measure the same quantity as the model field at the equally-spaced model grid locations, so that H = I. In this example, the simultaneous spectral decomposition of HBH T andR coincides with the Fourier decomposition. Observation errors are assumed to be spatially uncorrelated, while the background correlations are modelled using a Gaussian function with correlation length L b . The modelled homogeneous background-and observation-error variances are denoted bỹ2 b and̃2 o , respectively, and the domain of length is denoted by L. The modelled background-and observation-error covariances for the 1D periodic example is shown in spectral space in Figure 1 for four different choices of parameters. In each panel, the top plots show the background-and observation-error covariances, while the bottom plots show the filter functions (1 + ) −1 and (1 + ) −1 (corresponding to the filtersF and I −F, respectively) as well as the product between these filters. As the observation errors were assumed to be uncorrelated in space, its spectrum is flat. It is clear that (1 + ) −1 and (1 + ) −1 act as a low-pass and a high-pass filter, respectively. As seen in the different panels of Figure 1, the overlapping region of the filters is larger for smaller values of the background correlation length.
The values for the norms of HBH T andR, as well as the angleB ,R , are shown in each panel. From Equations (36b) and (36c), ||HBH T || 2 (D⊗D) −1 and ||R|| 2 (D⊗D) −1 are given by the sum of the squares under the (1 + ) −1 and (1 + ) −1 curves, respectively. From Equation (36a), ⟨HBH T ,R⟩ (D⊗D) −1 is the sum under the (1 + ) −2 curve. Comparing Figures 1a and b, the filterF in (a) filters out more wavelengths than that chosen for (b), which is reflected in the norm of HBH T having a smaller value in (a) than in (b). The filtersF and I −F significantly overlap over more wavenumbers in (b) than in (a), and accordingly the value ofB ,R is smaller in (b) than in (a). In (b), the value of̃2 b ∕̃2 o was changed from its value in (a) so that the values of the lowest mode in HBH T were the same in both panels. To examine changes in length-scales and variances separately, (c) shows the case where the background correlation length is shorter than in (a) but the variances are unaltered, while (d) shows the case where the background correlation length has the same value as in (a) but with different variance values. From these comparisons, we see that, in general, changing either the correlation length or variances changes both the norms of HBH T andR as well as the angle between them. Figure 2 showsB ,R , ||HBH T || 2 (D⊗D) −1 , and ||R|| 2 (D⊗D) −1 as a function of the background correlation length L b for a variety of different observation spacings Δx. In this example, we set the background-and observation-error variances equal to one another and set the domain length to L = 40, 000 km, which roughly coincides with the Earth's circumference. As L b increases, the background correlations become more spectrally distinct from the uncorrelated observation uncertainties and accord-inglyB ,R moves closer to 90 • . Increasing L b also decreases the spectral extent of the filterF (compare the bottom plots of Figures 1a and c), and so decreases the value of Most practical data assimilation applications involve making observations at discrete locations, in which case we are able to sample the covariances only at these locations. Figure 2 demonstrates that the observation spacing has a significant effect on the spectral overlap between HBH T andR. If a correlation function is sampled at discrete locations and the spaces between samples are greatly increased, then the sampled correlation function will begin to look more like a delta function, assuming the continuous correlation function that is being sampled approaches zero at infinite distance. Conversely, if the spaces between samples are greatly decreased, the correlation values between adjacent locations do not change as much, so the sampled correlation function starts to look more flat. Thus if the observation errors are uncorrelated in space, or have a much smaller correlation length than the background errors, increasing the observation spacing increases the spectral overlap between HBH T andR as the two sampled correlations becomes less distinct from one another. This is reflected in Figure 2, where we see that for a fixed value of L b ,B ,R decreases as Δx increases. But as the observations are equally spaced in this example, for a set value of L, changing Δx also changes the number of observation locations. Figure 3 examines this issue by plotting B,R as a function of observation spacing Δx, but holds the number of observation locations constant, so that L varies with Δx. Here too we see thatB ,R decreases as the observation spacing increases. Returning to Figure 2b, we see that increasing the density of observations increases both ||HBH T || (D⊗D) −1 and ||R|| (D⊗D) −1 for a fixed value of L b . Both ||HBH T || 2 (D⊗D) −1 and ||R|| 2 (D⊗D) −1 can be interpreted as an integral over the square of their respective filters (the squared values under the blue and red curves in the bottom plots of Figure 1). As increasing the observation density increases the number of modes in observation space, in  Figure 4 shows plots of the covariances and filters with the same model parameters as those in Figure 1a, but with observation errors correlated in space, modelled by a Gaussian with correlation length L o . The observation-error correlation length in Figure 4b is five times larger than its value in Figure 4a. In contrast to the flat observation-error spectra in Figure 1, those in Figure 4 decrease with wavenumber. Although the observation-error covariance spectra in Figure 4 decreases significantly at high wavenumbers, in these examples the observation-error spectra are still significantly larger than the background-error covariance spectrum at these wavenumbers. Consequently, the spectra for the filtersF and I −F are nearly the same as in  Figure 1a, we can see that the observation-error covariance spectrum is closer to the background-error covariance spectrum at low wavenumbers for the spatially correlated cases as compared to the uncorrelated case. As such, it is more difficult to distinguish between the background-and observation-error covariance at these wavenumbers, as seen by the decrease of the spectrum ofF and increase of the spectrum of I −F with increasing observation-error correlation length at low wavenumbers. Comparing Figure 1a with Figure 4, we see that the value ofB ,R decreases with increasing observation-error correlation length, which in these examples is primarily due to the differences in spectra at low wavenumbers. This is further illustrated in Figure 5, which shows the angleB ,R as a function of observation-error correlation length for three different values of the background-error correlation length. Figure  We saw in Sections 5.2 and 6.2 that the filtered innovation covariances play a central role in the both the DI01 and D05 methods, as it enables the algorithm to attribute parts of the innovation covariance to either the background-or observation-error covariance. Figure 6 illustrates the filtering of the innovation covariance by showing HBH T andD B for four different choices of parameters as a function of separation distance x. Figure 6a than in (a, c). Now comparing (a, b) with (c, d) respectively, as̃2 o decreases, the observational uncertainty contributes less to the innovation covariance and so the observational covariances misattributed to the background will be comparatively smaller as well. This is evident by comparing (a, b) with (c, d) respectively, where (c, d) have larger values forB ,R and smaller values for ||HBH T − D B || (D⊗D) −1 than (a, b), respectively. These examples illustrate that the more spectrally distinct HBH T andR are to one another, the closerD B is to HBH T (andD R is toR).

Limiting cases revisited
One of the fundamental limitations of error covariance estimation is related to the lack of identifiability between background and observational covariances when correlation length-scales are identical, as pointed out by ,  and Chapnik et al. (2004). With this in mind, Chapnik et al. (2004) considered the performance of the DI01 algorithm in two different limiting cases. One case has the backgroundand observation-error correlation lengths equal, and a contrasting case where the background-error correlation length is taken to be infinite. We now revisit these cases to gain further intuitive insight into the inner products between covariances. In the first case, an infinite background-error correlation length implies that all background-error correlations are equal to unity, which is the limiting case for situations where the background-error correlation length is much larger than the length of the domain. In contrast, the observation errors are taken as uncorrelated. In this case, HBH T would be a rank-1 matrix with a single non-zero spectral value whileR would have a uniform spectrum. Consequently, as i is the equal to the observation-error covariance spectrum divided by the background-error covariance spectrum at wavenumber i, the value of i would be infinite at every wavenumber other than at the lowest wavenumber. Examining the sums in Equations (36a) and (36b), only the first terms in the sums would be non-zero. The spectral overlap between the filtersF and I −F in this case is confined to solely the lowest wavenumber. However, the sum in Equation (36c) taken over all terms except the first would be equal to N obs − 1. Therefore, if N obs is very large, ||R|| 2 (D⊗D) −1 would be much larger than both ⟨HBH T ,R⟩ (D⊗D) −1 and ||HBH T || 2 (D⊗D) −1 , in which case, from Equation (37), cos(B ,R ) would be small andB ,R would be close to 90 • . As cos(B ,R ) continues to decrease with increasing values of N obs , HBH T andR are asymptotically orthogonal to one another (with respect to the weighting (D ⊗D) −1 ) in this case. As HBH T andR are nearly orthogonal in this case, the analysis filter can attribute certain modes to either the background or observations with little error.
The second case is where the background-and observation-error covariances have the exact same correlation and variance structure, so that HBH T andR are equal up to an overall scaling factor. It can easily be verified that all values of i are equal to one another in this case, which implies that the filtersF and I −F both have flat spectra. As the relative importance of HBH T compared toR is the same for each wavenumber, there is no way to use the spectral features of the innovation covariance to attribute a particular part to either the background or the observations. Using Equations (36a) to (36c) for this example gives cos(B ,R ) = 1, so thatB ,R = 0 • and HBH T andR can be considered to be parallel to one another (with respect to the weighting (D ⊗D) −1 ). Consequently, in this case, the analysis filter is ineffective at attributing modes to either the background or observations. Note that, in the second example, both the backgroundand observation-error covariances have the same covariance structure, not just the same correlation structure. As the rightmost expressions in Equations (36a) to (36c) involve the spectra of the full covariance,B ,R may differ from 0 • in cases where the correlations of HBH T andR are the same but the structure of their variances differ.

Geometric interpretation of the DI01 and D05 algorithms
As observed in Desroziers (2012) and Ménard (2016), convergence of the D05 algorithm is slow when the background-and observation-error correlations have similar length-scales. As the difference in correlation structure between the two error covariances can be quantified using inner products of the form ⟨HBH T ,R⟩ (and the corresponding angleB ,R ), these quantities can be used to quantify the rate of convergence for the DI01 and D05 algorithms. Additionally, since this has been quantified using an inner product, we can make geometric interpretations of the DI01 and D05 algorithms.
The variance scaling parameters for the DI01 method, as given in Equations (31a) and (31b), can be expressed in terms of the angleB ,D between HBH T and D andB ,D between HBH T andD as Written in this manner, it is evident that s DI01 b is the ratio of the projection of D onto HBH T to the projection ofD onto HBH T . This scaling is depicted in Figure 7, which represents the inner product space defined by the Frobenius inner product weighted by (D ⊗D) −1 , so that matrices are represented by vectors with lengths given by their F I G U R E 7 Graphical representation of the determination of the background-error variance scaling by use of the DI01 or D05 algorithms. The space displayed has lengths and angles defined by the Frobenius inner product. For the DI01 algorithm, norms and angles are defined using weightings of (D ⊗D) −1 , and weightings of (D ⊗ I) −1 for the D05 algorithm Frobenius norm and the angles defined by Equation (12). The scaling coefficient s DI01 o can be interpreted analogously with projections ontoR. The variance scaling coefficients for the D05 method can be interpreted in the same way, but defining norms and angles in terms of weightings by (D ⊗ I) −1 .

Geometric interpretation of the 2 diagnostic
We saw in Equation (34) that the 2 diagnostic could be expressed as the inner product between the observed (D) and modelled (D) innovation covariances weighted by (D ⊗D) −1 . As the expectation value of 2 is equal to the number of observations N obs when the modelled and observed innovation covariances are consistent with one another, the 2 diagnostic relation compares the observed 2 value to N obs . The 2 diagnostic relation can be written in terms of the Frobenius inner product as Complete innovation consistency implies that D =D for the full covariance matrices, while the 2 diagnostic looks solely at the quantity ⟨D, D⟩ (D⊗D) −1 = d TD −1 d. As such, when the 2 diagnostic is satisfied, the modelled and observed innovation covariances will still in general differ from one another. The degree of difference between the observed and prescribed innovation covariance can be quantified by If the 2 diagnostic is satisfied, then Equation (42), when rearranged, simplifies to F I G U R E 8 Graphical representation of the observed and modelled innovation covariances when the 2 diagnostic is satisfied. The space displayed has lengths and angles defined by the Frobenius inner product weighted by (D ⊗D) −1 Equation (43) implies that when the 2 diagnostic is satisfied, the norm of D −D can be found via the Pythagorean theorem. This situation is represented in Figure 8. Therefore, the difference between the squared norms of D andD quantifies the level of difference between the observed and modelled innovation covariances that remains after the 2 diagnostic has been satisfied.

STATISTICAL PROPERTIES OF A PRIORI AND A POSTERIORI METHODS
Up to this point, the comparison between the DI01, D05, and HL methods has been conceptual. In this section, we begin to examine the statistical properties of these methods by comparing the expectation values and variances of the scaling parameters derived by each method. This will be done in the idealized case where the structures ofB andR are correctly specified and vary from the truth only by global scaling factors. In other words, in this case the true covariance matrices would be given by B t = s t bB and R t = s t oR , where s t b and s t o are the true variance scalings required for our modelled covariances. The expressions in this section may be extended to include the effects of mis-specifications of the error correlations, but are not included here for brevity.
We begin by examining the expectation values of the DI01 scaling parameters. The expectation values can be found by the use of the relation in Equation (B11) with Equations (15a) and (15b), which with some simple algebraic manipulation yields From the first lines of Equations (44a) and (44b), we can see that the expectation values of the scaling parameters are equal to their true values if ⟨HBH T ,R⟩ (D⊗D) −1 is equal to zero or, in other words, when the spectral overlap between HBH T andR is negligible. WhenB andR differ from their true values by the same scaling factor, so that s t b = s t o , the expectation values of the scaling parameters will also equal their true values, in which case the updated error statistics do not effect the analysis. From the second line of Equation (44a) Turning now to the D05 method, using Equation (B11) with Equations (20a) and (20b) yields the expectation values for the D05 variance scaling parameters As with the DI01 variance scaling parameters, we see that the divergence of the expectation values of the D05 variance scaling parameters from their true values is proportional to terms that measure the spectral overlap between HBH T andR. For the weightings chosen in the D05 method, the spectral overlap term is given by Tr[HBH T − As for the DI01 estimates, if one scaling parameter is overestimated, then the other will be underestimated.
In this example, we have assumed that the modelled error covariances only differ from their true values by a scaling factor. As the HL method assumes spatially uncorrelated observations, this example implies that this assumption is satisfied. In this case, the expectation values of the HL variance scaling parameters s HL b and s HL o are equal to their true values, although we emphasize that this only applies to cases where the true observation-error covariances are uncorrelated. If the true observation uncertainties are spatially correlated, then the relations in Equations (46a) and (46b) would not hold. For the DI01 and D05 methods, any spectral overlap between the background-and observation-error covariances cause some amount of mis-attribution of the source of the innovation covariance. In contrast, if the observation-error covariance is truly uncorrelated in space, the HL method would correctly attribute all spatial correlations in the innovation covariance to the background. Another way to express this is by noting that the inner product ⟨V•D, I•D⟩ vanishes for any diagonal weighting matrix. This inner product serves an analogous role in the HL method to the inner product ⟨HBH T ,R⟩ in the DI01 and D05 methods (weighted by the appropriate factor). While the HL method does not make the types of attribution errors as in the DI01 and D05 methods, it is limited by the assumption of spatially uncorrelated observation errors.
Expressions for the variances of the scaling factors for these methods in this example can be found in Appendix D.

COMPARISON BETWEEN METHODS
In this section, we compare the scaling parameter estimates made using the DI01, D05, and HL methods for the example of homogeneous covariances defined on a 1D periodic domain, which was introduced in Section 8.3. As before, we restrict our analysis to global scalings of the error variances. We assume that both the modelled and true variances are homogeneous and correlations are modelled by a Gaussian. As in Section 9, we assume that the true and modelled covariances only differ by a scaling factor (Waller et al. (2016a) give a discussion of results for the D05 method when the true and assumed correlation models differ). In this case, the true scaling parameters are given by s t 2 are the true background-and observation-error variances, respectively. Therefore, s t o ∕s t b is simply given by For the DI01 and D05 methods, results are displayed for a single iteration of the algorithm, as only one iteration is performed in most practical implementations. All plots in this section use a domain length of L = 40, 000 km. This section will present results for both the expectation values and the variances of the scaling parameters for each method, both of which were calculated analytically using the relations in Section 9 and Appendix D. While the expressions for the expectation values do not require the innovations to be normally distributed, the analytic formulae for the variances only hold in general if the innovations are normally distributed. As such, we assume in this section that the innovations are normally distributed.
We begin by considering the case where both the modelled and true observation errors are spatially uncorrelated. Figure 9 shows the expectation values and square roots of the variances of the scaling parameters for each method as a function of background correlation length L b for two different choices of modelled and true error variances. The left panel show a case where s t o ∕s t b = 0.5, while the right panel have s t o ∕s t b = 2. The observation spacing is set at Δx = 40 km for all panels. For the DI01 and D05 algorithms, the expectation values of the scaling parameters approach their true values as L b increases since this decreases the spectral overlap between HBH T andR. As the true observation errors are uncorrelated in this case, the expectation values of the HL variance scaling parameters equal their true values. As noted in Section 9, for the DI01 and D05 methods, if one scaling parameter is overestimated then the other is underestimated, and in this example the parameters over-and underestimated switch from the left to the right panels as the values of s t o ∕s t b − 1 and s t b ∕s t o − 1 switch sign. Interestingly, increasing L b increases the variances of many of the scaling parameters for this range of background correlation lengths. While the expectation values of the HL scaling parameters are equal to their true values in this example, their variances are larger than that estimated by both the DI01 and D05 methods. Although the expectation values for D05 background scaling parameter were closer to the truth than that estimated with the DI01 method in this example, the opposite was true for the observation scaling parameters. The variances of the estimates of the background scaling parameter were larger for the D05 method than the DI01 method, F I G U R E 9 Statistics for scaling parameters as a function of background-error correlation length L b with spatially uncorrelated modelled and true observation errors. In each panel, the top plot shows the expectation value and the bottom plot shows the square root of the variance of each parameter estimate, both expressed as a percentage. The solid and dashed curves denote the background and observation scaling parameters, respectively, and the colours of the curves denote the estimation algorithm. The cases shown use Δx = 40 km and L = 40, 000 km while the variances of the observation scaling estimates were similar using these two methods.
The expectation values for the HL parameters diverge from their true values if the true observation errors are spatially correlated, a case which is illustrated by Figure 10. In this figure, as in the previous example, the estimates using the DI01 and D05 methods use modelled observation covariances that differ from their true values by only a scaling coefficient and so include these spatial correlations. However, since the HL method cannot handle spatially correlated observation errors, the estimates using the HL method do not include a modelled observation-error correlation. The parameter values used for Figure 10 are the same as used in the left panel of Figure 9, but have observation-error correlations modelled by a Gaussian with correlation length L o = 40 km. As seen in the figure, the divergence of the expectation values of the HL scaling parameters from their true values are on the same order of magnitude as the estimates from the other methods, but provide overestimates when the other methods underestimate, and vice versa for underestimates.
In Section 8.3, it was shown that the spectral overlap of the background-and observation-error covariances is strongly dependent on the observation spacing. Figure 11 shows the dependency of the estimates of the scaling parameters on the observation spacing Δx. In this figure, both the modelled and true observation errors are spatially uncorrelated. As the observation spacing increases, the expectation values of the scaling parameters from both the DI01 and D05 methods diverge from their true values. In this range of observation spacings, the variances of the observation-error scalings from all three methods increase with Δx, while the variances of the background scaling parameters estimated using the HL method increases with Δx but decreases for the DI01 and D05 methods.

WEIGHTING THE COST FUNCTIONS
We have seen that although the numerics of the DI01 and D05 methods differ, as well as their ease of implementation, both fundamentally solve the same least-squares problem but with different weightings. The choice of weightings and their corresponding variance scaling coefficients for some of the methods examined in this paper are summarized in Table 1. The third and fourth lines of Table 1, which both correspond to weightings for the 2 diagnostic, is another example of non-unique weightings for an algorithm. (As a diagnostic, the 2 diagnostic technically does not have scaling parameters, but is labelled as such in Table 1 in spite of this as the diagnostic equation is emitted by Equations (23a) and (23b) for those choices of weightings.) While more choices for the weightings are possible, many yield expressions for the scaling parameters that are difficult to evaluate in an operational setting, particularly in the evaluation of the trace in the denominators of Equations (23a) and (23b). As such, the choice of weightings for the DI01 and D05 algorithms are an appropriate choice for practical reasons, not necessarily because they are optimal in some theoretical sense.
As the variance scaling parameters are linear least-squares coefficients, ideally we would expect to use a weighting that corresponds to the covariance of the data being fitted. In our case, this would be the covariance of an estimate of the filtered innovation covariance. In the scalar case, the variance of the sample variance of samples drawn from a normal distribution is proportional to the square of the true variance. Symbolically, if S 2 is the unbiased sample variance for a set of independent and identically distributed random scalars drawn from a normal distribution with zero mean and variance 2 , then E[S 2 ] = 2 and V[S 2 ] ∝ 4 = E[S 2 ] 2 , where V[⋅] is the variance operator. In the multidimensional case, vectors independently drawn from a multivariate normal distribution follow a Wishart distribution, the variances of which are proportional to a sum of products between two covariance elements. In both the scalar and multivariate cases, the variances of the sample (co)variances are proportional to products of the (co)variances themselves. Therefore, ideally the cost functions that fit modelled to observed filtered innovation covariances would use weightings that are proportional to the inverse of products of the same filtered innovation covariances being fit. The weightings in Table 1 are inverses of Kronecker products of either the filtered covariances themselves, covariances that have a similar structure (such as the unfiltered covariances), or the identity. Although in most cases the weightings do not exactly match what would be used in the ideal case (the covariance of the Wishart distribution for D B or D R ), when the spectral overlap between the background-and observation-error covariances is small, the weightings in Table 1 would have a similar structure as the ideal weighting. We briefly note that the inverse of the weightings in Table 1 have the same form as the covariance for a matrix normal distribution for matrix-valued variables (Gupta and Nagar (2018) gives a review of matrix variate distributions).
To illustrate the weightings used for the DI01 and D05 algorithms, we return to the example of a 1D periodic domain introduced in Section 8.3. In this example, the background-error correlations are modelled with a second-order autoregressive (SOAR) model, instead of the Gaussian model, while the observation errors are spatially uncorrelated. The top plot of Figure 12 shows the relative weightings in spectral space used in the cost functions associated with the DI01 and D05 algorithms for this example, where the weightings have been scaled so that weighting of the lowest wavenumber is equal to unity (the absolute scaling of the weights is inconsequential). The bottom plot shows the filter functions for comparison. In this example, the filterF (represented by the (1 + ) −1 curve in spectral space) dampens high wavenumbers, and and J D05 R . The observation errors are spatially uncorrelated and a SOAR model is used for the background-error correlations. The model parameters used are L = 40, 000 km, L b = 1, 000 km, Δx = 40 km, and̃2 b ∕̃2 o = 2 since the square roots of the s b weightings are roughly inversely proportional to the innovation covariance filtered byF, modes at high wavenumbers are given much higher weights than at lower wavenumbers when determining s b . The same is true for the s o weightings at lower wavenumbers, where the filter I −F has low amplitudes and the weightings are large compared to those at the higher wavenumbers. This is evident for both the DI01 and D05 algorithms.

CONCLUSIONS
We have shown that for the case of estimating error variances, the DI01, D05, and HL methods can all be associated with a pair of cost functions that measure the mismatch between parts of the modelled and observed innovation covariances. In order for these methods to be successful, these parts must in some way isolate the contributions of the background and observations to the innovation covariance. The DI01 and D05 algorithms both use the filter that produces the analysis to make this separation and as such are categorized as a posteriori methods, which provides some degree of computational savings when analyses are already available. In contrast, the a priori HL method separates the background and observational contributions to the innovation covariance by attributing all spatial correlations to the background. Estimates for variance scaling factors from each method are obtained by minimizing their respective cost functions. Since each cost function measures the mismatch between observed and modelled covariances, this minimization is a fitting procedure and the variance scaling factors estimated are best-fit values. While this fitting procedure is done explicitly in the HL method, it is implicit in the DI01 and D05 methods. From a conceptual point of view, the DI01 and D05 algorithms fundamentally only differ by their choice of weightings in the implicit fitting procedure. As such, the DI01 and D05 algorithms belong to a larger class of algorithms which are differentiated by their weightings. However, the choice of weightings has important implications for the algorithm's numerics and its ability to be implemented in an operational setting.
The degree of success of each method is dependent on how well each method performs the separation between the background and observational contributions to the innovation covariance. For the DI01 and D05 methods, this is dependent on the amount of spectral overlap between the background-and observation-error covariances, which can be quantified by the Frobenius inner product between the two covariances. The norms of the error covariances and the angles between them with respect to the Frobenius inner product effectively summarizes important features of the error statistics and can quantify how effective the DI01 and D05 methods will be. Our results suggest that the angle between the background-and observation-error covariances with respect to the Frobenius inner product can be used as a scalar measure of identifiability between the two covariances. Using the Frobenius inner product in this manner also allows for the covariances to be interpreted geometrically. The success of these methods are also dependent on the distribution of observations. For all methods, the innovation covariance must be sampled at an adequate resolution to resolve the background and observation correlations for the algorithm to be successful.
The analytic expressions for the expectation values and variances for the scaling parameter estimates from each method were determined for the ideal case in which the prescribed covariances differ from their true value solely by a scaling factor. For the DI01 and D05 methods, these expressions indicate that the divergence of the expectation values of the scaling parameters from their true values is proportional to quantities that measure the amount of spectral overlap between the backgroundand observation-error covariances. In this case, when the observation errors are uncorrelated, the expectation values of the HL scaling parameters equal their true values, but diverge from their true values when observation-error correlations are present. An examination of the expectation values for this case also shows that, if one scaling parameters is overestimated (either the background-or the observation-error variance scaling), then the other scaling parameter will be underestimated.
The statistical properties of the variance scaling parameter estimates were illustrated using the example of homogeneous covariances defined on a 1D periodic domain for the case when the correlations are properly specified. For this example, results showed that the expectation value of the background scaling parameter estimated using the D05 method was closer to its true value than that estimated using the DI01 method, although the D05 estimate had a larger variance than the DI01 method. On the other hand, the expectation value of the observation-error variance scaling parameter for the DI01 was closer to its true value than that using the D05 method, with the variance of both estimates being similar. For all cases examined, the HL scaling parameters had greater variance of the parameter estimates than both the DI01 and D05 methods.
Although the examples examined in this work were idealized cases, they can provide guidance for real-world applications. By using the actual spatial distribution in two or three dimensions for an observational dataset and estimates of the correlation length-scales of the background and observation errors, the derived analytical expressions for the expectation values and variances of the error variance scaling parameters can be estimated. These estimates can be used to compare different properties of each method for that particular observational dataset. For instance, one could determine which method yields the best estimates for the expectation values or the lowest variances of the scaling estimates. One could also use these expressions to check if the preferred method is dependent on which error variance is being tuned; for example, in the examples used in Section 10, it was found that the expectation value of the D05 background-error scaling parameter was closer to its true value than the DI01 estimate, but the opposite was true for the observation-error scaling parameter. Therefore, in applications where only one parameter (either the background or observation scaling) is focused on, the expressions in Section 10 and Appendix D can be used to suggest which method will be more beneficial for that particular situation.
The conceptual framework developed in this work (the results of Sections 5 to 7) does not assume any particular form (homogeneous, isotropic, or otherwise) for the true or modelled error covariances, other than for the HL method which assumes uncorrelated observation errors. A particular form for the true covariances was only introduced in Section 9 so that the performance of each method could be evaluated through a comparison of the tuned error variances to their true values. This was accomplished by assuming that the modelled error correlations were equal to their true values, expect for the case of the HL estimates for the example illustrated in Figure 10 which had observation errors with non-negligible spatial correlations. We note that, in cases where the error correlations are not properly specified, there is some ambiguity when defining the true error variance scaling factors which were defined in Section 9 and used to evaluate the performance of each method. For instance, one may define the true error variance scalings as the parameters that best fit the modelled error variances to the true error variances. On the other hand, one could alternatively define the true error variance scalings as the parameters which minimize the difference between the full modelled and true error covariances; in other words, parameters s t b and s t o that minimize ||B t − s t bB || 2 and ||R t − s t oR || 2 with some appropriate weighting. As such, the decomposition of the error covariance mis-specification into variance and correlation length-scaling parameters is not unique and care must be taken when interpreting results where both true variance and correlation parameters are defined.
As previously noted, although this work concentrates on refining estimates of the error variances, certain aspects of this work can be generalized for parametrizations of the error correlations. For instance, the generalized DI01 relations as expressed in Equations (14a) and (14b) and the D05 cost functions as written in Equations (21a) and (21b) are independent of parametrization and so could be used for correlation estimations. However, unlike for scalings of the error variances, in these cases the cost functions in Equations (21a) and (21b) will in general not be quadratic in terms of the parameter variables and the best-fit parameters emitted from their minimization may not have simple analytical expressions. While the absence of simple analytical expressions may make the fitting processes more difficult to interpret, nonetheless in these cases best-fit correlation parameters may be obtained by numerically minimizing the cost functions in Equations (21a) and (21b). As with the error variance scalings, the particular weights chosen for the cost functions will affect both the best-fit correlation parameter values obtained and the ease of implementing the algorithm in practical settings.
How to cite this article: Sitwell M, Ménard R. Framework for the comparison of a priori and a posteriori error variance estimation and tuning schemes. Q J R Meteorol Soc. 2020;146:2547-2575. https://doi.org/10.1002/qj. 3805

APPENDIX A: SIMULTANEOUS DIAGONAL-IZATION OF COVARIANCE MATRICES
In this appendix, we describe the process for constructing the operator which simultaneously diagonalizes both HBH T andR. This standard mathematical procedure is described in mathematics textbooks such as Noble (1969) and in the context of data assimilation can be found in Hollingsworth (1986).
of the quadratic form x T Ax with a symmetric matrix A is In the case that A is not symmetric, then x T Ax = x T A s x, where A s = (A + A T )∕2 is the symmetric part of A, so for this case Equation (B11) holds if on the right-hand side A is replaced by the symmetric part of A.
• For a random normal vector x with mean and covariance matrix , the variance of the quadratic form x T Ax with a symmetric matrix A is As with Equation (B11), if A is not symmetric, then Equation (B12) holds if on the right-hand side A is replaced by the symmetric part of A.

APPENDIX C: MINIMIZATION OF COST FUNCTIONS
This section contains details relating to the minimization of quadratic cost functions of the forms encountered in the main text. Consider a cost function J that is defined by where A and B are symmetric matrices, s is a scalar, and W 1 and W 2 are positive definite weightings. The second line in Equation (C1) follows from the definition of the weighted Frobenius inner product given in Equation (11) and in the third line we have used the cyclic property of the trace operator.
The second derivative of the cost function J with respect to s is 2 J s 2 = Tr[AW 1 AW 2 ] = ||A|| 2 W 1 ⊗W 2 . (C4) As this second derivative is positive for all values of s, s =ŝ is the minimum of J.

APPENDIX D: VARIANCES OF SCALING PARAMETER ESTIMATES
This appendix contains expressions for the variances of scaling parameter estimates derived by the DI01, D05, and HL methods, introduced in Sections 5,6, and 7, respectively. These variances are calculated under the same assumptions as used for calculating the expectation values in Section 9, namely that the modelled covariances differed from their true values by only an overall scaling factor. The variances are calculated by use of Equation (B12) in Appendix B, which, unlike when calculating the expectation values, assume that the innovations d follow a normal distribution.
where in the last lines we have used Equations (3a) and (3b) to transform into the simultaneously diagonalized basis. For the D05 algorithm, the variances are given by For the HL method, the variances are given by ] .