A Novel Localized Fast Multipole Method for Computations With Spatially Correlated Observation Error Statistics in Data Assimilation

Several observation types (e.g., geostationary satellite and Doppler radar observations) have recently been found to exhibit strong spatial error correlations. Including these error statistics in data assimilation for numerical weather prediction can improve analysis quality and forecast skill. Moreover, it allows for increases in the spatial density of observations assimilated, which is needed for the provision of information on appropriate scales for high‐resolution forecasting. However, introducing correlated error statistics may increase the computational complexity and parallel communication costs of matrix‐vector products involving observation precision matrices (inverse observation error covariance matrices). Without new approaches, we cannot take full advantage of new observation uncertainty estimates. We develop a new numerical approximation method based on a particular type of fast multipole method and a domain localization approach. The basic idea is to divide the observation domain into boxes and then separate calculations of matrix‐vector products according to the partition. These calculations can be done in parallel with very low communication overheads. The new method is easy to implement and parallelize, and it is applicable to a wide variety of observation precision matrices. We applied the new method to a simple variational data assimilation problem and found that the computational cost of the variational minimization was dramatically reduced while preserving analysis accuracy across a range of scales. The new method has the potential to be used as an efficient technique for practical applications where a large number of observations with mutual error correlations need to be assimilated quickly.


Introduction
Data assimilation is essential for numerical weather prediction (NWP) as it provides the initial conditions for computer models that simulate the evolution of the atmosphere (e.g., Lorenc et al., 2000;Nichols, 2010).In addition, data assimilation is used to create climate reanalyzes (e.g., Bollmeyer et al., 2015;Hersbach et al., 2020).Data assimilation blends observations and model forecasts by taking account of their uncertainties.Observation uncertainties are described by an observation error covariance matrix, R ∈ R m×m , where m is the number of observations (Hodyss & Satterfield, 2017;Janjić et al., 2018).In practical data assimilation applications, the matrix R is often treated as block diagonal with each block containing error statistics of one type of observation (for a given time).This is because observation errors between observations of different types are usually assumed to be uncorrelated.Moreover, observation errors between observations of the same type may be mutually uncorrelated, resulting in diagonal blocks in the matrix R, or mutually correlated, leading to non-diagonal blocks.A widely adopted approach to estimate the matrix R is to use the diagnosis approach of Desroziers et al. (2005).This approach is an indirect sampling approach that uses a large sample of assimilation residuals.For a given sample size, the sampling error of this approach is affected by the accuracy of the observation and background error covariance matrices specified in the data assimilation algorithm (Hu & Dance, 2024).
Previous studies have shown that observations such as Doppler radar radial winds (DRWs), geostationary satellite data and atmospheric motion vectors (AMVs) can exhibit strong spatial error correlations, leading to nondiagonal blocks of the matrix R. For example, errors in the DRWs used by the Met Office and Deutscher Wetterdienst (DWD) NWP systems were found to have a horizontal correlation length of 12-50 km (Waller, Simonin, et al., 2016;Waller et al., 2019;Zeng et al., 2021).The spatial error correlation lengthscale estimated for observations from the Spinning Enhanced Visible and Infrared Imager (SEVIRI) used in the Met Office and Météo-France varies between 30 and 250 km (Michel, 2018;Waller, Ballard, et al., 2016).The horizontal error correlation lengthscale for the AMVs assimilated within the Met Office UKV model 3D-Var FGAT system is found to be around 140-210 km (Cordoba et al., 2017).Moreover, Bormann et al. (2003) found that some AMVs can have a correlation lengthscale up to 800 km or even broader for tropical regions in the ECMWF global data assimilation system at that time.These findings have important implications for better use of the observations with strong spatial error correlations.By directly including correlated error statistics in the assimilation scheme, we are able to use a shorter observation-thinning distance or make full use of the information from high-resolution observations (Bell et al., 2020;Fowler et al., 2018;Rainwater et al., 2015;Waller, Simonin, et al., 2016).This has been shown to improve both analysis accuracy and forecast skill in idealized data assimilation systems (e.g., Healy & White, 2005;Stewart et al., 2008Stewart et al., , 2013) ) as well as operational systems (e.g., Fujita et al., 2020;Simonin et al., 2019;Yeh et al., 2022).In addition, assimilation of dense observations (e.g., Doppler radar observations) without thinning is important for convection-permitting NWPs as they require information on small scales (Waller et al., 2021).Convection-permitting NWP is key for forecasting extreme weather events such as heavy rainfall (e.g., Clark et al., 2016;Dance et al., 2019).The timing of these extreme events is often difficult to capture in a model that does not resolve small-scale weather processes (e.g., Hu et al., 2019;Hu & Franzke, 2020).
A potential problem with including spatially correlated observation error statistics in data assimilation procedures is that matrix-vector products involving the inverse of the matrix R can be computationally expensive.These matrix-vector products arise in the observation penalty term of the variational data assimilation cost function and its gradient with the form (see more details in Section 2) where A ∈ R m×m denotes the observation precision matrix (the inverse of the matrix R ∈ R m×m ) and d ∈ R m denotes the observation-minus-model departure vector, where x ∈ R n is the model state vector, y ∈ R m is the observation vector and H is the observation operator that maps the model state vector to the observation vector (i.e., H : R n → R m ).The vector d is called the innovation vector if x = x b .In this study, we do not address the problem of inverting the matrix R. Instead, we assume that the matrix A is known and focus on fast computation of the matrix-vector products involving the matrix A. If the matrix R is diagonal, then the matrix A is also diagonal.In this case the computation of Equation 1 requires only m floating point operations, and it can be perfectly parallelized without incurring communication costs.However, if the matrix R is non-diagonal, then the matrix A is usually a full matrix.In this case, the work required to calculate Equation 1 increases to O m 2 ) floating point operations.Moreover, the bigger problem is that the parallelization of matrix-vector multiplications with full matrices requires a large number of communications if matrix and vector elements are distributed across multiple processors (Deng, 2012;Grama et al., 2003).
An efficient way to calculate Equation 1 for a non-diagonal matrix is to decompose the matrix R into a lower triangular matrix and an upper triangular matrix (e.g., using the Cholesky decomposition; Golub & Van Loan, 1996).This approach avoids the need to explicitly invert the matrix R and allows the linear system to be solved by forward and backward substitution.For correlated inter-channel error covariance matrices, the Cholesky decomposition approach is compatible with the parallelization strategy used for diagonal R matrices because the observations with mutually correlated errors are typically distributed in one processing element (PE; Simonin et al., 2019;Weston et al., 2014).However, for spatially correlated error covariance matrices, the Cholesky decomposition approach may be incompatible with this parallelization scheme because observations may be allocated to different PEs according to their geophysical locations, and hence excessive communications between PEs are required.To deal with this, Simonin et al. (2019) proposed a pragmatic strategy for operational applications where observations with mutually correlated errors were assigned to a single PE.With this new allocation of observations, the Cholesky decomposition approach can be easily applied locally in each PE without the need for communication.However, the strategy is only applicable to moderate numbers of mutually correlated observations.
Several other studies also addressed correlated observation errors in data assimilation.Guillet et al. (2019) modeled spatially correlated observation errors using diffusion operators, which facilitates a convenient operator for the inverse observation error covariance matrix.However, it is not clear how this approach can be combined with a treatment of satellite inter-channel error correlations.In another approach, eigendecomposition has been used to approximate the matrix R (e.g., Anderson, 2003;Fisher, 2005;Fowler, 2019;Michel, 2018;Stewart et al., 2013).However, using too few eigenpairs can cause spurious error correlations, while too many eigenpairs can lead to great computational expense (Fisher, 2005;Stewart, 2010).There have also been a few studies that account for correlated observation errors by transforming the observations into spectral space (Chabot et al., 2015(Chabot et al., , 2020;;Ying, 2020).In addition, Bédard and Buehner (2020) proposed the use of the difference between adjacent observations to indirectly extract the information from observations having spatial error correlations.However, it is unclear how to use this approach to deal with non-uniformly distributed data.
The overall aim of this work is to make progress toward development of a new, approximate approach for fast computations of matrix-vector products that can be used for the treatment of spatially correlated errors in data assimilation.As this is a complicated problem, we choose to make progress in small steps.Thus, the goal of this paper is to examine a novel, hierarchical approximation approach and establish some initial evidence with an idealized model that the method has the potential to provide the key features that it will need.Furthermore, we make a pragmatic assumption that the precision matrix, A, is known, rather than working directly with the error covariance matrix, R.These assumptions allow us to establish that the approximation approach preserves (rather than discards) small scale information in the analysis as well as being fast and accurate, before we invest significant further time in mathematical derivation and numerical implementation of the approach for operational applications.
In our previous work (Hu & Dance, 2021a), we explored the use of a singular value decomposition approach to the fast multipole method (SVD-FMM; Gimbutas & Rokhlin, 2003) for fast calculation of Equation 1 with a full matrix A. The SVD-FMM is a hierarchical decomposition where truncated SVDs are used to compress and decompress portions of the matrix-vector product.The approach is general in that it can be applied to any covariance matrix.In Hu and Dance (2021a), we proposed a possible parallelization scheme for the SVD-FMM in data assimilation applications.We compared its computational cost (including the communication cost) with three distinct standard parallel formulations for computing large matrix-vector products (see Grama et al., 2003, Section 8.1).We carried out a series of numerical experiments to test the performance of the SVD-FMM when using different observation error correlation functions and lengthscales, and different matrix reconditioning techniques.We also tested its performance in the presence of missing observations.We showed that the SVD-FMM can accurately and rapidly compute Equation 1 under various circumstances.However, the complexity of the algorithm of the SVD-FMM may be a disadvantage for its operational use.In this work, we develop a new method based on the SVD-FMM and a domain localization approach.We call this new method the local SVD-FMM.The local SVD-FMM simplifies the multi-level algorithm of the SVD-FMM into a single-level one.The single-level algorithm is easier to implement and parallelize.The domain localization we use is similar to, but not identical to, that used for ensemble Kalman filter methods (e.g., Greybush et al., 2011;Janjić et al., 2011;Nino-Ruiz et al., 2019;Periáñez et al., 2014).With the ensemble filter, localization is mainly used to get rid of spurious correlations (Hamill et al., 2001;Houtekamer & Mitchell, 1998), but it also brings some computational side benefits, including the increase in effective ensemble size (Oke et al., 2007) and the reduction of computational costs (Lorenc, 2003;Petrie & Dance, 2010).With the local SVD-FMM, localization is used to exclude interactions between observations that are beyond a certain distance (given by a non-dimensional localization length parameter; see Section 3.1) in the calculation of the matrix-vector products involving the matrix A. This means only subsets of the matrix A and the vector d are used to calculate the vector q (Equation 1).Mathematically, this is equivalent to setting some of the elements of A to be zero.Since the spatial correlation between two observation errors is expected to decrease as the distance between the observations increases, we anticipate that the numerical error caused by excluding the distant observations in the computation will be small (Eijkhout & Polman, 1988).
A further new contribution of this work is the application of the new method to a variational data assimilation experiment.We apply the local SVD-FMM to an idealized data assimilation problem and demonstrate its accuracy and efficiency across a range of scales in the analysis.
We organize the rest of this paper as follows: In Section 2, we introduce the three-dimensional variational data assimilation algorithm and explain where matrix-vector products involving the matrix A arise in the solution of the variational data assimilation problem.In Section 3, we present the algorithm for the novel local SVD-FMM along with a discussion of accuracy, efficiency and a potential parallelization scheme.In Section 4, we show the accuracy of the local SVD-FMM in computing Equation 1 for different observation error correlation lengthscales, different localization lengths and different numbers of singular vectors.In Section 5, we demonstrate the accuracy and efficiency of the local SVD-FMM in calculating the analysis in a simple variational data assimilation system.
In Section 6, we discuss our results and conclude that the local SVD-FMM has potential as a numerical approximation method that can accelerate matrix-vector multiplication in data assimilation applications where a large volume of observational data with spatially correlated errors needs to be assimilated.

Variational Data Assimilation Algorithm
In this section, we briefly introduce three-dimensional variational data assimilation (3D-Var).We use it as an example to show how the local SVD-FMM can be applied in variational data assimilation.The objective of 3D-Var is to find the value of x that minimizes the cost function where x b ∈ R n denotes the background state vector, B ∈ R n×n denotes the background error covariance matrix and other variables were introduced in Section 1.The minimizer of the cost function is the optimal estimate of the system state that we seek, which is called the analysis and denoted by x a .
In operational data assimilation, the incremental formulation of the cost function is widely used, as it reduces the computational cost of variational data assimilation (Courtier et al., 1994).The increment is defined as δx = x x b .By substituting x = x b + δx into Equation 3, we obtain the incremental cost function where H ∈ R m×n is the observation operator linearized at x = x b .Once we find the value of δx that minimizes Equation 4, we obtain the value of x that minimizes Equation 3 by where δx a is the minimizer of the incremental cost function.
To further reduce the computational cost, the control variable transform (CVT; e.g., Courtier et al., 1998) is applied in operational variational data assimilation, where a new variable is defined as z = L 1 x.The transformation matrix L satisfies B = LL T but is not necessarily the symmetric square root of the matrix B, and it can be a rectangular matrix.Using L = B 1/2 (the symmetric square root of B) Equation 4 reads where δz = B 1/2 δx.It can be seen from Equation 6that the use of CVT avoids the need to invert the matrix B.

Journal of Advances in Modeling Earth Systems
10.1029/2023MS003871 HU AND DANCE The cost function is minimized when its gradient vanishes.Differentiating Equation 6 with respect to δz gives its gradient Solving ∇J = 0 is equivalent to solving the linear system where and The matrix S is called the (preconditioned) Hessian, which must be positive definite to ensure a unique solution to the minimization of the cost function (Golub & Van Loan, 1996, Chapter 5).Since the matrix S is a low-rank update to the identity matrix, it has a theoretical minimum eigenvalue of one (e.g., Tabeart et al., 2022).
To solve Equation 8for δz, we may use the conjugate gradient (CG) method (e.g., Barrett et al., 1994;Tabeart, 2019;Trefethen & Bau, 1997).This iterative method has been widely used in NWP to minimize the cost function.In Section 5, we will use the local SVD-FMM in solving Equation 8, along with the CG method.

The Local SVD-FMM
In this section, we describe the new local SVD-FMM algorithm and propose a simple parallelization scheme for the method.We first introduce a partition of the observation domain and the domain localization approach (Section 3.1).We then separate the matrix-vector product given by Equation 1 into three terms, making use of the localized observation partition (Section 3.2).We finally describe the algorithm for calculating each term and discuss possible parallelization formulations and algorithmic complexity (Sections 3.3-3.5).The reader can refer to Gimbutas and Rokhlin (2003) and Hu and Dance (2021a) for further theoretical explanations and interpretation of the SVD-FMM, which is the basis for the local SVD-FMM.

Partition of the Observation Domain
The local SVD-FMM starts by dividing the observation domain into (approximately) equally-sized boxes.The observations could be regularly distributed or irregularly distributed.However, if observation locations are not evenly distributed, we may consider using boxes of different sizes.This ensures that the number of observations in each box is similar.If the observation domain is three-dimensional, then it can be divided into smaller cubes.
In practical applications, the number of boxes can be determined such that the average number of observations in each box is smaller than a prescribed value.The optimal configuration of boxes may depend on the individual application and computer architecture.Figure 1 shows an example of the observation domain partition, where the grid spacing of observations is 6/14°in the East-West direction (equates to about 30 km at 50.5°N) and 6/11°in the North-South direction (equates to about 30 km), and the observation domain is divided into 8 × 8 boxes.
After observation partitioning, we define four fields for each box.The near field of a box b, denoted by N b , is made of itself and all the boxes that connect to it (see the left panel of

Separation of the Matrix-Vector Product
We first separate the vector q into N box sub-vectors, where N box is the number of boxes used to divide the observation domain.The sub-vector of the vector q corresponding to box b is denoted by q(I b ) = {q(i)|i ∈ I b }, which contains elements of the vector q that correspond to the observations in box b.The vector I b ∈ R m b contains the indices of observations in box b.We then separate the calculation of the vector q(I b ) into three matrix-vector products, where vectors denote the sets of observation indices in N b , L b , and D b , respectively, the matrices denote sub-matrices of the matrix A that are comprised of specific rows and columns of the matrix A given by the corresponding observation indices, and the vectors d( The first matrix-vector product on the right-hand side of Equation 11is the near-field matrix-vector product, which is calculated exactly (see Section 3.3).The second matrix-vector product is the interaction-field matrixvector product, which is approximated using the SVD (see Section 3.4).Thus, the accuracy depends on the number of singular values/vectors used in the approximation (p).The last matrix-vector product is the noninteraction-field matrix-vector product, which is completely discarded (see Section 3.5).Figure 3 shows which elements of the matrix A are involved in the near-field, interaction-field and non-interaction-field matrix-vector products in the calculation of q(I b ) for all boxes.For example, the first panel of Figure 3 shows which elements of the matrix A are used to form A(I b , To provide the reader with further intuition about the approximation, we also plot a row of the matrix A and some example approximated forms in physical space (since matrix elements correspond to observation locations).To do this, we first create the matrix R using the second-order auto-regressive (SOAR) correlation function (e.g., Daley, 1994;Tabeart et al., 2018),  where σ is the error standard deviation, l is the error correlation lengthscale and Δ i,j denotes the great-circle distance between two points.We then use the MATLAB function inv.m (MATLAB INV, 2022) to invert the matrix R. The SOAR covariance matrix is symmetric positive definite and thus invertible.After obtaining the matrix A, we select the row corresponding to the observation located in the center of the domain and plot this row in physical space in the left panel of Figure 4. We find that the magnitude of the elements in this row decreases rapidly from the center and the sign of the elements oscillates.Similar results have been observed in other precision matrices (e.g., Morrison et al., 2022).The middle and right panels of Figure 4 show the same row of the matrix A that is approximated by the local SVD-FMM.The middle panel shows a good approximation of the matrix A where more elements are included in the interaction field and these elements are well approximated by the retained singular values/vectors.In contrast, the right panel shows a bad approximation where more elements are discarded and the interaction-field elements are not well approximated.
Although only the near-field elements of the matrix A are used without approximation, we still need the inverse of the full matrix R.This is because, for a full matrix, the inverse of a matrix sub-block typically depends on all of the other sub-blocks of the matrix (Bernstein, 2009, Chapter 2).In practice, if correlations between observation errors for different observation types are excluded, the matrix R can be considered block diagonal with zero blocks on the off-diagonals and each diagonal block corresponding to a different observation type (e.g., Stewart et al., 2013).In this case, the inverse of each block can be calculated independently, and the local SVD-FMM can be applied to each block separately.

The Near-Field Matrix-Vector Product
The near-field matrix-vector product is expected to contribute most to the results of Equation 11as the elements of the matrix A are expected to decay with distance from the diagonal (Eijkhout & Polman, 1988).In addition, from a physical point of view, since spatial observation error correlation is expected to decrease with distance, the observations in the near field provide more information than those in the interaction and non-interaction fields.Therefore, the near-field matrix-vector product should be computed without any approximation, Equation 13 should only have a very small numerical roundoff error, which is negligible in determining the accuracy of the local SVD-FMM.Since the average size of I b is m/N box and the average size of I N b is 9m/N box , the amount of work for computing Equation 13 for a box is approximately 9(m/ N box ) 2 additions and multiplications.In total, 9m 2 /N box additions and multiplications are required for the computations for all boxes.
Near-field matrix-vector products can be calculated in parallel using a simple strategy.Let PE-b denote a PE that has been assigned to carry out the computation task for box b and let it store the vector d(I b ), and the matrices A(I b , I N b ) and A(I b , I L b ) .To calculate Equation 13, each PE needs to receive elements of d from the PEs holding the data for the neighboring boxes.This means that each PE needs to communicate with at most 8 PEs, as each box can have up to 8 neighbors.We note that the observation domain must be partitioned to allow each PE to have enough memory to store its portion of the vector d and the matrix A. In addition, the load balance between PEs and the balance of communication overheads per PE need to be taken into account.

The Interaction-Field Matrix-Vector Product
The interaction-field matrix-vector product is approximated by a computationally efficient algorithm.This algorithm relies on the SVD for data compression.The SVD is mathematically defined by (e.g., Golub & Van Loan, 1996)  12).The approximated matrices are obtained by approximating the true matrix using the local singular value decomposition approach to the fast multipole method with different numbers of singular values ( p) and localization lengths (h).The corresponding physical locations of near-field elements are given by circles inside the solid line square or at its edge.The corresponding physical locations of interactionfield elements are given by circles outside the solid line square, but inside or at the edge of the dashed line square.The corresponding physical locations of the non-interaction-field elements are given by circles outside the dashed line square.
where M is an n 1 × n 2 matrix, u k ∈ R n 1 is the kth left singular vector, s k is the kth singular value, and v k ∈ R n 2 is the right singular vector.For a square, symmetric positive-definite matrix, the SVD is equivalent to eigenvalue decomposition, and we have u k = v k .
There are four main steps to calculate the interaction-field matrix-vector product: The mathematical detail, algorithmic complexity, parallelization scheme, and communication costs for each step of the interaction-field calculation are presented below.We assume that the data has already been allocated to each PE as described in Section 3.3.
1. Compute p-term truncated SVDs of the interaction-field sub-matrix for each box, 3. Transform the projection on the basis given by left singular vectors into the basis given by right singular vectors and obtain the coefficient of projection, where b ′ ∈ L b is a box in the interaction field of box b and T b, b′ ∈ R p×p is the translation operator between b and b′, which is given by where k, k′ = 1, …, p.It should be noted that v b,k has a larger size than u b′,k′ .Thus, we need to multiply each element of u b′,k′ by the element of v b,k corresponding to the same observation as the element of u b′,k′ .Equation 17requires less than 2(N box p) 2 operations because a box's interaction field always contains less than N box entries.The parallelization of this step is the most expensive compared to the parallelization of the other steps because it requires each PE-b to communicate with all the PEs holding data for the boxes in b's interaction field.The computational cost depends on the number of entries in the interaction fields.
4. Obtain the final result for the interaction-field computation, This step requires 2mp operations and is perfectly parallel.
The computational costs for the truncated SVDs (Equation 15) and the translation operator (Equation 18) have not been discussed above because they can be pre-computed and used for any vector d if the matrix A is unchanged.
The interaction-field computation could become expensive if the truncated SVDs and the translation operator need to be computed frequently.In practice, since the observations vary in space and time, the SVDs need to be recomputed for each assimilation cycle.However, since the observation precision matrix typically remains constant between iterations in a given cycle, the computed singular values and singular vectors can be reused at each iteration step.
We refer to the error arising from the interaction-field computation as the approximation error, which consists of the truncation error of the SVDs and the translation error of the translation operator.The truncation error and the translation error using p singular vectors are both dependent on the (p + 1)th singular values of the submatrices of A (Bernstein, 2009;Gimbutas & Rokhlin, 2003, Fact 9.14.28).We showed in our previous work that the error of the SVD-FMM (without localization) using p singular vectors is smaller if the mean (p + 1)th singular value of the submatrices of the matrix A is smaller (Hu & Dance, 2021a; Figure 9).Thus, the optimal value of p for an application should depend on how fast the singular value spectra decrease.In Section 4, we show numerically how the average of the (p + 1)th singular values of the submatrices of A, as well as the accuracy of the local SVD-FMM, vary with increasing p.

The Non-Interaction-Field Matrix-Vector Product
The non-interaction-field matrix-vector product in Equation 11 is completely discarded, as we expect that this term should have the smallest contribution to the result of q compared to the other two terms.This is because for spatially correlated observation errors (e.g., errors from geostationary satellite and radar observations), the observation error covariance matrix is expected to be banded, with the bandwidth depending on the observation error correlation lengthscale, and the inverses of positive definite band matrices are known to exhibit an exponential decay of their elements away from the main diagonal (Eijkhout & Polman, 1988).A band matrix is a matrix that contains non-zero elements near the diagonal and zero elements far away from the diagonal (Golub & Van Loan, 1996, Chapter 4).We refer to the error caused by discarding the non-interaction-field matrix-vector product as the localization error, which should depend on the observation error correlation lengthscale and the localization length.We give some example numerical results on how the localization error varies with observation error correlation lengthscale.We measure the localization error by the relative difference, where the matrix A is created in the same way as described in Section 3.2, the matrix Â is given by setting the noninteraction-field elements of the matrix A to be zero and ‖⋅‖ F denotes the Frobenius norm. Figure 5 shows that the localization error (RE) increases as the observation error correlation lengthscale increases.Nevertheless, for all the correlation lengthscales considered, the localization error is smaller than 0.24%.
In ensemble filtering, domain localization may cause abrupt transitions between neighboring sub-domains, as the analyses are calculated independently for each sub-domain using different subsets of observations (e.g., Janjić et al., 2011).However, this should not be an issue for the local SVD-FMM because the analysis is calculated for the entire domain.

The Non-Dimensional Localization Length Parameter
In Figure 1, boxes are approximately rectangular with longer sides along the lines of latitudes.As a consequence, they may encompass more observations in the East-West direction than in the North-South direction.If we use the same localization length for both directions, we may include more observations in the East-West direction.To include a similar number of observations in the two directions, we can use different localization lengths for the two directions, as shown in Figure 6.
Moreover, we may also use non-uniform localization lengths to suit the geometry of spatial error correlations.In practical applications, observation error correlations may not be isotropic.For instance, observation error correlation lengthscales for the DRWs vary with the height of the observation and with the distance of the observation away from the radar (Waller, Simonin, et al., 2016).For such spatial error correlations, we may need to set the localization lengths for each box differently.
In general, a longer localization length means fewer matrix elements are discarded, thus improving the accuracy of the local SVD-FMM.However, at the same time, longer localization lengths increase the size of the interactionfield submatrices, thereby increasing the computational cost of the SVD (see Sections 4 and 5).

Computational Costs
The local SVD-FMM reduces the algorithmic complexity of calculating Equation 1.The standard matrix-vector multiplication requires 2m 2 operations.The local SVD-FMM requires less than 9m 2 / N box + 4mp + 2(N box p) 2 operations.The operations for computing Equations 15 and 18 have not been counted, as they only need to be calculated once for the same matrix A.
We further compare the communication costs of the parallelization scheme of the local SVD-FMM with two distinct standard parallel formulations of matrix-vector multiplication.The two formulations start with two different partitions of the matrix.The row-wise (column-wise) partitioning means an equal division of rows (columns) of the matrix between PEs.Elements of the vector are equally allocated among PEs for both cases.With row-wise partitioning, each PE needs to collect the portion of the vector stored in every other PE, which requires an all-to-all broadcast operation.The size of the message to be transferred is m/N PE , where N PE is the number of PEs.With column-wise partitioning, an all-to-one reduction operation is first used to add up the partial results calculated locally by each PE, and a scatter operation is then used to spread the final result to all PEs.The size of the message transmitted in the all-to-one reduction and scatter operations is m and m/N PE respectively.More details on the communication operations can be found in Grama et al. (2003).In the parallelization scheme of the local SVD-FMM, communication is required in the near-field computation and the third step of the interaction-field computation.In the near-field computation, each PE should communicate with up to 8 PEs to collect the elements of the vector d.In the third step of the interaction-field computation, each PE should communicate with a number of PEs to collect the coefficient of projection, Φ b k .The exact number of PEs involved depends on the choice of localization length.These communication tasks can be completed together by an all-to-all broadcast operation, where the size of the message transmitted between a box and its near field is m/N PE , and the size of the message transmitted between a box and its interaction field is p.To compare the communication costs, we can compare the communication operations required and the size of messages to be transferred.An all-to-all broadcast is required in both the parallelization scheme of the local SVD-FMM and the row-wise partitioning formulation.The column-wise partitioning formulation requires two communication operations, in which the scatter operation requires a similar communication time as the all-to-all broadcast (Grama et al., 2003).Due to the smaller size of messages transferred, the  parallelization scheme of the local SVD-FMM requires fewer communication costs than the two standard parallel formulations of matrix-vector multiplication.

The Local SVD-FMM Accuracy Experiments
The accuracy of the local SVD-FMM is determined by the number of singular vectors (p), the localization length (h) and the structure of the matrix A. The structure of the matrix A is affected by the observation error correlation lengthscale (l).In this section, we carry out numerical experiments to investigate the accuracy of the local SVD-FMM in calculating Equation 1 under different values of p, h, and l.We focus on the accuracy of the matrix-vector products, and the effect of the local SVD-FMM on variational data assimilation will be examined in Section 5.

Design of Accuracy Experiments
Our observations are distributed over a region from 54°N to 60°N and 6°W to 6°E with a grid length of approximately 12 km, which results in m = 3,456 observations.We divide the observation domain into 8 × 8 boxes as demonstrated in Figure 1.The side length of the boxes is about 83-98 km.The distribution of our observations is similar to that of the geostationary satellite data over the UK (Waller, Ballard, et al., 2016).A moderate observation grid spacing is chosen to allow our experiments to be completed in a relatively short time.The matrix R is given by the SOAR covariance matrix (Equation 12).The observation error standard deviation is assumed to be 1, and the correlation lengthscale varies between 80 and 160 km with an interval of 10 km.These values are selected based on horizontal error correlations estimated for real observational data (e.g., Cordoba et al., 2017;Waller, Ballard, et al., 2016).The matrix A is obtained by inverting the matrix R using the INV function in MATLAB (MATLAB INV, 2022).The INV function uses an LDL decomposition for Hermitian matrices (Golub & Van Loan, 1996).The vector d is formed using random values that follow a multivariate normal distribution with zero mean and the covariance given by R + HBH T .The matrix H is considered an identity matrix, and the matrix B is also given by the SOAR covariance matrix.The background error standard deviation and correlation lengthscale are 0.6 and 20 km, respectively.These values are chosen based on that estimated for wind observations in km-scale NWP (Ballard et al., 2016).
We measure the accuracy of the local SVD-FMM in calculating Equation 1 using the root mean square error (RMSE), where q std (i) and q fmm (i) denote the ith element of the vector q calculated using the standard matrix-vector multiplication and the local SVD-FMM, respectively.

Experiments With Fixed Localization Length
In these experiments, we investigate the accuracy of the local SVD-FMM in calculating the matrix-vector product when the localization length is fixed to h = 3, while the number of singular vectors and the observation error correlation lengthscale vary.As expected, we found that the RMSE of the local SVD-FMM decreases as more singular vectors are used (see Figure 7).We also note that the RMSE decreases quickly for the first several values of p while decreasing more slowly as its value increases further.This is related to the spectra of the singular values of the sub-matrices of the matrix A. We explained in the last paragraph of Section 3.4 that the approximation error of the local SVD-FMM depends on the (p + 1)th singular values of the submatrices of the matrix A, and in our experiments, the singular values of the submatrices of the matrix A drop rapidly (Figure 8, discussed further below).Since the approximation error approaches a minimum with the use of more singular vectors, the RMSE at p = 10 gives us a rough idea of the size of the localization error in the local SVD-FMM.
Figure 7 also shows that the RMSE of the local SVD-FMM increases with the observation error correlation lengthscale.This is attributed to an increase in the approximation error and the localization error.We have shown in Figure 5 that a larger observation error correlation lengthscale will lead to a larger localization error.The observation error correlation lengthscale also affects the approximation error because it changes the singular values of the sub-matrices of the matrix A. We find in our experiments that if the correlation lengthscale is longer, then the mean singular value of the submatrices of the matrix A is larger for a given k (Figure 8).For a given truncation, p, a larger mean (p + 1)th singular value means a larger approximation error.Therefore, if the correlation lengthscale is longer, the approximation error will be larger.Our previous work has shown that the error of the SVD-FMM (without localization) also increases with the observation error correlation lengthscale (Hu & Dance, 2021a; Figure 10).A similar result should be expected here because the source of the error is the same.
While the singular values of the submatrices of A can be shown mathematically to control the approximation error, we have found that the singular values (or equivalently the eigenvalues) of the full matrix can also provide a rough guide (Hu & Dance, 2021a).Various studies have investigated the behavior of the eigenspectra of correlation matrices with varying correlation lengthscale (e.g., Fowler et al., 2018;Haben et al., 2011).

Experiments With a Fixed Number of Singular Vectors
We now investigate the accuracy of the local SVD-FMM when using 10 singular vectors and varying the localization length and the observation error correlation lengthscale.The localization length is given by a non-dimensional parameter (h) that describes the number of boxes.In these experiments, an approximate physical localization length can be obtained by multiplying h by 90 km (the length of the box).Figure 9 shows that there is a relatively big drop in the RMSE as h increases from 2 to 3, or from 180 to 270 km in metric units.However, when increasing h further, changes in the RMSE are too small to be observed visually.The optimal value of h should depend on the configuration of boxes and the structure of the matrix A.
As with Figures 7 and 9 shows that the RMSE of the local SVD-FMM increases with the observation error correlation lengthscale.We discussed previously that an increase in the observation error correlation lengthscale increases the approximation error and the localization error.An additional piece of information provided by Figure 9 is that the RMSE at h = 7 gives the size of the approximation error because the localization error is zero when h = 7.

Experiments With a Fixed Correlation Lengthscale
We finally carry out experiments of the local SVD-FMM using an observation error correlation lengthscale of 80 km, and different localization lengths  and numbers of singular vectors.Figure 9 shows that increasing the number of singular vectors has a much greater effect on the RMSE than increasing the localization length.This indicates that the approximation error dominates the RMSE of the local SVD-FMM.While Figure 9 shows results using a particular observation error correlation lengthscale, we expect similar results when the observation error correlation lengthscale is increased to a certain value (see Figure 9).

Data Assimilation Experiments
In this section, we conduct variational data assimilation experiments using the local SVD-FMM.We investigate (a) whether the use of the local SVD-FMM affects the number of iterations to solve the cost function, (b) how much wall-clock time is saved by using the local SVD-FMM, and (c) how accurate the analyses obtained using the local SVD-FMM are.Analysis accuracy at different spatial scales is examined because we expect that including correlated observation error statistics will modulate the scales that observation information projects onto in the analysis and thus improve the analysis accuracy at these scales (Fowler et al., 2018;Stewart et al., 2008Stewart et al., , 2013)).We wish to check whether this benefit is maintained with the local SVD-FMM approximation.We do not use Figure 9.As Figure 7, but under different localization lengths and observation error correlation lengthscales.Ten singular vectors are used.Note that the scale used here is different to the one used in Figure 7, but the size and colors of the squares nevertheless increase linearly with the root mean square error.
Figure 10.As Figure 7, but under different localization lengths and different numbers of singular vectors.The observation error correlation lengthscale is 80 km.Note that the scale used here is different to those used in Figures 7 and 9, but the size and colors of the diamonds nevertheless increase linearly with the root mean square error.
parallel computing because the size of the problem in this experiment is moderate and sequential computing is sufficient to provide the results we need for this initial study.

Design of Data Assimilation Experiments
In our variational data assimilation experiment, we solve Equation 8 for δz using the CG method.This is equivalent to solving the two-side preconditioned system (Haben, 2011, Chapter 4), where and is the Hessian of the cost function given by Equation 4.
To solve Equation 8, we need a background error covariance matrix (B), an observation error covariance matrix (R), an observation operator (H) and an innovation vector (d).We assume that our model grid points are distributed over the region shown in Figure 1 with a grid length of approximately 3 km.This leads to a model state vector of size n = 15, 904.The matrix B is then created using the SOAR correlation function given by Equation 12with an error standard deviation of 1 and an error correlation lengthscale of 80 km.We further assume our observations to be distributed with a grid length of about 6 km, which is similar to the footprint of SEVIRI observations in the mid-latitudes.This leads to m = 3,976 observations.The matrix R is created using the first-order auto-regressive (FOAR) correlation function (e.g., Stewart et al., 2013), which is also known as the Markov correlation function and is used for Doppler radar wind observations in operational NWP (Simonin et al., 2019).
The FOAR covariance matrix is where σ, l and Δ i,j are defined in Equation 12.We set the observation error standard deviation to be the same as the background error standard deviation and the observation error correlation lengthscale to 160 km.Since the observation error correlation lengthscale is larger than the background error correlation lengthscale, the observation information will project onto small scales in the analysis (Fowler et al., 2018).The relative position between observations and model grid points is shown in Figure 11.For each observation, our observation operator averages the values at the four nearest model grid points to that observation.This linear observation operator leads to where ϵ o ∈ R m contains observation errors drawn randomly from a Gaussian distribution with zero mean and covariance given by R and ϵ b ∈ R n contains background errors drawn randomly from a Gaussian distribution with zero mean and covariance given by B. Gaussian random numbers are produced using the MATLAB routine mvnrnd.m(MATLAB MVNRND, 2022).Figure 12 shows a sample of background and observation errors in physical space.We note that in this figure the amplitude of the observation error is larger than that of the background error.This is due to the luck of the draw.When using different random number seeds, we observe the opposite relative amplitudes.After obtaining the matrices R, B, and H, we invert the matrix R using the MATLAB routine inv.m (MATLAB INV, 2022), calculate the square root of the matrix B using the MATLAB routine sqrtm.m(MATLAB SQRTM, 2022) and compute the product of the matrices H and B 1/2 .The matrices R 1 and HB 1/2 are then used in the CG method described in the first column of Table 1.The second column of Table 1 identifies where the local SVD-FMM is used in the CG method.We choose a relative tolerance of τ = 10 6 for the CG method, which determines when the iteration terminates.We implement the CG method using the MATLAB routine pcg.m (MATLAB PCG, 2022).In the control experiment, we use the full matrix R and standard matrix-vector multiplication.The standard matrix-vector multiplication is implemented in a naive way (e.g., no function handles are used, and calculations are performed from left to right).To use the local SVD-FMM, we divide the observation domain into 8 × 8 boxes (see Figure 1).We choose the localization length as h = 2 and h = 3 and the number of singular vectors as p = 4 and p = 10.We have written our own MATLAB code to implement the local SVD-FMM, which can be used together with pcg.m.The code can also be used to calculate matrix-vector products in other applications, provided the matrix is symmetric and the magnitude of its elements decays away from the diagonal.The MATLAB code for the local SVD-FMM is available in Hu and Dance (2021b).Table 1 An Algorithm (the Conjugate Gradient Method; e.g., Barrett et al., 1994;Tabeart, 2019;Trefethen & Bau, 1997) for Solving Equation 8 In addition to the local SVD-FMM experiments, we perform experiments where the off-diagonal elements of the matrix R are completely ignored.To account for the effect of neglecting observation error correlations, the diagonal elements are inflated by a factor of 1.7.Observation error variance inflation avoids overfitting the largest scale (e.g., Stewart et al., 2008).The chosen inflation factor has a similar magnitude to inflation factors used in operational systems (e.g., Campbell et al., 2017).We also conduct experiments to combine the local SVD-FMM (p = 4 and h = 2) with observation error variance inflation.
After obtaining the vector δz a using the algorithm described in Table 1, we calculate the analysis increment vector as The analysis error vector can then be calculated as To investigate analysis accuracy at different spatial scales, we use the discrete Fourier transform (DFT), where k = 0, …, n x 1, l = 0, …, n y 1, i denotes the imaginary unit, n x = 112 and n y = 142 are numbers of grid points in North-South and East-West directions, F ∈ R n x ×n y denotes a complex matrix containing Fourier coefficients, and ϵ a 2d ∈ R n x ×n y is a matrix containing the analysis error in two-dimensional space.The matrix ϵ a 2d can be obtained by converting the vector ϵ a into a matrix.Equation 29 can also be applied to the background error.We carry out the DFT using the MATLAB routine fft2.m (MATLAB FFT2, 2022).The magnitude of Fourier coefficients is where Re(⋅) and Im(⋅) denote the real and imaginary parts, respectively.The magnitude γ(k, l) measures the size of the error in the scale represented by the wavenumber of k and l.A larger wavenumber in spectral space corresponds to a smaller scale in physical space.

Experimental Results
We perform the CG method (described in Table 1) 100 times using different samples of random background and observation errors.We first look at whether the use of the local SVD-FMM affects the number of iterations needed by the CG method.We find that how much the number of iterations varies depends on how many singular vectors are used and which localization length is chosen (Figure 13).When using too few singular vectors and a too short localization length, the matrix-vector product will be poorly approximated by the local SVD-FMM (see Section 4), and we would expect the number of iterations to vary considerably (see green triangles in Figure 13).The convergence rate of the CG method is known to be affected by the eigenstructure of the preconditioned Hessian S (e.g., Tabeart et al., 2022;Trefethen & Bau, 1997, Theorem 38.5).When using the local SVD-FMM, we ignore some elements in the matrix A and approximate its sub-matrices as low-rank matrices.This changes the eigenstructure of the matrix A and that of the Hessian S. In our experiments, the matrix S is positive definite for all the parameters we considered (number of singular vectors, p and localization length, h).This ensures that there is a unique solution to the minimization of the cost function.However, when the approximation of the matrix A is poor (e.g., when h = 2 and p = 4), the minimum eigenvalue of S is found to be less than its theoretical value of one.The abscissa is the number of iterations in experiments that use the local singular value decomposition approach to the fast multipole method with different numbers of singular vectors ( p) and localization lengths (h).The ordinate is the number of iterations in the control run that uses the standard matrix-vector product.Each marker represents the result obtained from one realization of background and observation errors.
A smaller minimum eigenvalue can cause a larger condition number, and a larger condition number typically means a slower convergence (e.g., Tabeart et al., 2020).Figure 13 also shows that, for a given choice of the number of singular vectors and localization length, the number of iterations also varies with different innovation vectors (variation of points marked with the same symbol).This is because the innovation vector and the local SVD-FMM are used to calculate the vector w (right-hand side of Equation 8), which also affects the convergence rate (Trefethen & Bau, 1997, p. 300).
We now look at the computational time for implementing the CG method (see Table .2).As expected, using the diagonal matrix R gives the shortest computation time and the standard approach the longest.Using the local SVD-FMM can substantially reduce the computational time relative to the standard method.The computational cost of the local SVD-FMM can be separated into two parts.The first part includes the time taken to partition the observation domain (Section 3.1) and to calculate SVDs (Equation 15) and translation operators (Equation 18).We refer to this part as the initialization time of the local SVD-FMM.The second part is the time taken to perform the CG method, which is referred to as the iteration time and is determined by the number of iterations and the computation cost for each iteration.When using a larger number of singular vectors or a longer localization length, the initialization time increases due to the increased time used to calculate the SVD.The longer iteration time for p = 4 and h = 2 is the result of a slower convergence rate (see Figure 13).In addition, we observe a slight increase in the mean iteration time from p = 4 and h = 3 to p = 10 and h = 3.This is because the number of the iterations is comparable in both cases (Figure 13), while the computation cost per iteration increases if p is larger.To conclude, our numerical experiments show that the use of the local SVD-FMM greatly improves the speed of data assimilation, even considering the additional computational cost of the SVDs.
We finally investigate analysis accuracy at different spatial scales.The magnitude of Fourier coefficients (γ; Equation 30) is used as a measure of analysis and background errors.Figure 14 shows γ as a function of the wavenumber given by ̅̅̅̅̅̅̅̅̅̅̅̅̅ k 2 + l 2 √ .Since we only have n x = 112 grid points in the North-South direction, we cut off the plot at the wavenumber n x /2 1 = 55.The upper panel of Figure 14 compares the mean analysis error and the mean background error in the control run, where the analysis is calculated using the standard matrix-vector product.The analysis and background errors are averaged over 100 realizations of the background and observation error vectors.We find that the mean analysis error is smaller than the mean background error at all scales, with the largest errors at large scales.This is consistent with the findings of Fowler et al. (2018).The lower panel of Figure 14 shows the absolute differences between the mean analysis error in the control run and the mean analysis errors in the experiments.The smallest difference is found for the local SVD-FMM experiment with p = 10 and h = 3.In particular, we find that the differences at the smaller scales are on the order of 10 8 .This indicates that the use of the local SVD-FMM can preserve the benefit of including correlated observation error statistics in data assimilation (see the beginning of Section 5).In comparison, a larger difference is found for the experiments with diagonal R at all scales.These results comparing analysis accuracy with diagonal and nondiagonal observation error variance matrices are consistent with the findings of previous studies (e.g., Rainwater et al., 2015;Simonin et al., 2019;Stewart et al., 2008Stewart et al., , 2013)).When smaller h and p are used for the local SVD-FMM, we observe an increase in the difference compared to the case of p = 10 and h = 3.At the largest scale, the differences for the local SVD-FMM experiments with p = 4 and h = 2 or 3 are even larger than that for the diagonal R experiment.It is worth noting that increasing h from 2 to 3 brings little benefits when p = 4.This is consistent with the results in Section 4.3, where we showed that for a fixed p, increasing h may have little effect on the accuracy of the local SVD-FMM once h reaches a certain value.In this case, it is reasonable to retain more singular values/vectors due to the increased matrix size.The lower panel of Figure 14 also shows that after inflating the observation error variance, the difference for the local SVD-FMM experiment with p = 4 and h = 2 at the largest scale is reduced.However, at the same time, the difference at the smaller scales increases.
In addition to the comparison of mean analysis and background errors, Figure 15 compares the analysis and background errors at the largest scale in each realization.We find that in some realizations the analysis error is larger, while in other realizations the background error is larger.When using the standard matrix-vector product, or when using the local SVD-FMM with a large number of singular vectors and a long localization length, the markers are mostly on the upper left of the diagonal in the scatterplot (the upper panels of Figure 15).This means that the analysis error is usually smaller than the background error at the largest scale.However, when using fewer singular vectors or a shorter localization length, more markers are found on the lower right of the diagonal (the middle panels of Figure 15), meaning that the analysis is usually less accurate than the background at the largest scale.In this case, we find that inflating the observation error variance improves the analysis accuracy at the largest scale (the bottom-left panel of Figure 15).This is similar to what happens when the observation error variance is inflated using a diagonal matrix, to compensate for ignoring observation error correlations (not shown).
In summary, as long as the number of singular vectors and the localization length are chosen so that the local SVD-FMM is a good approximation of the full matrix-vector product, it preserves the information in the analysis at all scales.

Conclusion
Accounting for spatially correlated observation error statistics in data assimilation systems has been shown to improve both the accuracy of data assimilation and forecasts (e.g., Fujita et al., 2020;Healy & White, 2005; Figure 15.Background errors against analysis errors at the largest scale.Analyses are calculated differently for each panel (see Section 5.1 for details).The markers and colors used for each analysis are consistent with Figure 14.Each marker in a panel represents the result obtained from one realization of background and observation errors.Simonin et al., 2019;Stewart et al., 2008Stewart et al., , 2013)).The inclusion of correlated error statistics allows for the assimilation of dense observational data sets, which can provide information on appropriate scales for highresolution forecasting.Such data sets are currently often thinned to get rid of the spatial error correlations.A potential downside of including correlated error statistics is that it may increase the computational cost of the variational minimization problem.This is because including these error statistics makes the observation precision matrix a full matrix.Computing a matrix-vector product involving a full matrix requires more floating-point operations and higher parallel computing communication costs than computing a matrix-vector product involving a diagonal matrix.
To reduce the computational cost, we proposed a new numerical approximation method, called the local SVD-FMM, which is easy to implement in both sequential and parallel computations and is suitable for matrices whose elements decay rapidly away from the main diagonal.The local SVD-FMM was developed based on the SVD-FMM (without localization), which was originally proposed by Gimbutas and Rokhlin (2003) and explored in Hu and Dance (2021a) for its feasibility in data assimilation.There are two new aspects in the local SVD-FMM.First, it is a single-level algorithm that requires a single-level partitioning of the observation domain.In contrast, the SVD-FMM is a multi-level algorithm that uses a hierarchical structure of boxes, which leads to a more complicated parallelization scheme.Second, a new non-dimensional localization length parameter is introduced in the local SVD-FMM.The use of localization compensates for the increased cost by replacing the multi-level algorithm with a single-level algorithm.Although the localization introduces errors by ignoring error correlations between distant observations, we showed that these errors can be very small and do not affect the accuracy of the analysis too much.The main drawback of the local SVD-FMM for data assimilation is the need to precompute the observation precision matrix before the method can be applied.In practice we believe that there are three main ways we could deal with this issue, which may each be appropriate for different observation types and situations, but would require further research to evaluate: (a) Starting from a statistical estimate of the precision matrix rather than the observation error covariance matrix.For example, Le and Zhong (2022) and Yuan (2010) each provide numerically efficient approaches to estimate a precision matrix starting from the sample covariance (that could be obtained in data assimilation from a Desroziers et al. (2005)type method).(b) Rather than inverting a new matrix each cycle (in response to quality control changes in observation distribution) compute low-rank downdates and updates, with only a small number of operations relative to the dimension of the matrix (e.g., Brand, 2006;Golub & Van Loan, 1996).(c) Combine this type of hierarchical approximation approach with a matrix factorization such as the Cholesky decomposition (Hackbusch, 2016).
The accuracy of the local SVD-FMM depends on many factors, including the localization length, the number of singular vectors used, and the observation error correlation lengthscale.Generally, the accuracy increases with the localization length and the number of singular vectors but decreases with the correlation lengthscale.Moreover, we find that after the localization length and the number of singular vectors reach certain values, further increasing them brings little benefit.The optimal value for them varies from application to application and depends on, for example, the configuration of boxes and the observation error correlation lengthscale.In a simple variational data assimilation experiment, the local SVD-FMM is found to dramatically reduce the computational cost of the variational minimization problem while maintaining the accuracy of the analysis.
In particular, the use of the local SVD-FMM can preserve the improvement in the analysis accuracy at all scales.
The local SVD-FMM has a wide range of applicability.It is applicable to both regularly and irregularly distributed observations and the observation domain can be partitioned into a regular or irregular mesh of boxes.We may only need to have each box contain a similar number of observations in order to achieve good efficiency and load balancing.Given the efficiency, accuracy and broad applicability, the local SVD-FMM has the potential to be used in practical data assimilation applications, such as convection-permitting NWP (Dance et al., 2019;Hu et al., 2023), where it may be important to assimilate a large volume of dense observations with strong mutual spatial error correlations.
Figure2).The far field of a box b, denoted by F b , consists of all the other boxes.The interaction field of box b, denoted by L b , consists of a subset of boxes in F b , which are within a certain distance of box b.The distance is given by a non-dimensional localization length parameter (h), denoting the number of boxes.For example, the right panel of Figure2shows the interaction field of box b with h = 3, indicating that the interaction field contains boxes within a distance of three boxes from b, excluding the near-field boxes.The non-dimensional localization length parameter can also be expressed in units of length by simply multiplying it by the size of the box.The non-interaction boxes of b, denoted by D b , are given by the boxes that are in F b but not in L b .Let m b , m N b , m L b and m D b denote the number of observations in box b, N b , L b and D b , respectively, then we have m = ∑ N box b=1 m b , where N box is the total number of boxes, and m

Figure 1 .
Figure 1.An example of the observation domain partition.Observations (gray dots) are approximately regularly distributed over a region from 49°N to 52°N and 3°W to 3°E with a grid spacing of about 30 km, resulting in 154 observations.The observation domain is divided into 8 × 8 boxes.

Figure 2 .
Figure 2. Illustration of the near, interaction and non-interaction fields of the box b. (left panel) The shaded boxes are in the near field of the box b. (middle panel) The shaded boxes are in the interaction field of the box b with a non-dimensional localization length parameter h = 3. (right panel) The shaded boxes are in the non-interaction field of the box b.

Figure 3 .
Figure 3. Plots illustrating the position (marked by black squares) of the elements of the matrix A ∈ R 154×154 used for the near-field, interaction-field, and non-interaction-field matrix-vector products in Equation 11.Observations are distributed and partitioned as in Figure 1.The localization length is h = 3.

Figure 4 .
Figure 4.One row of the true and approximated observation error precision matrix, A, in physical space.Observations are distributed and partitioned as in Figure1.The observation corresponding to the chosen row is located at the darkest red circle in the center of the domain.The true matrix is the inverse of the second-order auto-regressive covariance matrix with a standard deviation of 1 and a correlation lengthscale of 160 km (Equation12).The approximated matrices are obtained by approximating the true matrix using the local singular value decomposition approach to the fast multipole method with different numbers of singular values ( p) and localization lengths (h).The corresponding physical locations of near-field elements are given by circles inside the solid line square or at its edge.The corresponding physical locations of interactionfield elements are given by circles outside the solid line square, but inside or at the edge of the dashed line square.The corresponding physical locations of the non-interaction-field elements are given by circles outside the dashed line square.
(a) Obtain the numerical operators, which are used to compress, translate and decompress the information in the vector d; (b) Compress the part of d that is allocated to each box; (c) For each box b, translate the compressed information from boxes in b's interaction field (L b ) to b; and (d) Decompress the translated information for each box.
m b denotes the kth left singular vector for box b, v b,k ∈ R m L b denotes the kth right singular vector for box b and s b,k denotes the kth singular value for box b, and we should have p ≪ m b .The computation of the truncated SVDs is perfectly parallel.Each PE can perform a matrix decomposition independently.After the SVDs, the memory allocated in PE-b for the sub-matrix A(I b , I L b ) can be released.The singular vectors and singular values obtained can be used to compute Equation 1 with any vector d as long as the matrix A remains the same.In practical data assimilation applications, the matrix A may vary each assimilation cycle due to factors such as quality control but typically does not change between iterations of the minimization scheme in a given cycle.2. Project the sub-vector d(I b ) for each box b onto the basis given by p left singular vectors and obtain the coefficient of projection, Φ b,k = u b,k ⋅ d(I b ),(16)where k = 1, …, p and the symbol ⋅ denotes a dot product.For a fixed b and a fixed k, Equation 16 requires 2m b operations.Summing over each value of b and each value of k, the total number of operations is 2mp.The computation of Equation 16 is perfectly parallel.

Figure 5 .
Figure 5.The relative localization error with the localization length of h = 3 as a function of observation error correlation lengthscale.The observations are distributed and partitioned as in Figure 1.

Figure 6 .
Figure 6.Illustration of using different localization lengths for the latitudinal and longitudinal directions.

Figure 7 .
Figure7.The root mean square error (RMSE) in using the local singular value decomposition approach to the fast multipole method to calculate the matrix-vector product given in Equation 1 under different numbers of singular vectors and correlation lengthscales.The localization length parameter is 3.The RMSE is averaged over 100 realizations of the vector.The size and the color of the circles linearly increase with the RMSE.

Figure 8 .
Figure 8.Average of the kth singular values (s b,k ) of N box submatrices of the matrix A. Only the results for the first 10 singular values are plotted.The s b,k are calculated as shown in Equation 15.Different symbols represent results for different observation error correlation lengthscales (l).A logarithmic scale is used on the axis of ordinates to allow a clear comparison between smaller singular values.

Figure 11 .
Figure 11.Location of observations (diamonds) with respect to model grid points (circles).The numbers give the order of elements when converting 2-D data into a column vector.

Figure 12 .
Figure 12.A realization of (left panel) the background error and (right panel) the observation error in physical space.The grid lengths of the background and observations are 3 and 6 km respectively.Background and observation error correlation lengthscales are 80 and 160 km respectively.
for δz Algorithm Comment The conjugate gradient (CG) method Matrix-vector products involving the matrix R 1 1: Initialization k = 0, (δz) 0 = 0 2: r 0 = w S(δz) 0 = w 2: The calculation of w given by Equation 10 requires the calculation of R calculation of the product of S (Equation 9) and p k requires the calculation of R 1 HB 1/ 2 p k ) 6: (δz) k+1 = (δz) k + α k+1 p k 7: r k+1 = r k α k+1 Sp k 7: The matrix-vector product, Sp k , has already been calculated in line k+1 = β k+1 p k + r k+1 10: k = k + 1 11: end while Note.The initial guess of δz is a zero vector.The symbols r k = w S(δz) k and p k are the residual and search direction at iteration step k, respectively.The scalars α k+1 and β k+1 are used to calculate the residual and search direction for iteration step k + 1, respectively.The symbol ‖⋅‖ 2 denotes Euclidean norm and τ is the relative tolerance (compared to the size of ‖w‖ 2 ) that decides when the iteration stops.

Figure 13 .
Figure13.Comparison of the number of iterations required by the conjugate gradient method to minimize the variational data assimilation cost function.The abscissa is the number of iterations in experiments that use the local singular value decomposition approach to the fast multipole method with different numbers of singular vectors ( p) and localization lengths (h).The ordinate is the number of iterations in the control run that uses the standard matrix-vector product.Each marker represents the result obtained from one realization of background and observation errors.

Figure 14 .
Figure14.(upper panel) Background and analysis errors at different physical scales (represented by wavenumbers in spectral space) in the control run.Solid lines are the mean error size from 100 runs using different samples of background and observation errors.Shaded areas are bounded by the minimum and maximum errors from these runs.(lower panel) Absolute differences in the mean analysis error at different physical scales between the control run and experiments.The control run and experiments are described in Section 5.1.A logarithmic scale is used on the axis of ordinates to allow a clear comparison between results at all wavenumbers.

Table 2
CPU Time in Seconds Used byMATLAB R2023b on MacBook Air (M1, 2020)to Minimize the Variational Data Assimilation Cost Function The minimum, mean, and maximum values are taken from 100 runs with different samples of background and observation errors.For experiments where the local singular value decomposition approach to the fast multipole method (SVD-FMM) is used, the computational time is given as the initialization time plus the iteration time (see the main text for more details).