Divergence of the Ensemble Transform Kalman Filter (LETKF) by Nonlocal Observations

Ensemble Kalman filters are powerful tools to merge model dynamics and observation data. For large system models, they are known to diverge due to subsampling errors at small ensemble size and thus possible spurious correlations in forecast error covariances. The Local Ensemble Transform Kalman filter (LETKF) remedies these disadvantages by localization in observation space. However, its application to nonlocal observations is still under debate since it is still not clear how to optimally localize nonlocal observations. The present work studies intermittent divergence of filter innovations and shows that it increases forecast errors. Nonlocal observations enhance such innovation divergence under certain conditions, whereas similar localization radius and sensitivity function width of nonlocal observations minimizes the divergence rate. The analysis of the LETKF reveals inconsistencies in the assimilation of observed and unobserved model grid points which may yield detrimental effects. These inconsistencies inter alia indicate that the localization radius should be larger than the sensitivity function width if spatially synchronized system activity is expected. Moreover, the shift of observation power from observed to unobserved grid points hypothesized in the context of catastrophic filter divergence is supported for intermittent innovation divergence. Further possible mechanisms yielding such innovation divergence are ensemble member alignment and a novel covariation between background perturbations in location and observation space.


INTRODUCTION
Data assimilation (DA) merges models and observations to gain optimal model state estimates. It is well-established in meteorology [1], geophysics [2], and attracts attention in life sciences [3]. Typical applications of DA serve to estimate model parameters [4] or provide initial conditions for forecasts [5]. A prominent technique is the ensemble Kalman filter [6], which allows to assimilate observations in nonlinear models. When underlying models are high-dimensional, such as in geophysics or meteorology, spurious correlations in forecast errors are detrimental to state estimates. A prominent approach to avoid this effect is localization of error covariances. The Local Ensemble Transform Kalman Filter (LETKF) [7] utilizes a localization scheme in observation space that is computationally effective and applicable to high-dimensional model systems. The LETKF applies to local observations [8] measured in the physical system under study, e.g., by radiosondes, and nonlocal observations measured over a large area of the system by, e.g., weather radar or satellites [9][10][11]. Since nonlocal observations represent spatial integrals of activity, and the localization scheme of the LETKF requests a single spatial location of each observation, it is conceptually difficult to apply the LETKF to nonlocal observations. In fact, present localization definitions [10,12] of nonlocal observations attempt to estimate the best single spatial location neglecting the spatial distribution of possible activity sources. A recent study [13] on satellite data assimilation proposes to choose the localization radius equal to the spatial distribution width of radiation sources. This spatial source distribution is the sensitivity function of the nonlocal observation and is part of the model system. The present work considers the hypothesis that the relation between localization radius and sensitivity function width plays an important role in the filter performance.
Merging the model forecast state and observations, the ensemble Kalman filter tears the analysis, i.e., the newly estimated state, toward the model forecast state and thus underestimates the forecast error covariance matrix due to a limited ensemble size [14]. This is enforced by model errors [15,16] and leads to filter divergence. Moreover, if the forecast error covariances are too large, the forecasts have too less weight in the assimilation step and the filter tears the analysis toward the observations. This also results to filter divergence. In general terms, filter divergence occurs when an incorrect background state can not be adjusted to a better estimate of the true state by assimilating observations. Ensemble member inflation and localization improves the filter performance. The present work considers a perfect model and thus neglects model errors. By virtue of this study construction, all divergence effects observed result from undersampling and localization. The present work chooses a small ensemble size compared to the model dimension, fixes the ensemble inflation to a flow-independent additive inflation and investigates the effect of localization.
In addition to the filter divergence described above ensemble Kalman filter may exhibit catastrophic filter divergence which enhances the filter forecasts to numerical machine infinity [17][18][19][20][21]. This divergence is supposed to result from alignment of ensemble members and from unconserved observable energy dissipation [20]. This last criterion states that the filter diverges in a catastrophic manner if the observable energy of the system dissipates in unobserved directions, i.e., that energy moves from observed to unobserved locations. The present work raises the question whether such features of catastrophic divergence play a role in non-catastrophic filter divergence as well. Subsequent sections indicate that this is the case in the assimilation of nonlocal observations.
The underlying motivation of this work is the experience from meteorological data assimilation, that satellite data are detrimental to forecasts if assimilation procedure is not welltuned [12,13,22]. This effect is supposed to result from deficits in the underlying model. The present work assumes a perfect model and investigates the question, whether assimilating nonlocal observations is still detrimental.  local observations with respect to the nonlocal observation. This preliminary result, that additional observations increase the first guess error, is counter-intuitive at a first glance but consistent with practical experience in weather forecasting. This finding indicates that nonlocal observations renders the LETKF unstable and it diverges dependent on properties of the observations sensitivity function. What is the role of localization in this context? Is there a fundamental optimal relation between localization and sensitivity function as found in [13]? The present work addresses these questions in the following sections. Section 2 introduces the essential elements of the LETKF and re-calls its analytical description for a single observation in section 2.5. Section 2.8 provides conventional and new markers of filter divergence that help to elucidate possible underlying divergence mechanisms. Section 3 presents briefly the findings, that are put into context in section 4.

The Model
The storm-like Lorenz96-model [23] is a well-established meteorological model and the present work considers an extension by a space-dependent linear damping [24]. It is a circle network with nodes of number N, whose node activity x k (t) at node k and time t obeys with k = 1, . . . , N, x k = x k+N , and α k = k/N. We choose I = 8.0 and N = 80 and the initial condition is random with x k (0) = 8.0 + ξ k , k = N/2, and x N/2 (0) = 8.01 + ξ N/2 with the normal distributed random variable ξ ∼ N (0, 0.01). Figure 2A shows the model field dependent on time.
Typically, data assimilation techniques are applied to merge observations and solutions of imperfect models and the true dynamics of the underlying system is not known. To illustrate the impact of nonlocal observations, we assume (what is unrealistic in practice) that the model under consideration (1) is perfect and hence emerging differences between observations and model equivalents do not originate in the model error.

The Local Ensemble Transform Kalman Filter (LETKF)
The aim of data assimilation is to estimate a state that describes optimally both a model (or background) state x b ∈ R N and corresponding observations y ∈ R S of number S. This analysis x a ∈ R N minimizes the cost function with x ∈ R N , the background error covariance B ∈ R N×N and the observation error covariance R ∈ R S×S . The observation operatorĤ : R N → R S is linear in the present work and projects a model state into the observation space and thus links model and model equivalents in observation space. The LETKF estimates the background error covariance B by background-ensemble perturbations of number L with X b ∈ R N×L . In the following, we will call an ensemble with L = N a full ensemble, compared to a reduced ensemble with the typical case L < N. The columns of X b are the background ensemble member perturbations is the set of background ensemble members andx b is the mean over the ensemble. Then the coordinate transformation from physical space to ensemble space describes a state x in the ensemble space with new coordinates w [7]. Inserting Equation (4) into (2) yields in the new coordinate w. Hereȳ b =Ĥ(x b ) ∈ R S is the model equivalent of the background ensemble mean in observation space and Y b =Ĥ(X b ) is the corresponding model equivalent of X b . This implies [7] H which is valid for linear observation operators.
The minimization of the cost function (5) yields Equation (4) provides the analysis ensemble mean Then the square root filter-ansatz [7] yields the analysis ensemble members where W a,l is the l-th column of the matrix W a = (L − 1)A 1/2 .
The square root of A may be computed by using the singular value decomposition A = UDV t with the diagonal matrix D and the eigenvector matrices U, V. This yields A 1/2 = UD 1/2 V t . Finally the analysis ensemble members in physical space read x a,l =x b + Xw a + XW a,l , l = 1, . . . , L , see [7,8] for more details. Specifically, we have chosen L = 10 ensemble members and number of observations S = 1 or S = 2.

Observation Data
In principle there are two types of observations. Local observations are measured at a single spatial location in the system, whereas nonlocal observations are integrals over a set of spatial locations. Examples for local observations are radiosondes measuring humidity and temperature in the atmosphere at a certain vertical altitude and horizontal position. Typical nonlocal observations are satellite measurements capturing the radiation in a vertical atmospheric column.
The present work considers observations where η ∈ S is Gaussian white noise with the true variance R t and H(x) is a linear observation operatorĤ(x) = Hx, H ∈ R S×N . In the following, the linear operator H is called sensitivity function and we adopt this name from meteorological data assimilation of nonlocal satellite data. The present work considers either nonlocal observations only (S = 1) with the observation matrix elements n ∈ (N/2; N/2 + r H ] 0 otherwise (11) with sensitivity function width r H or both observation types (S = 2), for which the observation matrix elements read where the local observation is captured at spatial location i, cf. Figure 2 for illustration. The sensitivity function of the nonlocal observation has a small peak in its center, which permits to localize the observation in the center of the sensitivity function, cf. section 2.4. In the subsequent sections, i = N/2 and r H varies in the range 1 ≤ r H ≤ 10. Please note that r H = 1 approximates the model description of a local observation. Moreover, in the following a grid point whose activity contributes to a model equivalent in observation space is called an observed grid point and all others are called unobserved grid points. Mathematically, observed (unobserved) grid points exhibit H nk = 0 (H nk = 0).
In this work, a single partial study considers a smooth sensitivity function instead of the boxcar function described above. Then the sensitivity function is the Gaspari-Cohn function GC(n, r H /2) [25] in the interval −r H ≤ n ≤ r H , which approximates the Gaussian function by a smooth function with finite support 2r H The observations y(t n ), t n =1, . . . , T at T time instances (cf. Equation 10) obey the model (1) and Equation (10) with the observation operator (11) or (12). In a large part of the work, we have assumed zero observation error R t = 0, i.e., observations are perfect in the sense that they reflect the underlying perfect model, cf. section 2.1. We take the point of view that we do not know that the model and observations are perfect and hence we guess R as it is done in cases where models and observations are not perfect. This approach has been taken in most cases in the work. Since, however, this implicit filter error may already contribute to a filter instability or even may induce it, a short partial study has assumed perfect knowledge of the observation error.
To this end, in this short partial study we have assumed (R t ) jj = 0.1, j = 1, . . . , S and perfect knowledge of this error, i.e., R = R t . Although techniques have been developed to estimate R adaptively [26], we do not employ such a scheme for simplicity.

Localization
In the LETKF, the background covariance matrix B is expressed by L ensemble members, cf. Equation (3), and it is rank-deficit for L ≪ N. This leads to spurious correlations in B. Spatial localization in ensemble Kalman filters has been found to be beneficial [16,[27][28][29] in this context. The LETKF as defined by Hunt et al. [7] performs the localization in observation space.
In detail, Hunt et al. [7] proposed to localize by increasing the observation error in matrix R dependent on the distance between the analysis grid point and observations. The present implementation follows this approach. The observation error matrix R is diagonal, i.e., observation errors between single observations are uncorrelated. Then at each grid point i the localization scheme considers observations y n at location j only if the distance between location i and j does not exceed the localization radius r l . Then the error of observation n is R nn = R 0 nn /ρ ij , where ρ ij = GC(d ij , r) + ε for d ij ≤ r l is the weighting function with the Gaspari-Cohn function GC(d, r) [25], ε > 0 is a small constant ensuring a finite observation error and d ij is the spatial distance between i and j. The Gaspari-Cohn function approximates a Gaussian function with standard deviation r √ 3/10 by a polynomial with finite support. The parameter 2r = r l is the radius of the localization function with 0 ≤ GC(z, r) ≤ 1, 0 ≤ z ≤ r l . Consequently the observation error takes its minimum R 0 nn at distance d ij = 0 and increases monotonously with distance to its maximum R 0 nn /ε at d ij = r l . In the present implementation, we use ε = 10 −7 and observation errors R 0 11 = 0.1 for a single nonlocal observation S = 1 and R 0 nm = 0.1δ nm , n = 1, 2 for local and nonlocal observation with S = 2.
The observation error close to the border of the localization area about a grid point i is large by definition R nn = R 0 nn /(GC(d → r l /2, r l /2) + ε). In numerical practice, the assimilation effect of large values R nn > R 0 nn /GC low is equivalent for some distances from the grid point i in a reasonable approximation if GC low is low enough. By virtue of the monotonic decrease of GC(d, r l /2) with respect to distance d ≥ 0, this yields the condition GC(r l ≥ d ≥ r c , r l /2) < GC low . In other words, for distances d larger than a corrected localization radius r c , the observation errors R nn are that large that observations at such distances do poorly contribute to the analysis. For instance, if GC low = 0.01, then r l = 5 → r c = 3, r l = 10 → r c = 7 and r l = 15 → r c = 11. It is important to note that this corrected localization radius depends on the width of the Gaspari-Cohn function and thus on the original localization radius r l , i.e., r c = r c (r l ). In most following study cases results are given for original localization radii r l , while the usage of the corrected localization radius is stated explicitly. The existence of a corrected localization radius r c illustrates the insight, that there is not a single optimal localization radius for smooth localization functions but a certain range of equivalent localization radii. For non-smooth localization functions with sharp edges, e.g., a boxcar function, this variability would not exist.
The present work considers primarily nonlocal observations. Since these are not located at a single spatial site, it is nontrivial to include them in the LETKF that assumes a single observation location. To this end, several previous studies have suggested corresponding approaches [13,24,[30][31][32][33][34][35][36]. A reasonable approximation for the spatial location of a nonlocal observation is the location of the maximum sensitivity [10,37], i.e., max n H kn of nonlocal observation k. Although this approximation has been shown to yield good results, it introduces a considerable error for broad sensitivity functions, i.e., r H is large. In fact, this localization scheme introduces an additional contribution to the observation error. The present implementation considers this definition. This results in the localization of the nonlocal observation at grid point i = N/2.

LETKF for a Single Observation
In a large part of this work, we consider a single observation with S = 1. The subsequent paragraphs show an analytical derivation of the ensemble analysis mean and the analysis members, whose terms are interpreted in section 3. Considering the localization scheme described above, at the model grid point i the analysis ensemble mean (8) reads The term R i = R 0 11 /ρ i(N/2) denotes the weighted observation error used at grid point i, when the observation is located at j = N/2, and R 0 11 is the error of observation y 1 . Now utilizing the Woodbury matrix identity [38] where y = √ YY t ∈ R is a scalar. Inserting (17) into (15), the analysis ensemble mean is with Since R N/2 = R 0 11 and R N/2±r l = R 0 nn /ε = 10 7 R 0 nn , α i takes its maximum at the observation location and is very small when the observation is at the localization border. This means thatx a i ≈x b i at the border of the localization area. Now let us focus on the ensemble members. Equations (18) and (9) give the analysis ensemble members at grid point i where ( The singular value decomposition serves as a tool to compute where √ D ∈ R L×L is diagonal and its matrix elements are the eigenvalues of Q. The columns of matrix U ∈ R L×L are the normalized eigenvectors of Q. Then Equation (7) yields i.e., Y t is an eigenvector of Q i with eigenvalue 0 < λ i < 1. By virtue of the properties of R i , λ i takes its minimum at the observation location at i = N/2 and it is maximum at the localization border.
The remaining eigenvectors of number L − 1 are v n ⊥ Y, n = 1, . . . , L − 1 with unity eigenvalue since . .) and, after inserting into Equation (20) and lengthy calculations This leads to

Additive Covariance Inflation
The ensemble Kalman filter underestimates the forecast error covariance matrix due to the limited ensemble size [39]. This problem is often addressed by covariance inflation [27,40,41]. The present work implements additive covariance inflation [42].
with matrix elements Ŵ ij ∼ N (0, f 2 add ) and the inflation factor f add = 0.1.

Numerical Experiments
The present study investigates solutions x(t) of model (1) and Equation (10) provides the observations y(t). This is called the nature run. In the filter cycle, the initial analysis values are identical to the initial values of the nature run and the underlying filter model is the true model (1). In the forecast step, the model is advanced with time step t = 10 −3 /12 for 100 time steps applying a 4th-order Runge-Kutta integration scheme. According to [23], the duration of one forecast step corresponds to 1 hour which is also the time between two successive observations. The analysis update is instantaneous. In an initial phase, the model evolves freely for 50 forecast steps, i.e., 50 h, to avoid possible initial transients. Then, the LETKF estimates the analysis ensemble according to section 2.2 during 200 cycles if not stated otherwise. One of such a numerical simulation is called a trial in the following and it comprises 100 · 200 time steps. In all figures, the time given is the number of forecast steps, or equivalently analysis steps in hours. Each trial assumes identical initial ensemble members and the only difference in trials results from the additive noise in additive covariance inflation, cf. section 2.6.
By virtue of the primarily numerical nature of the present work, it is mandatory to vary certain parameters, such as perturbations to the observations or the factor of additive inflation. For instance, the data assimilation results in with r H = 10 and n 1 = 1, n 2 = 27, n 3 = 54. The localization radius is identical to the sensitivity function r l = r H and data assimilation is performed during 250 filter cycles with an initial phase of 50 forecast steps. For stabilization reasons, we have increased the model integration time step to t = 10 −2 /12 but reduced the number of model integrations to 10 steps, cf. [19], thus essentially retaining the time interval between observations. Other parameters are identical to the standard setting described in the previous sections. As mentioned above, typically the measurement process is not known in all details. For instance, the observation error is assumed to be R = 0.1 for the nonlocal observations, whereas the true model exhibits noise-free observations with R t = 0. This is the valid setting for all simulations but few set of trials shown in

Divergence Criteria and Verification
The Kalman filter may diverge for several reasons [6,27,43], such as model error, insufficient sampling of error covariance, or high condition number of observation operators [17,44]. Especially the latter has been shown to be able to trigger catastrophic filter divergence of the ensemble Kalman filter exhibiting a diverging forecasts in model state space [19][20][21]. This divergence type exhibits a magnitude increase of model variables to machine infinity in finite time. The present implementation detects catastrophic filter divergence and stop the numerical simulation when the maximum absolute value of any single ensemble member exceeds a certain threshold |x b,l k | > 10 10 , k ∈ The present work focusses primarily on a non-catastrophic filter divergence type showing a strong increase of the innovation magnitude to values much larger than the model equivalent in observation space of the attractor. This divergence may be temporally intermittent with finite duration. Since this intermittent innovation divergence results in increased first guess departures and hence worsens forecasts, it is important to detect these divergences and control them. By definition the innovation process diverges if max l,k |[y n −Hx b,l ] k | > σ th for any observation n with σ th = 1,000 R 0 nn . Then the numerical simulation is stopped. The time of filter divergence is called T b in the following. This criterion for innovation divergence is hard: if the innovation reaches the threshold σ th , then innovation divergence occurs. The corresponding divergence rate γ is the ratio between the number of divergent and non-divergent trials. For instance, for γ = 1 all numerical trials diverge whereas γ = 0 reflect stability in all numerical trials.
Moreover, it is possible that |[y n − Hx b,l ] k | grows intermittently but does not reach the divergence threshold. The first guess departure bias and the corresponding root mean-square error quantify the forecast error in such trials. For a single observation, y → y o . Larger values of bias RMSE indicate larger innovation values.
To quantify filter divergence, Tong et al. [18] have proposed the statistical measure and n = || at time t n , where the norm is defined by ||Z|| = n,m |Z nm | 2 for any matrix Z and Z nm are the corresponding matrix elements. The quantity n represents the ensemble spread in observation space and n is the covariation of observed and unobserved ensemble perturbations assuming local observations. Large values of indicates catastrophic filter divergence as pointed out in [18,20]. This definition may also apply to nonlocal observations, cf. section 2.5, although its original motivation assumes local observations.
An interesting feature to estimate the degree of divergence is the time of maximum ensemble spread T and the time of maximum covariation of observed and unobserved ensemble perturbations T : Moreover, previous studies have pointed out that catastrophic filter divergence in ensemble Kalman filter implies alignment of ensemble members. This may also represent an important mechanism in non-catastrophic filter divergence. The new quantity is the probability of alignment and unalignment, where n a is the number of aligned ensemble member perturbation pairs ( and n u is the number of ant-aligned member pairs with The alignment (anti-alignment) condition cos β lk > 0.5 (cos β lk < −0.5) implies −60 • ≤ β lk ≤ 60 • (120 • ≤ β lk | ≤ 240 • ). Please note that 0 ≤ p a,u ≤ 1 and the larger p a (p u ) the more ensemble members are aligned (anti-aligned) to each other. Considering the importance of member alignment to each other for catastrophic divergence, it may be interesting to estimate the alignment degree of background member perturbation with the analysis increments x a,l − x b,l by The term x a,l − x b,l is the analysis ensemble member perturbation from the background members and x b,l −x b is the direction of the background member perturbation. If cos α l → 1 (cos α l → −1) the analysis ensemble members point into the same (opposite) direction as the background ensemble members. In addition, are the percentages of aligned and anti-aligned ensemble members for which cos α l > 0.5 (of number n a ) and cos α l < −0.5 (of number n u ), respectively.

RESULTS
The stability of the ensemble Kalman filter depends heavily on the model and the nature of observations. To gain some insight into the effect of nonlocal observations, the present work considers primarily nonlocal observations only (section 3.1). Then the last section 3.2 shows briefly the divergence rates in the presence of both local and nonlocal observations.

Nonlocal Observations
The subsequent sections consider nonlocal observations only and show how they affect the filter stability. To this end, the first studies are purely numerical and are complemented by an additional analytical study.

Numerical Results
In order to find out how the choice of localization radius r l affects the stability of the LETKF, a large number of numerical experiments assist to investigate statistically under which condition the filter diverges. Figure 3 shows the temporal evolution of the background x b and the model equivalents in observation space y b for two different localization radii. In Figure 3A, observations (black line) are very close to their model equivalents (blue lines) for identical localization and sensitivity function width, i.e., r l = r H . Conversely, observations and their model equivalents diverge after some time for r l = r H . This is visible in the ensemble mean ( Figure 3A, top row) and the single ensemble members (Figure 3A, bottom row). The different filter behavior can be observed in model space as well, but there it is less obvious, cf. Figure 3B. The ensemble members at spatial location n = 40 are located in the center of the observation area. They exhibit a rather small spread around the ensemble mean for r l = r H , whereas the ensemble spread is larger for r l = r H . The ensemble at n = 20 is outside the observation area and thus is not updated. There, the ensemble in r l = r H and r l = r H are close to each other.
This result can be generalized to a larger number of localization and sensitivity function widths, cf. Figure 4. For the smallest sensitivity function width and thus the smallest observation area with r H = 1, no filter process diverges for a large range of localization radii r l , i.e., the LETKF is stable (dashed black line in Figure 4). This case r H = 1 corresponds to local observations. Now increasing the observation area with r H > 1, the filter may diverge and its divergence rate γ depends on the localization radius. We observe that the filter diverges least when the localization radius is close to the sensitivity function width. These findings hold true for both the original localization radius and the corrected radius r c , cf. section 2.4 and Figures 4A,B. Moreover, the filter does not exhibit catastrophic divergence before the background reaches its divergence threshold.
These results hold also true if observations are subjected to additive noise and the observation error is chosen to the true value, cf.  Figure 4, the divergence rate is minimum if the sensitivity function width is close to the original ( Figure 5A) or corrected ( Figure 5B) localization radius r l . The situation is different if the sensitivity function is not a non-smooth boxcar function as in the majority of the studies but a smooth Gaspari-Cohn function. Then the divergence rate is still minimum but the corresponding localization radius of this minimum is much smaller than r H , cf. dotted-dashed line in Figure 5.
All these results consider the realistic case of a small number of ensemble members L ≪ N. Nevertheless, it is interesting to raise the question how these results depend on the ensemble size. Figure 5 (bold dotted-dashed line) indicates that an ensemble with L = N = 80 may yield maximum divergence for r l < r H and stability with zero divergence rate for r l > r H . This means  the full ensemble does not show a local minimum divergence rate as observed for L < N.
The divergence criterion is conservative with a hard threshold and trials with large but sub-threshold innovations, i.e., with innovations that do not exceed the threshold, are not detected as being divergent. Nevertheless to quantify intermittent large innovations in the filter, Figure 6 shows the bias and RMSE of trials whose innovation process do not reach the divergence threshold. We observe minimum bias and RMSE for original localization radii r l that are similar to the sensitivity function width r H (Figure 6A). For corrected localization radii r c and r H agree well at minimum bias and RMSE, cf. Figure 6B.
Now understanding that localization radii r l = r H may destabilize the filter, the question arises where this comes from and which mechanisms may be responsible for the innovation divergence. Figure 7 illustrates various statistical quantities for three exemplary trials. These quantities have been proposed to reflect or explain divergence. The innovation-based measure n diverges ( Figure 7B) when the filter diverges ( Figure 7A) for r l < r H and r l ≫ r H , whereas n remains finite for r l ≈ r H . Interestingly, for r l < r H a certain number of ensemble members align and anti-align intermittently but do not align in the instance of divergence (Figure 7C). In the case of similar localization radius and sensitivity function width, a similar number of ensemble members align and anti-align but the filter does not diverge. Conversely, for r l ≫ r H ensemble members both align and anti-align while the filter diverges. These results already indicate a different divergence mechanism for r l ≤ r H and r > r H . Accordingly, for r l < r H and r l ≈ r H background member perturbations align with the analysis member perturbations with cos α l → 1 (Figure 7D), whereas cos α l fluctuates between 1 and −1 for r l ≫ r H while diverging. Figure 8A shows the distribution of time instances T and T when the respective quantities n and n are maximum. These time instances agree well with the divergence times T b . This confirms the single trial finding in Figures 7A,B that n and n are good markers for filter innovation divergence. Moreover only few background members align and anti-align for r l ≤ r H (small values of p a,u ), whereas many more background members align and anti-align for r l ≫ r H (Figure 8B). Conversely, each analysis member aligns with its corresponding background member for r l ≤ r H (q a = 1, q u = 0) and most analysis members still align with their background members for r l ≫ r H (Figure 8C). This means that nonlocal observations do poorly affect the direction of ensemble members in these cases.

Analytical Description
According to Figure 9, there are different possible configurations of the sensitivity function with respect to the localization area. The localization radius r l may be smaller (cases 1) or larger (cases 2) than the sensitivity width r H or both may be equal (cases 3). In addition, it is insightful to distinguish observed and unobserved grid points as already proposed in [18]. Now let us take a closer look at each case, cf. Figure 9: • case 1.1, r l ≤ r H , |i − N 2 | ≤ r H , and |i − N 2 | ≤ r l : the localization radius is smaller than the sensitivity function width and the observation at spatial location N/2 is located within the localization radius about grid point i. Then, the analysis ensemble (22) and its mean (18) read with the corresponding ensemble means at observed grid pointsx b o,i andx a o,i , the first guess perturbations X o,i and the analysis ensemble members x a,l o,i . • case 1.2, r l ≤ r H , |i − N 2 | ≤ r H , and |i − N 2 | > r l : compared to case 1.1, the grid point i is observed as well but the observation is outside the localization area; hence the analysis is identical to the first guess • case 1.3, r l ≤ r H , |i − N/2| > r H , and |i − N/2| > r l : the grid point i is not observed and the observation is outside the localization area leading tō with the corresponding unobserved ensemble meansx b u,i and x a u,i , the unobserved ensemble perturbations X u,i and the analysis ensemble member x a,l u,i . • case 2.1, r l > r H , |i − N 2 | ≤ r H , and |i − N 2 | ≤ r l : the localization radius is larger than the sensitivity function width, the observation is located within the localization radius about the grid point i and all grid points are observed. This case is equivalent to case 1.1 and the expressions for the analysis ensemble and mean hold as well.
• case 2.2, r l > r H , |i − N 2 | > r H , and |i − N 2 | ≤ r l : compared to case 2.1, the observation is located within the localization radius but grid points are unobserved. Thenx and • case 2.3, r l > r H , |i − N 2 | > r H , and |i − N 2 | > r l : in this case, the grid points are unobserved and the observation is outside the localization area. Then the analysis is identical to the first guess and the case is equivalent to case 1.3.   1.3, 2.3, and 3.2 no update is applied. These cases appear to be consistent since grid points that contribute to the observation are  updated by the observation and grid points that do not contribute to the observation are not updated. Conversely, observed grid points in case 1.2 do not consider the observation and are not updated although they contribute to the first guess in observation space. This missing update contributes to the filter error and the filter divergence as stated in previous work [12]. Moreover, the unobserved grid points in case 2.2 do consider the observation and are updated by the Kalman filter. At a very first glance, this inconsistency may be detrimental similar to case 2.1. However, it may be arguable whether this inconsistency may contribute to the filter error. On the one hand, the background error covariance propagates information from observed to unobserved grid points in each cycle step. This may hold true for system phenomena with a large characteristic spatial scale, such as wind advection or long-range moisture transport in meteorology or, more generally, emerging long-range spatial synchronization events. However, on the other hand, if the background error covariance represents a bad estimate, e.g., due to sampling errors or short-range synchronization, the false (or inconsistent) update may enhance errornous propagated information and hence contributes to the filter divergence. This agrees with the vanishing divergence in case of a full ensemble [cf. Figure 5 (bold dotted-dashed line)]. Moreover, updates at unobserved grid points may be errornous due to model errors or the approximation error made by replacing a nonlocal observation by an artificial observation at a single location. The larger the localization radius, the more distant are grid points to the observation location and the less representative is the localized observation to distant grid points.
Hence these two latter cases may cause detrimental effects. Consequently, cases 1 and 2, i.e., r l = r H , yields bad estimates of analysis updates that make the Kalman filter diverge. Conversely, case 3, i.e., r l = r H , involves consistent updates only and detrimental effects as described for the other cases are not present. These effects may explain enhanced filter divergence for r l = r H and minimum filter divergence for r l = r H seen in Figure 3, and the minimum divergence rate at r l ≈ r H shown in Figure 4.
The important terms in case 2.2, i.e., Equations (32) and (33), are X u,i Y t , α i , √ λ i Y l /y 2 , and n X u,i v n (v n ) l . Equivalently, the missing terms in case 1.
and α i appear in both cases 2.1 and 2.2. The terms c o,u represent the covariances between model and model equivalents perturbations over ensemble members and they may contribute differently to the intermittent divergence with increasing |r l −r H |.
For a closer investigation of these terms, let us consider in case 2.1 and in case 2.2. These terms represent the weighted ensemble covariances between model and model equivalents perturbations in observation space. To quantify their difference, T u = arg max Now that A o is strongly correlated with the filter innovation divergence, the question arises whether the difference between weighted observed and unobserved model-observation covariances is related to the innovation divergence. Figure 11 shows the distribution of A = A o − A u and T = T o − T u for divergent and non-divergent experimental trials. Most trials exhibit stronger modelobservation covariances in unobserved grid points than in observed grid points (A < 0), cf. Figure 11A, and the distribution variances of divergent and non-divergent trials are significantly different (Fligner-Killeen test, p < 0.001). Moreover, the distribution of T in divergent trials is asymmetric since T > 0 for almost all divergent trials (see Figure 11B). Hence weighted model-observation covariances in unobserved grid points reach their maximum significantly earlier than in observed grid points. Conversely, T = 0 of non-divergent trials scatters within a much wider range from negative to positive values (Fligner-Killeen test, p < 0.0001).
In this context, re-call that A u > A o but T u < T o in divergent trials, i.e., unobserved grid points reach their larger maximum faster than observed grid points. This indicates that the model-observation covariance c u reflects the instability of the filter.

Local and Nonlocal Observations
Several international weather services apply ensemble Kalman filters and assimilate both nonlocal and local observations. Performing assimilation experiments similar to the experiments for nonlocal observations but now with a single additional local observation at grid point i = N/2 (cf. section 2.7), the filter divergence rate γ indicates the filter stability. Figure 12 illustrates how local observations affect the filter stability in addition to nonlocal observations. For r H = 1, the filter diverges rarely due to large innovations (with fewest trials at r l ≈ 10) but at a larger number than in the absence of local observations, cf. Figure 4. Moreover, increasing the localization radius yields a higher number of trials with catastrophic filter divergence with a maximum catastrophic divergence rate at r l ≈ 10. In sum, the least number of divergent trials occur at r l = r H = 1 (blue curve in Figure 12). A similar stability behavior occurs for r H = 5 with a minimum innovation divergence rate at r l ≈ r H and a maximum catastrophic divergence rate at r l ≈ 10. Again, the least number of trials diverge at r l = r H . Figure 1 motivates the present work demonstrating that nonlocal observations yield larger first guess departures than for local observations only. Here, it is interesting to note that the numerical trial in Figure 1 with nonlocal observations exceeds the innovation divergence threshold, cf. section 2.8, but has run over all filter cycles for illustration reasons. Moreover, several trials with the same parameters exhibit catastrophic filter divergence and the shown trial is a rare case. This divergence could have been avoided by implementing stabilizing features, such as ensemble enlargement [19], adaptive localization [29], adaptive inflation [18], or first guess check [13,45]. However, these methods would have introduced additional assimilation effects and the gained results would not have been comparable to findings and insights in the remaining work.

DISCUSSION
Ensemble Kalman filtering of nonlocal observations may increase the innovation in the filter process leading to larger observationbackground departure bias and RMSE, cf. Figure 1. It is demanding to detect this innovation divergence since it is finite and transient, i.e., of finite duration. At a first glance, this negative impact is surprising since observations are thought to introduce additional knowledge to the system and thus should improve forecasts or at least retain them. To understand better why nonlocal observations may be detrimental, the present work performs numerical studies to identify markers of innovation divergence and understand their origin.

Nonlocal Observations Facilitate Filter Divergence
The majority of previous stability studies of Kalman filtering involving nonlocal observations consider catastrophic filter divergence. Kelly et al. [20] show analytically for a specific simple but non-trivial model how catastrophic filter divergence of a ensemble Kalman filter is affected by nonlocal observations. The work of Marx and Potthast [44] is an analytical discussion of the linear Kalman filter and the authors derive corresponding stability conditions. Conversely, the present work considers intermittent innovation divergence and, to our best knowledge, is one of the first to demonstrate this important effect numerically. Intermittent innovation divergence is detrimental to forecasts and are visible, e.g., in first guess departure statistics (Figure 1). It occurs for a nonlocal observation only (Figure 4) or for nonlocal and additional local observation (Figure 12). This holds true for almost all localization radii. Figures 4, 5, 6, 12 show that innovation divergence depends on the relation between sensitivity function width r H and localization radius r l . The LETKF diverges least when r l ≈ r H and hence this choice of localization radius is called optimal, i.e., the filter is least divergent. This insight agrees with the finding in an operational weather prediction framework involving the LETKF [13]. The authors consider an adaptive localization for (nonlocal) satellite observations and choose the corresponding radius to the sensitivity function width. In two different weather situations, this tight relation improves short-and middle-range weather forecasts compared to the case of independent sensitivity width and localization radius. Figure 9 illustrates the possible reason for the detrimental effect of different sensitivity function width and localization radius: the LETKF is inconsistent if it updates the state at unobserved spatial locations or does not update the state at observed spatial locations. Only if the sensitivity function and the localization width are similar, then this detrimental effect is small. Such an inconsistency is in line with other inconsistencies in ensemble Kalman filters caused by localization, cf. [46]. For instance, a full ensemble reduces inconsistencies for localization radii larger than the sensitivity function width and yields filter stability (Figure 5).

Optimal Localization Radius
It is important to point out that, under certain conditions, it may be beneficial to further enlarge the localization area compared to the sensitivity function. If the system's activity synchronizes on a larger spatial scale, then information is shared between observed and unobserved grid points and a larger localization radius would be beneficial. Examples for such synchronization phenomena are deep clouds or large-scale winds in meteorology or locally self-organized spots in physical complex systems. In other words, to decide how to choose the localization radius one should take a closer look at the system's dynamics: if larger spatially synchronized phenomena are expected, then r l ≫ r H is preferable, otherwise r l ≈ r H .
Several previous studies have derived optimal localization radii for local observations in ensemble Kalman filter [47][48][49] and the specific LETKF [28,50]. A variant of the LETKF localizes not in observation space as in the present work but in the spatial domain [24,31,34,51], where the localization of nonlocal observations has been studied as well [52]. There is the general agreement for local and nonlocal observations that the optimal localization radius depends on the ensemble size and the observation error but seems to be independent on the model [50].
Essentially, it is important to point out that there may be not a single optimal localization radius but a range of more or less equivalent localization radii. This holds true for smooth localization functions, whereas non-smooth (i.e., unrealistic) localization functions do not show this uncertainty, cf. section 2.4.

Origin of Divergence
It is important to understand why some numerical trials diverge and some do not. Direct and indirect markers indicate which dynamical features play an important role in divergence. The most obvious direct markers are the absolute values of the innovation and the ensemble member perturbation spread n and both increase sharply during filter innovation divergence, cf. Figures 4, 6, 7B, 8, 12. Similarly, the covariation of observed and unobserved background errors n also increases during divergence. Interestingly, n and n remain finite and take their maxima just before the instance of divergence, cf. Figure 8. The covariation n increases if both observed and unobserved errors increases. Kelly et al. [20] and Tong et al. [18] argue that this indicates a shift of power from observed to unobserved errors and that this shift is responsible for catastrophic divergence. The present findings support this line of argumentation and extends it to intermittent innovation divergence. This can be seen in Figure 11A. It shows larger mean weighted modelobservation error covariances (i.e., ensemble error covariances) in unobserved grid points than in observed grid points (A < 0) and these weighted model-observation covariances increase faster in unobserved grid points than in observed grid points. In addition, the larger the localization radius r l > r H , the larger the ensemble error in unobserved grid points compared to observed grid points. Hence the model-observation covariance reflects a degree of instability (and thus of divergence) in the filter and this is stronger in unobserved grid points than in observed grid points. Figures 4, 5, 6, 12 provide further evidence on possible error sources that yield filter divergence. The asymmetry of the divergence rates with respect to r l ≈ r H hints different underlying filter divergence contributions. If r l < r H , too few grid points are updated by the nonlocal observation (Figure 9) although they are observed. Consequently model equivalents in observation space include contributions from non-updated grid points which might yield large contributions to the model equivalents from large model magnitudes and hence this error mechanism is rather strong. Fertig et al. [12] have identified this case as a possible source of divergence and proposed to adapt the localization radius to the sensitivity function width. In fact, this removes case 1 in Figure 9 and stabilizes the filter for r l < r H .
For r l ≫ r H , a large number of grid points are updated which, however, consider an observation with a large intrinsic error resulting from, e.g., a too small number of ensemble members. The corresponding assimilation error is more subtle than for r l < r H and increases for larger localization radii only. The localized nonlocal observation comprises a representation error due to the reduction of the broad sensitivity function to a single location. For small ensembles, this implicit observation error contributes to the analysis update error and, finally, to filter divergence. In sum, the two inconsistencies illustrated in Figure 9 and derived in section 3.1 represent two possible contributions to the filter divergence for a low number of ensemble members. Conversely, for a full ensemble, intrinsic error contributions are well reduced rendering the filter more stable (Figure 5).
Moreover, there is some evidence that ensemble member alignment may cause catastrophic filter divergence [19][20][21]. Figure 8 shows such indirect markers indicating weak member anti-alignment for r l ≤ r H but enhanced alignment and antialignment for r l > r H . The authors in [19] argue that finite ensemble sizes cause the ensemble to align in case of divergence and Ng et al. [53] show that the ensemble members may align with the most unstable phase space direction. However, our results reveal that member alignment does not represent the major mechanism for innovation divergence. Conversely, Figure 8 provides evidence for alignment of analysis increments and background perturbations when the filter diverges. This alignment indicates that the analysis members point into the same direction as the background members. For instance, if background member perturbations point to less stable locations in phase space, then the LETKF does not correct this direction and the new analysis state is less stable, cf. the model example in [20]. This shows accordance to the reasoning in [53].
In addition to the alignment mechanism, Equation (34) represents the covariation of ensemble perturbations in spatial and observation space at observed and unobserved spatial locations. For observed spatial locations, it is maximum just before the innovation divergence time. Moreover, it reaches its maximum at unobserved locations almost always before the maximum at observed locations are reached (Figure 11). It seems this new feature represent an important contribution to the innovation divergence and future work will analyse this covariation in more detail.

Limits and Outlook
The present work considers the specific case of finite low ensemble size and application of the localization scheme. To understand better the origin of the filter divergence, it is insightful to study in detail the limiting case of large ensemble sizes, i.e., close to the model dimension, and a neglect of localization. Although this limit is far from practice in seismology and meteorology, where the model systems are too large to study this limit, nevertheless this limit study is of theoretical interest and future work will consider it in some detail.
There is some evidence that the optimal localization radius is flow-dependent [54,55], whereas we assume a constant radius. In addition, the constrained choice of parameters and missing standard techniques to prevent divergence, such as adaptive inflation and adaptive observation error, limits the present work in generality and interpretation and thus makes it hard to derive decisive conclusions. Future work will implement adaptive schemes [56,57] in a more realistic model framework.
In the majority of studies, the present work considers a non-smooth boxcar sensitivity function in order to distinguish observed and unobserved grid points. Although this simplification allows to gain deeper understanding of possible contributions to the filter divergence, the sensitivity function is unrealistic. A more realistic sensitivity function is smooth and unimodal or bimodal. Figure 5 shows that such a sensitivity function yields a minimum divergence rate but the localization radius at the minimum rate is much smaller than the sensitivity function width. Consequently, the line of argumentation about Figure 9 does not apply here since there is no clear distinction of observed and unobserved grid points anymore. Future work will attempt to consider smooth unimodal sensitivity functions.
Moreover, the localization scheme of nonlocal observations applied in the present work is very basic due to its choice of the maximum sensitivity as the observations location. Future work will investigate cut-off criteria as such in [12] that chooses the location of nonlocal observations in the range of the sensitivity function. Fertig et al. [12] also have shown that such a cut-off criterion improves first guess departure statistics and well reduces the divergence for localization radii that are smaller than the sensitivity weighting function.
Nevertheless the present work introduces the problem of intermittent innovation divergence, extends lines of reason on the origin of filter divergence to nonlocal observation and proposes new markers of innovation divergence.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
AH conceived the work, performed all simulations, and wrote the manuscript.