A Comparison Study on Performance of an Adaptive Filter with Other Estimation Methods for State Estimation in High-Dimensional System

In this chapter, performance comparison between the adaptive filter (AF) and other estimation methods, especially with the variational method (VM), is given in the context of data assimilation problem in dynamical systems with (very) high dimension. The emphasis is put on the importance of innovation approach which is a basis for construction of the AF as well as the choice of a set of tuning parameters in the filter gain. It will be shown that the innovation representation for the initial dynamical system plays essential role in providing stability of the assimilation algorithms for stable and unstable system dynamics. Numerical experiments will be given to illustrate the performance of the AF.


Introduction
Consider the following data assimilation problem: Given the dynamical system

1)
In this chapter, the emphasis is put principally on comparison of the AF with VM. For the review on the data assimilation methods in meteorology and oceanography, see Ref. [5]. To see more the advantages of the AF, we implement the extended KF (EKF) [6] in Section 6 and will compare its performance with that of the AF (the experiment with Lorenz system). The Cooper-Haines filter (CHF) [7], widely used in data assimilation in oceanography, is also applied in Section 7 to produce the estimate for the ocean state. It serves as a reference to be compared with that produced by the AF in high-dimensional setting.
In the next section, the variational method (VM), which is widely used in data assimilation for high-dimensional systems in meteorology and oceanography, is outlined. Section 3 provides the recently developed AF approach to data assimilation. The main idea of the AF is to take the innovation representation for the input-output system as a departure point to formulate the optimization problem, with the parameters of the filter gain as control variables. Section 4 presents the tools to implement the AF in a simple and efficient way which is adapted for highdimensional setting. This includes the objective function, filter stability, structure of the error covariance matrix (ECM), gain parameterization, algorithm for optimization known as simultaneous perturbation stochastic approximation (SPSA). It is shown how the ECM is estimated using an ensemble of samples of prediction error (PE) and the hypothesis on separation of vertical and horizontal variables structure (SeVHS) in the ECM. Computational comparison between the VM and AF is also given here. Section 5 includes a simple numerical experiment showing in detail how work the VM and AF, the difficulties of the VM in searching optimal solution and why no similar difficulties are encountered in the AF. The more complicated experiment with chaotic system known as Lorenz system is done in Section 6. The difficulties encountered here concern extreme sensitivity of its solution to small errors in the initial condition. Section 7 presents the performance of different filters in the data assimilation experiment with the high-dimensional ocean model MICOM with the North Atlantic configuration. The conclusions are given in Section 8.  Thus, in the VM, we seek optimal solutions in the functional space (space of functions {x k }). For systems of high dimension, this task is impossible to perform. The simplification is required. Suppose the system Eq. (1.1) is linear and perfect, that is, Expressing all x k as functions of the initial state x 0 , Φðk, lÞ ¼ Φ k−1 …Φ l , ðk > lÞ, Φðk, kÞ ¼ I, I is the identity matrix of appropriate dimension (2.5) and substituting x k , ∀k Eq. (2.5) into Eq. (1.1), at each k th observation instant, the following set of observations is available for x 0 , Under the assumption on perfect model, the optimization problem Eqs. (2.1)-(2.3) is simplified as J½x 0 ! min ½x0 , (2.7) for some y. As Φ T k is impossible to store, the approach known as adjoint equation (AE) is used which requires to construct AE code for computing the product Φ T k y. Each iteration in minimization of Eqs. (2.7) and (2.8) thus requires one integration of the model over the assimilation period, followed by one adjoint integration. The cost of one adjoint integration is about twice the cost of one direct integration, so that one minimization requires the equivalent of between 50-100 integrations of the model over the assimilation period (p. 205 [9]). In the next section, we see that the SPSA can also be used to solve this problem at a much lower cost.

Comment 2.3
As θ-initial state-has a physical meaning, it is important to introduce constraints on the appropriate physically realistic structure of the correction for θ during the estimation process. A poor (in the physical sense) structure of the guess for the initial state can lead to large estimation errors.

Adaptive filtering (AF)
To overcome the difficulties listed in Comments 2.1-2.3, an adaptive filtering (AF) has been proposed in [10]. The main difference of the AF with the VM is lying in the choice of innovation representation for the original input-output system Eqs. (1.1) and (1.2) as a departure point to formulate the optimization problems. It is well known that under standard conditions, the optimal in MSE estimatex k can be obtained by the KF. As the innovation process for the system output in the KF forms a white sequence, Kailath [11] has developed an innovation approach, in an elegant way, to derive the optimal filter for more general linear systems like nonstationary, filtering problems with Markovian processes for the model and observation errors. The innovation approach to linear least-squares approximation problems is first to "whiten" the output data and then to treat the resulting simpler white-noise observation problem. Consider the observation (output) sequence z k . The innovation process, associated with z k , is written as

1)
Under standard conditions (Gaussianness, uncorrelated noise sequences …), E½z k jz 1 wherex k=k−1 is an optimal in MSE one-step ahead prediction for x k given z 1 k−1 . Using ζ k instead of z k , one can write out the formula for the estimatex k and the KF under standard conditions. The filter has the formx where M k is the ECM for the predictionx k=k−1 . This matrix is found as a solution to the Riccati equation Due to the very expensive computational burden in time stepping the ECM M k in Eq. (3.4) as well as insufficient memory storage, the KF is impractical for solving data assimilation problems in very high-dimensional setting. The idea of the AF is based on the fact that when the filter is optimal, the innovation ζ k has a minimum variance. If we assume that the gain K k belongs to a set of parameterized gains, that is,

5)
the optimal AF can be considered as that in some class of parameterized filters of a given structure. The following objective function is introduced In Ref. [12], the different classes of parameterized filters are found which belong to the class of stable reduced-order filters (ROF) [10,13].
As an example for one class of ROFs, consider where K e, k : R p ! R ne represents the gain, mapping the innovation vector from the observational space to the reduced space R ne of dimension n e ≤n; P r, k is mapping the reduced space R ne to the full space R n . The choice of a reduced space is of primary importance since it depends on the main characteristics of the filter known as stability. As proved in [12], under detectability condition, stability of the filter is ensured by forming the columns of P r, k from unstable and stable eigenvectors (or singular vectors, Schur vectors) of the fundamental matrix Φ k , and one can choose

Batch data formulation
We list now the main differences between two approaches VM and AF from which it becomes clear what are the advantages of the AF over the VM.
To make easier comparison between two approaches, let us write out the objective function Eq. (3.6) using a representation in a sample space Mention that in practical implementation of the AF, the optimization algorithm is not constructed on the basis of Eq. (4.1), but on Eq. (3.6). That is, due to the fact that Eq. (4.1) is written in a batch form which requires to make optimization over the time interval ½1, N resulting in a very high computational burden. Minimizing Eq. (3.6) allows to apply SPSA method which is much less consuming for both computational and memory requirements. Below the main differences between VM and AF are listed: (D1) Dynamical system (DS): if in Eq. (2.1), the DS is the initial system Eq. (1.1); in Eq. (4.1), the DS is the filtering Eq. (3.3). This difference has an interesting consequence: if in practice, there is very little known about statistics of w k , the sequence ζ k is observed, and hence, it is possible to estimate the statistics of ζ k .
(D2) The system noise w k in Eq. (1.1) is white, while in Eq. (3.3), ζ k is a white sequence only if the filter is optimal. That allows us to easily apply different statistical tests for verifying the optimality of the assimilation procedure.
(D3) Control variable x 0 in the VM is the initial state, whereas the control variable in Eq. (3.6) is the parameter vector θ.
This difference has an important consequence: as x 0 has to be of precise physical meaning (depending, for example, on the ocean domain of interest), the structure for the guess θ 0 :¼x 0 0 (for the initial state) as well as correction δx ν 0 , generated by iterative algorithm, must be chosen carefully so that at each ν iteration, the estimatex ν 0 ,x ν 0 ¼x ν−1 0 þ δx ν 0 , must be of physically realistic structure. This is not an easy task. On the other hand, in the AF, the parameters usually are immaterial [see θ Eq. in (3.10)]; hence, the choice of structure for θ is of no importance.
(D4) Suppose the DS Eq. (1.1) is unstable. It implies that the error in estimating x 0 will grow during integration of the direct and AE. As for the AF, by its construction, the filtering system Eq. (3.3) remains stable. This can be seen by representing the filtering Eq. (3.3) through its fundamental matrix L k ,x As shown in Ref. [12], the filter Eq.
One sees that for a batch of N observations, Eq. (4.3) requires computation of N terms (without counting for the term M −1 0 e ν 0 ). The k th term is associated with the assimilation instant k, and one needs to compute first μ k :¼ Φðk, 0Þe ν 0 , that is, to integrate k times the direct model Φ κ for κ ¼ 1,…; k and next to integrate backward (k times also) the AE Φ T The larger the k, the bigger the amplification of the initial error e ν 0 and the observation error v k . The error e ν 0 is amplified doubly since it is integrated by the direct and adjoint models. But the amplification of v k (and w k when w k =¼ 0) is most worrying since it is integrated in the gradient estimate, making the gradient direction to be, possibly, completely erroneous.

Implementation of AF
The choice of Eq. (3.6) is important in many aspects in order to obtain a simple and efficient data assimilation algorithm. The idea lying in the criteria Eq. (3.6) is to select some pertinent parameters as a control variables for minimizing the mean of the cost function Ψ ðζ k Þ. The solution to the problem Eq. (3.6) can be found iteratively using a stochastic optimization (SA) algorithm where {a k } is a sequence of positive scalars satisfying some conditions to guarantee a convergence of the estimation procedure. The standard conditions are The algorithm Eq. (4.4) is much more simple [compared to the computation of Eq. (4. 3)] since it requires, at the kth assimilation instant, to compute only the gradient of the sample cost function Ψ ðζ k Þ. The gradient ∇ θ Ψ ðθ k Þ of the sample objective function Ψ ðθ k Þ can be computed using the AE approach (in what follows, for simplicity, the subscript k will be omitted to shorten the notations).
Thus, minimization of Eq. (3.6) by gradient-based SA algorithm requires only one integration of the direct model and one backward integration of the AE code: direct integration ofx k for producing the forecastx kþ1=k ¼ Φx k and backward integration Φ T H k ζ k−1 in computation of ζ ′ kþ1 . For the structure of the gain Eq. (3.9), the objective function Ψ is quadratic wrt θ; hence, one can find easily the optimal parameters.
A less computational burden can be achieved by measuring the sample objective function (but not based on a gradient formula): instead of computing the gradient by Eq. (4.6) based on AE, one can approximate the gradient using the values of the cost function [on the basis of finite difference scheme (FDSA)]. Traditionally, the ith component of the gradient can be approximated by where e i is the unit vector with 1 in the ith component, 0 otherwise.
It is seen that FDSA algorithms do not require the formula for the gradient. However, for the high-dimensional systems (n≈Oð10 6 Þ−Oð10 7 Þ), this algorithm is inapplicable due to component-wise derivative approximation: for approximation of each partial derivative of the cost function, we need to make two integrations of the direct model.
In order to overcome the difficulties with very high dimension of θ, recently the class of algorithms known as simultaneous perturbation SA (SPSA) receives a great interest [14,15]. The algorithm SPSA is of the same structure as that of FDSA Eq. (4.7), with the difference residing in the way to perturb stochastically and simultaneously all the components of θ.
n are Bernoulli independent identically distributed (iid). The gradient of the objective function is estimated as It is seen that in the SPSA, all the directions are perturbed at the same time (the numerator is identical in all n components). Thus, SPSA uses only two (or three) times integrations of the model, independently on the dimension of θ which makes it possible to apply to high-dimensional optimization problems. Generally, SPSA converges in the same number of iterations as FDSA, and it follows approximately the steepest descent direction, behaving like the gradient method [14]. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. On average, though, it tracks the gradient nearly because the gradient approximation is an almost unbiased estimator of the gradient, as shown in Ref. [15].
For the SPSA algorithm, the conditions for {a k } and {c k } are

On the operator P r
As shown in Ref. [12], span½P r -the subspace, spanned by the columns of P r , must be chosen so that the filter gain K ensures a stability of the filter. Mention that even the KF may suffer from instability. To ensure filter stability, P r is constructed from all unstable and neutral eigenvectors of the fundamental matrix Φ (or real Schur vectors (ScVs), singular vectors). In practice, we choose P r to be consisting of the column vectors of S (called S-PE samples) which are results of integration of leading ScVs (columns of X). The columns in S have the meaning of the PE for the system state and are used to approximate the ECM M. As to the ScVs, they are preferred to eigenvectors (or singular vectors) because the ScVs are real, and their computation is numerically stable. Mention that computation of singular vector requires also adjoint code. The ensemble of columns of S plays the same role as an ensemble PE samples in the ensemble-based filtering technique for approximating the background ECM [16,17].

On separation of vertical and horizontal variables structure in ECM [18]
Let us consider the situation when the DS is described by PDEs. The state vector at the time instant k is x k ¼ x k ði, j, lÞ where ði, j, lÞ represents a grid point in three dimensional space. Introduce the stabilizing structure for the filter gain [12] where Λ is symmetric positive definitive. As shown in Ref. [12], one can choose Λ to be diagonal with diagonal elements serving to regularize the amplitude of ECM. The matrix M in Eq. (4.11) is ECM, and if it is computed on the basis of the Riccati equation (3.6), K k is known as the KF gain. For M ¼ M d , computation of M is realizable if the reduced dimension n e is not too large. Actually the number n e of the ensemble size is of order Oð100Þ which is too small for M to be a good approximation for the true ECM. In Ref. [18], it is assumed that the estimated ECM is a member of the class of ECMs with separation of vertical and horizontal variables structure (SeVHS). Mention that this hypothesis is not new and used in modeling the ECM in meteorological data assimilation [19]. The optimal ECM is found as a solution of the minimization problem As the number of vertical layers in the today's numerical models is of order Oð10Þ, all elements of the vertical ECM M v (included in θ 1 ) can be considered as tuning to be estimated. As to M h , it is often chosen in analytical form (e.g., the 1st or 2nd order autoregressive models). The parameters like correlation length can be selected as components of the control vector θ 2 in M h .
Using dominant real Schur vectors has advantages that they are real and their computation is stable [12] while the computation of eigenvectors is unstable, and they may be complex.

Computational comparison between VM and AF
We give a brief comparison (computational burden) between the VM and AF algorithms.  .3) since they represent the most computational burdens for these two algorithms. These numbers are rounded up to the dimensions of the entry matrices. Here, N ito is a number of iterations required to solve the minimization problem (2.1)-(2.3); N it is a number of iterations required to solve the equation In the VM algorithm, the computation of M −1 0 , R −1 is not taken into account. In Table 1, there is also the number of operations required for the AROF, when the ECM M is given in the product decomposition form M ¼ P r P T r , P r ∈R n · ne . In this situation, instead of n 2 operations in the AF, we need to perform 2nn e operations. For n e << n, much less computational and memory requirements are needed to perform the AROF.
To have the idea on how work in practice the VM and AF, here the examples of experiments with two numerical models MICOM (see Section 7) and HYCOM (Hybrid Coordinate Ocean Model) [20] developed at SHOM, Toulouse, France. The first experiment is performed with the MICOM model. The observations are available each 10 days (ds) during 2 years. For the MICOM (state dimension n ¼ 3 · 10 5 Þ, the 10 ds forecast requires 45 s (supercomputer Caparmor, IFREMER, France, sequential run). The 2-year integration takes 54.45 min (73 · 45 s). The AF needs 54:45 min · 3 ¼ 164 min (or 2 h 45 min) to perform the assimilation experiment for the 2-year period. In this context, the VM requires between 5.7 ds and 11.4 ds to perform the experiment (hypothesis 50-100 times integration of the MICOM over the 2-year window, Section 2.2). As to the HYCOM (state dimension n ¼ 7 · 10 7 Þ, the observations are available each 5 ds. The 5 ds forecast requires 1 h (supercomputer Beaufix, Météo, France, parallel run, 62 processors). Two year integration requires 146 · 1 h = 146 h (or 6 ds). The AF needs, hence 18 ds to make the 2-year experiment. As to the VM, the experiment requires between 304 ds and 608 ds. That is one of the reasons why in operational setting one has to choose a short window for assimilating the observations by the VM.  Table 1, one sees that the dominant numbers n d of operations in the VM and AF are n d ðVMÞ ¼ n 2 N 2 N ito and n d ðNAFÞ ¼ n 2 NN it . If we assume that N ito ≈N it (in fact, the number of iterations for solving the optimization problem Eqs. (2.7) and (2.8) is often larger than the number of iterations for solving the system of equations (4.13); it is more critical for the VM when the DS is nonlinear); the number n d ðVMÞ is N times larger than n d ðNAFÞ.

. Estimation problem
Consider the simple scalar dynamical process xðtÞ ¼ sinðtÞ. As _ xðtÞ ¼ cosðtÞ, for t k ¼ kδt, using the approximation xðt k þ δtÞ−xðt k Þ ¼ cosðt k Þδt, one has the following discrete dynamical system In Eq. (5.1), w k represents the Gaussian model error. Suppose at each t k moment, we observe the state x k corrupted by the Gaussian noise v k , hence where δ kl is the Kronecker symbol. Suppose that the true initial state x Ã 0 ¼ 0:5. The problem we study here is to estimate the system state x k based on the set of observations z k , k ¼ 1,…; N.

Experiment: cost functions
In the experiment, δt ¼ 0:01, N ¼ 1000. The two methods, VM and AF, will be implemented to produce the system estimates. We study two situations: (S1) the model is considered as perfect, that is, w k ¼ 0; and (S2) there exists the model error w k with the variance σ 2 w ¼ 0:001. As to v k , σ 2 v ¼ 0:1 in both cases.
Let w k ¼ 0. To see the advantages of the AF over VM in finding optimal solutions, Figures 1 and 2 display the curves (time averaged variance of distances between the true trajectory and those resulting from varying the control variables) as functions of tuning parameters in these two methods: the initial state θ :¼ x 0 and the parameter θ :¼ λ (see Eq. (3.9)). We remark that for the filtering system (5.1), (5.2), as φ ¼ 1, the system has one stable eigenvalue and one stable eigenvector (singular value and Schur decompositions are of the same structure). The filter fundamental matrix L ¼ ð1−KhÞφ ¼ ð1−KÞ is stable if K∈ð0; 2Þ. For the gain structure (4.11), L is stable for any M > 0 since K ¼ M Mþσ 2 r for h ¼ 1 and σ 2 v ¼ 0:1 > 0. We have then K∈ð0; 1Þ⊂ð0; 2Þ. Thus, one can choose θ :¼ M > 0 as tuning parameter. This structure is of less interest compared to Eq. (3.9) since θ enters in K in a nonlinear way, and in fact, K is allowed to vary only in the interval ð0; 1Þ. For Eq. (3.9), P r ¼ 1, H e ¼ 1 hence K e ¼ 1 1þσ 2 r < 1 and the filter is stable if θ satisfies Eq. (3.10). For this structure, the filter is stable for K∈ð0; 2Þ. We will select the last structure as a departure point to optimize the AF performance.
From Figure 1, it is seen that the curve "noise-free" is equal to 0 whenx 0 ¼ 0:5 for S1, but the "noisy" attains the minimal value 0.121 atx 0 ¼ 0:6. We note that almost the same picture is obtained for the cost function (2.1) (time averaged variance of the distance between x k , k ¼ 1; …; 1000 and observations z k , k ¼ 1; …; 1000) subject tox 0 ∈½−1 : 1. Despite the fact that the curves in Figure 1 are quadratic, it is impossible to find the true initial state in the noisy situation since almost the same curves are obtained for the cost function Eq. (2.1). Figure 2 presents the same curves as those in Figure 1 resulting from application of the filter by letting the parameter θ in the gain vary in θ∈ð0 : 2Þ. Figure 2 shows that for the both curves "noise-free" and "noisy", the minimal values are attained at θ ¼ 1:1 for both situations S1, S2. Moreover, the two minimal values are identical. The same picture is observed for the cost function Eq. (4.1) as function of θ∈ð0 : 2Þ. It means that independently on whether the model is perfect or not, AF formulation allows optimization algorithms to find the optimal value for θ, hence, to ensure optimality of the filter.
Two curves "noisy" in Figures 1 and 2 show that when the model is noisy, the minimal value of the curve "noisy" (VM) in Figure 1 is much higher (it is equal to 0.121) than that in Figure 2

. Lorenz equations
The Lorenz attractor is a chaotic map, noted for its butterfly shape. The map shows how the state of a dynamical system evolves over time in a complex, non-repeating pattern [21].
The attractor itself and the equations from which it is derived were introduced by Edward Lorenz [21], who derived it from the simplified equations of convection rolls arising in the equations of the atmosphere.
The equations that govern the Lorenz attractor are: where σ is called the Prandtl number, and ρ is called the Rayleigh number. All σ, β, ρ > 0, but usually σ = 10, β = 8/3 and ρ is varied. The system exhibits chaotic behavior for ρ = 28 but displays knotted periodic orbits for other values of ρ.

Numerical model
In the experiments, the parameters σ, ρ, β are chosen to have the values 10, 28, and 8/3 for which the "butterfly" attractor exists.
The numerical model is obtained by applying the Euler method (first-order accurate method) to approximate Eq. (6.1). Symbolically, we have yðt kþ1 Þ ¼ F ′ ðyðt k ÞÞ, yðt k Þ :¼ ðy 1 ðt k Þ, y 2 ðt k Þ, y 3 ðt k ÞÞ T , (6.2) where δt :¼ t kþ1 −t k is the model time step. The observations arrive at the moments T k and ΔT k :¼ T kþ1 −T k . The experiment setup is similar to that described in Ref. [22].

Observations: assimilation
The corresponding δt ¼ 0:005, ΔT k ¼ 1, hence the sequence of observations is given by zðkÞ :¼ zðT k Þ, k ¼ 1,…; N o . The dynamical system corresponding to the transition of the states between two time instants T k and T kþ1 is denoted as In Eq. (6.3), w k simulates the model error. The sequence w k is assumed to be a white noise having variance 2, 12.13 and 12.13 respectively. The observation system is then given by where the operator H ¼ ½h T 1 , h T 2 T , h 1 ¼ ð1; 0; 0Þ, h 2 ¼ ð0; 0; 1Þ, that is, the first and third components x 1 , x 3 are observed at each time instant k ¼ 1; …; 100. The noise sequence v k is white with zero mean and variance R ¼ 2I 2 where I n is the unit matrix of dimension n. The initial estimate in all filters is given by the initial conditionxð0Þ ¼ ð1, −1, 24Þ T .
The true system state x Ã is modeled as the solution of Eq. (6.3) subject to The problem considered in this experiment is to apply the extended KF (EKF), nonadaptive filter (NAF), and adaptive filter (AF) to estimate the true system state using the observations z k , k ¼ 1, 2; N o and to compare their performances.
Here, the NAF is in fact the prediction error filter (PEF). Mention that the PEF is developed in Ref. [23] in which the prediction error ECM is estimated on the basis of an ensemble of PE samples, that is, The filter gain is taken in the following form which is time invariant. At the same time, for the comparison purpose, the EKF is also used for assimilating the observations. Figure 3 shows the evolution of the prediction errors resulting from three filters: NAF, AF, and EKF. It is of no surprise that the NAF has produced the estimates with larger estimation error. By adaptation, however, it is possible to obtain the AF, which improves significantly the PEF and even behaves better than the EKF.
Mention that the VM is much less appropriate for assimilating the observations in the Lorenz model due to the choice of the initial state as control vector. For simplicity, we simulate the situation when all three components of the system state are observed in additive noise, i.e. with H ¼ I 3 . Figure 4 displays time averaged variances of the difference between the true trajectory and model trajectory (denoted as AVðx Ã ,xÞÞ, resulting from varying the third component of the initial state for two situations of noise-free and noisy models. Namely, we initialize the model by the initial state, which is the same as the true one x Ã 0 ¼ ð1:508870, −1:531271, 25:46091Þ T , with the difference, that the third componentx 3 ð0Þ is varying in the interval ½24:5 : 26:5. The global minimum is attained at x Ã 0 ð3Þ ¼ 25:46091 as expected. However, if the system is initialized by the estimate in a vicinity, even not so far from x Ã 3 ð0Þ, there is no guarantee that the VM can approach the true initial condition. For the noisy model, the global minimum is not attained at x Ã 0 ð3Þ. As for the PEF, the function AVðx Ã ,xÞ is quadratic wrt to the gain parameter, for both situations of noise-free or noisy models, as seen in Figure 5: here, the sample cost function (4.1) is computed over all assimilation period, by varying the third parameter θ 3 in the gain (related to the third observed component of the system state). The global minimum is attained at the true initial condition, but there is no guarantee for the VM to approach the true initial state (no-noisy model). For noisy model, the global minimum is not attained at the true initial state. The curve "noisy model" is scaled by the factor C ¼ 1=15. Figure 5. Cost function in the PEF as a function of perturbed third gain parameter θ 3 . It is seen that in the PEF, the cost function is quadratic wrt to the gain parameter in both situations of noise-free and noisy models. The curve "noisy model" is scaled by the factor C ¼ 1=50. In this section, we show how the AF can be designed in a simple way to produce the high performance estimates for the ocean state in the high-dimensional ocean model MICOM. For details on the Miami Isopycnal Coordinate Ocean Model (MICOM) used here, see Ref. [24]. The model configuration is a domain situated in the North Atlantic from 30°N to 60°N and 80°W to 44°W; for the exact model domain and some main features of the ocean current (mean, variability of the SSH, velocity) produced by the model, see Ref. [24]. The grid spacing is about 0.2°in longitude and in latitude, requiring N h = II · JJ = 25200 (II = 140, JJ = 180) horizontal grid points. The number of layers in the model is KK = 4. It is configured in a flat bottom rectangular basin (1860 km · 2380 km · 5 km) driven by a periodic wind forcing. The model relies on one prognostic equation for each component of the horizontal velocity field and one equation for mass conservation per layer. We note that the state of the model is x :¼ ðh, u, vÞ where h ¼ hði, j, lrÞ is the thickness of lrth layer; u ¼ uði, j, lrÞ, v ¼ vði, j, lrÞ are two velocity components. The layer stratification is made in the isopycnal coordinates, that is, the layer is characterized by a constant potential density of water. The model is integrated from the state of rest during 20 years. Averaging the sequence of states over 2 years 17 and 18 gives a so-called climatology. During the period of 2 years 19 and 20, every 10 days (10 ds), we calculate the sea surface height (SSH) from the layer thickness h which will serve as a source for generating observations to be used in the assimilation experiments (in total, there are 72 observations).

Different filters
The filter used for assimilating SSH observations is of the form wherex kþ1 is the filtered estimate for x kþ1 , x kþ1 ¼ ½h kþ1 , u kþ1 , v kþ1 is the system state at ðk þ 1Þ assimilation instant, Fð:Þ represents the integration of the nonlinear MICOM model over 10 days, K is the filter gain, ζ kþ1 is the innovation vector. The gain K is of the form (4.11) where the ECM M will be estimated from the MICOM model. In the experiment, to be closed to realistic situations, only SSH at the grid points i ¼ 1; …; 140, j ¼ 1; ::; 180 are collected as observations. Thus, the observations are available not at all model grid points.
The gain K is symbolically written as K ¼ ðK h , K u , K v Þ T with K u , K v representing the operators which produce the correction for the velocity ðu, vÞ from the layer thickness correction K h ζ using the geostrophy hypothesis. The filter thus is a reduced order which has the gain K h to be estimated from S-PE samples.

PEF: computation of ECM
In the experiment, two assimilation methods will be implemented. First the PEF is designed.
To do that, the data ECM M d (see Eqs. (6.5) and (7.4), below) is performed by generating an ensemble of PE samples (as done in the experiment with Lorenz system, see the sampling procedure in Ref. [23] for more detail). As the number of elements of ECM is of order 10 10 (for only the layer thickness component h), it is impossible to simulate a sufficient number of PE samples so that M d would be a good estimate for the ECM. The matrix M d will be used only as data to estimate the parameters in a parametrized ECM as follows (see Ref. [18]): Let M∈R n h · n h be the ECM for the layer thickness h, that is, M ¼ Mðs, s ′ Þ. One useful and efficient way to simplify the filter structure is to assume that the ECM M has a SeVHS.
Assuming there exist two covariance matrices, M v and M h such that where ⊗ denotes the Kronecker product between two matrices [25], 3) The main advantage of the separability hypothesis is that the number of parameters to be estimated in the covariance matrix is reduced drastically. As a consequence, even an ensemble of PE samples with small size can serve as a large data set for estimating the unknown parameters. This results in a fast convergence of the estimation procedure. In addition, introducing the SeVHS hypothesis allows to avoid the rank deficiency problem in the estimation of the ECM. In fact, as only a few numbers of ScVs can be computed in very high-dimensional systems, approximation of the ECM M by Eq. (6.5) results in rank deficiency for M. With such an ECM, the resulting filter will probably produce worse results, not to say on instability which may occur during the filtering process.
Suppose we are given the ensemble of S-PE samples S τ ½L ¼ ½δh ð1Þ τ ,…; δh ðLÞ τ which are obtained at the τ time instant by applying the sampling procedure in Ref. [23] subject to L perturbations. For τ ¼ 1,…; T, the ECM M d in Eq. (4.12) is estimated as dðy, y ′ Þ is the distance between two horizontal points y :¼ ði, jÞ and y ′ : Considering M d as data matrix, optimization problem for determining the vector θ looks like Mention that the problem Eq. (7.4) is somewhat closely related to the Nearest Kronecker Product (NKP) problem [25]. In the experiment, the correlation length L d is not estimated and is taken identical in two filters PEF and CHF, L d ¼ 25.

PEF: computation of gain
where G v, l , Σ are defined in Eq. (7.9).

Cooper-Haines filter: CHF
To see the performance of the AF, we implement also a so-called CHF. The CHF [7] is obtained from (7.9), (7.10) under three hypotheses [24]: (H1) the analysis error of the system output is canceled in the case of noise-free observations; (H2) conservation of the linear potential vorticity (PV); (H3) there is no correction for the velocity at the bottom layer. The AF in Ref. [24] is obtained by relaxing one or several hypotheses (H1)-(H3). From the filtering theory, the difference between the PEF and CHF is lying in the way we estimate the elements of the ECM.
For the choice of the tuning parameters in the PEF, see Ref. [24].

Numerical results
First, we run the model initialized by the climatology. This run is different from that used for modeling the sequence of true ocean states only by changing the initial state by climatology. This run is denoted as model. Figure 6 shows the (spatial) averaged variance between SSH observations and that produced by the model. We see the error grows as time progresses, meaning instability of the numerical model wrt the perturbed initial condition. That fact signifies that the VM will have difficulties in producing high performance estimates if the assimilation window is long.
Next the two filters, CHF and PEF, are implemented under the same initial condition as those carried out with the experiment model. It is seen from Figure 6 that initialized by the same initial condition, and the CHF is much more efficient than the model in reducing the estimation error. The performance comparison between the CHF and PEF is presented in Figure 7. Here, the (spatially averaged variance) SSH prediction errors, resulting from two Figure 6. SSH prediction errors resulting from "model" and CHF: growing of the prediction error in the "model" signifies instability of the numerical model wrt specification of the initial system state.
filters CHF and PEF, are displayed. The superiority of the PEF over the CHF is undoubted. It is clear from Figure 7 that the PEF is capable of producing the better estimates, with lower error level, along all assimilation period. On the other hand, if the estimation error in CHF decreases continuously at the beginning of the assimilation and is stabilized more or less after (during the interval k∈ð10 : 45Þ), it becomes to increase considerably at the end of the assimilation period. It means that the PEF is more efficient than the CHF and the fact that the ECM in the PEF, constructed on the basis of the S-PE samples, has the effect to stabilize its behavior.
To see the effect of adaptation, Figure 8 displays the filtered errors for the u-velocity component estimates at the surface, produced by the PEF and APEF, respectively. The APEF is an AF, which is an adaptive version of the PEF. Here, the tuning parameters are optimized by the SPSA method. From the computational point of view, the SPSA requires much less time Figure 7. SSH prediction errors resulting from two filters: CHF and PEF. The PEF is capable of producing the estimates with lower estimation errors and is stable along all assimilation period, whereas the CHF has a difficulty to maintain the same performance at the end of the assimilation period. The PEF is much better than the CHF in providing better estimates for the system states. integration and memory storage compared with the traditional AE method. At each assimilation instant, we have to make only two integrations of the MICOM for approximating the gradient vector. From Figure 8, one sees that the adaptation allows to reduce significantly the estimation errors produced by the PEF.

Conclusions
In this chapter, a comparison study on the performance of the AF with other existing methods is presented in the context of its application to data assimilation problems in high-dimensional numerical models. As it is seen, in comparison with the standard VM, the AF is much simpler to implement and produces better estimates. The advantages of the AF over other methods such as EKF, CHF are also demonstrated. The principal reason for high performance of the AF is lying in the choice of innovation representation for the initial input-output system and selection of pertinent gain parameters as control variables to minimize the MSE of the innovation process. If in the VM, the choice of the structure for the initial state is the most important thing to do (but that is insufficient for guaranteeing its high performance), in the AF, however, the initial state has a little impact on the performance of the AF. This happens because the AF is selected as optimal in the class of stable filters, and as a consequence, the error in the initial estimate is attenuated during assimilation process. In contrary, in the VM, the error in the specification of the initial guess is amplified during assimilation if the numerical model is unstable.
In conclusion, it is important to emphasize that the AF approach, presented in this chapter, is consolidated by exploiting the following road map: (i) generate data ECM from S-PE samples which grow in the directions of dominant Schur vectors; (ii) select a parametrized structure for the ECM under the hypothesis SeVHS; (iii) make the choice of tuning parameters in the gain by minimizing the distance between the data ECM and that having the SeHVS structure; (iv) adjust the unknown parameters in the gain in order to minimize the PE error of the system output by applying the SPSA algorithm.
There are a wide variety of engineering problems to which the AF is applicable and that could be worthy of further study. Depending on particular problems, undoubtedly, the other modifications would be helpful to improve the filter performance, to simplify its implementation. But the main features of the AF, presented in this chapter, remain as the key points to follow in order to preserve a high performance of the AF.