Adaptive filters: stable but divergent

The pros and cons of a quadratic error measure in the context of various applications have often been discussed. In this tutorial, we argue that it is not only a suboptimal but definitely the wrong choice when describing the stability behavior of adaptive filters. We take a walk through the past and recent history of adaptive filters and present 14 canonical forms of adaptive algorithms and even more variants thereof contrasting their mean-square with their l2−stability conditions. In particular, in safety critical applications, the convergence in the mean-square sense turns out to provide wrong results, often not leading to stability at all. Only the robustness concept with its l2−stability conditions ensures the absence of divergence.


Introduction: some historical background on adaptive-filter stability
The basic concept of a quadratic error measure whose minimum can simply be found by differentiating and solving a resulting set of linear equations, invented by C.F.Gauss in 1795, has been the tool of choice for about 200 years. In [1], many arguments were demonstrated to question the usefulness of the mean squared error (MSE) in image and audio processing due to our complex human perception and these arguments were nicely supported by many practical examples and observations. Such a quadratic error measure has also been employed in adaptive-filter theory as a practical means to derive convergence in the mean-square sense, starting with Ungerböck in 1972 [2] who applied the technique onto Widrow and Hoff 's famous least-mean-square (LMS) algorithm [3] 1 . He also introduced the so-called independence assumption that is not well argued for [4] but a necessity once MSE techniques are being applied. The concept was to evaluate not only the mean of the parameter-error vectorw k = w − w k (also known as weight error vector) but also the mean-square of it, typically in terms of the parameter-error vector covariance matrix E w kw H k . Although originally derived in the context of machine learning, the LMS algorithm is a standard gradient-type algorithm mostly applied for system Correspondence: mrupp@nt.tuwien.ac.at TU Wien, Institute of Telecommunications, Gusshausstr. 25, 1040 Wien, Austria identification (see Algorithm 1). Many other applications, such as linear prediction or active noise control [5], can be brought into such framework. Note that we refer to the estimates of the time-invariant reference w by w k , that is, time-variant estimates. This paper sticks with the estimation of time-invariant systems; to describe the tracking behavior of learning systems, another notation would be required. Figure 1 depicts a typical setup for system identification. For the LMS algorithm, simply set F = I. Note that we describe Algorithm 1 in terms of a finite impulse-response (FIR) filter. There are also other filter structures possible, e.g., infinite impulse-response (IIR) filters described in the next section or linear input combiners in which the input sequence u k is completely defined by instantaneous input vectors u k with typical applications in adaptive antenna beamforming [6].  Table 1 lists the most commonly used variable names and their meaning.
This paper provides a historical overview of adaptivefilter theory spanning the past 50 years. In Section 2, we review the problems of filters including filtered errors as they emerged in the 1970s. In Section 3, we formally introduce the two different concepts of stability, MSE, and l 2 −stability, and compare their properties. Section 4 then continues our historical walk into the 1990s including newer and older algorithms that exhibit stability problems which were not observed at the time of their proposal. We even address adaptive algorithms for blind channel estimation and show their robustness. We further exploit the robustness concept and l 2 −stability in Section 5 by a recently proposed singular-value-decomposition (SVD)based method that is better suited to detect the instability of adaptive systems. We also provide an example of the so-called proportionate normalized LMS algorithm (PNLMS) which shows that an adaptive-filter algorithm can be MSE stable but still exhibit divergence. Based on this new framework, we investigate in Section 6 the stability of adaptive algorithms whose error signals are linearly coupled. As shown in Section 7, this sets the framework for all cascaded adaptive algorithms and allows us finally to describe the stability behavior of such an algorithmic family. Eventually, some open issues are addressed in Section 8. Altogether, we discuss 14 different adaptivefilter algorithms and many of their variants in terms of stability and robustness.

First stability problems found
Employing the MSE method as the favorite and satisfying tool of most researchers in the field of adaptive filters, Feintuch introduced in 1976 an adaptive algorithm [7] (see Algorithm 2) that exhibited first obstacles. He proposed the estimation of IIR filter coefficients (A, B), rather than the conventional FIR (B) coefficients, located in the filter weights w k . Usually, when a stability issue in adaptive filters occurs, practitioners recommend to lower the stepsize μ k , thus buying increased stability with the expense of Immediate responses [8,9] after this publication showed that the MSE argumentation must be wrong even though the originating author with help by others valiantly defended his MSE-based argument. Soon after the notion of strict positive real (SPR) [10] was introduced (see Table 2). A linear time-invariant filter F is called SPR if for all frequencies −π ≤ ≤ π the transfer function Re{F(e −j )} > 0. Derived in the context of so-called hyperstability, it now became obvious that adaptive IIR filters -no matter if updated with a gradient-type algorithm or a more complex Gauss-Newton-type algorithm-all share a common fate; the output error signal used for updating the filter weights passes first through a linear Fig. 1). If this filter exhibits the SPR property, a stepsize small enough can be found to ensure convergence of the algorithm, while if such property is not satisfied, sequences can be found for which the filter not only does not converge but indeed diverges for all step-sizes (that are nonzero). It now became obvious why some situations worked for this algorithm and others did not. Once the recursive part of the coefficients in 1 − A does not satisfy the desired SPR condition, the algorithm is doomed to be unstable. The history of this development is nicely summarized in [11] and the interested reader is recommended to read it.
In fact, Feintuch's adaptive IIR filter algorithm is a special case of a so-called filtered-error-type algorithm (see Algorithm 4). A very simple instantiation of such filterederror-type of algorithm is the so-called LMS algorithm with delayed updates (DLMS) [12][13][14]. It occurs if the error filter is a simple delay, which can easily happen if a pipelined chip structure for the LMS algorithm is designed that requires to introduce a delayed version of the error signal. If the error signal appears delayed by, say K > 0 steps, the filter F(e −j ) = e −jK cannot be SPR and thus a pipelined LMS algorithm can become unstable. However, the cure for this algorithm is simply obtained by also delaying the regression vector by K steps and applying an older estimate, as shown in Algorithm 3.
If the filter F is known and non-SPR, a cure can be rather simple by applying an additional backward filter F R to the filtered error. This results in an SPR part F * F = |F| 2 and a pure delay e −jM that can be treated by delaying the regression vector u k in a similar way to the DLMS cure in Algorithm 3. Note, however, that such treatment usually results in a rather slow update rate as the error signal is severely delayed now.
Such a behavior was also detected in the context of active noise control, where the linear filter F is not defined by the unknown recursive part of an IIR filter but by an acoustic-electrical transfer function, defined also by the mechanical construction of the concatenation of loudspeaker, free-space, and microphone system [15]. Different from the adaptive IIR filter, however, in acoustic noise control, the filter F can be observed and its impulse response identified first and then compensated for. An alternative idea that avoids applying the backward filter was proposed by applying the error filter F on the regression vector. Many algorithms being derivatives of this so-called Filtered-X-LMS algorithm (see Algorithm 5) have been proposed during the 1980s to overcome the SPR condition, once F is known. The essential idea is to compensate the impact of the filtered error by an identical filter on the regression vector (in this case G k [.] = 1).
In [16], robustness conditions for the Filtered-X-LMS algorithm were analyzed and it was found that, although placing F on the regression vector has a beneficial behavior, the algorithm is in general only locally robust. The Filtered-X-LMS algorithm was then reformulated in the form of a filtered-error-type; however, now, a new, timevariant linear operator 1/ [1 − μ k C k ] applies on the filtered error. The coefficients of this time-variant operator C k depend on linearly filtered versions of the input signal u k as well as on the algorithm's step-size. As the coefficients of μ k C k are proportional to the step-size μ k , sufficiently small step-sizes can ensure that 1 − μ k C k is SPR, however, with the price of a slowed down adaptation. Only if a particular time-variant linear operator is additionally applied on the filtered error term, this can be compensated for and the algorithm can be sped up considerably.
The Filtered-X LMS algorithm has experienced a renaissance during the past years as it appears to be the right choice for vibration control in car engines [17][18][19][20]. A novel aspect here is that car engines can be controlled without sensors as the engine speed is known. The input signal u k can thus be generated artificially out of weighted sine and cosine terms of the car's rotation frequency and multiples thereof. A compact notation of this algorithm results in a complex-valued Filtered-X LMS algorithm. However, due to physical reasons, the error signal must be real-valued and therefore a complex-valued LMS algorithm is run only by a real-valued error fraction. In [21], it was shown that this variant behaves indeed in a robust way while an alternative variant employing a complex-valued error and a real-valued regressor does not. Both variants show identical MSE behavior though.

Stability of adaptive filters
After so much disturbing news on potential instability, it is time now to take a closer look into the stability of adaptive filters as we need to understand the various notions of stability.
MSE-stability: is based on minimizing E[ |ẽ a | 2 ] with respect to the parameter estimates w k [2,[22][23][24][25][26]. Depending on the application, a minimal remaining error energy can be desired (signal adaptation), but also the correct knowledge of the parameters w may be desired (system adaptation). In the classical MSE analysis, the parameter vector error-covariance matrix P k = E w kw H k is studied-requiring the so-called independence assumptions on the participating processes u k and v k -and step-size conditions are derived to guarantee tr(P k ) to decrease. Due to this procedure, MSE-stability always includes some form of convergence. If an additive stationary noise process v k is assumed, the algorithm converges into a nonzero steady-state. l 2 −stability: is based on robustness terms originating from control theory [27,28] in the form of l 2 -norms of instantaneous regression vectors rather than their expectation values. In the context of adaptive filters, it was introduced in 1993 by Kailath, Sayed, and Hassibi [29]. Further work over the next 10 years [25,[30][31][32][33] showed that more and more adaptive filters exhibit such property. In loose words, l 2 −stability simply says that if the input sequence has a bounded Euclidean norm, so does the output sequence. Note that, different from common treatment, inputs of the scheme are now additive noise v k as well as the initial parameter-error vectorw 0 = w − w 0 , outputs are the undistorted a-priori error sequence e a,k = u T kw k−1 and possibly the a-posteriori parametererror vectorw k = w − w k . The driving sequence u k only influences the algorithmic mapping from input to output.
Different from MSE-stability, the robustness concept leading to l 2 −stability does not require any simplifying assumptions on the signals and formulates the adaptive learning process in terms of a feedback filter structure with an allpass (unitary transformation) that is lossless in the feedforward path and a feedback path that usually contains all important signal and system properties as well as free parameters such as the step-size. A typical structure of this for the standard LMS algorithm with time-variant step-size μ k is shown in Fig. 2.
As the stability result depends on the small-gain theorem [27,28], the resulting step-size bound is conservative. While for the classic LMS algorithm, the observation coincides very sharply with the predicted bounds, for many other algorithms, the bound obtained is indeed conservative.
In the context of robustness, the question about worstcase sequences was posed the first time: what sequences {w 0 , v 0 , v 1 , . . . , v N } can be envisaged for the worst behavior of the LMS algorithm, i.e., For gradient-type algorithms it was concluded that, if the noise sequence compensates the undistorted error, i.e., Fig. 2 Gradient-type method described as feedback structure with allpass (lossless) in the forward path and lossy feedback path v k = −e a,k , the algorithms do not update and their maximum robustness level γ is obtained with identity, thus such sequences were considered to cause worst-case situations. Surprisingly, there was no worst condition imposed on the driving sequence u k as long as 0 . The reason for this may be in the fact that the method itself aims for a convergence of the undisturbed error sequence e a,k = u T kw k−1 → 0, rather than the parameter-error vectorw k . Not surprisingly, signal conditions only came up when requiring that not only the error energy of |e a,k | 2 tends to zero but also that the parametererror vectorw k−1 = w −ŵ k−1 , i.e., the difference between the true system impulse response and its estimate, converges strongly (in norm) to zero. If the latter is also required, the driving signal vectors u k need to be persistent exciting, i.e., consecutive vectors need to span the space of dimension M, if M denotes the filter order.
A consequence of the l 2 −stability property is that an energy-bounded input sequence (noise v k and initial parameter-error vectorw 0 ) causes a bounded output of undistorted errors e a,k . If the input sequence is a Cauchy sequence, so is the output. If, on the other hand, such bound γ cannot be guaranteed, it is likely that an input sequence exists that causes divergence. Convergence in this context means that a range of step-size parameters (or alternative design parameters) exist, for which even under worst-case sequences, no divergence occurs.
The concept of l 2 −stability is thus very different from MSE-stability as the existence of a single worst-case sequence (e.g., one among an infinite amount) still would guarantee MSE-stability (an infinite amount of working sequences outweigh the single worst-case sequence) but not vice versa. The idea of l 2 −stability is thus more restricting and to be preferred in cases where security is of utmost importance (smart cities, smart grids, transportation flow, automatically controlled cars, flight control, and so on), while MSE-stability might only be sufficient for typical applications in telecommunications where corrupted data transmissions can be corrected by different means. We can conclude that for bounded random sequences, l 2 -stability leads to MSE-stability but not to the opposite.
The robustness framework was even able to handle such different algorithms as the Gauss-Newton-type Algorithm 6, of which the recursive least squares (RLS) algorithm is its most famous special form, but also singlelayer neural network adaptations. The global robustness and l 2 −stability of the RLS algorithm was shown [34], corresponding results for the entire Gauss-Newton algorithmic family with time-variant forgetting factor 0 < λ k < 1 as well as memory factor 0 < β k ≤ 1 are reported in [35], special results for least squares (LS) estimators including Kalman filters appeared in [36]. The real-valued Perceptron learning algorithm (PLA), see Algorithm 7, was shown to be l 2 −stable in [37]. Even more complicated single-layer structures, such as the so-called Narendra and Parthasarathy structure, that include feedback with memory could be analyzed and l 2 −stable conditions were provided.

Recently discovered evidence
Until here, the occurrence of a linear filter in the error path may have been regarded as some curiosity in the many variations of adaptive-filter algorithms and applications that simply set an exemplary exception, requiring a different treatment while the majority of adaptive-filter algorithms work accurately according to an MSE-based theory. The developed robustness description allows to define stability conditions for all those cases very accurately. Back to our historical walk. In the 1990s, adaptive filters for neural networks and particular fast versions of LS techniques were in the focus, so called Fast-RLS algorithms. Their theory is also based on minimum MSE (MMSE) but, due to their deterministic nature, independence assumptions were not required. To include them in practical applications, their LS nature was often sacrificed, and time-variant step-sizes were introduced. With such step-sizes, however, their nature was more along the stochastic, gradient-type algorithms. One of these RLS derivatives is the affine projection (AP) algorithm [38] that speeds up convergence when compared to its simpler gradient counterpart by taking P past regression directions into account. A fast version of this [39,40] is the basis for millions of copies of such algorithms running today in electric echo cancellation devices to reduce the echoes of long-distance call telephone cables. Unlike the original algorithm, they use a sophisticated step-size control to prevent instable behavior in double-talk situations [41], that is when both talkers are active. The resulting algorithm is called pseudo affine projection (PAP) algorithm, see Algorithm 8, as with a moderate step-size the original property is lost. Recently, it has been shown [42] that depending on the correlation of the input signal, such PAP algorithm can become unstable and that, depending on the input signal statistic, situations exist in which even small step-sizes do not result in stable behavior but larger ones are required; thus depending on the steadystate of the predictor coefficients a k (correspondingly denoted here as linear operator A(q −1 )), lower normalized step-sizes α min may exist as well as upper bounds α max . prediction: However, this is not the only algorithm for which stability problems remained undiscovered for a long time. A well-known adaptive algorithm for zero forcing (ZF) equalization is the gradient algorithm by Lucky [43], see Algorithm 9. In the well-known text book by Proakis [44] we can read: "The peak distortion has been shown by Lucky (1965) to be a convex function of the coefficients. That is, it possesses a global minimum and no relative minima. Its minimization can be carried out numerically, using, for example, the method of steepest descent".
The argumentation sounded very convincing until the algorithmic behavior was analyzed throughly in [45] and it was found that indeed there exist channel conditions and data sequences that cause the algorithm to diverge, even for smallest step-sizes. Based on the channel impulse response {h i }, step-size conditions only for MSE-stability can be derived. See also [46] for alternative non-robust ZF equalizer algorithms. ALGORITHM 9. ZF EQUALIZER ALGORITHM reference: Such examples may corroborate the suspicion that they all may be related to a linear filter in the error path of some form and thus depend on an SPR condition. Note, however, that neither for the ZF algorithm nor for the PAP algorithm, any SPR condition appears in the error path; thus, they do not fall under the existing knowledge of the early 1990s, their l 2 − stability behavior being much different from their MSE behavior. In the meantime, they were, however, correctly analyzed by the now existing robustness techniques [42,45].
Moreover, also other problems can cause stability trouble, when the driving signal is of persistent excitation. Once we consider algorithms with matrix inverses such as RLS algorithms, it is well understood that with a lack of persistent excitation, a null space in the solution opens up that offers the algorithm a wide space to diverge. Also in applications, such as stereo hands-free telephones [47], null-spaces can occur as part of the solution and cause adaptive filters to diverge. In such cases, regularization and leakage factors are often applied to force the null spaces out of the obtained estimates.
But the existence of null-spaces is not necessarily a reason for a lack of robustness. It may thus come as a surprise that there exists an adaptive algorithm for blind channel estimation [48,49] that is indeed robust [46,50], although it is known that most of the blind methods are non-robust [51]; see Algorithm 10 for details. It is the classical twochannel path setup as depicted in Fig. 3, in which the task of the algorithm is to estimate both channels, g and h, denoted here as concatenated vector This result, however, does not mean that all blind algorithms are robust; some more comments on this topic are provided in Section 8.

A converse approach: worst-case scenarios that lead to divergence
While for many known adaptive filters, it was now possible to show robustness conditions; for some of them, the problem remained unsolved as they cannot be brought into the feedback structure as depicted in Fig. 2. Typically, these problematic adaptive filters employ the general update form: that is directions x k are applied, different (not parallel) from driving process vector u k that constitutes the error e a,k = e a,k + v k = u T kw k−1 + v k . We refer to these algorithms in the following as asymmetric in contrast to symmetric algorithms, such as LMS or RLS. Equivalently speaking, it remained unclear for those adaptive filters of general asymmetric structure whether worst-case sequences exist that could cause divergence no matter what the step-size (μ k > 0) is.
A more general view that encompasses also the driving processes u k into the worst-case scenarios and aims directly at the convergence or divergence of the parameter-error vectorw k is proposed in [52] where a similar argument to robustness is employed but instead of using the small-gain theorem, the sub-multiplicative property of norms in the context of SVD is applied.
The idea is the following: assume for simplicity the noiseless case and consider the parameter-error vector w k−1 . An arbitrary gradient method may use its update error e a,k = u T kw k−1 and update the parameter-error vector through that is, not necessarily into direction u k but x k . Applying the update several times results in a product of matrices B k whose largest singular value should remain bounded to preserve stability. This is equivalent to requiring a norm on B k to remain bounded and as B k ≤ B k for many norms (sub-multiplicative property), we can conclude that l 2 −stability can be guaranteed as long as the largest singular value σ max (B k ) ≤ 1. This condition is-similar to the small-gain theorem that was applied before-a conservative condition. However, due to the linear operators involved, it is now simpler to analyze converse conditions, i.e., bounds for instability rather than stability. Note that for the above example, the largest singular value turns out to be larger than one if x k = αu k , that is if these vectors are not parallel.
If observation noise is added again, compared to (1), now, conditions of the form are obtained, which are obviously weaker than the original ones as the terms of the undistorted errors N k=1 μ k |e a,k | 2 are missing. Similarly, to the robustness method of the previous section, stability conditions for the step-size μ k and boundedness conditions on additive noise can be derived now. However, the so obtained bounds appear to be tighter (or equivalent) when compared to the previous ones based on robustness. If the largest singular value of the mapping B k is larger than one,γ will not remain bounded for N, growing to infinity, and thus robustness is potentially lost.
A good first example is the LMS algorithm, i.e, Algorithm 1. The classic robustness scheme showed l 2 −stability as long as μ k < 2/ u k 2 2 . But, how much larger can μ k become until the algorithm really diverges? Due to the conservatism of the small-gain theorem, we cannot answer this. The SVD method, on the other hand, allows to derive the worst-case sequence u k so that divergence can be guaranteed [52] and indeed for μ k > 2/ u k 2 2 divergence can be ensured that is, sequences that cause divergence can always be found. The stability bound of the LMS algorithm is thus tight as both methods deliver the same bound. While the SVD-based method provides an identical bound in this case; in many other algorithms, larger bounds could be identified.
Note, however, that the condition of having a singular value larger than one and thus the loss of robustness does not necessarily mean that the system must behave in an unstable manner. In order to cause instability, the driving signal must ensure that with every update step (or the majority of update steps), the condition is violated and not just once. As sometimes additional constraints are imposed to the adaptive filter, this potential worst-case sequence may not exist, and the algorithm may behave in a robust manner although one singular value is larger than one. Limitations of worst-case sequences typically occur with additional constraints due to the filter application and structure. A linear combiner with input u k from an arbitrary alphabet is very likely to cause a worst-case sequence leading to divergence, while an adaptive filter of FIR structure allows only one degree of freedom per iteration, as all other elements of the update vectors are already given. If the driving sequence is further restricted by originating from a limited alphabet, say binary phase shift keying (BPSK), it can very well happen that a singular value larger than one exists, but the excitation can never work in direction of its corresponding vector. A systematic example is provided further ahead in the context of the PNLMS algorithm. This brings us to the question of systematically finding worst-case sequences. For the above mentioned example, this is equivalent to requiring max for some positive range of step-sizes, for example 0 < μ k < x k , u k . Indeed for arbitrary vectors x k , u k , it is relatively straightforward to find such sequences and prove divergence. However, as soon as x k = f (u k ) and the application imposes more restrictions, finding the sequences can become challenging. It is very illustrative to view in this context the so-called proportionate normalized LMS (PNLMS) algorithm as a second example. Originally derived by Duttweiler [53] in 2000, the algorithm can be viewed as a time-variant counterpart of the algorithm by Makino [54]; both variants are shown in Algorithm 11. During the next 10 years the algorithm became very popular as a clever control of the diagonal step-size matrix can cause a significant speed up of the algorithm [55]. Note that time-invariant matrix stepsizes that are positive definite or exhibit SPR properties are shown to be robust in [32,42], ensuring l 2 −stability of Makino's algorithm. This can easily be shown as the product of consecutive matrices B k is equivalent to Eq. (5).
The asymmetric form of matrix B k can thus be made symmetric and standard theory can be applied. Duttweiler replaced L by a time-variant diagonal matrix L k for which such symmetry correction in the style of (5) does not work any more. He showed his algorithm to be meansquare convergent. First attempts for showing robustness, however, turned out to require further rather limiting conditions on L k [56]. In [57], it finally is shown that the PNLMS algorithm can indeed become non-robust even if the positive definite entries of L k are fluctuating only little. ALGORITHM 11. PNLMS ALGORITHM input:

Linearly-coupled and partitioned adaptive filters
We are further interested in understanding the stability of adaptive filters with asymmetric update forms. To this end, we apply the SVD-based method to so-called linearly-coupled adaptive filters where two adaptive filters use linear combinations of their error terms, that is Such a coupling may be undesired and caused by implementation or desired to achieve particular convergence properties [58]. In case of two coupled adaptive filters, the four coupling factors ν 1,k , ν 2,k , ν 3,k , ν 4,k can also consume the step-sizes so that only four degrees of freedom remain. Figure 4 depicts the setup. This structure turns out to be the vehicle to analyze cascaded and partitioned adaptive algorithms and is thus of high interest.
In a partitioned algorithm, the input vector is split up into one or more sections that run with a different (individual) step-size. This can facilitate parallel implementation and/or improved convergence speed. To simplify matters, let us thus envisage a simple form of a gradient algorithm in which we use two partitions with different step-sizes μ g,k and μ h,k . We split the entire parametererror vector into two parts, say g and h, and correspondingly we use two partitions, say u k and x k , as regression vectors. The so obtained Bipartite PNLMS algorithm is summarized in Algorithm 12.
Based on such an algorithmic formulation, we recognize that the bipartite PNLMS algorithm is of the same kind as linearly-coupled adaptive filters with the special step-size/coupling factor choice: ν 1,k = ν 2,k = μ g,k and Fig. 4 Two adaptive filters with their a-priori output errorsẽ g,k andẽ h,k coupled by the multiplicative coupling factors {ν 1,k , ν 2,k , ν 3,k , ν 4,k } ν 3,k = ν 4,k = μ h,k . Even if the two step-sizes are not identical, the update error is still linearly dependent for both partitions, causing one singular value to be larger than one, thus violating robustness. Only the weaker MSEstability remains. In the following, we demonstrate this behavior on a simple example in which we first run the PNLMS algorithm with worst-case sequences rather than random sequences.
Bipartite PNLMS simulation example: we simulate the Bipartite PNLMS algorithm for a filter of length M = 10 and average over 20 Monte Carlo (MC) runs. We vary the filter structure {linear combiner, FIR} as well as the input symbols {Gaussian, bipolar}. We choose a diagonal matrix which nicely reveals the chimera character of the algorithm, being of PNLMS-type with a time-variant diagonal matrix L k as well as being of partitioned structure. If we further add the error terms, we recognize that which also reveals the linearly-coupled nature of the algorithm. We apply a small normalized step-size of μ k = 0.1/ x k 2 2 + ρ k u k 2 2 for which all analyzed cases exhibit MSE-stability. In the first case, we set ρ k = ρ = 2 and refer this to the fixed matrix L in Fig. 5, while in the second case, we select ρ k to be a random process uniformly distributed between zero and two and refer this to the random matrix L k in Fig. 6. While for a fixed matrix, the system mismatch has some potential to grow initially, it cannot keep growing and runs into a steadystate. Note that for bipolar driving signals as well as FIR filter structures, the worst-case sequences are relatively simply found as only a finite space has to be exhaustively searched through. For a uniform input (in [−10,10]) and a linear combiner structure, a more sophisticated search is required though. As we recognize in Fig. 6, employing FIR filters restricts the search space for worst-case sequences dramatically and the filter converges despite its largest singular value being larger than one. This is different from a general PNLMS algorithm with arbitrary L k Fig. 5 The worst-case learning behavior of the Bipartite PNLMS algorithm (fixed matrix) under various conditions: FIR filter vs. linear combiner and bipolar vs. Gaussian input. The algorithm always becomes l 2 −stable even though the relative system mismatch initially rises Fig. 6 The worst-case learning behavior of the Bipartite PNLMS algorithm (random matrix) under various conditions: FIR filter vs. linear combiner and bipolar vs. Gaussian input. The PNLMS algorithm is potentially unstable, but by restricting the filter structure to FIR and the input alphabet to BPSK, the algorithm can become l 2 −stable where we observe that having FIR filter structures is usually not a sufficient constraint to ensure robustness. The PNLMS algorithm with its time-variant diagonal matrix L k is thus an excellent example for an adaptive algorithm that is MSE stable but can behave in a divergent manner.

Cascaded adaptive algorithms
Cascaded or concatenated structures of adaptive filter algorithms have attracted many researchers in the past. The motivation can be as simple as dividing a long filter in shorter autonomous parts or owing to structural purposes [59,60]. In the context of the identification of non-linear power amplifiers of large bandwidth, a concatenation of linear filter parts with memory and nonlinear parts without memory is very common. Depending on whether the linear filter comes first or not, we distinguish so-called Wiener or Hammerstein models [61,62].
We pick here the Wiener filter structure as an illustrative example. The architecture is depicted in Fig. 7. A linear filter g is followed by a nonlinear filter h. The coefficients in h are linear weights to nonlinearly mapped instantaneous outputs of g, which serve as the input of h. Typically polynomial families with orthonormal properties such as Hermite or Legendre polynomials are applied. Consider a polynomial base {p i (x)}; i = 1, 2, . . . , M. The gradient-type procedure is provided as Algorithm 13. In [62], only local robustness is shown for these algorithms. In [63], the algorithm is identified as a bipartite form of the PNLMS algorithm, as addressed in the previous section. We thus conclude that cascaded algorithms in general are potentially non-robust algorithms.
We introduce a new vector d k = p 1 (x), p 2 (x), . . . , T whose entries are simply the values of p i (x) = . . , M of this matrix. According to the mean-value theorem, this value is given by the derivative of p i (.) evaluated at the argument in the range between x k andx k 2 . Recalling now that x k = u T k g andx k = u T k g k−1 we find eventually: (10) In Eq. (10), we recognize the linearly-coupled error term from Eq. (6) with ν 1,k = μ g,k d T k h, ν 2,k = μ g,k , ν 3,k = μ h,k d T k h, and ν 4,k = μ h,k , a common property of cascaded filter structures. An extension toward more than two cascaded stages is straightforward but not changing the essential properties. As the update error is identically applied in both stages (in all stages if the filter chain comprises more partitions), the update errors are linearly dependent, only different by potentially different stepsizes. For this particular case of linearly dependent error terms, we find no robustness possible, in particular, as long as d T k h is not exactly known. Recent results on this are provided in [64]. Cascaded structures also appear in multiple-input, multiple-output form in the context of Big Data [65,66].

Outlook and conclusions
It may thus be surprising that indeed many practically relevant adaptive algorithms are non-robust although they are MSE stable. While in everyday situations, they appear to work very properly, input sequences can be found that cause the algorithm to diverge. Once such a sequence is present and no step-size control can cure it.
Are all adaptive filters well understood now? No, there certainly still are white spots left that are not as clear as they could be. Take for example the well-known backpropagation algorithm [67][68][69] (see Algorithm 14), an extension of the PLA into several layers. Early investigations [70] only showed that the algorithm is locally but not globally robust. Single-layer PLAs, however, are globally robust. But, also the search for worst-case sequences in asymmetric algorithms that exhibit singular values larger than one, can remain inconclusive. Once such a sequence is found, non-robustness follows, but if the search space is too large and no sequence is found in a random or somehow sophisticated search, it remains unclear whether such sequence does not exist or if we simply cannot find it.
Once we need to rely on the algorithms, we should thus turn to the few robust algorithms rather than pray for stability in the mean-square sense. An open question in this contest remains, however: Is the MSE really the trouble maker, or is it the independence assumption?
As the mean-square analysis typically comes with the independent assumption, both are not easy to separate. The few known cases for which an analysis is known without the independence assumption [71,72] are valid for the LMS algorithm only (either for very short or infinitely long filters) and this is as we know a very robust algorithm of symmetrical form.
There are indeed many more algorithms that are worth mentioning. They cannot all be named due to limited space. Let us, however, shortly refer to the notion of stable in probability (almost sure convergence), as it provides another means of describing stable or unstable filter behavior. In [73], the LMS algorithm was analyzed in terms of almost sure convergence, showing that substantially larger step-sizes can be employed than those obtained from MSE analysis. Such methods were successfully applied to the constant modulus algorithm (CMA) [74] and the least mean fourth (LMF) algorithm [75], showing that indeed divergence of the algorithms can be readily obtained when modifying the signal properties of noise [76] and input [77,78], respectively. Extensions to higher than four exponents can be found in [79].

Endnotes
1 It is worth mentioning in a historical context that there were earlier algorithmic proposals by Robbins-Monro [80] and Kiefer-Wolfowitz [81] in the direction of stochastic gradient algorithms.