Adaptive Filters with Error Nonlinearities: Mean-Square Analysis and Optimum Design

This paper develops a unified approach to the analysis and design of adaptive filters with error nonlinearities. In particular, the paper performs stability and steady-state analysis of this class of filters under weaker conditions than what is usually encountered in the literature, and without imposing any restriction on the color or statistics of the input. The analysis results are subsequently used to derive an expression for the optimum nonlinearity, which turns out to be a function of the probability density function of the estimation error. Some common nonlinearities are shown to be approximations to the optimum nonlinearity. The framework pursued here is based on energy conservation arguments.


INTRODUCTION
The least-mean-squares (LMS) algorithm is a popular adaptive algorithm because of its simplicity and robustness [1,2]. Many LMS-like algorithms have been suggested and analyzed in the literature with the aim of retaining the desirable properties of LMS and simultaneously offsetting some of its limitations. Of particular importance is the class of least-meansquares algorithms with error nonlinearities. Table 1 lists examples from this class of algorithms for real-valued data. Despite the favorable behavior of many of these LMS variants, their choice and design are mostly justified by intuition rather than by rigorous theoretical arguments. Even the LMS algorithm, which has long been considered as an approximate solution to a least-mean-squares problem, has only recently been justified by a rigorous theory [3].
In this paper, we provide a unifying framework for the mean-square performance of adaptive filters that involve error nonlinearities in their update equations. We will use this analysis to design adaptive algorithms with optimized performance. Before discussing the features of the approach proposed herein and its contributions, we provide, as a motivation, a summary of selected studies dealing with adaptive filters with error nonlinearities. These studies can be classified into two categories.

I. Analysis using simplifying assumptions
Since adaptive algorithms with error nonlinearities are among the most difficult to analyze, it is not uncommon to resort to different methods and assumptions with the intent of performing tractable analysis. This includes: Linearization: here the error nonlinearity is linearized around an operating point and higher-order terms are discarded as in [4,5,6,7,8]. Analyses that are based on this technique fail to accurately describe the adaptive filter performance for large values of the error, for example, at early stages of adaptation. Linearization can be avoided by focusing on a specific nonlinearity (e.g., as in the sign algorithm [9]) or a subclass of nonlinearities (e.g., as in the case of the error saturation algorithm [10,11]).
Independence assumption: it is very common to assume that successive regression vectors are independent.
Assumptions on the statistics of the error signal: while statistical assumptions are usually imposed on the regression and noise sequences, it is also common to impose statistical conditions on error quantities. For example, in studying the sign-LMS algorithm, it was assumed in [17] that the elements of the weight-error vector are jointly Gaussian. More accurate is the assumption that the residual error is Gaussian, which was adopted in [6,9,10,11]. By central limit arguments, this assumption is justified for long adaptive filters. More importantly, the assumption is as valid in the early stages as in the final stages of adaptation.
Assuming Gaussian noise: noise is sometimes restricted to be i.i.d Gaussian as in [6,9,17,18], although Gaussianity is not as common as the previous assumptions.
Most studies of adaptive algorithms with error nonlinearities rely on a selection from the above array of assumptions/techniques. [8,19,20,21] Here one attempts to construct adaptive algorithms with optimum nonlinearities. A natural prerequisite is to evaluate some measures of performance (e.g., [6,7,16,22]) and then minimize them to arrive at optimum choices for the nonlinearity [8,19,20,21]. The difficulty, of course, is that the analysis is often plagued by the aforementioned assumptions and techniques. The result is that the optimum nonlinearities obtained are not any more valid than the restrictions imposed by the analysis.

The approach of this paper
In this paper, we address some of the above concerns. In particular, we present a unified approach to the mean-square analysis of adaptive algorithms with general error nonlinearities. The approach relies on energy conservation arguments and applies under weaker assumptions than what is available in the literature. Our performance results are subsequently optimized to obtain an expression for the optimum nonlinearity. In what follows, we list the contributions of the paper. This also serves as a layout for its organization.
(1) After introducing our notation, we set up the stage in the next section by defining the adaptive filtering problem. We also derive an energy relation that will be the starting point for much of the subsequent analysis.
(2) The energy relation is used in Section 3 to study mean-square stability. In particular, without relying on any independence-like assumptions, we derive bounds on the step-size for stability.
(3) Section 4 is devoted to studying the steady-state behavior, where we show that the mean-square error can be obtained as the fixed point of a nonlinear equation. The stability and steady-state analysis apply under weaker conditions than usual, and these conditions become reasonably accurate for long adaptive filters.
(4) The steady-state results are used in Section 5 to obtain an expression for the optimum nonlinearity, which is valid for all stages of adaptation. The nonlinearity turns out to be a function of the noise probability density function (pdf). We show how the nonlinearity manifests itself for different noise distributions and how it relates to more common nonlinearities.

Notation
We focus on real-valued data, although the extension to complex-valued data is immediate. Small boldface letters are used to denote vectors, for example, w. Also, the symbol T denotes transposition. The notation w 2 stands for the squared Euclidean norm of a vector, w 2 = w T w. All vectors are column vectors except for a single vector, namely the input data vector u i , which is taken to be a row vector. The time instant is placed as a subscript for vectors (e.g., w i ) and between parentheses for scalars (e.g., e(i)).

ADAPTIVE ALGORITHMS WITH ERROR NONLINEARITIES
An adaptive filter attempts to identify a weight vector w o by using a sequence of input (row) regressors {u i } and output samples {d(i)} that are related via Here v(i) accounts for measurement noise and modeling errors. Many adaptive schemes have been proposed in the literature for this purpose (cf. [1,2]). In this paper, we focus on the class of algorithms where w i is the estimate of w at time i, µ is the step size, is the estimation error, and f [e(i)] is a scalar function of the error e(i). Table 1 lists some common adaptive algorithms and their corresponding error nonlinearities.
Mean-square analysis of adaptive filters is best carried out in terms of the weight-error vector We can use these quantities to rewrite the adaptation equation Moreover, by combining the defining expressions (3) and (5), we obtain A relation between the estimation errors e a (i), e p (i), and e(i) can be obtained by pre-multiplying both sides of the adaptation equation (6) by u i , and incorporating the defining expressions (5), which yields Observe that the above questions are all conveniently phrased in terms of the error quantities {e a (i), e p (i),w i } or, more accurately, in terms of their energies. This fact motivates us to pursue an energy-based approach.

An energy conservation relation
More specifically, in order to address questions of this kind, we will rely on an energy equality that relates the squared norms of the error quantities {e a (i), e p (i),w i ,w i+1 }. To derive the energy relation, we combine (6) and (9) so as to eliminate the nonlinearity f [e(i)]: (14) This averaged form of the energy relation is the starting point of our analysis. In the course of answering the adaptive filtering questions, we will not attempt to develop (14) into a self-contained recursion as is usually done in literature. Rather, our efforts will be centered around manipulating the two expectations that appear on the right-hand side of (14) by imposing as little assumptions as necessary to answer the adaptive filtering question under consideration. In particular, the following two assumptions will be used throughout our analysis: (AN) The noise sequence {v(i)} is independent, identically distributed, and independent of the input sequence (AG) The filter is long enough such that e a (i) is Gaussian. The independence assumption on the noise is valid in many practical situations. Notice, however, that we make no assumption on the noise statistics which is contrary to the Gaussian restriction that is sometimes imposed in literature (e.g., [6,9,17,18]).
Assumption (AG) is justified for long filters by the central limit theorem. As such, the validity of the assumption is dependent on the filter order M. Nevertheless, the assumption remains as valid in the initial stages of adaptation as it is in the final stages. This comes contrary to the linearization arguments that are usually employed when dealing with error nonlinearities and which are only valid in the final stages of adaptation (see [6,7,8]). By expressing the two expectations in (14) in terms of the second-order moment E[e 2 a (i)], we basically bypass the need for linearization.

MEAN-SQUARE STABILITY
Stability is usually studied in literature by first developing (14) into a self-contained recursion and subsequently determining conditions on the step-size in order to guarantee the stability of the recursion. As we will now see, we can study stability directly from (14) thus doing away with the self-contained recursion and with any auxiliary assumptions that are invoked to develop this recursion. In particular, starting from (14), we pursue a Lyapunov approach to stability where we provide a nontrivial upper bound on µ for which E[ w i 2 ] remains uniformly bounded for all i. More specifically, we will show how to calculate a bound µ 0 for which for some constant C.

A monotone sequence of weight energies
Starting from (14), it is easy to see that Thus, if we choose µ such that for all i then the sequence {E[ w i 2 ]} will be decreasing and (being bounded from below) also convergent. Alternatively, a sufficient condition for stability would be where we appealed to the Cauchy-Schwartz inequality to bound the denominator by To proceed further, the convenience of the Gaussian assumption on e a (i) can be brought into fruit to evaluate the expectations in (18). In particular, the expectations can be written as functions of the second moment 1 E[e 2 a (i)]. This prompts 1 Since e a (i) is assumed Gaussian and independent of v(i), we can, for example, write For future reference, h G is tabulated in Table 2 for the error nonlinearities of Table 1. Upon substituting (20) and (21) into (18), we see that a sufficient condition for convergence is that where we moved the expectation E[ u i 4 ] outside the minimization since the input data is assumed stationary. Observe now that all terms in the above minimization are functions of E[e 2 a (i)]. We can therefore rewrite (22) as We emphasize that the minimization takes place over the possible values of E[e 2 a (i)] only; these values are not arbitrary but correspond to those assumed by the learning curve of the adaptive filter. As it stands, the bound (23) is still not useful, and we need to replace it by a time-independent bound.

Removing the time index (dependence)
We can replace the range of feasible values of E[e 2 a (i)] by the larger set Minimization over Ω is easier to carry out and we additionally have Almost always, however, minimization over Ω yields a null value and hence is useless. A more intelligent choice of the feasible set is thus called for.

A lower bound on E[e 2 a (i)]
A nonzero lower bound on E[e 2 a (i)] can be obtained by noting that E[e 2 a (i)] cannot be lower than the Cramer-Rao bound λ associated with the underlying estimation process, that is, the problem of estimating the random quantity u i w o by using u i w i (see [32, page 72]). Thus, we can write and subsequently replace Ω with the smaller set This results in the tighter bound µ ≤ 2

An upper bound on E[e 2 a (i)]
It turns out that for adaptive filters that employ a linear or sublinear error nonlinearities (e.g., the LMS and the sign algorithms), the feasible set Ω defined by (27) is good enough (as we will show in the examples further ahead). In other words, the set is small enough to produce a positive upper bound on the step size for stability. In other cases, for example, the LMF algorithm and its family, we need a more compact set and an upper bound on E[e 2 a (i)] is also necessary. Not surprisingly perhaps, the upper bound depends on the initial condition of the adaptive filter.
To this end, observe that since e a (i) is Gaussian, by assumption, it satisfies where the last line follows from the Cauchy-Schwartz inequality. To proceed further, we apply the same inequality to the expectation operator this time, splitting it as Here R denotes the covariance matrix of the regression vector u i . Now since the bounds (18) or (23) on µ ensure that This gives us the desired upper bound The two bounds (26) and (32) produce the alternative feasibility set which leads to the following conclusion.
Theorem 1 (stability). Consider an adaptive filter of the form where and λ is the Cramer-Rao bound associated with the problem of estimating the random quantity u i w o by using u i w i .
As indicated earlier, the bound (35) will be zero for superlinear functions f [·] (e.g., LMF and LMMN), in which case the tighter bound (36) will be more useful. Notice also that the above bounds are derived without relying on any independence-like assumptions.

The LMS algorithm
Instead of applying Theorem 1, we can in the LMS case be more specific. Thus starting from (17), we obtain where the last line follows from the Gaussian assumption (AG). By performing the minimization over Ω , we obtain the tighter bound For binary inputs, stability of the LMS algorithm can be established even without the Gaussian assumption (AG). For then, u i 2 = M and stability is guaranteed if E e a (i)e(i)

The sign algorithm
For the sign algorithm, the bound (35)

The LMF algorithm
For the LMF algorithm, we employ the tighter bound (36) µ ≤ 2 For design purposes, we could determine the infimum of a ]h G / h C over Ω . Here we are interested in the simpler task of establishing stability.
Remark. Linearization was employed in [4] to prove the stability of the LMF algorithm. While this might be reasonable for steady-state analysis, it need not be valid when stability is concerned. Notice that no linearization arguments were used here. Observe also that for the LMF algorithm, the infimum of E[e 2 a ]h G / h C over Ω is zero. That explains why we had to perform minimization over the smaller set Ω , which is initial-condition dependent. This suggests that the performance of the LMF is initial-condition dependent too, as is often confirmed by simulation. In general, this is expected to be the case for algorithms employing super-nonlinearities.

STEADY-STATE BEHAVIOR
To investigate the steady-state behavior, we again start from the averaged energy relation (14). Assuming that the filter is stable, it should attain a steady-state where it holds that Therefore, (14) where the function h G was tabulated in Table 2 for the nonlinearities in Table 1.
In a similar fashion, we now proceed to evaluate (rather than bound it as in the stability analysis). This prompts us to introduce the following "asymptotic" assumption: (AU) The random variables u i 2 and f 2 [e(i)] are asymptotically uncorrelated, that is, Assumption (AU) has the same spirit as the independence assumption 3 but it is weaker. For one thing, relation (48) is exact for constant modulus inputs while the independence assumption is not. Moreover, the separation property (48) need only be satisfied asymptotically. Fortunately, assumption (AU) acts in harmony with the Gaussianity assumption on e a (i) in that it also becomes more realistic as the filter gets longer. For then, by an ergodic argument, u i 2 behaves like the second moment of the input (scaled by the filter length M).

To proceed, we use the Gaussian assumption (AG) on e a (i) to express the expectation E[f 2 [e(i)]] as a function of the second-order moment E[e 2 a (i)], which motivates the definition
The function h U is tabulated in Table 3 for the nonlinearities  of Table 1. This definition, together with (48), yields Upon substituting (47) and (50) Then since both h U and h G are analytic in their arguments, 3 The independence assumption requires that the input regressors {u i } form an independent and identically distributed sequence. This assumption is heavily used in the adaptive filtering literature.
we have so that the MSE is the positive root of the nonlinear equation (55) To demonstrate the use of this theorem, we provide in what follows expressions for the mean-square error of some of the nonlinearities in Table 1.

Examples: MSE expressions 4.1.1 The LMS algorithm
In the LMS case, (54) reads or, equivalently, This is a well-known result that was derived in [13] by relying on the independence assumption. Using the energy-based approach of this paper, we only need the weaker assumption (AU), as indicated above and also in [27,29]. 4

The sign algorithm
In the case of the sign algorithm, we can show that the MSE satisfies This relation applies irrespective of assumption (AU). We have only appealed to the Gaussian assumption (AG) in arriving at (58)-see also [28]. The expectation that appears Sign error 1 1 Sat. nonlin. (58) is carried over the noise pdf. By specifying the noise statistics, we obtain the following special cases: Gaussian noise, where α = µ π/8 Tr(R). Each of these equations can be uniquely solved for the MSE. They were derived in [12] under the independence assumption and under an i.i.d restriction on the input, but are rederived here without relying on these restrictions.

The LMF algorithm
For the LMF algorithm, and with the aid of Tables 1 and 2, equation (54) takes the form where m v,4 and m v,6 denote the fourth and sixth moments of the noise v(i). Finding the MSE is thus equivalent to finding the roots of a third-order equation, which can be done numerically. We can avoid this in the Gaussian case and obtain a closed formula for the MSE.

Gaussian noise
In the Gaussian noise case, (60) simplifies to where α = 5µ Tr(R). This is a quadratic equation in S with two positive roots Only the larger root is meaningful. 5 Remarks. Although several of the MSE expressions above appeared previously in literature, the advantage of the energybased approach of this article is threefold: (1) All expressions are obtained as a fall out of the same approach (i.e., by using the same expression (54))-an approach that avoids the need for a self-contained recursion.
(3) Here we avoid the need for linearization as in the LMF example. It seems that calculating the steady-state error for super nonlinearities (e.g., the LMF and its family) has always involved some form of linearization arguments (e.g., [4,6,8,16]).

OPTIMUM CHOICE OF THE NONLINEARITY
In this section, we build upon the second-order analysis performed above to optimize the choice of the error nonlinearity f . To this end, consider expression (54) for the mean-square error written in a more explicit form Tr(R)

E f 2 [e(i)] E e a (i)f [e(i)] /E e 2 a (i)
.
(63) 5 We can show that the smaller root is O(µ 3 ). It is well known that the MSE is linearly proportional to the step size, and hence the smaller root can be ignored.
We would like to choose a nonlinearity f that minimizes the mean-square error. If we confine our attention to the class of smooth nonlinearities, we can write, using the Gaussian assumption (AG) and Price theorem [11,33],

E e a (i)f [e(i)] = E e a (i)f e a (i) + v(i)
= E e a (i)e(i) E f e(i) Thus, for a smooth error nonlinearity f , the MSE takes the alternative form The mean-square error cannot be minimized beyond the limit λ, which corresponds to the Cramer-Rao bound of the underlying estimation process. We can thus write (77)

Incorporating the Gaussian assumption on e a (i)
Implementing the optimum nonlinearity (67) or (76) requires that we evaluate the pdf of e(i) at each time instant. It turns out, however, that we can replace this task with the simpler task of evaluating the variance of e a (i) in addition to specifying the (time-invariant) noise pdf. To this end, recall that our derivation of the MSE expression relied on a Gaussian assumption on the error e a (i), and, hence, so does our subsequent derivation of the optimum error nonlinearity. Fortunately, this assumption helps us obtain a more explicit expression for the optimum nonlinearity (76). To see this, notice that the estimation error is the sum of two independent random variables, Therefore, its pdf, p e [e(i)], is the convolution of the pdf 's of e a (i) and v(i), that is, Here σ 2 ea denotes the variance of e a (i). 6 The above calculation reduces the determination of the optimum nonlinearity to the task of modeling the noise pdf.

Estimating the variance σ 2 ea
Perhaps the most challenging task of the algorithm is estimating the variance of the a priori error e a (i)-a nonstationary quantity. The easiest way out is to set σ 2 ea to some constant value. Alternatively, as done in the simulation, we first estimate the variance of e(i) using a window of samples of e(i), and subsequently estimate σ 2 ea from Furthermore, to avoid malfunctioning of the algorithm, we confine the estimate σ 2 ea to a bounded interval [a, b] that is determined by the designer.
Remarks. (1) From above, we see that the derivation of the optimum nonlinearity blends smoothly with the stability and steady-state analysis in that it relies on the same set of assumptions, and is also obtained as a fallout of the same energy conservation approach.
(2) Our derivation of the optimum nonlinearity shares another feature with the (stability) analysis in that it makes use of the fundamental limit set by the Cramer-Rao bound of the underlying estimation process.
(3) Also note that no heavy machinery is appealed to in developing the optimum nonlinearity, maintaining the general themes of clarity and simplicity. In particular, we avoid the variational approaches that are usually employed in literature in designing optimum adaptation schemes (see [8,20]).
(4) The nonlinearity (76) is derived under simpler assumptions compared to what is available in literature. For instance, we employ a weaker version of the independence assumption (compare with [8,10,19,20,21]) and make no restriction on the color or statistics of the input (compare with [8,10,20]). The nonlinearity (76) also applies irrespective of the noise statistics or whether its pdf is symmetric or not (contrary to what is assumed in [8,19]). We only require the noise to have zero mean.
(5) More importantly, perhaps, we avoid the need for any linearization arguments making the nonlinearity (76) accurate over all stages of adaptation. In contrast, the optimum nonlinearity that was derived in [8] using linearization arguments is only accurate in the final stages of adaptation. In fact, the more accurate expression (76) for the optimum nonlinearity collapses to (82) as the filter reaches its steady-state. (6) Notice further that expression (76) for the optimum nonlinearity applies irrespective of whether the noise pdf is smooth enough (differentiable) or not. Thanks to the 6 The time dependence of σ 2 ea is suppressed for notational convenience.
smoothing convolution operator (see (80)), we can, for example, directly calculate the optimum nonlinearity for binary and uniform noise (see examples below). This comes contrary to the nonlinearity (82) where an artificial smoothing kernel needs to be employed for such singular cases [8].

Examples
In what follows, we show how the error nonlinearity manifests itself for different noise statistics.

Gaussian noise
When v is Gaussian, so is the estimation error e(i) (since e(i) is then the sum of two Gaussian random variables, v(i) and e a (i)). In this case, the optimum nonlinearity (76) which, up to the scaling factor 1/σ 2 e , is the error function of the LMS. Therefore, the LMS is the optimum adaptive algorithm in the presence of Gaussian noise.

Laplacian noise
When v follows a Laplacian distribution, its pdf takes the form where erf is the error function defined by (87)

Binary noise
In the binary noise case, we have In this case, the optimum nonlinearity reads (92)

Simulations
Here we use simulations to illustrate the favorable behavior of the optimum algorithm in comparison to the LMS. The system to be identified is an FIR channel with 15 taps normalized so that the SNR relative to the input and output is the same (10 dB in our case). The input is taken to be Gaussian while the additive output noise is assumed to be binary or Laplacian. The variance σ 2 e is estimated using the most recent four samples of e(i), and the estimate is in turn used in (81) to estimate the variance of e a (i instead. The experiment is averaged over one thousand runs. The LMS and the optimum adaptive algorithms are compared (Figures 1 and 2) in terms of their learning curves; the evolution of E[ w i 2 ] with time (also known as the mean-square deviation or MSD). We also plot the nonlinearities employed by both algorithms. Since the optimum nonlinearity is time varying (through its dependence on σ 2 ea ), it has a stochastic nature. The plots thus show the optimum nonlinearities in their averaged forms.

Relation between optimum and other nonlinearities
The optimum nonlinearities (76) and (82)  This makes the nonlinearities difficult to implement since the pdf is usually unknown and/or time-varying. Even if the pdf is known, the corresponding nonlinearity would be expressed in terms of transcendental functions (e.g., as in (90) and (92)), which do not lend themselves to real-time implementations. This is compounded by the fact that different distributions (i.e., pdf 's) call for different nonlinearities. Thus, the optimum nonlinearities defy an important feature of least-mean-square algorithms, namely computational simplicity. A more alarming issue though is that these nonlinearities do not seem to relate to the ubiquitous LMS algorithm or its common variants. In the following, we address both of these issues for the nonlinearity (76) by representing the pdf p e [e(i)] in an Edgeworth expansion, which we now digress to introduce. The nonlinearity (82) can be dealt with similarly.

Edgeworth expansion of p e
Let γ j denote the jth cumulant of e(i) and let σ 2 e denote its variance. Assuming p e to be even, its Edgeworth expansion is given by [34] where φ is the standard (zero-mean, unit variance) Gaussian pdf and He 2j is the Hermite polynomial of degree 2j [35]. The coefficients a j are defined recursively by Since p e is even, we can show that a 2j+1 = 0 for all j, and only the even indexed coefficients appear in (93). Thus, p e [e(i)]/φ[e(i)/σ e ] is an infinite linear combination of Hermite polynomials of even degree. Alternatively, since He 2j [e(i)/σ e ] is a linear combination of even powers of its argument, so is the expansion (93). This series expansion is therefore similar to the familiar Taylor expansion except for the fact that it is expressed in statistically relevant termsthe a j 's, which are defined in terms of the cumulants rather than the derivatives of p e . The expansion (93) is also different from a Taylor series in that we are not interested in its convergence as much as in representing p e in as few terms  of (93) as possible. For most practical purposes, the first few terms of (93) are sufficient for a good approximation [34].

Relating the nonlinearities
We can finally put f opt in a more familiar form by approximating the rational function in the last equation by its Taylor series. As the rational function is odd, the Taylor series will contain only odd powers of e(i), and we can, therefore, write The c 2j+1 's can be written in terms of the a 2j 's but the explicit dependence is not essential for the subsequent discussion.
The following remarks are in order.