Sign-Error Adaptive Filtering Algorithms for Markovian Parameters

Motivated by reduction of computational complexity, this work develops sign-error adaptive filtering algorithms for estimating time-varying system parameters. Different from the previous work on sign-error algorithms, the parameters are time-varying and their dynamics are modeled by a discrete-time Markov chain. A distinctive feature of the algorithms is the multi-time-scale framework for characterizing parameter varia- tions and algorithm updating speeds. This is realized by considering the stepsize of the estimation algorithms and a scaling parameter that defines the transition rates of the Markov jump process. Depending on the relative time scales of these two pro- cesses, suitably scaled sequences of the estimates are shown to converge to either an ordinary differential equation, or a set of ordinary differential equations modulated by random switching, or a stochastic differential equation, or stochastic differential equa- tions with random switching. Using weak convergence methods, convergence and rates of convergence of the algorithms are obtained for all these cases.


Introduction
Adaptive filtering algorithms have been studied extensively, thanks to their simple recursive forms and wide applicability for diversified practical problems arising in estimation, identification, adaptive control, and signal processing [26].
Recent rapid advancement in science and technology has introduced many emerging applications in which adaptive filtering is of substantial utility, including consensus controls, networked systems, and wireless communications; see [1,2,4,5,8,7,12,13,14,16,17,18,19,20,23,24,27]. One typical scenario of such new domains of applications is that the underlying systems are inherently time varying and their parameter variations are stochastic [29,30,31]. One important class of such stochastic systems involves systems whose randomly time-varying parameters can be described by Markov chains. For example, networked systems include communication channels as part of the system topology. Channel connections, interruptions, data transmission queuing and routing, packet delays and losses, are always random. Markov chain models become a natural choice for such systems. For control strategy adaptation and performance optimization, it is essential to capture time-varying system parameters during their operations, which lead to the problems of identifying Markovian regime-switching systems pursued in this paper.
When data acquisition, signal processing, algorithm implementation are subject to resource limitations, it is highly desirable to reduce data complexity. This is especially important when data shuffling involves communication networks. This understanding has motivated the main theme of this paper by using sign-error updating schemes, which carry much reduced data complexity, in adaptive filtering algorithms, without detrimental effects on parameter estimation accuracy and convergence rates.
In our recent work, we developed a sign-regressor algorithm for adaptive filters [28]. The current paper further develops sign-error adaptive filtering algorithms. It is well-known that sign algorithms have the advantage of reduced computational complexity. The sign operator reduces the implementation of the algorithms to bits in data communications and simple bit shifts in multiplications. As such, sign algorithms are highly appealing for practical applications. The work [11] introduced sign algorithms and has inspired much of the subsequent developments in the field. On the other hand, employing sign operators in adaptive algorithms has introduced substantial challenges in establishing convergence properties and error bounds.
A distinctive feature of the algorithms introduced in this paper is the multi-time-scale framework for characterizing parameter variations and algorithm updating speeds. This is realized by considering the stepsize of the estimation algorithms and a scaling parameter that defines the transition rates of the Markov jump process. Depending on the relative time scales of these two processes, suitably scaled sequences of the estimates are shown to converge to either an ordinary differential equation, or a set of ordinary differential equations modulated by random switching, or a stochastic differential equation, or stochastic differential equations with random switching. Using weak convergence methods, convergence and rates of convergence of the algorithms are obtained for all these cases.
The rest of the paper is arranged as follows. Section 2 formulates the problems and introduces the two-time-scale framework. The main algorithms are presented in Section 3.
Mean-squares errors on parameter estimators are derived. By taking appropriate continuoustime interpolations, Section 4 establishes convergence properties of interpolated sequences of estimates from the adaptive filtering algorithms. Our analysis is based on weak convergence methods. The convergence properties are obtained by using martingale averaging techniques.
Section 5 further investigates the rates of convergence. Suitably interpolated sequences are shown to converge to either stochastic differential equations or randomly-switched stochastic differential equations, depending on relations between the two time scales. Numerical results by simulation are presented to demonstrate the performance of our algorithms in Section 6.

Problem Formulation
Let y n = ϕ ′ n α n + e n , n = 0, 1, . . . , where ϕ n ∈ R r is the sequence of regression vectors, e n ∈ R is a sequence of zero mean random variables representing the error or noise, α n ∈ R r is the time-varying true parameter process, and y n ∈ R is the sequence of observation signals at time n.
Estimates of α n are denoted by θ n and are given by the following adaptive filtering algorithm using a sign operator on the prediction error where sgn(y) is defined as sgn(y) := 1 {y>0} − 1 {y<0} for y ∈ R 1 . We impose the following assumptions.
(A1) α n is a discrete-time homogeneous Markov chain with state space and whose transition probability matrix is given by where ε > 0 is a small parameter, I is the R m 0 ×m 0 identity matrix, and Q = (q ij ) ∈ R m 0 ×m 0 is an irreducible generator (i.e., Q satisfies q ij ≥ 0 for i = j and m 0 j=1 q ij = 0 for each i = 1, . . . , m 0 ) of a continuous-time Markov chain. For simplicity, assume that the initial distribution of the Markov chain α n is given by P (A2) The sequence of signals {(ϕ n , e n )} is uniformly bounded, stationary, and independent of the parameter process {α n }. Let F n be the σ-algebra generated by {(ϕ j , e j ), α j : j < n; α n }, and denote the conditional expectation with respect to F n by E n .
(A3) For each i = 1, . . . , m 0 , define For each n and i, there is an A (i) n ∈ R r×r such that given α n = a i , (A4) There is a sequence of non-negative real numbers {φ(k)} with k φ 1/2 (k) < ∞ such that for each n and each j > n, and for some K > 0, uniformly in i = 1, . . . , m 0 .
Remark 2.1 Let us take a moment to justify the practicality of the assumptions. The boundedness assumption in (A2) is fairly mild. For example, we may use a truncated Gaussian process. In addition, it is possible to accommodate unbounded signals by treating martingale difference sequences (which make the proofs slightly simpler).
In (A3), we consider that while g n (θ, i) is not smooth w.r.t. θ, its conditional expectation g n (θ, i) can be a smooth function of θ. The condition (6) indicates that g n (θ, i) is locally (near a i ) linearizable. For example, this is satisfied if the conditional joint density of (ϕ n , e n ) with respect to {ϕ j , e j , j < n, ϕ n } is differentiable with bounded derivatives; see [6] for more discussion. Finally, (A4) is essentially a mixing condition which indicates that the remote past and distant future are asymptotically independent. Hence we may work with correlated signals as long as the correlation decays sufficiently quickly between iterates.

Mean Squares Error Bounds
Denote the sequence of estimation errors by θ n := α n − θ n . We proceed to obtain bounds for the mean squares error in terms of the transition rate of the parameter ε and the adaptation rate of the algorithm µ.
Theorem 3.1 Assume (A1)-(A4). Then there is an N ε > 0 such that for all n ≥ N ε , Proof. Define a function by V (x) = (x ′ x)/2. Observe that By (A2), the Markov chain α n is independent of (ϕ n , e n ) and I {αn=a i } is F n -measurable.
Since the transition matrix is of the form P ε = I + εQ, we obtain Similarly, Note that | θ n | = | θ n | · 1 ≤ (| θ n | 2 + 1)/2, so Since the signals {(ϕ n , e n )} are bounded, we have Applying (14) to (10), we arrive at Note also that by (A3), To treat the first three terms in (15), we define the following perturbed Liapunov functions By virtue of (A4), we have Note also that the irreducibility of Q implies that of I + εQ for sufficiently small ε > 0. Thus there is an N ε such that for all n ≥ N ε , |(I + εQ) k − 1lν ε | ≤ λ k c for some 0 < λ c < 1, where ν ε denotes the stationary distribution associated with the transition matrix I + εQ. Note that the difference of the j + 1 − n and j − n step transition matrices is given by The last line above follows from the fact The forgoing estimates lead to ∞ j=n E n (α j+1 − α j ) = O(ε) and as a result and similarly so all the perturbations can be made small. Now, we note that where and Using (11), we have Thus, in view of (A4) and Putting together (22)- (27), we establish that Likewise, we can obtain and Now we define . Using this along with (10), (16), (28)- (30), and the inequality O(µε) = O(µ 2 + ε 2 ), we arrive at Choose µ and ε small enough so that there is a λ 0 > 0 satisfying λ 0 ≤ λ and Then we obtain Taking expectation in the iteration for W ( θ n , n) and iterating on the resulting inequality yield Finally, applying (18)-(21) again, we also obtain Thus the desired result follows.
4 Convergence Properties We assume the adaptation rate and the transition frequency are of the same order, that is . For simplicity, we take µ = ε. To study the asymptotic properties of the sequence {θ n }, we take a continuous-time interpolation of the process. Define We proceed to prove that θ µ (·) converges weakly to a system of randomly switching ordinary differential equations.
Theorem 4.1 Assume (A1)-(A4) hold and ε = µ. Then the process (θ µ (·), α µ (·)) converges weakly to (θ(·), α(·)) such that α(·) is a continuous-time Markov chain generated by Q and the limit process θ(·) satisfies the Markov switched ordinary differential equatioṅ The theorem is established through a series of lemmas. We begin by using a truncation device to bound the estimates. Define S N = {θ ∈ R r : |θ| ≤ N} to be the ball with radius N, and q N (·) as a truncation function that is equal to 1 for θ ∈ S N , 0 for θ ∈ S N +1 , and sufficiently smooth between. Then we modify algorithm (2) so that is now a bounded sequence of estimates. As before, define We shall first show that the sequence {θ N,µ (·), α µ (·)} is tight, and thus by Prohorov's theorem we may extract a convergent subsequence. We will then show the limit satisfies a switched differential equation. Lastly, we let the truncation bound N grow and show the untruncated sequence given by (2) is also weakly convergent.
Proof of Lemma 4.2. Note that the sequence α µ (·) is tight by virtue of [33,Theorem 4.3]. In addition, α µ (·) converges weakly to a Markov chain generated by Q. To proceed, we examine the asymptotics of the sequence θ N,µ (·). We have that for any δ > 0, and t, s > 0 satisfying s ≤ δ, For any T < ∞ and any 0 ≤ t ≤ T , use E µ t to denote the conditional expectation w.r.t. the σ-algebra F µ t , we have Applying the criterion [15, p.47], the tightness is proved.
Proof. To derive the martingale limit, we need only show that for the C 1 function with compact support f (·, i), for each bounded and continuous function h(·), each t, s > 0, each positive integer κ, and each t i ≤ t for i ≤ κ, To verify (36), we use the processes indexed by µ. As before, note that Subdivide the interval with the end points t/µ and (t + s)/µ − 1 by choosing m µ such that m µ → ∞ as µ → 0 but δ µ = µm µ → 0. By the smoothness of f (·, i), it is readily seen that as µ → 0, Next, we insert a term to examine the change in the parameter α and the estimate θ N separately lim µ→0 First, we work with the last term in (39). By using a Taylor expansion on each interval indexed by l we have where θ N,+ lmµ is a point on the line segment joining θ N lmµ and θ N lmµ+mµ . Since and ∇f ′ (·, i) is smooth, we have the last term in (40) is o(1) in the sense of in probability as µ → 0. To work with the first term we insert the conditional expectation E k and apply Then for small µ, Letting µlm µ → τ , then by (7), Likewise, we can obtain Combining (40) Next, letting N → ∞, we show that the limit of the untruncated sequence θ(·) and the limit of θ N (·) as N → ∞ are the same. The argument is similar to that of [17, pp. 249-250]; we explain the main steps below. Let P 0 (·) and P N (·) be the measures induced by θ(·) and θ N (·), respectively. Since the martingale problem with operator L N 1 has a unique solution, the associated differential equation has a unique solution for each initial condition and P 0 (·) is unique. For each T < ∞ and t ≤ T , P 0 (·) agrees with P N (·) on all Borel subsets of the set of paths in D[0, ∞) with values in S N . By using P 0 (sup t≤T |θ(t)| ≤ N) → 1 as N → ∞, and the weak convergence of θ N,µ (·) to θ N (·), we have θ µ (·) converges weakly to θ(·). Thus the proof of Theorem 4.1 is completed.

Remark 4.4
The following calculation will be used for both the slow and fast Markov chain cases. The result is essentially one about two-time-scale Markov chains considered in [33]. Define a probability vector by p ε n = (P (α n = a 1 ), . . . , P (α n = a m 0 )) ∈ R 1×m 0 . Note that p ε 0 = (p 0,1 , . . . , p 0,m 0 ) (independent of ε). Because the Markov chain is time homogeneous, (P ε ) n is the n-step transition probability matrix with P ε = I + εQ. Then, for some 0 < λ 1 < 1, where p(t) = (p 1 (t), . . . , p m 0 (t)) is the probability vector of the continuous Markov chain with generator Q such that for all t ≥ 0 and p 0 is the initial probability. In addition, where with t 0 = εn 0 and t = εn, Define the continuous-time interpolation α ε (t) of α n as α ε (t) := α n for t ∈ [nε, nε + ε).

Slowly-Varying Markov Chain: ε ≪ µ
In this case, since the Markov chain changes so slowly, the time-varying parameter process is essentially a constant. To facilitate the discussion and to fix notation, we take ε = µ 1+∆ for some ∆ > 0 in what follows.
The analysis is similar to the ε = O(µ) case. Begin by defining the continuous time interpolation as before. While a truncation device is still needed, we omit it and assume the iterates are bounded for notational brevity. The tightness of {θ µ (·)} can be verified similar to Lemma 4.2. To characterize the weak limit we note that the estimates from the previous section remain valid, except that involving the Markov chain α k . Thus we need only examine (from the second to last line of (43)) To obtain the last line above, we have used that for lm µ ≤ k ≤ lm µ + m µ since ε = µ 1+∆ , We omit the details, but present the main result as follows, Theorem 4.5 Assume (A1)-(A4) hold, and ε = µ 1+∆ for some ∆ > 0. Then we have θ µ (·) converges weakly to θ(·) such that θ(·) is the unique solution of the differential equation

Fast-Varying Markov Chain: µ ≪ ε
The idea for the fast varying chain is that the parameter changes so fast that it quickly approaches the stationary distribution of the Markov chain. As a result, the limit dynamic system is one that is averaged out with respect to the stationary distribution of the Markov chain. In this section, we take ε = µ γ where 1/2 < γ < 1. Then, letting µlm µ → τ as in the proof of Theorem 4.1, we have ε(k − lm µ ) = µ γ (k − lm µ ) → ∞. Thus, for some 0 < λ 1 < 1, where ν = (ν 1 , . . . , ν m 0 ) is the stationary distribution of the continuous-time Markov chain with generator Q, Ξ ij (s 1 , s 2 ) denotes the ijth entry of the matrix Ξ(s 1 , s 2 ). Therefore, we can show that as µ → 0, Theorem 4.6 Assume (A1)-(A4) hold, and ε = µ γ for some 1/2 < γ < 1. Then we have θ µ (·) converges weakly to θ(·) such that θ(·) is the unique solution of the differential equation 5 Rates of Convergence 5.1 Scaled Errors: ε = µ Define u n := θ n / √ µ = (α n − θ n )/ √ µ. Then In view of Theorem 3.1 there is a N µ such that E|α n − θ n | 2 = O(µ) for n ≥ N µ , with which we can show {u n : n ≥ N µ } is tight. In addition, take N µ large such that by (19), we Then define We can then proceed to the study of the asymptotic distribution of u µ (·). As before, a truncation device may be employed. For notational simplicity, it will be assumed here.
Proof. Note that Note that we have used the convention that t/µ denotes the integer part of t/µ in the above.
Use E µ t to denote the conditional expectation with respect to the σ-algebra F µ t = σ{u µ (τ ) : τ ≤ t}. Then by (55), Now we examine Since E| θ k | 2 = O(µ) for k large (µ small), in the last term of (58) we have For the first term we use the mixing inequality of (A4), For any T < ∞ and any 0 ≤ t ≤ T , The following is a variant of the well-known central limit theorem for mixing processes; see [3] or [10] for details.
with covariance Σt such that the covariance Σ is given by Theorem 5.3 u µ (·) converges weakly to u(·) such that u(·) is the solution of where w(·) is a standard Brownian motion.
Proof. As usual, extract a convergent subsequence of u µ (·) (still denoted by u µ (·)) with limit u(·). We will show that for each s, t > 0, the limit process satisfies Note from (56), Define We then expand on the (negative of the) inside of the sum indexed by i in (65) as Note that for the second term above we used g k (a i , i) = o( θ k ) = o( √ µ|u k |) by (A3). First, we show the last term in (66) is o(1). Since ∆ k (i) is a martingale difference, we have The boundedness of ϕ k and u k implies √ µϕ ′ k u k → 0 in probability uniformly in k as µ → 0. Hence, the first term in (67) has (t+s)/µ−1 Using (A3) and (A4), along with the boundedness of u k , on the second term of (67) gives Hence E (t+s)/µ−1 k=t/µ √ µ∆ k (i) → 0 as µ → 0.
Hence, putting the above estimates together we obtain

Scaled Errors: ε ≪ µ
The analysis for the cases ε ≪ µ and ε ≫ µ is similar to that for ε = O(µ). We omit the details and present the main results. Recall that in the ε ≪ µ case, the parameter is essentially a constant and thus we look to the initial distribution to determine the asymptotic properties. Define Then we have the following: Then v µ (·) converges weakly to v(·) such that v(·) is the solution of where w(·) is a standard Brownian motion.

Scaled Errors: ε ≫ µ
Again, here the idea is that the parameter varies so quickly that it quickly converges to the stationary distribution ν = (ν 1 , . . . , ν m 0 ). Thus we look to the expectation against the stationary distribution to determine the asymptotic properties.
We have the following result.
We is the generator of a continuous-time Markov chain whose stationary distribution is therefore ν = (1/3, 1/3, 1/3). Hence α = 3 i=1 a i ν i = 0 We take the initial distribution for α 0 to be (3/4, 1/8, 1/8). So α * = 3 i=1 a i P (α 0 = a i ) = −0.625. {ϕ n } and {e n } are i.i.d. N (0, 1) and N (0, .25), respectively. We proceed to observe 1000 iterations of the algorithm for the cases  To observe the tracking behavior of the SE algorithm, in comparison to the SR and LMS algorithms, we overlay the respective plots for each case. When ε = O(µ), the LMS and SR estimates tend to be approximately equal, while the SE estimates show more deviations from the other estimates. The SE algorithm responds to changes in the parameter more quickly, while the LMS and SR algorithms adhere to the parameter more closely while it is stationary. In the ε ≪ µ case, we see this behavior repeated. While all three estimates track In the ε ≫ µ case, none of the algorithms can track the parameter at each iterate very well. However, when we observe the scaled error against the stationary distribution of the Markov chain z n , the expected diffusion behavior is displayed. Examining the cumulative average of the parameter and the estimates of the iterates, we note that the parameter average quickly converges toᾱ. The LMS and SR estimate averages adhere closely to the parameter average, while the SE estimate average deviates slightly more.