Convergence rate for the moving least-squares learning with dependent sampling

We consider the moving least-squares (MLS) method by the regression learning framework under the assumption that the sampling process satisfies the α-mixing condition. We conduct the rigorous error analysis by using the probability inequalities for the dependent samples in the error estimates. When the dependent samples satisfy an exponential α-mixing, we derive the satisfactory learning rate and error bound of the algorithm.


Introduction
The least-squares (LS) method is an important global approximate method based on the regular or concentrated data sample points. However, there are still some irregular or scattered samples which are obtained in many practical applications such as engineering and machine learning [1][2][3][4]. They also need to be analyzed to achieve their special usefulness. The moving least-squares (MLS) method was introduced by McLain in [4] to draw a set of contours based on a cluster of scattered data sample points. It turns out that the MLS method is a useful local approximation tool in various fields of mathematics such as approximation theory, data smoothing [5], statistics [6], and numerical analysis [7]. Recently, research effort has been made to study the regression learning algorithm by the MLS method, see [8][9][10][11][12]. The main advantage of the MLS regression learning algorithm is that we can learn the regression function in the simple function space, usually generated by polynomials.
We recall the regression learning problem by the MLS method briefly. Functions for learning are defined on a compact metric space X (input space) and take values in Y = R (output space). The sampling process is controlled by an unknown Borel probability measure ρ on Z = X × Y . We define the regression function as follows: where ρ(·|x) is the conditional probability measure induced by ρ on Y given x ∈ X. The goal of regression learning is to find a good approximation of the regression function f ρ based on a set of random samples We define the approximation f z of f ρ pointwisely: the local moving empirical error is defined by where the hypothesis space H ⊆ C(X) is ad-dimensional Lipschitz function space, σ = σ (m) > 0 is a window width, and : R n × R n → R + is called an MLS weight function which satisfies the conditions as follows, see [9,10]: where the constants q > n + 1, c q ,c q > 0. The task of the paper is to derive the error bound of f zf ρ ρ X with the norm f (·) ρ X := ( X |f (·)| 2 dρ X ) 1 2 to evaluate the approximation ability of f z , see [13][14][15][16][17][18][19][20][21][22]. The error analysis of algorithm (1.1) for the independent and identical (i.i.d.) samples has been carried out in [8][9][10]. However, the samples are not independent but are not far from being independent in some real data analysis such as market prediction, system diagnosis, and speech recognition. The mixing conditions can quantify how close to independence a sequence of random samples is. In [14,16,[23][24][25], the authors carried out the regression estimation of the least squares algorithm with the α-mixing samples. Up to now there has been no result of algorithm (1.1) obtained in the case of dependent samples. Hence we extend the analysis of algorithm (1.1) to the α-mixing sampling setting which is quite easy to establish, see [26].
a denote the σ -algebras of events generated by the random samples is said to satisfy a strongly mixing condition (or α-mixing condition) if Specifically, if there exist some positive constants α > 0, β > 0, and c > 0 such that then it is said to satisfy an exponential strongly mixing condition.
Our goal is to obtain the convergence rate as m → ∞ of algorithm (1.1) under hypothesis (1.7). The rest of the paper is organized as follows. In Sect. 2, we review some concepts and state our main results and the error decomposition. In Sect. 3, we present the estimate of the sample error. In Sect. 4, we provide the proofs of the main results.

Main results and error decomposition
Before giving the main results, we firstly need to provide some concepts that will be referred to throughout this paper, see [8][9][10].

Definition 2.2
We say that the hypothesis space H satisfies the norming condition with exponent ζ > 0 and d ∈ N if we can find points where the constants σ 0 > 0, c H > 0 and d is chosen as at least the dimensiond of H.
Here we assume |y| ≤ M almost surely, and all the constants such as C, C H,ζ , A τ ,ζ , C H,ρ X , C H,ρ X , and so on are independent of the key parameters δ, m, or σ in this paper. Now we give our main results of algorithm (1.1).
Then we can obtain the explicit learning rate of algorithm (1.1) with selecting the suitable parameter σ = σ (m).

Theorem 2.2 Under the assumptions of Theorem
(2.7) Remark 2.1 The result of the above theorem shows that the learning rate tends to m -1 2 when σ → 1. For the i.i.d. case, the same rate has been obtained in [9,10].
To estimate the quantity of the total error f zf ρ ρ X , we use the proposition from [8] below. where is called the local moving expected risk and that Next we only need to provide the upper bound of the integral in (2.8). So to do this, we give its decomposition as follows: What is left is to estimate the sample error S(z, σ ).

Estimates for the sample error
In order to obtain the probability estimate of S(z, σ ), we shall use the upper bound for f z,σ ,x and f H,σ ,x . We firstly derive the confidence-based estimate of f z,σ ,x as follows.
then with confidence at least 1δ, we have The proof is analogous to that of Theorem 3 in [8] except that we need to use the following Lemma 3.1 for the dependent sampling setting to replace Lemma 2 in [8].

Proposition 3.2 Suppose that
Then it follows from the above proposition that hence, Then we get It follows that, with confidence at least 1δ, and 0 otherwise. So that )/m. Hence, Observe from X ⊆ N j=1 B(v j , r 2 ) that for each x ∈ X, there exists some j ∈ 1, . . . , N such that x ∈ B(v j , r 2 ), i.e., |v j -x| ≤ r 2 . Since This proves Lemma 3.1.
Now we are in a position to prove Proposition 3.1.
Finally, combining (3.19) with the following inequality we derive the desired result.
We also need to invoke Lemma 4 in [8] which provides the result about the upper bound of f H,σ ,x .

Proposition 3.3 Assume that (2.1) and (2.2) hold. Then, for some constant C
(3.21) Next we will bound the sample error. The estimation for S(z, σ ) relies on the ratio probability inequality below that can be found in [27]. (1.7) holds. Let G be a set of functions on Z and c > 0 such that, for each g ∈ G, μ(g) = Z g(z) dρ ≥ 0, μ(g 2 ) ≤ cμ(g), and |g(z)μ(g)| ≤ D almost surely. Then, for every ε > 0 and 0 < α ≤ 1, we have

Proposition 3.4 Suppose that
We obtain the upper bound estimate for S(z, σ ) by using Proposition 3.4.

Proposition 3.5
If the assumptions of Proposition 3.1 hold, Proof Let the function g(u, y) be defined on the function set With condition (1.5) and the bound c ρ of the density function of ρ X , we have Since for any g 1 , g 1 ∈ G R , ≤ 4RDσ n f 1 (u)f 2 (u) , (3.34) then we have (3.35) It follows from (3.33) that (3.36) We set the term (1 + 4e -2 α)N (B 1 , ε 16R 2 Dσ n ) exp{-3m (α) ε 2048R 2 Dσ n } of the above inequality to δ/2. We need to invoke the lemma proved by the same method of Proposition 4.3 in [21]. Lemma 3.2 Let η * (m (α) , δ) be the smallest positive solution of the following inequality in η: If log N (B 1 , η) ≤ c p (η) -p , for some p ∈ (0, 2), c p > 0 and all η > 0, then with confidence at least