Forward Selection of Relevant Factors by Means of MDR-EFE Method

Bulinski, Alexander

doi:10.3390/math12060831

Open AccessArticle

Forward Selection of Relevant Factors by Means of MDR-EFE Method

by

Alexander Bulinski

Faculty of Mathematics and Mechanics, Lomonosov Moscow State University, Leninskie Gory 1, 119991 Moscow, Russia

Mathematics 2024, 12(6), 831; https://doi.org/10.3390/math12060831

Submission received: 20 January 2024 / Revised: 5 March 2024 / Accepted: 8 March 2024 / Published: 12 March 2024

(This article belongs to the Special Issue New Trends in Stochastic Processes, Probability and Statistics)

Download Versions Notes

Abstract

:

The suboptimal procedure under consideration, based on the MDR-EFE algorithm, provides sequential selection of relevant (in a sense) factors affecting the studied, in general, non-binary random response. The model is not assumed linear, the joint distribution of the factors vector and response is unknown. A set of relevant factors has specified cardinality. It is proved that under certain conditions the mentioned forward selection procedure gives a random set of factors that asymptotically (with probability tending to one as the number of observations grows to infinity) coincides with the “oracle” one. The latter means that the random set, obtained with this algorithm, approximates the features collection that would be identified, if the joint distribution of the features vector and response were known. For this purpose the statistical estimators of the prediction error functional of the studied response are proposed. They involve a new version of regularization. This permits to guarantee not only the central limit theorem for normalized estimators, but also to find the convergence rate of their first two moments to the corresponding moments of the limiting Gaussian variable.

Keywords:

feature selection; relevant factors; MDR-EFE method; forward selection; suboptimal procedures; statistical estimators of error functional (of a response); regularized estimators; CLT; convergence of estimators moments

MSC:

62G20; 62H12; 62J02; 62L12

1. Introduction

This paper is dedicated to the eminent scientist Professor A.S. Holevo, academician of the Russian Academy of Sciences, on occasion of his remarkable birthday.

The classical problem of regression analysis consists in the search for deterministic function f, which, in a certain sense, “well” approximates the observed random variable (response) Y by the value

f (X)

, where

X = (X_{1}, \dots, X_{p})

is a vector of factors influencing the behavior of Y. This approach was initiated by the works of A.-M. Legendre and K. Gauss. At that time it found application in the processing of astronomical observations. Nowadays one widely uses the methods involving the appropriate choice of unknown real coefficients

β_{1}, \dots, β_{p}

for a linear model of the form

Y = \sum_{i = 1}^{p} β_{i} X_{i} + ε

, where

ε

describes a random error. Clearly,

X_{0} = 1

can be included in the collection of factors, then

Y = β_{0} + \sum_{i = 1}^{p} β_{i} X_{i} + ε

. For example, books [1,2] are devoted to regression. The close tasks also arise in observations classification, see, e.g., [3].

Since the end of the 20th century, stochastic models have been studied where the random response Y depended only on some subset of the factors in the set of

X_{1}, \dots, X_{p}

. So, in article [4], the LASSO method (Least Absolute Shrinkage and Selection Operator) was introduced, using the idea of regularization (going back to A.N.Tikhonov), which allowed to find factors included with non-zero coefficients in a “sparse” linear model. Somewhat earlier, this approach was used by several authors for the treatment of geophysical data. Generalizations of the mentioned method are considered in monograph [5]. We emphasize that the idea of identifying some of the factors having a principle (in a certain sense) impact on a response is also intensely developing within the framework of nonlinear models. Such direction of modern mathematical statistics is called Feature Selection (FS), i.e., the choice of features (variables, factors). In this regard, we refer, e.g., to monographs [6,7,8,9] and also to reviews [10,11,12,13,14]. In [10] the authors consider filter, wrapper and embedded methods of FS. They concentrate on feature elimination and also demonstrate the application of FS technique on standard datasets. In [11] the modern mainstream dimensionality reduction methods are analyzed including ones for small samples and those based on deep learning. In [12] FS machinery is considered based on filtering methods for detecting the cyber attacks. Survey [13] is devoted to FS methods in machine learning (the structured information is contained in 20 tables). The authors of [14] concentrate on applications of FS to stock market prediction and applications of FS in the analysis of credit risks are considered, e.g., in [15]. Beyond financial mathematics the choice of relevant factors is very important in medicine and biology. For instance, in the field of genetic data analysis there is an extensive research area called GWAS (Genome-Wide Association Studies) aimed at studying the relationships between phenotypes and genotypes, see, e.g., [16,17]. The authors of [18] provide the survey of starting methods used by genetic algorithms. Review [19] is devoted to the FS methods for predicting the risk of diseases. Thus, research in the field of FS is not only of theoretical interest, but also admits various applications.

Note that there are a number of complementary methods for identifying relevant factors. Much attention is paid to those employing the basic concepts of information theory such as entropy, mutual information, conditional mutual information, interaction information, various divergences, etc. Here statistical estimation of information characteristics plays an important role. One can mention, e.g., works [20,21]. In this article, the accent is made on identifying a set of relevant factors in the framework of a certain stochastic model, when the quality of the response approximation is evaluated by means of some metric.

Recall that J.B. Herrick in 1910 described the Sickle cell anemia (HbS). Later it was discovered that all clinical manifestations of the presence of HbS are the consequences of the single change in the B-globin gene. This famous example shows that even the search of a single feature having impact on a disease is reasonable. Nowadays the researchers concentrate on complex diseases provoked by several disorders of the human genome. Even identification of two SNPs (single nucleotide polymorphisms) having impact on a certain disease is of interest, see, e.g., [22].

Now we turn to the description of the studied mathematical model. All the considered random variables are defined on a probability space

(Ω, F, P)

. Let a random variable Y map

Ω

to some finite set

Y

. We assume that, for

k \in T : = {1, \dots, p}

, a random variable

X_{k} : Ω \to M_{k}

, where

M_{k}

is an arbitrary finite set. Then the vector

X = (X_{1}, \dots, X_{p})

takes the values in

X = M_{1} \times \dots \times M_{p}

. For a set

S = {i_{1}, \dots, i_{r}}

, where

1 \leq i_{1} < \dots < i_{r} \leq p

, we put

X_{S} : = (X_{i_{1}}, \dots, X_{i_{r}})

. Similarly, for

x \in X

,

x_{S}

denotes a vector

(x_{i_{1}}, \dots, x_{i_{r}})

. A collection of indices

S \subset T

(the symbol ⊂ is everywhere understood as a non-strict inclusion) is called relevant if the following relation holds for any

x \in X

and

y \in Y

:

P (Y = y | X = x) = P (Y = y | X_{S} = x_{S}),

(1)

whenever

P (Y = y | X = x) \neq 0

. In this case, the set of factors

X_{S}

is called relevant as well. If (1) takes place for some

S = S_{0}

then it will be obviously valid for any S containing

S_{0}

. Therefore, the natural desire is to identify a set S that satisfies (1) and has cardinality

r < p

(if such a set other than T exists). Note that there are different definitions of the relevant factors collection, see, e.g., [23,24] and the references therein.

It is assumed that a collection of relevant factors has r elements

(1 \leq r < p

), however, the set S itself, which appears in (1), is unknown and should be identified. We label this assumption as (A). There is no restriction that S satisfying (1) and containing r elements is unique. Usually the joint distribution of

(X, Y)

is also unknown. Therefore, a statistical estimator of S is constructed based on the first N observations

ξ_{N} : = (ξ^{(1)}, \dots, ξ^{(N)})

of a sequence

ξ^{(1)}, ξ^{(2)}, \dots

, consisting of i.i.d. random vectors, where, for

k \in N

,

ξ^{(k)} : = (X^{(k)}, Y^{(k)})

has the same distribution as the vector

(X, Y)

.

In 2001, the authors of [25] proposed a method for identifying relevant factors, called MDR (Multifactor Dimensionality Reduction). According to article [26], more than 800 publications were devoted to the development of this method and its applications in the period from 2001 to 2014. Research in this direction has continued over the last decade, see, e.g., [27,28,29]. In [30], for the binary response Y, a modification of the MDR method was introduced, namely, MDR-EFE (Error Function Estimation), based on statistical estimates of the error functional of the response prediction using the K-fold cross-validation procedure, see also [31]. Later this method was extended in [32] to study the non-binary response.

Recall how the MDR-EFE method is employed. Let a non-random function

f : X \to Y

be used to predict the response Y by the values of the factors vector X. Further we exclude considering the trivial case when

Y = y_{0}

with probability one for some

y_{0} \in Y

(hence, X and Y are independent). The prediction quality is determined by applying the following error functional

Err (f) : = E | Y - f (X) | ψ (Y),

(2)

where a penalty function

ψ : Y \to R_{+}

. The functional

Err

takes finite values for the discrete X and Y under consideration. The function

ψ

allows to take into account the importance of approximating a particular value of Y using

f (X)

.

In biomedical research, one often considers the binary response Y characterizing the patient’s state of health, say, the value

Y = 1

corresponds to illness, and

Y = - 1

means that the patient is healthy. In many situations it is more important to consider the disease detection, so the value of 1 is attributed more weight. Of interest is the situation when

Y = {- 1, 0, 1}

. Then the value 0 describes some intermediate state of uncertainty (“gray zone”). Following [32], we will consider a more general scheme when the set

Y : = {- m, \dots, 0, \dots, m}

for some

m \in N

. Lemma 1 in [32] describes for such model all optimal functions

f_{o p t}

that deliver a minimum to the error functional (2). Note that we can suppose that the set of values of Y is strictly contained in

{- m, \dots, m}

, i.e., some values are accepted with zero probability. For such y, we assume that

ψ (y) = 0

. Thus, it is possible to study Y taking values in an arbitrary finite subset of

Z

. In order to simplify the notation, we further consider

P (Y = y) > 0

for all

y \in Y = {- m, \dots, m}

.

It is proved that in the framework of model (1) the relation

f_{o p t} = f^{S}

is valid, where, for

x \in X

and

U \subset T

,

f^{U} (x) = f (x_{U})

and a function f is constructed in a due way. At the same time, for any

U \subset T

such that

♯ U = ♯ S

(♯ denotes the cardinality of a finite set) and S appearing in (1), the following inequality is true:

Err (f^{S}) \leq Err (f^{U}) .

(3)

For

U \subset T

, the function

f^{U}

is introduced further. It depends on the joint distribution of

(X, Y)

which is usually unknown. Thus we use observations

ξ_{N} = {(X^{(j)}, Y^{(j)}),

j = 1, \dots, N}

for statistical estimates of the functional

Err (f^{U})

, where

U \subset T

, and then select as an estimator of S the set U on which the minimum of the corresponding statistical estimate is attained. This approach is described in the next section of the article.

We underline that consideration of all subsets (of the set T) having the cardinality r in the mentioned comparison procedure (involving regularized estimators, as explained in Section 2) for statistical estimates of the error functional is practically unfeasible, when p is large and r is moderately large. Therefore, a number of suboptimal methods of sequential feature selection have emerged. Such methods are used in various approaches to identify sets of relevant factors.

Mainly, one aims either to sequentially add indexes at each step of the algorithm for constructing a statistical estimator of a set S appearing in (1), or to sequentially exclude features from the general set T. In [33], algorithms of forward selection, i.e., sequential addition of indexes to the initial set, based on information theory, are considered. The authors of [33] show that the various algorithms employed can be interpreted as procedures based on proper approximations of the certain objective function. In [34] the principle attention is paid to simple models describing the phenomenon of epistasis observed in genetics, when individual factors do not affect the response, and some combinations of them lead to essential effects (in statistics one says “synergy interaction” of factors). Besides we also demonstrated that a number of well-known algorithms, for instance, mRMR (Minimum Redundancy Maximum Relevance) using mutual information and/or interaction information with a sequential procedure for selecting relevant factors can lead to the identification of the desired set with probability which is negligibly small. In [35] a variant is proposed for sequential (forward) application of the MDR-EFE method within the binary response model involving the naive Bayesian classifier scheme. The latter means that, for any

y \in {- 1, 1}

and all

x \in X

, the following relation holds:

P (X = x | Y = y) = \prod_{k = 1}^{p} P (X_{k} = x_{k} | Y = y) .

(4)

In other words, the factors

X_{1}, \dots, X_{p}

are conditionally independent for a given response Y. In [35] the joint distribution of X and Y was assumed known.

The principle goal of our work is to derive, for a non-binary, in general, random response, the probability that a sequential selection of features based on the (forward) application of the MDR-EFE method, without assuming the validity of (4), leads to identifying a suboptimal set that would be constructed by means of the same method from observations with a known joint distribution of the response and the vector of factors.

This result builds on the central limit theorem (CLT) for statistical estimates of the prediction error functional for a possibly non-binary response, proved in [32], which extends the CLT for the binary response model studied by the author previously. In addition, for the purposes of this work, we found the convergence rate of the first two moments of the considered statistics to the corresponding moments of the limiting Gaussian variable as the number of observations tends to infinity.

The article has the following structure. Section 2 describes statistical estimates of the error functional (for a response prediction) based on the MDR-EFE method. We also introduce the regularized versions of these estimators. In Section 3, the convergence rate of the first two moments of the regularized estimators of the error functional to the corresponding moments of the limiting Gaussian variable is established. Section 4 contains the main result related to the forward selection of relevant factors. The concluding remarks are given in Section 5. The proof of elementary Lemma 2 is provided in Appendix A for completeness of exposition.

2. Error Functional Estimators

Consider, in general, a non-binary response, i.e., let

Y : = {- m, \dots, 0, \dots, m}

for some

m \in N

. In the framework of the introduced discrete model, Lemma 1 of [32] gives a complete description of the class of optimal functions

f_{o p t}

providing the minimum error

Err (f)

, determined by (2), in the class of all functions

f : X \to Y

. To define such a function (included in the optimal class) for

x \in X

, we deal with a vector

w (x)

having components

w_{y} (x) : = ψ (y) P (Y = y, X = x), y \in Y .

It can be easily seen that

Err (f) = \sum_{y, z \in Y} | y - z | ψ (y) P (Y = y, f (X) = z) = \sum_{z \in Y} \sum_{x \in A_{z}} w^{⊤} (x) q (z),

(5)

where

A_{z} : = {x \in X : f (x) = z}

,

q (z)

is a column of

(2 m + 1) \times (2 m + 1)

matrix Q having elements

q_{y, z} : = | y - z |

(the element

q_{- m, - m}

is located in the upper left corner of the matrix Q), ⊤ stands for the transposition of column vectors. In other words, one employs in (5) the scalar product of the vectors

w (x)

and

q (z)

. Thus, search for an optimal function

f_{o p t}

means finding the partition of

X

into such sets

A_{z}

,

z \in Y

, that provide the minimum value of the right-hand side of (5). Note also that, according to Formula (13) of [32], the error of response prediction can be written as follows:

Err (f) = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ψ (y) P (Y = y, | f (X) - y | > i) .

(6)

Let, for

y \in Y

, the vector

Δ (y)

have the first

m + y

components equal to 1, and the remaining

m - y + 1

components equal to

(- 1)

. For any

x \in X

, we introduce a vector

L (x)

with

2 m

components having the form

L_{y} (x) : = w^{⊤} (x) Δ (y), y \in Y, y > - m .

(7)

According to formula (11) of [32] one infers that

f_{o p t} (x) = y ⟺ \{\begin{matrix} L_{- m + 1} (x) \geq 0, & y = - m, \\ L_{y + 1} (x) \geq 0, L_{y} (x) < 0, & y \neq \pm m, \\ L_{m} (x) < 0, & y = m . \end{matrix}

(8)

The joint distribution of

(X, Y)

is, in general, unknown. Therefore, the optimal function

f_{o p t}

cannot be found in practice, so an algorithm is used to predict it, i.e., to approximate by means of specified statistical estimators. The response prediction algorithm is defined as a function

{\hat{f}}_{P A} = {\hat{f}}_{P A} (x, ξ (W))

given for

x \in X

and a set of observations

ξ (W) : = {ξ^{(j)} = (X^{(j)}, Y^{(j)}), j \in W}, W \subset N, ♯ W < \infty .

(9)

The function

{\hat{f}}_{P A}

takes values in the set

Y

. It is assumed that the value of

{\hat{f}}_{P A} (x, ξ (W))

becomes close, in a certain sense, to

f (x)

for x in a specified subset of the set

X

when W is sufficiently “massive”. More precisely, we consider a family of functions

{\hat{f}}_{P A}

that depend on sets

ξ (W)

of different cardinalities, but we will not complicate the notation. Consider

M = {x \in X : P (X = x) > 0}

. For

x \in X

,

U \subset T

and

y \in Y

, introduce a vector

w^{U} (x)

with components

w_{y}^{U} (x) : = \{\begin{matrix} ψ (y) P (Y = y, X_{U} = x_{U}), & x \in M, \\ 0, & x \notin M . \end{matrix}

Set

L_{y}^{U} (x) : = {(w^{U} (x))}^{⊤} Δ (y), y \in Y, y > - m .

(10)

For

U \subset T

, let

f^{U}

be defined by means of a counterpart of formula (8), where

L_{y}^{U} (x)

is now written instead of

L_{y} (x)

. Then, according to Section 5 of [32] (the notation

α

is used there instead of U), in the framework of model (1), the optimal function

f_{o p t} = f^{S}

, where S appears in (1) and

♯ S = r

. Therefore relation (3) is valid for

f^{U}

corresponding to any

U \subset T

with

♯ U = r

(the assumption (A) holds).

To introduce an algorithm for predicting the function

f^{U}

, we employ statistical estimators of the penalty function

ψ

, as well as the values

L_{y}^{U} (x)

, where

x \in X

,

y \in Y

,

y > - m

. Consider

ψ (y) : = 1 / P (Y = y), w h e r e P (Y = y) > 0, y \in Y .

(11)

In the case of a binary response, such a choice of the penalty function was proposed in [36], the justification for this choice is given in [31], see also Section 4 in [32]. For the specified function

ψ (y)

and observations

ξ (W)

, where the finite set

W \subset N

, we use

\hat{ψ} (y, ξ (W)) : = \{\begin{matrix} \frac{1}{\hat{P} (y, ξ (W))}, & \hat{P} (y, ξ (W)) \neq 0, \\ 0, & \hat{P} (y, ξ (W)) = 0, \end{matrix}

(12)

where the frequency estimator of a probability

P (Y = y)

has the form

\hat{P} (y, ξ (W)) : = \frac{1}{♯ W} \sum_{j \in W} I {Y^{(j)} = y}, y \in N .

(13)

It is not difficult to see that the strong law of large numbers for arrays of random variables (see, e.g., [37]) entails for finite sets

W_{N} \subset N

, such that

♯ W_{N} \to \infty

, the relation

\hat{ψ} (y, ξ (W_{N})) \to ψ (y) a . s ., N \to \infty .

(14)

Let the prediction algorithm

{\hat{f}}_{P A}^{U} (x, ξ (W_{N}))

of a function

f^{U} (x)

be constructed by means of formula (8) analogue, where, for

x \in X

,

y \in Y

,

y > - m

, and

W_{N} \subset {1, \dots, N}

, one uses now statistical estimators

{\hat{L}}_{y}^{U, W_{N}} (x)

of functions

L_{y}^{U} (x)

introduced in (10). Namely, let us define the following random variables:

{\hat{w}}_{y}^{U, W_{N}} (x) : = \hat{ψ} (y, ξ (W_{N})) \frac{1}{♯ W_{N}} \sum_{j \in W_{N}} I {Y^{(j)} = y, X_{U}^{(j)} = x_{U}}, y \in Y,

where

\hat{ψ} (y, ξ (W_{N}))

is an estimator of

ψ (y)

appearing in (12). For

x \in X

,

y \in Y

,

y > - m

, set

{\hat{L}}_{y}^{U, W_{N}} (x) : = {\hat{w}}_{y}^{U, W_{N}} {(x)}^{⊤} Δ (y) .

Replace the value

L_{y} (x)

in (8) by

{\hat{L}}_{y}^{U, W_{N}} (x)

. Then one can claim that

{\hat{f}}_{P A}^{U} (x, ξ (W_{N})) = y ⟺ \{\begin{matrix} {\hat{L}}_{y}^{U, W_{N}} (x) \geq 0, & y = - m, \\ {\hat{L}}_{y + 1}^{U, W_{N}} (x) \geq 0, {\hat{L}}_{y}^{U, W_{N}} (x) < 0, & y \neq \pm m, \\ {\hat{L}}_{y}^{U, W_{N}} (x) < 0, & y = m . \end{matrix}

(15)

For

K \in N

,

K > 1

, we take a partition of a set

{1, \dots, N}

into subsets

D_{k} (N) : = {(k - 1) [N / K] + 1, \dots, k [N / K] I {k < K} + N I {K = N}},

(16)

here

k = 1, \dots, K

,

[a]

is an integer part of a number

a \in R

,

I {A}

is an indicator of a set A. These sets are applied in the K-fold cross-validation procedure increasing the stability of statistical inference (cross-validation procedure is studied, e.g., in [38]). Following [32], the estimator of the functional

Err (f^{U})

, i.e., a statistical estimator of the prediction error functional for a function

f^{U}

and observations

ξ_{N} : = ξ ({1, \dots, N})

, involving the K-fold cross-validation procedure, is given by the formula:

{\hat{E r r}}_{K, N} (f^{U}) : = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{1}{K} \sum_{k = 1}^{K} \hat{ψ} (y, ξ (D_{k} (N)))

\times \frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} I {Y^{(j)} = y, | {\hat{f}}_{P A}^{U} (X^{(j)}, ξ ({\bar{D}}_{k} (N))) - y | > i},

(17)

where

{\bar{D}}_{k} (N) : = {1, \dots, N} ∖ D_{k} (N)

and

\hat{ψ} (y, ξ (D_{k} (N)))

are evaluated according to (12) for

W_{N} = D_{k} (N)

,

k = 1, \dots, K

. The estimator (17) is a natural statistical analogue of the error functional (2) written in the form (6) when one employs the K-cross-validation procedure. Namely, instead of

ψ (y)

we apply its statistical estimator of the type (12) and instead of f we use its approximation by means of prediction algorithm based on the part

D_{k} (N)

of observations. To obtain the statistical estimators of the probability appearing in Formula (6) we write the corresponding average of indicator functions. One employs also the averaging over different parts of observations.

By Theorem 2 of [32], if

S = {i_{1}, \dots, i_{r}}

is a set of relevant factors, i.e., (1) holds, then, for each

ε > 0

and any set

U = {m_{1}, \dots, m_{r}} \subset T

, the following inequality takes place almost sure for all N large enough:

{\hat{E r r}}_{K, N} (f^{S}) \leq {\hat{E r r}}_{K, N} (f^{U}) + ε .

(18)

Thus, it is natural to consider all subsets

U = {m_{1}, \dots, m_{r}} \subset T

and choose as a statistical estimator of a relevant collection of indices

(i_{1}, \dots, i_{r})

a set U on which the minimum of

{\hat{E r r}}_{K, N} (f^{U})

is attained. Here we also note that, for the study of asymptotic properties of the error functional, the regularization of the prediction algorithm by means of a sequence of positive numbers

{(ε_{N})}_{N \in N}

such that

ε_{N} \to 0

, as

N \to \infty

, plays an important role. Namely, for

W_{N} \subset {1, \dots, N}

, we define

{\hat{f}}_{P A, ε_{N}}^{U} (x, ξ (W_{N})) = y ⟺ \{\begin{matrix} {\hat{L}}_{y}^{U, W_{N}} (x) + ε_{N} \geq 0, & y = - m, \\ {\hat{L}}_{y + 1}^{U, W_{N}} (x) + ε_{N} \geq 0, {\hat{L}}_{y}^{U, W_{N}} (x) + ε_{N} < 0, & y \neq \pm m, \\ {\hat{L}}_{y}^{U, W_{N}} (x) + ε_{N} < 0, & y = m . \end{matrix}

(19)

As in article [32], we assume that

ε_{N} \to 0 +, \sqrt{N} ε_{N} \to \infty, N \to \infty .

(20)

Now we introduce a statistical estimator

{\hat{Err}}_{K, N, ε_{N}} (f^{U})

using an analogue of Formula (17), where one employs

{\hat{f}}_{P A, ε_{N}}^{U}

instead of

{\hat{f}}_{P A}^{U}

. For the regularized statistical estimators, as mentioned in [32], the analogue of Formula (18) holds. In [32], for estimators

{\hat{f}}_{P A, ε_{N}}^{U}

constructed when condition (20) is met, the CLT is established. In the next section we apply a slightly different regularization for the error functional estimates, which will permit us to specify the convergence rate of the first two moments of these estimators to corresponding moments of the limiting Gaussian variable. This result is not only of independent interest, but is also applied in Section 4.

3. Asymptotic Behavior of the First Two Moments of Statistical Estimators of the Error Functional

As noted in Section 2, we will use the penalty function (11). Therefore, for

W_{N} = D_{k} (N)

, as a strongly consistent estimator

\hat{ψ} (y, D_{k} (N))

of

ψ (y)

we will employ the variable appearing in (12), denoted below as

{\hat{ψ}}_{N, k} (y)

, where

y \in Y

,

k = 1, \dots, K

,

N \in N

. Recall that the estimator

{\hat{Err}}_{K, N} (f^{U})

is defined by formula (2). If the regularized version

{\hat{f}}_{P A, ε_{N}}

is substituted into this estimator instead of

{\hat{f}}_{P A}^{U}

, where

x \in X

and

N \in N

, then the notation

{\hat{Err}}_{K, N, ε_{N}} (f^{U})

is used. We will apply the following Corollary 3 of [32] established in the framework of a model satisfying (1).

Theorem 1 ([32]).

Let U be an arbitrary subset of T having the cardinality r, the function

f^{U}

be defined after formula (10),

{\hat{f}}_{P A, ε_{N}}^{U}

appear in (19) for observations

ξ_{N}

, and the sequence

{(ε_{N})}_{N \in N}

satisfy condition (20). Then

\sqrt{N} ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U})) \overset{D}{\to} Z \sim N (0, σ^{2} (U)), N \to \infty,

(21)

and in this case

σ^{2} (U)

is the variance of a random variable

V (U) : = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{I {Y = y}}{P (Y = y)} (I {| f^{U} (X) - y | > i} - P (| f^{U} (X) - y | > i | Y = y)) .

(22)

It is known that the convergence in distribution of random variables, in general, does not ensure the convergence of their moments even when the moments exist. We will manage to establish the convergence rate of the first two moments of the error functional statistical estimators to the corresponding moments of the limit random variable. For this purpose we slightly strength the condition of estimates regularization. We require that a sequence

{(ε_{N})}_{N \in N}

satisfies the following condition:

ε_{N} \to 0 +, \frac{ε_{N} \sqrt{N}}{\sqrt{log \frac{1}{ε_{N}}}} \to \infty, N \to \infty .

(23)

Clearly, (23) implies the validity of (20). Relation (23) holds if one takes

ε_{N} = N^{- δ}

,

N \in N

, where

δ \in (0, 1 / 2)

.

Lemma 1.

Let condition (23) be met. Then, for every

K \in N

,

K > 1

, and any

U \subset T

, the statistical estimators

{\hat{Err}}_{K, N, ε_{N}} (f^{U})

satisfy the following relation:

N E {({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U}))}^{2} \to σ^{2} (U), N \to \infty,

(24)

where

σ^{2} (U) = var V (U)

and

V (U)

is introduced in formula (22).

Proof of Lemma 1.

Let us fix an arbitrary set

U \subset T

. For each

N \in N

one has

\begin{matrix} Z_{N} & : = & \sqrt{N} ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U})) = \sqrt{N} ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - {\hat{T}}_{N} (f^{U})) \\ + & \sqrt{N} ({\hat{T}}_{N} (f^{U}) - T_{N} (f^{U})) + \sqrt{N} (T_{N} (f^{U}) - Err (f^{U})), \end{matrix}

(25)

where

T_{N} (f^{U}) : = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{1}{K} \sum_{k = 1}^{K} \frac{ψ (y)}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} I {Y^{(j)} = y, | f^{U} (X^{(j)}) - y | > i},

(26)

{\hat{T}}_{N} (f^{U}) : = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{1}{K} \sum_{k = 1}^{K} \frac{{\hat{ψ}}_{N, k} (y)}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} I {Y^{(j)} = y, | f^{U} (X^{(j)}) - y | > i},

(27)

{\hat{ψ}}_{N, k} (y)

are defined by means of (12) for

W_{N} = D_{k} (N)

,

k = 1, \dots, K

,

N \in N

. The proof is divided into several steps.

Step 1. At first we consider

R_{N} : = \sqrt{N} ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - {\hat{T}}_{N} (f^{U})), N \in N .

To simplify the notation, we do not write that

R_{N}

also depends on K,

ξ_{N}

and

ε_{N}

. Our aim is to show that if (23) holds then

E R_{N}^{2} \to 0 a s N \to \infty .

(28)

In the light of formula (71) of [32], under condition (20) the following relation is valid:

R_{N} \overset{P}{\to} 0, N \to \infty .

(29)

Taking into account (29), by Theorem 5.4 of [39], relation (28) holds if (and only if) the sequence

{(R_{N}^{2})}_{N \in N}

is uniformly integrable. Due to theorem by De La Vallé - Poussin (see, e.g., Theorem 1.3.4 of [40]) it is sufficient to verify that

sup_{N \in N} E (R_{N}^{4}) < \infty .

For

x \in X

,

y \in Y

,

i \in Z_{+}

,

k = 1, \dots, K

and

N \in N

we introduce the following random variables:

F_{N, k}^{(i)} (x, y) = I {| {\hat{f}}_{P A, ε_{N}}^{U} (x, ξ ({\bar{D}}_{N, k})) - y | > i} - I {| f^{U} (x) - y | > i},

(30)

S_{k} (i, y) : = \frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (X^{(j)}, y),

(31)

where, for

W \subset N

,

ξ (W)

is defined by Formula (9). Write

R_{N} = U_{N, 1} + U_{N, 2}

, here

\begin{matrix} U_{N, 1} & : = & \sqrt{N} (\frac{1}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ψ (y) S_{k} (i, y)), \\ U_{N, 2} & : = & \sqrt{N} (\frac{1}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ({\hat{ψ}}_{N, k} (y) - ψ (y)) S_{k} (i, y)) . \end{matrix}

Now note that, for any real numbers

a_{1}, \dots, a_{v}

, every

v \in N

and an arbitrary

γ > 1

, the Hölder inequality implies that

{(\sum_{r = 1}^{v} | a_{r} |)}^{γ} \leq v^{γ - 1} \sum_{r = 1}^{v} {| a_{r} |}^{γ} .

(32)

Evidently, (32) is true for

γ = 1

as well. Consequently, we get

R_{N}^{4} \leq 8 (U_{N, 1}^{4} + U_{N, 2}^{4}), N \in N .

(33)

Clearly, for all

x \in X

,

y \in Y

,

W_{N} \subset {1, \dots, N}

and

N \in N

, one has

{\hat{L}}_{y, ε_{N}}^{U, W_{N}} (x) : = {\hat{L}}_{y}^{U, W_{N}} (x) + ε_{N} = L_{y}^{U} (x) + {({\hat{w}}_{y}^{U, W_{N}} (x) - w_{y} (x))}^{⊤} Δ (y) + ε_{N},

(34)

where the functions appearing in (34) were introduced in Section 2. For any

x \in X

and

y \in Y

, the inequalities

L_{y}^{U} (x) \geq 0

,

L_{y + 1}^{U} (x) < 0

are satisfied if and only if, for arbitrary

δ_{N} (x, y; U) > 0

such that

δ_{N} (x, y; U) \to 0

, as

N \to \infty

, and all sufficiently large

N \in N

, the following inequalities are valid:

L_{y}^{U} (x) + δ_{N} (x, y; U) > 0

,

L_{y + 1}^{U} (x) + δ_{N} (x, y; U) < 0

(the analogous statement is true for inequalities corresponding to coordinates

y = m

and

y = - m

in Formula (19)). Obviously,

| {({\hat{w}}_{y}^{U, W_{N}} (x) - w_{y} (x))}^{⊤} Δ (y) |

\leq | \hat{ψ} (y, ξ (W_{N})) - ψ (y) | + ψ (y) |\frac{1}{♯ W_{N}} \sum_{q \in W_{N}} I {X_{U}^{(q)} = x_{U}, Y^{(q)} = y} - P (X_{U} = x_{U}, Y = y)|,

where

\hat{ψ} (y, ξ (W_{N}))

is defined in (12). One has

\begin{matrix} \sum_{x_{U}} (\frac{1}{♯ W_{N}} \sum_{q \in W_{N}} I {X_{U}^{(q)} = x_{U}, Y^{(q)} = y} - P (X_{U} = x_{U}, Y = y)) \\ = \frac{1}{♯ W_{N}} \sum_{q \in W_{N}} I {Y^{(q)} = y} - P (Y = y) \\ = \hat{P} (y, ξ (W_{N})) - P (Y = y) . \end{matrix}

(35)

For

x \in X

,

y \in Y

,

W_{N} \subset {1, \dots, N}

and

N \in N

, consider the following event

A_{W_{N}} (x, y) = \{|\frac{1}{♯ W_{N}} \sum_{q \in W_{N}} I {X_{U}^{(q)} = x_{U}, Y^{(q)} = y} - P (X_{U} = x_{U}, Y = y)| \leq \frac{p_{0}^{2} ε_{N}}{8 ♯ X}\},

(36)

where

p_{0} = {min}_{y \in Y} P (Y = y)

(we assumed that

P (Y = y) > 0

for

y \in Y

). More precisely one can write

A_{W_{N}} (x, y) = A_{W_{N}} (x, y, U; {(X^{(q)}, Y^{(q)}), q \in W_{N}})

. We will not include a set U in the list of arguments since this set is fixed. Then, for

ω \in A_{W_{N}} (x, y)

, in view of (35), we get

|\hat{P} (y, ξ (W_{N})) - P (Y = y)| \leq \frac{p_{0}^{2} ε_{N}}{8} .

(37)

Then by virtue of (37), for any

y \in Y

and all N large enough, i.e., for

N \geq N_{0} (Y, {(ε_{N})}_{N \in N})

, one has

\hat{P} (y, ξ (W_{N})) \geq P (Y = y) - \frac{p_{0}^{2} ε_{N}}{8} \geq P (Y = y) - \frac{ε_{N}}{8} > \frac{P (Y = y)}{2} > 0,

and hence the following relation holds

| \hat{ψ} (y, ξ (W_{N})) - ψ (y) | = \frac{| \hat{P} (y, ξ (W_{N})) - P (Y = y) |}{\hat{P} (y, ξ (W_{N})) P (Y = y)} \leq \frac{\frac{p_{0}^{2} ε_{N}}{8}}{\frac{P {(Y = y)}^{2}}{2}} \leq \frac{ε_{N}}{4} .

(38)

Thus if

ω \in A_{W_{N}} (x, y)

, where

x \in X

and

y \in Y

, then according to (36) and (38), for all N large enough, we can write

| {({\hat{w}}_{y}^{U, W_{N}} (x) - w_{y} (x))}^{⊤} Δ (y) | \leq \frac{ε_{N}}{4} + (\frac{1}{p_{0}}) \frac{p_{0}^{2} ε_{N}}{8 ♯ X} \leq \frac{ε_{N}}{2} .

Taking into account that the sets

X

and

Y

have finite cardinalities, we ascertain that, for any

x \in X

,

y \in Y

and all N large enough, for

ω \in A_{W_{N}} (x, y)

, one has

{\hat{f}}_{P A, ε_{N}}^{U, W_{N}} (x) = f^{U} (x) .

(39)

Consequently, for any

x \in X

,

y \in Y

,

i = 0, 1, \dots, 2 m - 1

,

ω \in A_{W_{N}} (x, y)

, where

W_{N} = {\bar{D}}_{k} (N)

,

k = 1, \dots, K

, for all N large enough (i.e.,

N \geq N_{1}

), the following inequality holds:

F_{N, k}^{(i)} (x, y) I {A_{{\bar{D}}_{k} (N)} (x, y)} = 0 .

(40)

Applying (32) we come to the inequality

| U_{N, 1} |^{4} \leq N^{2} \frac{{(2 m)}^{6}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ψ {(y)}^{4} {(\frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (X^{(j)}, y))}^{4} .

Let

\tilde{Σ}

denote the summation over all

x_{j} \in X

for

j \in D_{k} (N)

. For

N \geq N_{1}

one has

\begin{matrix} E {(\sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (X^{(j)}, y))}^{4} \\ = & E (\tilde{Σ} {(\sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (x_{j}, y))}^{4} I \{⋂_{j \in D_{k} (N)} {X^{(j)} = x_{j}}\}) \end{matrix}

\begin{matrix} = & E (\tilde{Σ} {(\sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (x_{j}, y) I {{\bar{A}}_{{\bar{D}}_{k} (N)} (x_{j}, y)})}^{4} I \{⋂_{j \in D_{k} (N)} {X^{(j)} = x_{j}}\}) \\ = & E {(\sum_{j \in D_{k} (N)} I {Y^{(j)} = y} F_{N, k}^{(i)} (X^{(j)}, y) I {{\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y)})}^{4} \\ \leq & E {(\sum_{j \in D_{k} (N)} I {{\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y)})}^{4}, \end{matrix}

here we employ (40) and take into account that

| F_{N, k}^{(i)} (x, y) | \leq 1

. We see that

| U_{N, 1} |^{4} \leq N^{2} \frac{{(2 m)}^{6}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{ψ {(y)}^{4}}{{(♯ D_{k} (N))}^{4}} {(\sum_{j \in D_{k} (N)} I {{\bar{A}}_{{\bar{D}}_{N} (k)} (X^{(j)}, y)})}^{4} .

(41)

For

W_{N} \subset {1, \dots, N}

,

y \in Y

and

j = 1, \dots, N

, introduce the functions

g_{W_{N}} (X^{(j)}, y) = I {{\bar{A}}_{W_{N}} (X^{(j)}, y)} = I {{\bar{A}}_{W_{N}} (X^{(j)}, y; {(X^{(q)}, Y^{(q)}), q \in W_{N}})} .

It is known (see, e.g., formula (15) in Chap. VI of [41]) that if a bounded Borel function

g : R^{n} \times R^{m} \to R

,

ξ

and

ζ

are independent random vectors taking values in

R^{n}

and

R^{m}

, respectively, then

E (g (ξ, ζ) | ζ = z) = E g (ξ, z), z \in R^{n} .

Due to independence of

(X^{(j)}, Y^{(j)})

,

j \in N

, we can apply the lemma on grouping random vectors (see, e.g., [42], p. 28) to get the relation

E ((\sum_{j \in D_{k} (N)} g_{{\bar{D}}_{k} (N)} (X^{(j)}, y; (X^{(q)}, Y^{(q)}), q \in {\bar{D}}_{k} (N))))^{4} |(X^{(q)}, Y^{(q)}) = (x_{q}, y_{q}), q \in {\bar{D}}_{k} (N)))

= E {(\sum_{j \in D_{k} (N)} g_{{\bar{D}}_{k} (N)} (X^{(j)}, y; (x_{q}, y_{q})), q \in {\bar{D}}_{N} (k))))}^{4} .

By the Rosenthal inequality (see, e.g., Theorem 2.9 of [43]), for independent centered random variables

Z_{1}, \dots, Z_{v}

, having

E | Z_{j} |^{t} < \infty

for some

t \in [2, \infty)

and each

j = 1, \dots, v

, one has

E {|\sum_{j = 1}^{v} Z_{j}|}^{t} \leq C (t) (\sum_{j = 1}^{v} E {| Z_{j} |}^{t} + {(\sum_{j = 1}^{v} E Z_{j}^{2})}^{\frac{t}{2}}),

(42)

where

C (t) > 0

depends on t but does not depend on v and distributions of variables

Z_{j}

,

j = 1, \dots, v

.

Set

η_{N, k}^{(j)} : = g_{{\bar{D}}_{k} (N)} (X^{(j)}, y; {(x_{q}, y_{q})), q \in {\bar{D}}_{N} (k)})

,

j \in N

. Note that

0 \leq η_{N, k}^{(j)} \leq 1

for all

j \in D_{N} (k)

. Then according to (42) we come to the inequality

E {(\sum_{j \in D_{k} (N)} (η_{N, k}^{(j)} - E η_{N, k}^{(j)}))}^{4} \leq C {(♯ D_{k} (N))}^{2},

where

k = 1, \dots, K

and

C = 2 C (4)

. Hence, applying (32) for

γ = 4

and

v = 2

, one has

E {(\sum_{j \in D_{k} (N)} η_{N, k}^{(j)})}^{4} \leq 8 (C {(♯ D_{k} (N))}^{2} + {(\sum_{j \in D_{k} (N)} E η_{N, k}^{(j)})}^{4})

\leq 8 C {(♯ D_{k} (N))}^{2} + 8 {(♯ D_{k} (N))}^{4} max_{j \in D_{k} (N)} {(E η_{N, k}^{(j)})}^{4} .

Evidently, we can write

E (η_{N, k}^{(j)}) = P ({\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y; {(x_{q}, y_{q})), q \in {\bar{D}}_{k} (N)}) .

Let

M_{k} = ♯ {\bar{D}}_{k} (N)

, where

M_{k} = M_{k} (N)

,

k = 1, \dots, K

. Set

ζ_{q} = I {X_{U}^{(q)} = x_{U}, Y^{(q)} = y}

, where

q \in {\bar{D}}_{k} (N)

,

σ_{0}^{2} = var ζ_{q}

. Clearly,

ζ_{q}

depends on

x_{U}

, y and U. Random variables

ζ_{q}

are identically distributed for

q \in N

. Therefore

σ_{0}^{2} = σ_{0}^{2} (U, x, y)

, but does not depend on q. If

σ_{0}^{2} = 0

, then the variables

ζ_{q}

are a.s. equal to some constant. According to (36), an event

{\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y; {(x_{q}, y_{q})), q \in {\bar{D}}_{k} (N)})

occurrence means that the variable which is equal to zero a.s. turns greater than

(p_{0}^{2} ε_{N}) / (8 ♯ X)

. Therefore, in the degenerate case one has

P ({\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y; (x_{q}, y_{q})), q \in {\bar{D}}_{k} (N))) = 0

and

E η_{N, k}^{(j)} = 0

for all

j = 1, \dots, N

. Consider now the case when

σ_{0}^{2} > 0

. Then we get

P ({\bar{A}}_{{\bar{D}}_{k} (N)} (X^{(j)}, y; {(x_{q}, y_{q}), q \in {\bar{D}}_{k} (N)}) = P (\frac{\sum_{q \in {\bar{D}}_{k} (N)} (ζ_{q} - E ζ_{q})}{σ_{0} \sqrt{M_{k}}} > \frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{8 ♯ X σ_{0}}),

where

p_{0}

appeared in (36).

Now we employ the Berry-Esseen estimate of the convergence rate in CLT for i.i.d. random variables. Let

Z_{1}, \dots, Z_{v}

be i.i.d. random variables such that

E Z_{1} = 0

,

var Z_{1} = σ^{2} \in (0, \infty)

,

E | Z_{1} |^{3} = ρ < \infty

. We write F for the distribution function of

Z_{1}

and

F_{v}

stands for the distribution function of

(Z_{1} + \dots + Z_{v}) / (σ \sqrt{v})

. Then (see, e.g., Theorem 5.4 of [43]), for any

v \in N

,

sup_{u \in R} | F_{v} (u) - Φ (u) | \leq \frac{C_{0} ρ}{σ^{3} \sqrt{v}},

where

Φ (u)

is the distribution function of a standard normal random variable,

C_{0}

is a positive constant (

C_{0}

does not depend on distribution of

Z_{1}

and v). According to [44] one has

C_{0}

0.4693. Consequently, taking

Z \sim N (0, 1)

, we have

P (|\frac{\sum_{q \in {\bar{D}}_{k} (N)} (ζ_{q} - E ζ_{q})}{σ_{0} \sqrt{M_{k}}}| > \frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{8 ♯ X σ_{0}}) \leq P (| Z | > \frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{8 ♯ X σ_{0}}) + \frac{2 C_{0}}{σ_{0}^{3} \sqrt{M_{k}}}

(43)

since

E | ζ_{q} - E ζ_{q} |^{3} \leq 1

for

q \in {\bar{D}}_{k} (N)

, where

ζ_{q} = I {X_{U}^{(q)} = x_{U}, Y^{(q)} = y}

.

It is well-known (see, e.g., formula (29) of Chap. II of [41]), that, for

u > 0

, the following inequality is true:

P (| Z | \geq u) \leq \frac{\sqrt{2 / π}}{u} exp \{- \frac{u^{2}}{2}\} .

Therefore, by virtue of an inequality

σ_{0}^{2} \leq 1 / 4

(which is valid for the indicator variance) and as

(K - 1) [N / K] \leq M_{k} \leq N,

(44)

we can write under condition (23) that

P (| Z | > \frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{8 ♯ X σ_{0}}) \leq \frac{8 ♯ X \sqrt{2} σ_{0}}{p_{0}^{2} \sqrt{π M_{k}} ε_{N}} exp \{- \frac{1}{2} {(\frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{8 ♯ X σ_{0}})}^{2}\}

\begin{matrix} \leq & \frac{4 \sqrt{2} ♯ X}{p_{0}^{2} \sqrt{π M_{k}} ε_{N}} exp \{- \frac{1}{32} {(\frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{♯ X})}^{2}\} \\ = & \frac{4 \sqrt{2} ♯ X}{p_{0}^{2} \sqrt{π M_{k}}} exp \{- \frac{1}{32} {(\frac{p_{0}^{2} \sqrt{M_{k}} ε_{N}}{♯ X})}^{2} + log (\frac{1}{ε_{N}})\} \leq \frac{C_{1}}{\sqrt{N}}, N \in N, \end{matrix}

and

C_{1}

does not depend on N.

Introduce

{\tilde{σ}}^{2} : = min_{U \subset T, x \in X, y \in Y} σ_{0}^{2} (U, x, y),

where one considers only strictly positive

σ_{0}^{2} (U, x, y)

. Then obviously

{\tilde{σ}}^{2} > 0

, as there exists only a finite collection of different variants. Thus in view of (44), for all x, y and U under consideration, one has

\frac{2 C_{0}}{{\tilde{σ}}^{3} \sqrt{M_{k}}} \leq \frac{C_{2}}{\sqrt{N}}, N \in N,

where

C_{0}

appeared in (43) and

C_{2}

does not depend on N.

Therefore, if condition (23) is satisfied then, for all

x \in X

,

y \in Y

,

k = 1, \dots, K

and

j \in D_{k} (N)

, the following inequality holds:

E η_{N, k}^{(j)} \leq \frac{C_{3}}{\sqrt{N}}, N \in N,

(45)

where

C_{3}

does not depend on

x, y

, k and N. Hence, in view of (44) we come to the relation

E {(\sum_{j \in D_{k} (N)} g_{{\bar{D}}_{k} (N)} (X^{(j)}, y; {(X^{(q)}, Y^{(q)}), q \in {\bar{D}}_{k} (N)}))}^{4}

\leq (8 C {(♯ D_{k} (N))}^{2} + 8 {(♯ D_{k} (N))}^{4} \frac{C_{3}^{4}}{N^{2}}) \sum_{(x_{q}, y_{q})), q \in {\bar{D}}_{N} (k)} P ((X^{(q)}, Y^{(q)}) = (x_{q}, y_{q})) \leq C_{4} N^{2},

where

C_{4}

does not depend on x, y, k and N. Thus according to (41), for all N large enough, we have proved the inequality

E U_{N, 1}^{4} \leq C_{5},

(46)

where

C_{5}

does not depend on N.

In a similar way (taking into account (42) and (45)), for

i = 0, \dots, 2 m - 1

,

y \in Y

,

k = 1, \dots, K

, and all N large enough, we get

E S_{k} {(i, y)}^{8} \leq C_{6} {(♯ D_{N} (k))}^{- 4},

(47)

where

S_{k} (i, y)

is introduced in (31), and

C_{6}

does not depend on N.

We will employ an elementary result for the Bernoulli scheme. Let

U_{1}, U_{2}, \dots,

be a sequence of i.i.d. random variables such that

P (U_{1} = 1) = p

and

P (U_{1} = 0) = 1 - p

, where

p \in (0, 1)

. Consider the following frequency estimator of a probability

p

:

{\hat{p}}_{N} : = \frac{1}{N} \sum_{j = 1}^{N} I {U_{j} = 1}, N \in N .

Define

{\hat{ψ}}_{N} : = \{\begin{matrix} \frac{1}{{\hat{p}}_{N}}, & {\hat{p}}_{N} \neq 0, \\ 0, & {\hat{p}}_{N} = 0 . \end{matrix}

(48)

Lemma 2.

For the Bernoulli scheme introduced above and the estimators

{\hat{ψ}}_{N}

provided by formula (48), for each

t \in N

, the following relation holds:

E {({\hat{ψ}}_{N} - \frac{1}{p})}^{t} = O (\frac{1}{N}), N \to \infty .

(49)

More precisely, the absolute value of the function in the left-hand side of (49), for all

N \in N

, admits a bound

c / N

where

c = c (p, t)

for

p \in (0, 1)

and

t \in N

.

For the sake of completeness the proof of this result is given in Appendix A.

Now we continue the proof corresponding to Step 1. For all considered k, i, y and any

N \in N

, the Cauchy - Bunyakovsky - Schwarz inequality yields

E {(({\hat{ψ}}_{N, k} (y) - ψ (y)) S_{k} (i, y))}^{4} \leq {(E {({\hat{ψ}}_{N, k} (y) - ψ (y))}^{8} E S_{k} {(i, y))}^{8})}^{\frac{1}{2}} .

Due to Lemma 2 one has

E {({\hat{ψ}}_{N, k} (y) - ψ (y))}^{8} = O (\frac{1}{N})

,

N \to \infty

. Employing the Minkowski inequality (to take into account the summation over i, y, k), for all

N \in N

, we come to the bound

E U_{N, 2}^{4} \leq N^{2} C_{7} {((\frac{1}{N}) (\frac{1}{N^{4}}))}^{\frac{1}{2}} = \frac{C_{7}}{\sqrt{N}},

(50)

where

C_{7}

does not depend on N.

Consequently, by virtue of (33), (46) and (50) the uniform integrability of a sequence

{(R_{N}^{2})}_{N \in N}

is established. Thus (28) is verified.

Step 2. Now we study the asymptotic behavior of the variables

\sqrt{N} ({\hat{T}}_{N} (f^{U}) - T_{N} (f^{U}))

, as

N \to \infty

, where

{\hat{T}}_{N} (f^{U})

and

T_{N} (f^{U})

are given by Formulas (26) and (27), respectively. For

j \in N

,

i = 0, \dots, 2 m - 1

,

y \in Y

, we set

Z_{i}^{(j)} (y) = I {Y^{(j)} = y, | f (X^{(j)}) - y | > i}

. One has

\sqrt{N} ({\hat{T}}_{N} (f^{U}) - T_{N} (f^{U})) = W_{N, 1} + W_{N, 2},

where

\begin{matrix} W_{N, 1} & = & \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{({\hat{ψ}}_{N, k} (y) - ψ (y))}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} (Z_{i}^{(j)} (y) - E Z_{i}^{(j)} (y)), \\ W_{N, 2} & = & \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{({\hat{ψ}}_{N, k} (y) - ψ (y))}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} P (Y^{(j)} = y, | f^{U} (X^{(j)}) - y | > i) \\ = & \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ({\hat{ψ}}_{N, k} (y) - ψ (y)) P (Y = y, | f^{U} (X) - y | > i) . \end{matrix}

(51)

The purpose of the second step is to prove that

E W_{N, 1}^{2} \to 0, N \to \infty .

(52)

For

k = 1, \dots, K

,

i = 0, \dots, 2 m - 1

and

y \in Y

introduce

G_{k} (i, y) = \frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} (Z_{i}^{(j)} (y) - E Z_{i}^{(j)} (y)) .

The Cauchy-Bunyakovsky-Schwarz inequality yields

E {(({\hat{ψ}}_{N, k} (y) - ψ (y)) G_{k} (i, y))}^{2} \leq {(E {({\hat{ψ}}_{N, k} (y) - ψ (y))}^{4})}^{\frac{1}{2}} {(E {(G_{k} (i, y))}^{4})}^{\frac{1}{2}} .

For each considered N, y, i and k, the variables

{Z_{i}^{(j)} (y), j \in D_{k} (N)}

are independent and

| Z_{i}^{(j)} (y) - E Z_{i}^{(j)} (y) | \leq 1

, so by virtue of the Rosenthal inequality (42) we obtain

E {(\sum_{j \in D_{k} (N)} (Z_{i}^{(j)} (y) - E Z_{i}^{(j)} (y)))}^{4} = O (♯ D_{k} {(N)}^{2}) .

Taking into account Lemma 2 for

t = 4

and in view of (44), for each

k = 1, \dots, K

, we get the relation

E W_{N, 1}^{2} = O (N^{- \frac{1}{2}}), N \to \infty .

Therefore, the goal of the second step has been achieved.

Step 3. The implementation of steps 1 and 2 permits to reduce the study of the asymptotic behavior (as

N \to \infty

) of

Z_{N}

given by Formula (25) to the study of variables

η_{N} : = \sqrt{N} (T_{N} (f^{U}) - Err (f^{U})) + W_{N, 2}, N \in N,

where

W_{N, 2}

is defined by Formula (51).

The aim of the third step is to prove that

E {(η_{N})}^{2} \to σ^{2} (U)

, as

N \to \infty

, where

σ^{2} (U)

is the variance of the random variable

V (U)

appearing in Formula (22).

On this way, we will show that the sum of certain part of the terms in a specified representation of the variables

η_{N}

does not affect (in the sense of

L^{2} (Ω, F, P)

) the limit behavior of these variables for growing N. For

y \in Y

and

W_{N} \subset {1, \dots, N}

, where

N \in N

, we introduce the event

B_{W_{N}} (y) : = {ω : \hat{P} (y, ξ (W_{N})) \neq 0},

(53)

where

\hat{P} (y, ξ (W_{N}))

is defined according to (13). Then, in view of the independence of observations

ξ^{(1)}, ξ^{(2)}, \dots

we have

P ({\bar{B}}_{W_{N}} (y)) = P (⋂_{j \in W_{N}} {Y^{(j)} \neq y}) = {(1 - P (Y = y))}^{♯ W_{N}} .

If

ω \in {\bar{B}}_{W_{N}} (y)

then

| \hat{ψ} (y, ξ (W_{N})) - ψ (y) | = ψ (y)

. Set

H_{N} : = \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} ({\hat{ψ}}_{N, k} (y) - ψ (y)) I {B_{N, k} (y)} P (Y = y, | f (X) - y | > i),

where

B_{N, k} (y) : = B_{D_{k} (N)} (y)

and an event

B_{W_{N}} (y)

is introduced by Formula (53). Then

E {(W_{N, 2} - H_{N})}^{2} = E {(\frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{I {{\bar{B}}_{N, k} (y)}}{P (Y = y)} P (Y = y, | f^{U} (X) - y | > i))}^{2}

\leq \frac{N {(2 m)}^{4}}{p_{0}^{2}} max_{y \in Y} {(1 - P (Y = y))}^{[N / K]} \to 0, N \to \infty,

since

♯ D_{k} (N) \geq [N / K]

for

N \in N

,

k = 1, \dots, K

and because all

P (Y = y) > 0

for each

y \in Y

,

[\cdot]

stands for an integer part of a number.

We verify that

H_{N}

for large N is approximated in the space

L^{2} (Ω, F, P)

by the random variable

{\tilde{H}}_{N} : = \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} I {B_{N, k} (y)} (\frac{P (Y = y) - {\hat{p}}_{N, k} (y)}{P {(Y = y)}^{2}}) P (Y = y, | f^{U} (X) - y | > i),

where

{\hat{p}}_{N, k} (y) : = \hat{P} (y, ξ (D_{k} (N)))

and

\hat{P} (y, ξ (W_{N}))

was introduced by (13) for

y \in Y

and

W_{N} \subset {1, \dots, N}

. Evidently,

0 \leq P (Y = y, | f^{U} (X) - y | > i) \leq 1

for all k, i, y and N under consideration. Consequently, it follows that

\begin{matrix} Δ_{N, k} (i, y) & : = & | \sqrt{N} I {B_{N, k} (y)} (\frac{1}{{\hat{p}}_{N, k} (y)} - \frac{1}{P (Y = y)}) P (Y = y, | f^{U} (X) - y | > i) \\ - & \sqrt{N} I {B_{N, k} (y)} (\frac{P (Y = y) - {\hat{p}}_{N, k} (y)}{P {(Y = y)}^{2}}) P (Y = y, | f^{U} (X) - y | > i) | \\ \leq & \sqrt{N} |\frac{P (Y = y) - {\hat{p}}_{N, k} (y)}{P (Y = y)}| |{\hat{ψ}}_{N, k} (y) - \frac{1}{P (Y = y)}| \\ = & \frac{\sqrt{N}}{P (Y = y) \sqrt{♯ D_{k} (N)}} |{\hat{ψ}}_{N, k} (y) - \frac{1}{P (Y = y)}| J_{N}, \end{matrix}

where

J_{N} : = \frac{1}{\sqrt{♯ D_{k} (N)}} \sum_{j \in D_{k} (N)} (I {Y^{(j)} = y} - P (Y^{(j)} = y)) .

For any considered k, i, y and N the Cauchy - Bunyakovsky - Schwarz inequality implies that

E {(Δ_{N, k} (i, y))}^{2} \leq \frac{N}{(P {(Y = y)}^{2} ♯ D_{k} (N)} {(E J_{N}^{4} E {({\hat{ψ}}_{N, k} (y) - \frac{1}{P (Y = y)})}^{4})}^{\frac{1}{2}} .

The Rosenthal inequality (42) yields that

E J_{N}^{4} \leq 2 C (4)

. By means of Lemma 2 (for

t = 4

and multipliers

c (p, t)

with

p = P (Y = y)

), for all considered i, y, k and any

N \in N

we come to the bound

E {(Δ_{N, k} (i, y))}^{2} \leq \frac{N}{(P {(Y = y)}^{2} ♯ D_{k} (N)} \frac{{(2 C (4) c (P (Y = y), 4))}^{\frac{1}{2}}}{\sqrt{N}} .

Therefore,

E {(H_{N} - {\tilde{H}}_{N})}^{2} \to 0

as

N \to \infty

.

Let us define the variable

G_{N}

by formula similar to

{\tilde{H}}_{N}

but without the multiplier

I {B_{N, k} (y)}

. In view of (44) it is easily seen that

E {({\tilde{H}}_{N} - G_{N})}^{2} \leq \frac{N {(2 m)}^{4}}{p_{0}^{4}} max_{y \in Y} {(1 - P (Y = y))}^{[N / K]} (\frac{1}{4}) max_{k = 1, \dots, K} \frac{1}{♯ D_{k} (N)} \to 0, N \to \infty .

Thus

E {(η_{N} - Q_{N})}^{2} \to 0

as

N \to \infty

, where

Q_{N} : = \sqrt{N} (T_{N} (f^{U}) - Err (f^{U})) + G_{N}, N \in N .

Taking into account Formula (6) for the function

f = f^{U}

, we come to the relation

\begin{matrix} Q_{N} & = & \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} (\frac{I {Y^{(j)} = y, | f^{U} (X^{(j)}) - y | > i}}{P (Y = y)} \\ - & \frac{P (Y = y, | f^{U} (X) - y | > i)}{P (Y = y)} + \frac{(P (Y = y) - I {Y^{(j)} = y}) P (Y = y, | f^{U} (X) - y | > i)}{P {(Y = y)}^{2}}) \\ = & \frac{\sqrt{N}}{K} \sum_{k = 1}^{K} \frac{1}{♯ D_{k} (N)} \sum_{j \in D_{k} (N)} V^{(j)}, \end{matrix}

(54)

where, for

j \in N

,

V^{(j)} : = \sum_{i = 0}^{2 m - 1} \sum_{i - m < | y | \leq m} \frac{I {Y^{(j)} = y}}{P (Y = y)} (I {| f^{U} (X^{(j)}) - y | > i) - P (| f^{U} (X) - y | > i | Y = y)) .

(55)

The variables

{V^{(j)}, j \in N}

are centered, i.i.d. and uniformly bounded for all j (clearly,

V^{(j)} = V^{(j)} (U)

). For each

j \in N

, the distributions of

V^{(j)}

and

V (U)

coincide, where

V (U)

is introduced in (22). Thus, one has

var V^{(j)} = var V (U) = σ^{2} (U), j \in N .

(56)

According to the lemma on grouping independent random variables, for each

N \in N

, the variables

\sum_{j \in D_{k} (N)} V^{(j)}

,

k = 1, \dots, K

, are independent. Since

N / ♯ D_{k} (N) \to K

as

N \to \infty

, for

k = 1, \dots, K

, we come to the relation

E (Q_{N}^{2}) = var Q_{N} = \frac{N}{K^{2}} \sum_{k = 1}^{K} \frac{1}{{(♯ D_{k} (N))}^{2}} \sum_{j \in D_{k} (N)} var V^{(j)} = σ^{2} (U) \frac{1}{K^{2}} \sum_{k = 1}^{K} \frac{N}{♯ D_{k} (N)} \to σ^{2} (U),

as

N \to \infty

. Hence

E η_{N}^{2} \to σ^{2} (U)

,

N \to \infty

. The goal of the third step has been achieved.

In view of the above approximations (in

L^{2} (Ω, F, P)

) of the initial random variables

Z_{N}

, introduced by (25), we conclude that

E Z_{N}^{2} \to σ^{2} (U)

, as

N \to \infty

. Namely, we apply the following elementary statement: if

E α_{N}^{2} \to 0

and

E β_{N}^{2} \to σ^{2}

then

E {(α_{N} + β_{N})}^{2} \to σ^{2}

, as

N \to \infty

. Therefore, (24) is established. The proof of Lemma 1 is complete. □

Further we will also employ a result that immediately follows from Theorem 1.

Corollary 1.

Let the conditions of Lemma 1 be satisfied. Then the following relations hold:

\sqrt{N} E ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U})) \to 0, N \to \infty,

(57)

var (\sqrt{N} {\hat{Err}}_{K, N, ε_{N}} (f^{U})) \to σ^{2} (U), N \to \infty,

(58)

where

σ^{2} (U)

is a variance of the random variable

V (U)

introduced in (22).

Proof.

Condition (23) implies (20). Thus, according to Theorem 1, we have

Z_{N} \overset{D}{\to} Z \sim N (0, σ^{2} (U)), N \to \infty,

(59)

where

Z_{N}

,

N \in N

, are defined in (25). Due to Lemma 1 one has the uniform integrability of the sequence

{(Z)}_{N \in N}

. Consequently, relation (59) implies (57), i.e.,

E Z_{N} \to E Z = 0

, as

N \to \infty

. Obviously,

var (\sqrt{N} {\hat{Err}}_{K, N, ε_{N}} (f^{U}))

= E {(\sqrt{N} ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U}))}^{2} - {(\sqrt{N} E ({\hat{Err}}_{K, N, ε_{N}} (f^{U}) - Err (f^{U})))}^{2} .

Therefore, to obtain (58), it is sufficient to use Lemma 1 and take into account (57). The proof is complete. □

Note that (59) can be obtained directly under conditions of Lemma 1. For each

N \in N

and any

k = 1, \dots, K

, according to Lindeberg’s theorem applied to arrays

{V^{(j)}, j \in D_{k} (N)}

of centered i.i.d. uniformly bounded summands, where a sequence

{(V^{(j)})}_{j \in N}

is introduced in (55), taking into account (56) one has

V_{N, k} : = \frac{1}{\sqrt{♯ D_{k} (N)}} \sum_{j \in D_{k} (N)} V^{(j)} \overset{D}{\to} Z_{k} \sim N (0, σ^{2} (U)), N \to \infty .

(60)

For every

N \in N

, the random variables

V_{N, k}

,

k = 1, \dots, K

, are independent and

var V_{N, k} = σ^{2} (U)

. Since

N / ♯ D_{k} (N) \to K

as

N \to \infty

, for

k = 1, \dots, K

, by virtue of (60) we come to relation

Q_{N} \overset{D}{\to} Z \sim N (0, σ^{2} (U)), N \to \infty,

(61)

where in view of (54) one has

Q_{N} = \frac{1}{K} \sum_{k = 1}^{K} \sqrt{\frac{N}{♯ D_{k} (N)}} V_{N, k}

,

N \in N

. Applying (61) and Slutsky’s lemma, we arrive at (59).

Also note that relation (29) can be easily derived from (36) and (39) without employment of [32].

4. Forward Selection of Relevant Factors

Now we can turn to the sequential selection of factors based on MDR-EFE method. At the first step one searches for

j_{1} \in T

a point where the function

{\hat{E r r}}_{K, N, ε_{N}} (f^{{i}})

attains the minimum over all

i \in T

. If there are several such points, then we take, e.g., one with the smallest index value. Recall that according to (17) (more precisely, after regularization), the random variable

{\hat{E r r}}_{K, N, ε_{N}} (f^{{i}})

is in fact a function of

{\hat{f}}_{P A}^{{i}}

, which is a forecast of the function

f^{{i}}

. Then this procedure is repeated, namely, if at

(k - 1)

-th step the set

S_{k - 1} : = {j_{1}, \dots, j_{k - 1}}

is constructed, where

k \in {2, \dots, r}

, then

j_{k} \in T ∖ S_{k - 1}

is selected at step k in such a way that given

j_{1}, \dots, j_{k - 1}

the function

{\hat{E r r}}_{K, N, ε_{N}} (f^{{S_{k - 1}, i}})

takes the minimum value over

i \in T ∖ S_{k - 1}

for

i = j_{k}

. It is convenient to assume that an empty set is taken at the zero step. Then at each next step one new element is added to the previously constructed sets. If at some step there are several minimum points of the considered function then we take only one of them, e.g., with the minimal index.

Thus, for each

N \in N

the random sets

S_{k} (N) = S_{k} (N, ω) : = {j_{1}, \dots, j_{k}}

arise, where

k = 1, \dots, r

and

j_{m} = j_{m} (N, ω)

,

m = 1, \dots, r

. By construction one can write

j_{k} (N, ω) \in J_{k} (N, ω) : = \arg min_{i \in T ∖ S_{k - 1} (N, ω)} {\hat{E r r}}_{K, N, ε_{N}} (f^{{S_{k - 1} (N, ω), i}}),

where

S_{0} : = ⌀

and

{⌀, i} : = {i}

. In other words the choice

j_{k} (N, ω)

at step k means that, for

i \in T ∖ S_{k - 1} (N, ω)

,

{\hat{E r r}}_{K, N, ε_{N}} (f^{S_{k} (N, ω)}) \leq {\hat{E r r}}_{K, N, ε_{N}} (f^{{S_{k - 1} (N, ω), i}}),

(62)

moreover,

j_{k} (N, ω) = min {i : i \in J_{k} (N, ω)}

,

k = 1, \dots, r

. If the joint distribution of X and Y is known, then instead of the described scheme for constructing random sets,

S_{k} (N, ω)

we turn to considering the non-random “oracle” sets

T_{k} = {i_{1}, \dots, i_{k}}

, where

k = 1, \dots, r

,

i_{k} \in \arg min_{i \in T ∖ T_{k - 1}} Err (f^{{T_{k - 1}, i}}),

(63)

T_{0} : = ⌀

, and the functional

Err

is introduced by formula (2). If there are several

i_{k}

satisfying (63) we take among them that one which has the minimal value.

For

k \in {1, \dots, r}

and

i \in T ∖ T_{k}

introduce

C_{k, i} : = Err (f^{{T_{k - 1}, i}}) - Err (f^{T_{k}}) .

By construction of the sets

T_{k}

we have

C_{k, i} \geq 0

, where

k = 1, \dots, r

and

i \in T ∖ T_{k}

. We call a model, satisfying condition (1), regular whenever the following relation is true:

C_{k, i} > 0, k = 1, \dots, r, i \in T ∖ T_{k} .

(64)

In other words, for each

k = 1, \dots, r

, a point

i_{k}

in (63) is determined uniquely. Further we employ the penalty function introduced in (11). We also use its strongly consistent estimate of type (48) with

{\hat{p}}_{N} : = \frac{1}{W_{N}} \sum_{j \in W_{N}} I {Y^{(j)} = y},

(65)

W_{N} \subset {1, \dots, N}

and

♯ W_{N} \to \infty

as

N \to \infty

.

Theorem 2.

Let the considered model (1) with a collection of relevant factors having cardinality

r < p

, be regular, i.e., let (64) take place. Then, for the random sets

S_{r} (N)

introduced above, the following relation is valid

P (S_{r} (N) = T_{r}) \to 1, N \to \infty,

(66)

where

T_{r}

is defined by means of (63) for

k = 1, \dots, r

. In other words, with probability close to one, the described procedure of forward selection based on statistical estimates of the error functional leads to the “oracle” collection

T_{r}

, when N is large enough.

Proof.

For a random set

S_{r} (N, ω) = {j_{1} (N, ω), \dots, j_{r} (N, ω)}

, where

j_{k} (N, ω)

is an element taken at k-th step, one has

P (ω : S_{r} (N, ω) = T_{r}) \geq P (ω : j_{1} (N, ω) = i_{1}, \dots, j_{r} (N, ω) = i_{r}) .

Note that

P (ω : j_{1} (N, ω) = i_{1}, \dots, j_{r} (N, ω) = i_{r}) \geq P (⋂_{k = 1}^{r} A_{k} (N)),

where

A_{k} (N) : = ⋂_{i \in T ∖ T_{k - 1}} \{{\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}}) < {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}})\},

k = 1, \dots, r

. Thus, we obtain:

P (⋂_{k = 1}^{r} A_{k} (N)) = 1 - P (⋃_{k = 1}^{r} {\bar{A}}_{k} (N)) \geq 1 - \sum_{k = 1}^{r} P ({\bar{A}}_{k} (N))

\geq 1 - \sum_{k = 1}^{r} \sum_{i \in T ∖ T_{k - 1}} P ({\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}}) \geq {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}})),

(67)

where, as usual,

\bar{A} : = Ω ∖ A

for

A \subset Ω

. Then, for

k = 1, \dots, r

,

i \in T ∖ T_{k - 1}

and

N \in N

, we get

\begin{matrix} Δ_{k, i} (N) & : = & {\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}}) - {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}}) \\ = & ({\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}}) - E {\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}})) + (E {\hat{E r r}}_{K, N, ε_{N}} (f^{T_{k}}) - Err (f^{T_{k}})) \\ + & (Err (f^{T_{k}}) - Err (f^{{T_{k - 1}, i}})) + (Err (f^{{T_{k - 1}, i}}) - E {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}})) \\ + & (E {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}}) - {\hat{E r r}}_{K, N, ε_{N}} (f^{{T_{k - 1}, i}})) . \end{matrix}

(68)

For

U \subset T

, set

Z_{N} (U) : = {\hat{E r r}}_{K, N, ε_{N}} (f^{U}) - E {\hat{E r r}}_{K, N, ε_{N}} (f^{U}) .

For any

k = 1, \dots, K

,

i \in T ∖ T_{k - 1}

and each

δ \in (0, 1)

in light of formula (57) of Corollary 1, for all N large enough (

N \geq N_{2} (δ, k, i)

) it holds

P (Δ_{k, i} (N) \geq 0) \leq P (\sqrt{N} | Z_{N} (T_{k} (N)) | + \sqrt{N} | Z_{N} ({T_{k - 1} (N), i}) | \geq \sqrt{N} C_{k, i} - δ)

\leq P (\sqrt{N} | Z_{N} (T_{k} (N)) | \geq \frac{(1 - δ) \sqrt{N} C_{k, i}}{2}) + P (| Z_{N} ({T_{k - 1} (N), i}) | \geq \frac{(1 - δ) \sqrt{N} C_{k, i}}{2}),

where

C_{k, i}

are introduced in (66),

Δ_{k, i} (N)

is defined by (68).

Applying the Bienaymé - Chebyshev inequality and taking into account Formula (58) of Corollary 1, for each

U \subset T

and any

c > 0

, we come, for a centered random variable

Z_{N} (U)

, to the relation

P (\sqrt{N} | Z_{N} (U) | \geq c \sqrt{N}) \leq \frac{N var Z_{N} (U)}{N c^{2}} \sim \frac{var V (U)}{N c^{2}}, N \to \infty,

(69)

where

V (U)

is determined by Formula (22). According to (64), for

k \in {1, \dots, r}

and

i \in T ∖ T_{k}

, one has

C_{k, i} > 0

. Therefore, for all N large enough (

N \geq N_{3} (δ, k, i)

), the following inequality takes place:

P (Δ_{k, i} (N) \geq 0) \leq \frac{4 (var V (T_{k}) + var V ({T_{k - 1}, i})}{N {(1 - δ)}^{2} C_{k, i}^{2}} .

(70)

For a fixed

m \in N

, one can change the summation order over i and y to write Formula (22) as follows:

V (U) = \sum_{y = - m}^{m} \frac{I {Y = y}}{P (Y = y)} W (y, U),

where

W (y, U) = \sum_{0 \leq i < | y | + m} (I {| f^{U} (X) - y | > i} - P (| f^{U} (X) - y | > i | Y = y)) .

(71)

Thus, for any

U \subset T

, one has

| V (U) | \leq 2 m \sum_{y = - m}^{m} \frac{I {Y = y}}{P (Y = y)} .

Consequently, we come to the inequality

var V (U) \leq E V^{2} (U) \leq 4 m^{2} \sum_{y = - m}^{m} \frac{1}{P (Y = y)} = : a,

where

a = a (m, {(P (Y = y))}_{y \in Y})

. We see that

var V (T_{k}) + var V ({T_{k - 1}, i}) \leq 2 a

for all

k \in {1, \dots, r}

,

i \in T ∖ T_{k - 1}

and

N \in N

. For each

δ \in (0, 1)

, any

k \in {1, \dots, r}

,

i \in T ∖ T_{k - 1}

and all N large enough, we get the following bound:

P (Δ_{k, i} (N) \geq 0) \leq \frac{8 a}{N {(1 - δ)}^{2} C_{k, i}^{2}} .

Hence, for each

δ \in (0, 1)

and all N large enough, by virtue of (67) the following inequality holds:

P (S_{r} (N) = T_{r}) \geq 1 - \frac{8 a r}{N {(1 - δ)}^{2} C_{0}^{2}} (p + 1 - \frac{r + 1}{2}),

(72)

where

C_{0}^{2} : = {min}_{k = 1, \dots, r, i \in T ∖ T_{k - 1}} C_{k, i}^{2} > 0

according to (64). Thus relation (72) implies the validity of (66). □

Now note that according to (69) the following relation is true:

P (\sqrt{N} | Z_{N} (U) | \geq c \sqrt{N}) = O (\frac{1}{N}), N \to \infty .

(73)

The question arises whether this probability decreases like

C / N

where C is a positive constant or more rapidly. The answer depends on the variance of the random variable

V (U)

given by Formula (22). In view of (70) we will determine when the variable

V (U)

is degenerate, i.e., equal to a constant a.s. This is also of independent interest for the CLT established in Section 6 of [32] and given above as Theorem 1. The following result provides a simple characterization of the

V (U)

degeneracy.

Lemma 3.

For an arbitrary set

U \subset T

, the variance of the random variable

V (U)

, appearing in Formula (22), is zero if and only if, for every

y \in Y

, there is

k_{0} (y) \in {0, \dots, m + | y |}

such that

P (| f^{U} (X) - y | = k_{0} (y), Y = y) = P (Y = y) .

(74)

Thus, for each

y \in Y

, on the set

{Y = y}

the random variable

f^{U} (X)

does not necessarily take a constant value. Moreover, the values of

k_{0} (y)

need not coincide for different y.

Proof.

For

y = 0, \dots, m

and a random variable

W (y, U)

, introduced by Formula (71), one can write

\begin{matrix} W (y, U) & = & \sum_{0 \leq i < y + m} (I {| f^{U} (X) - y | > i} - P (| f^{U} (X) - y | > i | Y = y)) \\ = & \sum_{0 \leq i < y + m} \sum_{i < k \leq m + y} (I {| f^{U} (X) - y | = k} - P (| f^{U} (X) - y | = k | Y = y)) \\ = & \sum_{k = 1}^{m + y} \sum_{i = 0}^{k - 1} (I {| f^{U} (X) - y | = k} - P (| f^{U} (X) - y | = k | Y = y)) \\ = & \sum_{k = 1}^{m + y} k (I {| f^{U} (X) - y | = k} - P (| f^{U} (X) - y | = k | Y = y)) \\ = & \sum_{k = 1}^{m + y} k I {| f^{U} (X) - y | = k} - E (| f^{U} (X) - y | | Y = y) . \end{matrix}

In a similar way we consider

y = - m, \dots, - 1

. Thus, for all

y \in Y

, one gets

W (y, U) = \sum_{k = 1}^{m + | y |} k I {| f^{U} (X) - y | = k} - E (| f^{U} (X) - y | | Y = y) .

Recall that

P (Y = y) > 0

for all

y \in Y

. If, for some

y, k, j \in Y

,

k \neq j

, we have

P (| f^{U} (X) - y | = k, Y = y) > 0, P (| f^{U} (X) - y | = j, Y = y) > 0,

then on the events

{| f^{U} (X) - y | = k, Y = y}

and

{| f^{U} (X) - y | = j, Y = y}

the variable

W (y, U)

takes different values. Therefore,

V (U)

takes different values on these events. Hence

var V (U) > 0

, if (74) is not valid. Thus (74) is a necessary condition to guarantee that

var V (U) = 0

. Suppose now that, (74) holds. In this case we get

E (| f^{U} (X) - y | | Y = y) = k_{0} (y), y \in Y .

Clearly,

k_{0} (y)

depends on U as well. We see that

V (U)

on each set

{Y = y}

takes (up to the set of measure zero) the value

\frac{1}{P (Y = y)} (k_{0} (y) - k_{0} (y)) = 0

,

y \in Y

. Therefore,

var V (U) = 0

. Note that

k_{0} (y)

need not coincide for different

y \in Y

. The proof is complete. □

5. Concluding Remarks

The established asymptotical result (Theorem 2) is rather qualitative in nature, since relation (66) assumes increasing values of N. Relation (72) is more precise. However, (72) demonstrates that, loosely speaking, one has to employ

N > > r p

. As previously, we assume that assumption (A), introduced on page 2, is valid. Evidently, the sequential choice of relevant variables based on statistical estimators of the error functional (of response approximation), is attractive for implementation, although suboptimal. In this regard Theorem 2 shows that under certain conditions, forward (random) selection with a high probability leads to the same collection of factors, which is provided by the sequential procedure with known joint distribution of the vector of factors X and the response Y. In the future work, it would be reasonable to supplement the theoretical results by computer simulations (see, e.g., [45]).

Consideration of the proximity of the results of optimal and suboptimal procedures requires a separate study. In addition, we note that within the framework of linear models, estimates of the probability of correct identification of relevant factors are considered, e.g., in [46,47]. Theorem 2 does not assume the linearity of stochastic model. Presumably for the first time, in our work a forward selection of relevant factors affecting the non-binary random response is treated on the base of MDR-EFE method. It would be interesting to extend the conditions allowing to establish relation (66). Moreover, stability problems of FS deserve special attention, see, e.g., [48,49,50]. Algorithms stability for classification problems in the framework of random trees is treated in [51].

Finally, we emphasize that the problem of statistical estimation of the cardinality of a set of relevant factors appearing in definition (1) is very important and complex. Along with dealing with the deterministic number of selected factors, there is a research approach based on developing the rules for stopping the procedures used to identify the relevant set. In this regard, we indicate, e.g., article [52], dedicated to information methods for selecting relevant factors. The study of non-discrete stochastic models is also of undoubted interest, see, e.g., [53].

Further it would be interesting to study other functionals than (2) to measure the quality of a response approximation by means of functions defined on various collections of factors. One can also consider a random number of observations. In this regard we refer, e.g., to [27,54].

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The author is very grateful to the Reviewers for careful reading the manuscript and making valuable remarks and suggestions. He would also like to thank Alexander Tikhomirov for invitation to present manuscript for this issue.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 2

Proof.

For any

t \in N

and

p \in (0, 1)

, one has

\begin{matrix} E {({\hat{ψ}}_{N})}^{t} & = & N^{t} \sum_{j = 1}^{N} \frac{1}{j^{t}} (\binom{N}{j}) p^{j} {(1 - p)}^{N - j} \\ = & \frac{N^{t}}{p^{t} (N + 1) \dots (N + t)} \sum_{j = 1}^{N} \frac{(j + 1) \dots (j + t)}{j^{t}} (\binom{N + t}{j + t}) p^{j + t} {(1 - p)}^{(N + t) - (j + t)} \\ = & \frac{1}{p^{t}} (1 + h_{t} (N)) \sum_{i = t + 1}^{N + t} (1 + \frac{a_{1}}{i - t} + \dots + \frac{a_{t}}{{(i - t)}^{t}}) (\binom{N + t}{i}) p^{i} {(1 - p)}^{N + t - i}, \end{matrix}

where

h_{t} (N) = O (1 / N)

, as

N \to \infty

, and

a_{1}, \dots, a_{t} \in N

. We do not use the explicit formulas

a_{1} = t (t + 1) / 2, \dots, a_{t} = t!

. Note that

\sum_{i = t + 1}^{N + t} (\binom{N + t}{i}) p^{i} {(1 - p)}^{N + t - i} = 1 - \sum_{i = 0}^{t} (\binom{N + t}{i}) p^{i} {(1 - p)}^{N + t - i} = 1 - g_{t} (N),

where

g_{t} (N) : = \sum_{i = 0}^{t} g_{t, i} (N)

and, for

i = t + 1, \dots, N + t

, one has

0 \leq g_{t, i} (N) : = (\binom{N + t}{i}) p^{i} {(1 - p)}^{N + t - i} \leq {(N + t)}^{t} {(1 - p)}^{N} = O (1 / N), N \to \infty .

For each

k = 1, \dots, t

, introduce

q_{t, k} (N) : = \sum_{i = t + 1}^{N + t} \frac{1}{{(i - t)}^{k}} (\binom{N + t}{i}) p^{i} {(1 - p)}^{N + t - i}

= \frac{1}{p^{k} (N + t + 1) \dots (N + t + k)} \sum_{i = t + 1}^{N + t} \frac{(i + 1) \dots (i + k)}{{(i - t)}^{k}} (\binom{N + t + k}{i + k}) p^{i + k} {(1 - p)}^{(N + t + k) - (i + k)} .

Obviously, one can write

q_{t, k} (N) = O (1 / N^{k})

, as

(i + 1) \dots (i + k) {(i - t)}^{- k} \leq {(1 + t + k)}^{k} \leq {(1 + 2 t)}^{t}

for all

i \geq t + 1

,

k = 1, \dots, t

, and since

\sum_{i = t + 1}^{N + t} (\binom{N + t + k}{i + k}) p^{i + k} {(1 - p)}^{(N + t + k) - (i + k)} \leq 1 .

Consequently, for any

t \in N

, we get

E {({\hat{ψ}}_{N})}^{t} = \frac{1}{p^{t}} (1 + h_{t} (N)) (1 - g_{t} (N) + \sum_{k = 1}^{t} q_{t, k} (N)) = \frac{1}{p^{t}} + R_{t} (N),

where

R_{t} (N) = O (1 / N)

, as

N \to \infty

. Evidently,

E {({\hat{ψ}}_{N})}^{0} = 1

for

N \in N

. For each

N \in N

, set

R_{0} (N) = 0

. Thus, for

t \in N

, one has

E {({\hat{ψ}}_{N} - \frac{1}{p})}^{t} = \sum_{v = 0}^{t} (\binom{t}{v}) E {({\hat{ψ}}_{N})}^{v} {(- \frac{1}{p})}^{t - v} = \sum_{v = 0}^{t} (\binom{t}{v}) (\frac{1}{p^{v}} + R_{v} (N))) {(- \frac{1}{p})}^{t - v} = O (\frac{1}{N}),

because

\sum_{v = 0}^{t} (\binom{t}{v}) {(\frac{1}{p})}^{v} {(- \frac{1}{p})}^{t - v} = {(\frac{1}{p} - \frac{1}{p})}^{t} = 0, \sum_{v = 0}^{t} (\binom{t}{v}) {(\frac{1}{p})}^{t - v} = {(1 + \frac{1}{p})}^{t}

and

max_{v = 0, \dots, t} | R_{v} (N) | = O (1 / N), N \to \infty .

The proof of Lemma 2 is complete. □

References

Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; J.Wiley and Sons Publication: Hoboken, NJ, USA, 2003. [Google Scholar]
Györfi, L.; Kohler, M.; Krzyz˙ak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
Matloff, N. Statistical Regression and Classification. From Linear Models to Machine Learning; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Wainwrigth, R. Statistical Learning with Sparsity. The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Bolón-Candedo, V.; Alonso-Betanzos, A. Recent Advances in Ensembles for Feature Selection; Springer: Cham, Switzerland, 2018. [Google Scholar]
Giraud, C. Introduction to High-Dimensional Statistics; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Stańczyk, U.; Zielosko, B.; Jain, L.C. (Eds.) Advances in Feature Selection for Data and Pattern Recognition; Springer International Publishing AG: Cham, Switzerland, 2018. [Google Scholar]
Kuhn, M.; Johnson, K. Feature Engineering and Selection. A Practical Approach for Predictive Models; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
Lyu, Y.; Feng, Y.; Sakurai, K. A survey on feature selection techniques based on filtering methods for cyber attack detection. Information 2023, 14, 191. [Google Scholar] [CrossRef]
Pradip, D.; Chandrashekhar, A. A comprehensive survey on feature selection in the various fields of machine learning. Appl. Intell. 2023, 52, 4543–4581. [Google Scholar]
Htun, H.H.; Biehl, M.; Petkov, N. Survey of feature selection and extraction techniques for stock market prediction. Financ. Innov. 2023, 9, 26. [Google Scholar] [CrossRef] [PubMed]
Laborda, J.; Ryoo, S. Feature Selection in a Credit Scoring Model. Mathematics 2021, 9, 746. [Google Scholar] [CrossRef]
Emily, M. A survey of statistical methods for gene-gene interaction in case-control genomewide association studies. J. Société Fr. Stat. 2018, 159, 27–67. [Google Scholar]
Tsunoda, T.; Tanaka, T.; Nakamura, Y. (Eds.) Genome-Wide Association Studies; Springer: Singapore, 2019. [Google Scholar]
Luque-Rodriguez, M.; Molina-Baena, J.; Jimenez-Vilchez, A.; Arauzo-Azofra, A. Initialization of feature selection search for classification. J. Artif. Intell. Res. 2022, 75, 953–998. [Google Scholar] [CrossRef]
Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
Coelho, F.; Braga, A.P.; Verleysen, M.A. Mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [Google Scholar] [CrossRef]
Kozhevin, A.A. Feature selection based on statistical estimation of mutual information. Sib. Elektron. Mat. Izv. 2021, 18, 720–728. [Google Scholar] [CrossRef]
Latt, K.Z.; Honda, K.; Thiri, M.; Hitomi, Y.; Omae, Y.; Sawai, H.; Kawai, Y.; Teraguchi, S.; Ueno, K.; Nagasaki, M.; et al. Identification of a two-SNP PLA2R1 haplotype and HLA-DRB1 alleles as primary risk associations in idiopathic membranous nephropathy. Sci. Rep. 2018, 8, 15576. [Google Scholar] [CrossRef] [PubMed]
Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Applic 2014, 24, 175–186. [Google Scholar] [CrossRef]
AlNuaimi, N.; Masud, M.M.; Serhani, M.A.; Zaki, N. Streaming feature selection algorithms for big data: A survey. Appl. Comput. Inform. 2022, 18, 113–135. [Google Scholar] [CrossRef]
Ritchie, M.D.; Hahn, L.W.; Roodi, N.; Bailey, L.R.; Dupont, W.D.; Parl, F.F.; Moore, J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Human Genet. 2001, 69, 138–147. [Google Scholar] [CrossRef]
Gola, D.; John, J.M.M.; van Steen, K.; Kónig, I.R. A roadmap to multifactor dimensionality reduction methods. Briefings Bioinform. 2016, 17, 293–308. [Google Scholar] [CrossRef]
Bulinski, A.; Kozhevin, A. New version of the MDR method for stratified samples. Stat. Optim. Inf. Comput. 2017, 5, 1–18. [Google Scholar] [CrossRef]
Abegaz, F.; van Lishout, F.; Mahachie, J.J.M.; Chiachoompu, K.; Bhardwaj, A.; Duroux, D.; Gusareva, R.S.; Wei, Z.; Hakonarson, H.; Van Steen, K. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min. 2021, 14, 16. [Google Scholar] [CrossRef]
Yang, C.H.; Hou, M.F.; Chuang, L.Y.; Yang, C.S.; Lin, Y.D. Dimensionality reduction approach for many-objective epistasis analysis. Briefings Bioinform 2023, 24, bbac512. [Google Scholar] [CrossRef]
Bulinski, A.; Butkovsky, O.; Sadovnichy, V.; Shashkin, A.; Yaskov, P.; Balatskiy, A.; Samokhodskaya, L.; Tkachuk, V. Statistical Methods of SNP Data Analysis and Applications. Open J. Stat. 2012, 2, 73–87. [Google Scholar] [CrossRef]
Bulinski, A. On foundation of the dimensionality reduction method for explanatory variables. J. Math. Sci. 2014, 199, 113–122. [Google Scholar] [CrossRef]
Bulinski, A.V.; Rakitko, A.S. MDR method for nonbinary response variable. J. Multivar. Anal. 2015, 135, 25–42. [Google Scholar] [CrossRef]
Macedo, F.; Oliveira, M.R.; Pacheco, A.; Valadas, R. Theoretical Foundations of Forward Feature Selection Methods based on Mutual Information. Neurocomputing 2019, 325, 67–89. [Google Scholar] [CrossRef]
Bulinski, A.V. On relevant feature selection based on information theory. Theory Probab. Its Appl. 2023, 68, 392–410. [Google Scholar] [CrossRef]
Rakitko, A. MDR-EFE method with forward selection. In Proceedings of the The 5th International Conference on Stochastic Methods (ICSM-5), Moscow, Russia, 23–27 November 2020. [Google Scholar] [CrossRef]
Velez, D.R.; White, B.C.; Motsinger, A.A.; Bush, W.S.; Ritchie, M.D.; Williams, S.M.; Moore, J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 2007, 31, 306–315. [Google Scholar] [CrossRef] [PubMed]
Hu, T.-C.; Moricz, F.; Taylor, R. Strong laws of large numbers for arrays of rowwise independent random variables. Acta Math. Hung. 1989, 54, 153–162. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Billingsley, P. Convergence of Probability Measures; John Wiley and Sons: New York, NY, USA, 1968. [Google Scholar]
Borkar, V.S. Probability Theory: An Advanced Course; Springer: New York, NY, USA, 1995. [Google Scholar]
Bulinski, A.V.; Shiryaev, A.N. Theory of Stochastic Processes, 2nd ed.; Fizmatlit: Moscow, Russia, 2005. (In Russian) [Google Scholar]
Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. [Google Scholar]
Petrov, V.V. Limit Theorems of Probability Theory: Sequences of Independent Random Variables; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
Shevtsova, I.G. On absolute constants in the Berry-Esseen inequality and its structural and non-uniform refinements. Informatics Its Appl. 2013, 7, 124–125. [Google Scholar]
Bulinski, A.V.; Rakitko, A.S. Simulation and analytical approach to the identification of significant factors. Commun. Stat.-Simul. Comput. 2016, 45, 1430–1450. [Google Scholar] [CrossRef]
Shah, R.D.; Samworth, R.J. Variable selection with error control: Another look at stablity selection. J. R. Statist. Soc. B. 2012, 74, 1–26. [Google Scholar] [CrossRef]
Beinrucker, A.; Dogan, U.; Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat. Comput. 2016, 26, 1059–1077. [Google Scholar] [CrossRef]
Nogueira, S.; Sechidis, K.; Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
Bulinski, A. Stability properties of feature selection measures. Theory Probab. Appl. 2024, 69, 3–15. [Google Scholar]
Bénard, C.; Biau, G.; Da Veiga, S.; Scornet, E. SIRUS: Stable and Interpretable RUle Set for classification. Electron. J. Statist. 2021, 15, 427–505. [Google Scholar] [CrossRef]
Mielniczuk, J. Information theoretic methods for variable selection—A review. Entropy 2022, 24, 1079. [Google Scholar] [CrossRef]
Linke, Y.; Borisov, I.; Ruzankin, P.; Kutsenko, V.; Yarovaya, E.; Shalnova, S. Universal Local Linear Kernel Estimators in Nonparametric Regression. Mathematics 2022, 10, 2693. [Google Scholar] [CrossRef]
Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F.J. The Methods of Distances in the Theory of Probability and Statistics; Springer: New York, NY, USA, 2013. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bulinski, A. Forward Selection of Relevant Factors by Means of MDR-EFE Method. Mathematics 2024, 12, 831. https://doi.org/10.3390/math12060831

AMA Style

Bulinski A. Forward Selection of Relevant Factors by Means of MDR-EFE Method. Mathematics. 2024; 12(6):831. https://doi.org/10.3390/math12060831

Chicago/Turabian Style

Bulinski, Alexander. 2024. "Forward Selection of Relevant Factors by Means of MDR-EFE Method" Mathematics 12, no. 6: 831. https://doi.org/10.3390/math12060831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forward Selection of Relevant Factors by Means of MDR-EFE Method

Abstract

1. Introduction

2. Error Functional Estimators

3. Asymptotic Behavior of the First Two Moments of Statistical Estimators of the Error Functional

4. Forward Selection of Relevant Factors

5. Concluding Remarks

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Lemma 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI