Next Article in Journal
Direction of Arrival Estimation Method Based on Eigenvalues and Eigenvectors for Coherent Signals in Impulsive Noise
Next Article in Special Issue
Improved Bayesian Inferences for Right-Censored Birnbaum–Saunders Data
Previous Article in Journal
Key Backup and Recovery for Resilient DID Environment
Previous Article in Special Issue
Branching Random Walks in a Random Killing Environment with a Single Reproduction Source
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Forward Selection of Relevant Factors by Means of MDR-EFE Method

by
Alexander Bulinski
Faculty of Mathematics and Mechanics, Lomonosov Moscow State University, Leninskie Gory 1, 119991 Moscow, Russia
Mathematics 2024, 12(6), 831; https://doi.org/10.3390/math12060831
Submission received: 20 January 2024 / Revised: 5 March 2024 / Accepted: 8 March 2024 / Published: 12 March 2024
(This article belongs to the Special Issue New Trends in Stochastic Processes, Probability and Statistics)

Abstract

:
The suboptimal procedure under consideration, based on the MDR-EFE algorithm, provides sequential selection of relevant (in a sense) factors affecting the studied, in general, non-binary random response. The model is not assumed linear, the joint distribution of the factors vector and response is unknown. A set of relevant factors has specified cardinality. It is proved that under certain conditions the mentioned forward selection procedure gives a random set of factors that asymptotically (with probability tending to one as the number of observations grows to infinity) coincides with the “oracle” one. The latter means that the random set, obtained with this algorithm, approximates the features collection that would be identified, if the joint distribution of the features vector and response were known. For this purpose the statistical estimators of the prediction error functional of the studied response are proposed. They involve a new version of regularization. This permits to guarantee not only the central limit theorem for normalized estimators, but also to find the convergence rate of their first two moments to the corresponding moments of the limiting Gaussian variable.

1. Introduction

This paper is dedicated to the eminent scientist Professor A.S. Holevo, academician of the Russian Academy of Sciences, on occasion of his remarkable birthday.
The classical problem of regression analysis consists in the search for deterministic function f, which, in a certain sense, “well” approximates the observed random variable (response) Y by the value f ( X ) , where X = ( X 1 , , X p ) is a vector of factors influencing the behavior of Y. This approach was initiated by the works of A.-M. Legendre and K. Gauss. At that time it found application in the processing of astronomical observations. Nowadays one widely uses the methods involving the appropriate choice of unknown real coefficients β 1 , , β p for a linear model of the form Y = i = 1 p β i X i + ε , where ε describes a random error. Clearly, X 0 = 1 can be included in the collection of factors, then Y = β 0 + i = 1 p β i X i + ε . For example, books [1,2] are devoted to regression. The close tasks also arise in observations classification, see, e.g., [3].
Since the end of the 20th century, stochastic models have been studied where the random response Y depended only on some subset of the factors in the set of X 1 , , X p . So, in article [4], the LASSO method (Least Absolute Shrinkage and Selection Operator) was introduced, using the idea of regularization (going back to A.N.Tikhonov), which allowed to find factors included with non-zero coefficients in a “sparse” linear model. Somewhat earlier, this approach was used by several authors for the treatment of geophysical data. Generalizations of the mentioned method are considered in monograph [5]. We emphasize that the idea of identifying some of the factors having a principle (in a certain sense) impact on a response is also intensely developing within the framework of nonlinear models. Such direction of modern mathematical statistics is called Feature Selection (FS), i.e., the choice of features (variables, factors). In this regard, we refer, e.g., to monographs [6,7,8,9] and also to reviews [10,11,12,13,14]. In [10] the authors consider filter, wrapper and embedded methods of FS. They concentrate on feature elimination and also demonstrate the application of FS technique on standard datasets. In [11] the modern mainstream dimensionality reduction methods are analyzed including ones for small samples and those based on deep learning. In [12] FS machinery is considered based on filtering methods for detecting the cyber attacks. Survey [13] is devoted to FS methods in machine learning (the structured information is contained in 20 tables). The authors of [14] concentrate on applications of FS to stock market prediction and applications of FS in the analysis of credit risks are considered, e.g., in [15]. Beyond financial mathematics the choice of relevant factors is very important in medicine and biology. For instance, in the field of genetic data analysis there is an extensive research area called GWAS (Genome-Wide Association Studies) aimed at studying the relationships between phenotypes and genotypes, see, e.g., [16,17]. The authors of [18] provide the survey of starting methods used by genetic algorithms. Review [19] is devoted to the FS methods for predicting the risk of diseases. Thus, research in the field of FS is not only of theoretical interest, but also admits various applications.
Note that there are a number of complementary methods for identifying relevant factors. Much attention is paid to those employing the basic concepts of information theory such as entropy, mutual information, conditional mutual information, interaction information, various divergences, etc. Here statistical estimation of information characteristics plays an important role. One can mention, e.g., works [20,21]. In this article, the accent is made on identifying a set of relevant factors in the framework of a certain stochastic model, when the quality of the response approximation is evaluated by means of some metric.
Recall that J.B. Herrick in 1910 described the Sickle cell anemia (HbS). Later it was discovered that all clinical manifestations of the presence of HbS are the consequences of the single change in the B-globin gene. This famous example shows that even the search of a single feature having impact on a disease is reasonable. Nowadays the researchers concentrate on complex diseases provoked by several disorders of the human genome. Even identification of two SNPs (single nucleotide polymorphisms) having impact on a certain disease is of interest, see, e.g., [22].
Now we turn to the description of the studied mathematical model. All the considered random variables are defined on a probability space ( Ω , F , P ) . Let a random variable Y map Ω to some finite set Y . We assume that, for k T : = { 1 , , p } , a random variable X k : Ω M k , where M k is an arbitrary finite set. Then the vector X = ( X 1 , , X p ) takes the values in X = M 1 × × M p . For a set S = { i 1 , , i r } , where 1 i 1 < < i r p , we put X S : = ( X i 1 , , X i r ) . Similarly, for x X , x S denotes a vector ( x i 1 , , x i r ) . A collection of indices S T (the symbol ⊂ is everywhere understood as a non-strict inclusion) is called relevant if the following relation holds for any x X and y Y :
P ( Y = y | X = x ) = P ( Y = y | X S = x S ) ,
whenever P ( Y = y | X = x ) 0 . In this case, the set of factors X S is called relevant as well. If (1) takes place for some S = S 0 then it will be obviously valid for any S containing S 0 . Therefore, the natural desire is to identify a set S that satisfies (1) and has cardinality r < p (if such a set other than T exists). Note that there are different definitions of the relevant factors collection, see, e.g., [23,24] and the references therein.
It is assumed that a collection of relevant factors has r elements ( 1 r < p ), however, the set S itself, which appears in (1), is unknown and should be identified. We label this assumption as (A). There is no restriction that S satisfying (1) and containing r elements is unique. Usually the joint distribution of ( X , Y ) is also unknown. Therefore, a statistical estimator of S is constructed based on the first N observations ξ N : = ( ξ ( 1 ) , , ξ ( N ) ) of a sequence ξ ( 1 ) , ξ ( 2 ) , , consisting of i.i.d. random vectors, where, for k N , ξ ( k ) : = ( X ( k ) , Y ( k ) ) has the same distribution as the vector ( X , Y ) .
In 2001, the authors of [25] proposed a method for identifying relevant factors, called MDR (Multifactor Dimensionality Reduction). According to article [26], more than 800 publications were devoted to the development of this method and its applications in the period from 2001 to 2014. Research in this direction has continued over the last decade, see, e.g., [27,28,29]. In [30], for the binary response Y, a modification of the MDR method was introduced, namely, MDR-EFE (Error Function Estimation), based on statistical estimates of the error functional of the response prediction using the K-fold cross-validation procedure, see also [31]. Later this method was extended in [32] to study the non-binary response.
Recall how the MDR-EFE method is employed. Let a non-random function f : X Y be used to predict the response Y by the values of the factors vector X. Further we exclude considering the trivial case when Y = y 0 with probability one for some y 0 Y (hence, X and Y are independent). The prediction quality is determined by applying the following error functional
Err ( f ) : = E | Y f ( X ) | ψ ( Y ) ,
where a penalty function ψ : Y R + . The functional Err takes finite values for the discrete X and Y under consideration. The function ψ allows to take into account the importance of approximating a particular value of Y using f ( X ) .
In biomedical research, one often considers the binary response Y characterizing the patient’s state of health, say, the value Y = 1 corresponds to illness, and Y = 1 means that the patient is healthy. In many situations it is more important to consider the disease detection, so the value of 1 is attributed more weight. Of interest is the situation when Y = { 1 , 0 , 1 } . Then the value 0 describes some intermediate state of uncertainty (“gray zone”). Following [32], we will consider a more general scheme when the set Y : = { m , , 0 , , m } for some m N . Lemma 1 in [32] describes for such model all optimal functions f o p t that deliver a minimum to the error functional (2). Note that we can suppose that the set of values of Y is strictly contained in { m , , m } , i.e., some values are accepted with zero probability. For such y, we assume that ψ ( y ) = 0 . Thus, it is possible to study Y taking values in an arbitrary finite subset of Z . In order to simplify the notation, we further consider P ( Y = y ) > 0 for all y Y = { m , , m } .
It is proved that in the framework of model (1) the relation f o p t = f S is valid, where, for x X and U T , f U ( x ) = f ( x U ) and a function f is constructed in a due way. At the same time, for any U T such that U = S (♯ denotes the cardinality of a finite set) and S appearing in (1), the following inequality is true:
Err ( f S ) Err ( f U ) .
For U T , the function f U is introduced further. It depends on the joint distribution of ( X , Y ) which is usually unknown. Thus we use observations ξ N = { ( X ( j ) , Y ( j ) ) , j = 1 , , N } for statistical estimates of the functional Err ( f U ) , where U T , and then select as an estimator of S the set U on which the minimum of the corresponding statistical estimate is attained. This approach is described in the next section of the article.
We underline that consideration of all subsets (of the set T) having the cardinality r in the mentioned comparison procedure (involving regularized estimators, as explained in Section 2) for statistical estimates of the error functional is practically unfeasible, when p is large and r is moderately large. Therefore, a number of suboptimal methods of sequential feature selection have emerged. Such methods are used in various approaches to identify sets of relevant factors.
Mainly, one aims either to sequentially add indexes at each step of the algorithm for constructing a statistical estimator of a set S appearing in (1), or to sequentially exclude features from the general set T. In [33], algorithms of forward selection, i.e., sequential addition of indexes to the initial set, based on information theory, are considered. The authors of [33] show that the various algorithms employed can be interpreted as procedures based on proper approximations of the certain objective function. In [34] the principle attention is paid to simple models describing the phenomenon of epistasis observed in genetics, when individual factors do not affect the response, and some combinations of them lead to essential effects (in statistics one says “synergy interaction” of factors). Besides we also demonstrated that a number of well-known algorithms, for instance, mRMR (Minimum Redundancy Maximum Relevance) using mutual information and/or interaction information with a sequential procedure for selecting relevant factors can lead to the identification of the desired set with probability which is negligibly small. In [35] a variant is proposed for sequential (forward) application of the MDR-EFE method within the binary response model involving the naive Bayesian classifier scheme. The latter means that, for any y { 1 , 1 } and all x X , the following relation holds:
P ( X = x | Y = y ) = k = 1 p P ( X k = x k | Y = y ) .
In other words, the factors X 1 , , X p are conditionally independent for a given response Y. In [35] the joint distribution of X and Y was assumed known.
The principle goal of our work is to derive, for a non-binary, in general, random response, the probability that a sequential selection of features based on the (forward) application of the MDR-EFE method, without assuming the validity of (4), leads to identifying a suboptimal set that would be constructed by means of the same method from observations with a known joint distribution of the response and the vector of factors.
This result builds on the central limit theorem (CLT) for statistical estimates of the prediction error functional for a possibly non-binary response, proved in [32], which extends the CLT for the binary response model studied by the author previously. In addition, for the purposes of this work, we found the convergence rate of the first two moments of the considered statistics to the corresponding moments of the limiting Gaussian variable as the number of observations tends to infinity.
The article has the following structure. Section 2 describes statistical estimates of the error functional (for a response prediction) based on the MDR-EFE method. We also introduce the regularized versions of these estimators. In Section 3, the convergence rate of the first two moments of the regularized estimators of the error functional to the corresponding moments of the limiting Gaussian variable is established. Section 4 contains the main result related to the forward selection of relevant factors. The concluding remarks are given in Section 5. The proof of elementary Lemma 2 is provided in Appendix A for completeness of exposition.

2. Error Functional Estimators

Consider, in general, a non-binary response, i.e., let Y : = { m , , 0 , , m } for some m N . In the framework of the introduced discrete model, Lemma 1 of [32] gives a complete description of the class of optimal functions f o p t providing the minimum error Err ( f ) , determined by (2), in the class of all functions f : X Y . To define such a function (included in the optimal class) for x X , we deal with a vector w ( x ) having components
w y ( x ) : = ψ ( y ) P ( Y = y , X = x ) , y Y .
It can be easily seen that
Err ( f ) = y , z Y | y z | ψ ( y ) P ( Y = y , f ( X ) = z ) = z Y x A z w ( x ) q ( z ) ,
where A z : = { x X : f ( x ) = z } , q ( z ) is a column of ( 2 m + 1 ) × ( 2 m + 1 ) matrix Q having elements q y , z : = | y z | (the element q m , m is located in the upper left corner of the matrix Q), ⊤ stands for the transposition of column vectors. In other words, one employs in (5) the scalar product of the vectors w ( x ) and q ( z ) . Thus, search for an optimal function f o p t means finding the partition of X into such sets A z , z Y , that provide the minimum value of the right-hand side of (5). Note also that, according to Formula (13) of [32], the error of response prediction can be written as follows:
Err ( f ) = i = 0 2 m 1 i m < | y | m ψ ( y ) P ( Y = y , | f ( X ) y | > i ) .
Let, for y Y , the vector Δ ( y ) have the first m + y components equal to 1, and the remaining m y + 1 components equal to ( 1 ) . For any x X , we introduce a vector L ( x ) with 2 m components having the form
L y ( x ) : = w ( x ) Δ ( y ) , y Y , y > m .
According to formula (11) of [32] one infers that
f o p t ( x ) = y L m + 1 ( x ) 0 , y = m , L y + 1 ( x ) 0 , L y ( x ) < 0 , y ± m , L m ( x ) < 0 , y = m .
The joint distribution of ( X , Y ) is, in general, unknown. Therefore, the optimal function f o p t cannot be found in practice, so an algorithm is used to predict it, i.e., to approximate by means of specified statistical estimators. The response prediction algorithm is defined as a function f ^ P A = f ^ P A ( x , ξ ( W ) ) given for x X and a set of observations
ξ ( W ) : = { ξ ( j ) = ( X ( j ) , Y ( j ) ) , j W } , W N , W < .
The function f ^ P A takes values in the set Y . It is assumed that the value of f ^ P A ( x , ξ ( W ) ) becomes close, in a certain sense, to f ( x ) for x in a specified subset of the set X when W is sufficiently “massive”. More precisely, we consider a family of functions f ^ P A that depend on sets ξ ( W ) of different cardinalities, but we will not complicate the notation. Consider M = { x X : P ( X = x ) > 0 } . For x X , U T and y Y , introduce a vector w U ( x ) with components
w y U ( x ) : = ψ ( y ) P ( Y = y , X U = x U ) , x M , 0 , x M .
Set
L y U ( x ) : = ( w U ( x ) ) Δ ( y ) , y Y , y > m .
For U T , let f U be defined by means of a counterpart of formula (8), where L y U ( x ) is now written instead of L y ( x ) . Then, according to Section 5 of [32] (the notation α is used there instead of U), in the framework of model (1), the optimal function f o p t = f S , where S appears in (1) and S = r . Therefore relation (3) is valid for f U corresponding to any U T with U = r (the assumption (A) holds).
To introduce an algorithm for predicting the function f U , we employ statistical estimators of the penalty function ψ , as well as the values L y U ( x ) , where x X , y Y , y > m . Consider
ψ ( y ) : = 1 / P ( Y = y ) , w h e r e P ( Y = y ) > 0 , y Y .
In the case of a binary response, such a choice of the penalty function was proposed in [36], the justification for this choice is given in [31], see also Section 4 in [32]. For the specified function ψ ( y ) and observations ξ ( W ) , where the finite set W N , we use
ψ ^ ( y , ξ ( W ) ) : = 1 P ^ ( y , ξ ( W ) ) , P ^ ( y , ξ ( W ) ) 0 , 0 , P ^ ( y , ξ ( W ) ) = 0 ,
where the frequency estimator of a probability P ( Y = y ) has the form
P ^ ( y , ξ ( W ) ) : = 1 W j W I { Y ( j ) = y } , y N .
It is not difficult to see that the strong law of large numbers for arrays of random variables (see, e.g., [37]) entails for finite sets W N N , such that W N , the relation
ψ ^ ( y , ξ ( W N ) ) ψ ( y ) a . s . , N .
Let the prediction algorithm f ^ P A U ( x , ξ ( W N ) ) of a function f U ( x ) be constructed by means of formula (8) analogue, where, for x X , y Y , y > m , and W N { 1 , , N } , one uses now statistical estimators L ^ y U , W N ( x ) of functions L y U ( x ) introduced in (10). Namely, let us define the following random variables:
w ^ y U , W N ( x ) : = ψ ^ ( y , ξ ( W N ) ) 1 W N j W N I { Y ( j ) = y , X U ( j ) = x U } , y Y ,
where ψ ^ ( y , ξ ( W N ) ) is an estimator of ψ ( y ) appearing in (12). For x X , y Y , y > m , set
L ^ y U , W N ( x ) : = w ^ y U , W N ( x ) Δ ( y ) .
Replace the value L y ( x ) in (8) by L ^ y U , W N ( x ) . Then one can claim that
f ^ P A U ( x , ξ ( W N ) ) = y L ^ y U , W N ( x ) 0 , y = m , L ^ y + 1 U , W N ( x ) 0 , L ^ y U , W N ( x ) < 0 , y ± m , L ^ y U , W N ( x ) < 0 , y = m .
For K N , K > 1 , we take a partition of a set { 1 , , N } into subsets
D k ( N ) : = { ( k 1 ) [ N / K ] + 1 , , k [ N / K ] I { k < K } + N I { K = N } } ,
here k = 1 , , K , [ a ] is an integer part of a number a R , I { A } is an indicator of a set A. These sets are applied in the K-fold cross-validation procedure increasing the stability of statistical inference (cross-validation procedure is studied, e.g., in [38]). Following [32], the estimator of the functional Err ( f U ) , i.e., a statistical estimator of the prediction error functional for a function f U and observations ξ N : = ξ ( { 1 , , N } ) , involving the K-fold cross-validation procedure, is given by the formula:
E r r ^ K , N ( f U ) : = i = 0 2 m 1 i m < | y | m 1 K k = 1 K ψ ^ ( y , ξ ( D k ( N ) ) )
× 1 D k ( N ) j D k ( N ) I { Y ( j ) = y , | f ^ P A U ( X ( j ) , ξ ( D ¯ k ( N ) ) ) y | > i } ,
where D ¯ k ( N ) : = { 1 , , N } D k ( N ) and ψ ^ ( y , ξ ( D k ( N ) ) ) are evaluated according to (12) for W N = D k ( N ) , k = 1 , , K . The estimator (17) is a natural statistical analogue of the error functional (2) written in the form (6) when one employs the K-cross-validation procedure. Namely, instead of ψ ( y ) we apply its statistical estimator of the type (12) and instead of f we use its approximation by means of prediction algorithm based on the part D k ( N ) of observations. To obtain the statistical estimators of the probability appearing in Formula (6) we write the corresponding average of indicator functions. One employs also the averaging over different parts of observations.
By Theorem 2 of [32], if S = { i 1 , , i r } is a set of relevant factors, i.e., (1) holds, then, for each ε > 0 and any set U = { m 1 , , m r } T , the following inequality takes place almost sure for all N large enough:
E r r ^ K , N ( f S ) E r r ^ K , N ( f U ) + ε .
Thus, it is natural to consider all subsets U = { m 1 , , m r } T and choose as a statistical estimator of a relevant collection of indices ( i 1 , , i r ) a set U on which the minimum of E r r ^ K , N ( f U ) is attained. Here we also note that, for the study of asymptotic properties of the error functional, the regularization of the prediction algorithm by means of a sequence of positive numbers ( ε N ) N N such that ε N 0 , as N , plays an important role. Namely, for W N { 1 , , N } , we define
f ^ P A , ε N U ( x , ξ ( W N ) ) = y L ^ y U , W N ( x ) + ε N 0 , y = m , L ^ y + 1 U , W N ( x ) + ε N 0 , L ^ y U , W N ( x ) + ε N < 0 , y ± m , L ^ y U , W N ( x ) + ε N < 0 , y = m .
As in article [32], we assume that
ε N 0 + , N ε N , N .
Now we introduce a statistical estimator Err ^ K , N , ε N ( f U ) using an analogue of Formula (17), where one employs f ^ P A , ε N U instead of f ^ P A U . For the regularized statistical estimators, as mentioned in [32], the analogue of Formula (18) holds. In [32], for estimators f ^ P A , ε N U constructed when condition (20) is met, the CLT is established. In the next section we apply a slightly different regularization for the error functional estimates, which will permit us to specify the convergence rate of the first two moments of these estimators to corresponding moments of the limiting Gaussian variable. This result is not only of independent interest, but is also applied in Section 4.

3. Asymptotic Behavior of the First Two Moments of Statistical Estimators of the Error Functional

As noted in Section 2, we will use the penalty function (11). Therefore, for W N = D k ( N ) , as a strongly consistent estimator ψ ^ ( y , D k ( N ) ) of ψ ( y ) we will employ the variable appearing in (12), denoted below as ψ ^ N , k ( y ) , where y Y , k = 1 , , K , N N . Recall that the estimator Err ^ K , N ( f U ) is defined by formula (2). If the regularized version f ^ P A , ε N is substituted into this estimator instead of f ^ P A U , where x X and N N , then the notation Err ^ K , N , ε N ( f U ) is used. We will apply the following Corollary 3 of [32] established in the framework of a model satisfying (1).
Theorem 1 ([32]).
Let U be an arbitrary subset of T having the cardinality r, the function f U be defined after formula (10), f ^ P A , ε N U appear in (19) for observations ξ N , and the sequence ( ε N ) N N satisfy condition (20). Then
N Err ^ K , N , ε N ( f U ) Err ( f U ) D Z N ( 0 , σ 2 ( U ) ) , N ,
and in this case σ 2 ( U ) is the variance of a random variable
V ( U ) : = i = 0 2 m 1 i m < | y | m I { Y = y } P ( Y = y ) ( I { | f U ( X ) y | > i } P ( | f U ( X ) y | > i | Y = y ) ) .
It is known that the convergence in distribution of random variables, in general, does not ensure the convergence of their moments even when the moments exist. We will manage to establish the convergence rate of the first two moments of the error functional statistical estimators to the corresponding moments of the limit random variable. For this purpose we slightly strength the condition of estimates regularization. We require that a sequence ( ε N ) N N satisfies the following condition:
ε N 0 + , ε N N log 1 ε N , N .
Clearly, (23) implies the validity of (20). Relation (23) holds if one takes ε N = N δ , N N , where δ ( 0 , 1 / 2 ) .
Lemma 1.
Let condition (23) be met. Then, for every K N , K > 1 , and any U T , the statistical estimators Err ^ K , N , ε N ( f U ) satisfy the following relation:
N E ( Err ^ K , N , ε N ( f U ) Err ( f U ) ) 2 σ 2 ( U ) , N ,
where σ 2 ( U ) = var V ( U ) and V ( U ) is introduced in formula (22).
Proof of Lemma 1.
Let us fix an arbitrary set U T . For each N N one has
Z N : = N Err ^ K , N , ε N ( f U ) Err ( f U ) = N ( Err ^ K , N , ε N ( f U ) T ^ N ( f U ) ) + N ( T ^ N ( f U ) T N ( f U ) ) + N ( T N ( f U ) Err ( f U ) ) ,
where
T N ( f U ) : = i = 0 2 m 1 i m < | y | m 1 K k = 1 K ψ ( y ) D k ( N ) j D k ( N ) I { Y ( j ) = y , | f U ( X ( j ) ) y | > i } ,
T ^ N ( f U ) : = i = 0 2 m 1 i m < | y | m 1 K k = 1 K ψ ^ N , k ( y ) D k ( N ) j D k ( N ) I { Y ( j ) = y , | f U ( X ( j ) ) y | > i } ,
ψ ^ N , k ( y ) are defined by means of (12) for W N = D k ( N ) , k = 1 , , K , N N . The proof is divided into several steps.
Step 1. At first we consider
R N : = N ( Err ^ K , N , ε N ( f U ) T ^ N ( f U ) ) , N N .
To simplify the notation, we do not write that R N also depends on K, ξ N and ε N . Our aim is to show that if (23) holds then
E R N 2 0 a s N .
In the light of formula (71) of [32], under condition (20) the following relation is valid:
R N P 0 , N .
Taking into account (29), by Theorem 5.4 of [39], relation (28) holds if (and only if) the sequence ( R N 2 ) N N is uniformly integrable. Due to theorem by De La Vallé - Poussin (see, e.g., Theorem 1.3.4 of [40]) it is sufficient to verify that
sup N N E ( R N 4 ) < .
For x X , y Y , i Z + , k = 1 , , K and N N we introduce the following random variables:
F N , k ( i ) ( x , y ) = I { | f ^ P A , ε N U ( x , ξ ( D ¯ N , k ) ) y | > i } I { | f U ( x ) y | > i } ,
S k ( i , y ) : = 1 D k ( N ) j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( X ( j ) , y ) ,
where, for W N , ξ ( W ) is defined by Formula (9). Write R N = U N , 1 + U N , 2 , here
U N , 1 : = N 1 K k = 1 K i = 0 2 m 1 i m < | y | m ψ ( y ) S k ( i , y ) , U N , 2 : = N 1 K k = 1 K i = 0 2 m 1 i m < | y | m ( ψ ^ N , k ( y ) ψ ( y ) ) S k ( i , y ) .
Now note that, for any real numbers a 1 , , a v , every v N and an arbitrary γ > 1 , the Hölder inequality implies that
r = 1 v | a r | γ v γ 1 r = 1 v | a r | γ .
Evidently, (32) is true for γ = 1 as well. Consequently, we get
R N 4 8 ( U N , 1 4 + U N , 2 4 ) , N N .
Clearly, for all x X , y Y , W N { 1 , , N } and N N , one has
L ^ y , ε N U , W N ( x ) : = L ^ y U , W N ( x ) + ε N = L y U ( x ) + ( w ^ y U , W N ( x ) w y ( x ) ) Δ ( y ) + ε N ,
where the functions appearing in (34) were introduced in Section 2. For any x X and y Y , the inequalities L y U ( x ) 0 , L y + 1 U ( x ) < 0 are satisfied if and only if, for arbitrary δ N ( x , y ; U ) > 0 such that δ N ( x , y ; U ) 0 , as N , and all sufficiently large N N , the following inequalities are valid: L y U ( x ) + δ N ( x , y ; U ) > 0 , L y + 1 U ( x ) + δ N ( x , y ; U ) < 0 (the analogous statement is true for inequalities corresponding to coordinates y = m and y = m in Formula (19)). Obviously,
| ( w ^ y U , W N ( x ) w y ( x ) ) Δ ( y ) |
| ψ ^ ( y , ξ ( W N ) ) ψ ( y ) | + ψ ( y ) 1 W N q W N I { X U ( q ) = x U , Y ( q ) = y } P ( X U = x U , Y = y ) ,
where ψ ^ ( y , ξ ( W N ) ) is defined in (12). One has
x U 1 W N q W N I { X U ( q ) = x U , Y ( q ) = y } P ( X U = x U , Y = y ) = 1 W N q W N I { Y ( q ) = y } P ( Y = y ) = P ^ ( y , ξ ( W N ) ) P ( Y = y ) .
For x X , y Y , W N { 1 , , N } and N N , consider the following event
A W N ( x , y ) = 1 W N q W N I { X U ( q ) = x U , Y ( q ) = y } P ( X U = x U , Y = y ) p 0 2 ε N 8 X ,
where p 0 = min y Y P ( Y = y ) (we assumed that P ( Y = y ) > 0 for y Y ). More precisely one can write A W N ( x , y ) = A W N ( x , y , U ; { ( X ( q ) , Y ( q ) ) , q W N } ) . We will not include a set U in the list of arguments since this set is fixed. Then, for ω A W N ( x , y ) , in view of (35), we get
P ^ ( y , ξ ( W N ) ) P ( Y = y ) p 0 2 ε N 8 .
Then by virtue of (37), for any y Y and all N large enough, i.e., for N N 0 ( Y , ( ε N ) N N ) , one has
P ^ ( y , ξ ( W N ) ) P ( Y = y ) p 0 2 ε N 8 P ( Y = y ) ε N 8 > P ( Y = y ) 2 > 0 ,
and hence the following relation holds
| ψ ^ ( y , ξ ( W N ) ) ψ ( y ) | = | P ^ ( y , ξ ( W N ) ) P ( Y = y ) | P ^ ( y , ξ ( W N ) ) P ( Y = y ) p 0 2 ε N 8 P ( Y = y ) 2 2 ε N 4 .
Thus if ω A W N ( x , y ) , where x X and y Y , then according to (36) and (38), for all N large enough, we can write
| ( w ^ y U , W N ( x ) w y ( x ) ) Δ ( y ) | ε N 4 + 1 p 0 p 0 2 ε N 8 X ε N 2 .
Taking into account that the sets X and Y have finite cardinalities, we ascertain that, for any x X , y Y and all N large enough, for ω A W N ( x , y ) , one has
f ^ P A , ε N U , W N ( x ) = f U ( x ) .
Consequently, for any x X , y Y , i = 0 , 1 , , 2 m 1 , ω A W N ( x , y ) , where W N = D ¯ k ( N ) , k = 1 , , K , for all N large enough (i.e., N N 1 ), the following inequality holds:
F N , k ( i ) ( x , y ) I { A D ¯ k ( N ) ( x , y ) } = 0 .
Applying (32) we come to the inequality
| U N , 1 | 4 N 2 ( 2 m ) 6 K k = 1 K i = 0 2 m 1 i m < | y | m ψ ( y ) 4 1 D k ( N ) j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( X ( j ) , y ) 4 .
Let Σ ˜ denote the summation over all x j X for j D k ( N ) . For N N 1 one has
E j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( X ( j ) , y ) 4 = E Σ ˜ j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( x j , y ) 4 I j D k ( N ) { X ( j ) = x j }
= E Σ ˜ j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( x j , y ) I { A ¯ D ¯ k ( N ) ( x j , y ) } 4 I j D k ( N ) { X ( j ) = x j } = E j D k ( N ) I { Y ( j ) = y } F N , k ( i ) ( X ( j ) , y ) I { A ¯ D ¯ k ( N ) ( X ( j ) , y ) } 4 E j D k ( N ) I { A ¯ D ¯ k ( N ) ( X ( j ) , y ) } 4 ,
here we employ (40) and take into account that | F N , k ( i ) ( x , y ) | 1 . We see that
| U N , 1 | 4 N 2 ( 2 m ) 6 K k = 1 K i = 0 2 m 1 i m < | y | m ψ ( y ) 4 ( D k ( N ) ) 4 j D k ( N ) I { A ¯ D ¯ N ( k ) ( X ( j ) , y ) } 4 .
For W N { 1 , , N } , y Y and j = 1 , , N , introduce the functions
g W N ( X ( j ) , y ) = I { A ¯ W N ( X ( j ) , y ) } = I { A ¯ W N ( X ( j ) , y ; { ( X ( q ) , Y ( q ) ) , q W N } ) } .
It is known (see, e.g., formula (15) in Chap. VI of [41]) that if a bounded Borel function g : R n × R m R , ξ and ζ are independent random vectors taking values in R n and R m , respectively, then
E ( g ( ξ , ζ ) | ζ = z ) = E g ( ξ , z ) , z R n .
Due to independence of ( X ( j ) , Y ( j ) ) , j N , we can apply the lemma on grouping random vectors (see, e.g., [42], p. 28) to get the relation
E ( ( j D k ( N ) g D ¯ k ( N ) ( X ( j ) , y ; ( X ( q ) , Y ( q ) ) , q D ¯ k ( N ) ) ) ) 4 ( X ( q ) , Y ( q ) ) = ( x q , y q ) , q D ¯ k ( N ) )
= E j D k ( N ) g D ¯ k ( N ) ( X ( j ) , y ; ( x q , y q ) ) , q D ¯ N ( k ) ) ) 4 .
By the Rosenthal inequality (see, e.g., Theorem 2.9 of [43]), for independent centered random variables Z 1 , , Z v , having E | Z j | t < for some t [ 2 , ) and each j = 1 , , v , one has
E j = 1 v Z j t C ( t ) j = 1 v E | Z j | t + j = 1 v E Z j 2 t 2 ,
where C ( t ) > 0 depends on t but does not depend on v and distributions of variables Z j , j = 1 , , v .
Set η N , k ( j ) : = g D ¯ k ( N ) ( X ( j ) , y ; { ( x q , y q ) ) , q D ¯ N ( k ) } ) , j N . Note that 0 η N , k ( j ) 1 for all j D N ( k ) . Then according to (42) we come to the inequality
E j D k ( N ) ( η N , k ( j ) E η N , k ( j ) ) 4 C ( D k ( N ) ) 2 ,
where k = 1 , , K and C = 2 C ( 4 ) . Hence, applying (32) for γ = 4 and v = 2 , one has
E j D k ( N ) η N , k ( j ) 4 8 C ( D k ( N ) ) 2 + j D k ( N ) E η N , k ( j ) 4
8 C ( D k ( N ) ) 2 + 8 ( D k ( N ) ) 4 max j D k ( N ) ( E η N , k ( j ) ) 4 .
Evidently, we can write
E ( η N , k ( j ) ) = P ( A ¯ D ¯ k ( N ) ( X ( j ) , y ; { ( x q , y q ) ) , q D ¯ k ( N ) } ) .
Let M k = D ¯ k ( N ) , where M k = M k ( N ) , k = 1 , , K . Set ζ q = I { X U ( q ) = x U , Y ( q ) = y } , where q D ¯ k ( N ) , σ 0 2 = var ζ q . Clearly, ζ q depends on x U , y and U. Random variables ζ q are identically distributed for q N . Therefore σ 0 2 = σ 0 2 ( U , x , y ) , but does not depend on q. If σ 0 2 = 0 , then the variables ζ q are a.s. equal to some constant. According to (36), an event A ¯ D ¯ k ( N ) ( X ( j ) , y ; { ( x q , y q ) ) , q D ¯ k ( N ) } ) occurrence means that the variable which is equal to zero a.s. turns greater than ( p 0 2 ε N ) / ( 8 X ) . Therefore, in the degenerate case one has
P ( A ¯ D ¯ k ( N ) ( X ( j ) , y ; ( x q , y q ) ) , q D ¯ k ( N ) ) ) = 0
and E η N , k ( j ) = 0 for all j = 1 , , N . Consider now the case when σ 0 2 > 0 . Then we get
P ( A ¯ D ¯ k ( N ) ( X ( j ) , y ; { ( x q , y q ) , q D ¯ k ( N ) } ) = P q D ¯ k ( N ) ( ζ q E ζ q ) σ 0 M k > p 0 2 M k ε N 8 X σ 0 ,
where p 0 appeared in (36).
Now we employ the Berry-Esseen estimate of the convergence rate in CLT for i.i.d. random variables. Let Z 1 , , Z v be i.i.d. random variables such that E Z 1 = 0 , var Z 1 = σ 2 ( 0 , ) , E | Z 1 | 3 = ρ < . We write F for the distribution function of Z 1 and F v stands for the distribution function of ( Z 1 + + Z v ) / ( σ v ) . Then (see, e.g., Theorem 5.4 of [43]), for any v N ,
sup u R | F v ( u ) Φ ( u ) | C 0 ρ σ 3 v ,
where Φ ( u ) is the distribution function of a standard normal random variable, C 0 is a positive constant ( C 0 does not depend on distribution of Z 1 and v). According to [44] one has C 0 0.4693. Consequently, taking Z N ( 0 , 1 ) , we have
P q D ¯ k ( N ) ( ζ q E ζ q ) σ 0 M k > p 0 2 M k ε N 8 X σ 0 P | Z | > p 0 2 M k ε N 8 X σ 0 + 2 C 0 σ 0 3 M k
since E | ζ q E ζ q | 3 1 for q D ¯ k ( N ) , where ζ q = I { X U ( q ) = x U , Y ( q ) = y } .
It is well-known (see, e.g., formula (29) of Chap. II of [41]), that, for u > 0 , the following inequality is true:
P ( | Z | u ) 2 / π u exp u 2 2 .
Therefore, by virtue of an inequality σ 0 2 1 / 4 (which is valid for the indicator variance) and as
( K 1 ) [ N / K ] M k N ,
we can write under condition (23) that
P | Z | > p 0 2 M k ε N 8 X σ 0 8 X 2 σ 0 p 0 2 π M k ε N exp 1 2 p 0 2 M k ε N 8 X σ 0 2
4 2 X p 0 2 π M k ε N exp 1 32 p 0 2 M k ε N X 2 = 4 2 X p 0 2 π M k exp 1 32 p 0 2 M k ε N X 2 + log 1 ε N C 1 N , N N ,
and C 1 does not depend on N.
Introduce
σ ˜ 2 : = min U T , x X , y Y σ 0 2 ( U , x , y ) ,
where one considers only strictly positive σ 0 2 ( U , x , y ) . Then obviously σ ˜ 2 > 0 , as there exists only a finite collection of different variants. Thus in view of (44), for all x, y and U under consideration, one has
2 C 0 σ ˜ 3 M k C 2 N , N N ,
where C 0 appeared in (43) and C 2 does not depend on N.
Therefore, if condition (23) is satisfied then, for all x X , y Y , k = 1 , , K and j D k ( N ) , the following inequality holds:
E η N , k ( j ) C 3 N , N N ,
where C 3 does not depend on x , y , k and N. Hence, in view of (44) we come to the relation
E j D k ( N ) g D ¯ k ( N ) ( X ( j ) , y ; { ( X ( q ) , Y ( q ) ) , q D ¯ k ( N ) } ) 4
8 C ( D k ( N ) ) 2 + 8 ( D k ( N ) ) 4 C 3 4 N 2 ( x q , y q ) ) , q D ¯ N ( k ) P ( ( X ( q ) , Y ( q ) ) = ( x q , y q ) ) C 4 N 2 ,
where C 4 does not depend on x, y, k and N. Thus according to (41), for all N large enough, we have proved the inequality
E U N , 1 4 C 5 ,
where C 5 does not depend on N.
In a similar way (taking into account (42) and (45)), for i = 0 , , 2 m 1 , y Y , k = 1 , , K , and all N large enough, we get
E S k ( i , y ) 8 C 6 ( D N ( k ) ) 4 ,
where S k ( i , y ) is introduced in (31), and C 6 does not depend on N.
We will employ an elementary result for the Bernoulli scheme. Let U 1 , U 2 , , be a sequence of i.i.d. random variables such that P ( U 1 = 1 ) = p and P ( U 1 = 0 ) = 1 p , where p ( 0 , 1 ) . Consider the following frequency estimator of a probability p :
p ^ N : = 1 N j = 1 N I { U j = 1 } , N N .
Define
ψ ^ N : = 1 p ^ N , p ^ N 0 , 0 , p ^ N = 0 .
Lemma 2.
For the Bernoulli scheme introduced above and the estimators ψ ^ N provided by formula (48), for each t N , the following relation holds:
E ψ ^ N 1 p t = O 1 N , N .
More precisely, the absolute value of the function in the left-hand side of (49), for all N N , admits a bound c / N where c = c ( p , t ) for p ( 0 , 1 ) and t N .
For the sake of completeness the proof of this result is given in Appendix A.
Now we continue the proof corresponding to Step 1. For all considered k, i, y and any N N , the Cauchy - Bunyakovsky - Schwarz inequality yields
E ( ψ ^ N , k ( y ) ψ ( y ) ) S k ( i , y ) 4 E ( ψ ^ N , k ( y ) ψ ( y ) ) 8 E S k ( i , y ) ) 8 1 2 .
Due to Lemma 2 one has E ( ψ ^ N , k ( y ) ψ ( y ) ) 8 = O 1 N , N . Employing the Minkowski inequality (to take into account the summation over i, y, k), for all N N , we come to the bound
E U N , 2 4 N 2 C 7 1 N 1 N 4 1 2 = C 7 N ,
where C 7 does not depend on N.
Consequently, by virtue of (33), (46) and (50) the uniform integrability of a sequence ( R N 2 ) N N is established. Thus (28) is verified.
Step 2. Now we study the asymptotic behavior of the variables N ( T ^ N ( f U ) T N ( f U ) ) , as N , where T ^ N ( f U ) and T N ( f U ) are given by Formulas (26) and (27), respectively. For j N , i = 0 , , 2 m 1 , y Y , we set Z i ( j ) ( y ) = I { Y ( j ) = y , | f ( X ( j ) ) y | > i } . One has
N ( T ^ N ( f U ) T N ( f U ) ) = W N , 1 + W N , 2 ,
where
W N , 1 = N K k = 1 K i = 0 2 m 1 i m < | y | m ( ψ ^ N , k ( y ) ψ ( y ) ) D k ( N ) j D k ( N ) ( Z i ( j ) ( y ) E Z i ( j ) ( y ) ) , W N , 2 = N K k = 1 K i = 0 2 m 1 i m < | y | m ( ψ ^ N , k ( y ) ψ ( y ) ) D k ( N ) j D k ( N ) P ( Y ( j ) = y , | f U ( X ( j ) ) y | > i ) = N K k = 1 K i = 0 2 m 1 i m < | y | m ( ψ ^ N , k ( y ) ψ ( y ) ) P ( Y = y , | f U ( X ) y | > i ) .
The purpose of the second step is to prove that
E W N , 1 2 0 , N .
For k = 1 , , K , i = 0 , , 2 m 1 and y Y introduce
G k ( i , y ) = 1 D k ( N ) j D k ( N ) ( Z i ( j ) ( y ) E Z i ( j ) ( y ) ) .
The Cauchy-Bunyakovsky-Schwarz inequality yields
E ( ψ ^ N , k ( y ) ψ ( y ) ) G k ( i , y ) 2 E ( ψ ^ N , k ( y ) ψ ( y ) ) 4 1 2 E G k ( i , y ) 4 1 2 .
For each considered N, y, i and k, the variables { Z i ( j ) ( y ) , j D k ( N ) } are independent and | Z i ( j ) ( y ) E Z i ( j ) ( y ) | 1 , so by virtue of the Rosenthal inequality (42) we obtain
E j D k ( N ) ( Z i ( j ) ( y ) E Z i ( j ) ( y ) ) 4 = O ( D k ( N ) 2 ) .
Taking into account Lemma 2 for t = 4 and in view of (44), for each k = 1 , , K , we get the relation
E W N , 1 2 = O N 1 2 , N .
Therefore, the goal of the second step has been achieved.
Step 3. The implementation of steps 1 and 2 permits to reduce the study of the asymptotic behavior (as N ) of Z N given by Formula (25) to the study of variables
η N : = N ( T N ( f U ) Err ( f U ) ) + W N , 2 , N N ,
where W N , 2 is defined by Formula (51).
The aim of the third step is to prove that E ( η N ) 2 σ 2 ( U ) , as N , where σ 2 ( U ) is the variance of the random variable V ( U ) appearing in Formula (22).
On this way, we will show that the sum of certain part of the terms in a specified representation of the variables η N does not affect (in the sense of L 2 ( Ω , F , P ) ) the limit behavior of these variables for growing N. For y Y and W N { 1 , , N } , where N N , we introduce the event
B W N ( y ) : = { ω : P ^ ( y , ξ ( W N ) ) 0 } ,
where P ^ ( y , ξ ( W N ) ) is defined according to (13). Then, in view of the independence of observations ξ ( 1 ) , ξ ( 2 ) , we have
P ( B ¯ W N ( y ) ) = P j W N { Y ( j ) y } = ( 1 P ( Y = y ) ) W N .
If ω B ¯ W N ( y ) then | ψ ^ ( y , ξ ( W N ) ) ψ ( y ) | = ψ ( y ) . Set
H N : = N K k = 1 K i = 0 2 m 1 i m < | y | m ( ψ ^ N , k ( y ) ψ ( y ) ) I { B N , k ( y ) } P ( Y = y , | f ( X ) y | > i ) ,
where B N , k ( y ) : = B D k ( N ) ( y ) and an event B W N ( y ) is introduced by Formula (53). Then
E ( W N , 2 H N ) 2 = E N K k = 1 K i = 0 2 m 1 i m < | y | m I { B ¯ N , k ( y ) } P ( Y = y ) P ( Y = y , | f U ( X ) y | > i ) 2
N ( 2 m ) 4 p 0 2 max y Y ( 1 P ( Y = y ) ) [ N / K ] 0 , N ,
since D k ( N ) [ N / K ] for N N , k = 1 , , K and because all P ( Y = y ) > 0 for each y Y , [ · ] stands for an integer part of a number.
We verify that H N for large N is approximated in the space L 2 ( Ω , F , P ) by the random variable
H ˜ N : = N K k = 1 K i = 0 2 m 1 i m < | y | m I { B N , k ( y ) } P ( Y = y ) p ^ N , k ( y ) P ( Y = y ) 2 P ( Y = y , | f U ( X ) y | > i ) ,
where p ^ N , k ( y ) : = P ^ ( y , ξ ( D k ( N ) ) ) and P ^ ( y , ξ ( W N ) ) was introduced by (13) for y Y and W N { 1 , , N } . Evidently, 0 P ( Y = y , | f U ( X ) y | > i ) 1 for all k, i, y and N under consideration. Consequently, it follows that
Δ N , k ( i , y ) : = | N I { B N , k ( y ) } 1 p ^ N , k ( y ) 1 P ( Y = y ) P ( Y = y , | f U ( X ) y | > i ) N I { B N , k ( y ) } P ( Y = y ) p ^ N , k ( y ) P ( Y = y ) 2 P ( Y = y , | f U ( X ) y | > i ) | N P ( Y = y ) p ^ N , k ( y ) P ( Y = y ) ψ ^ N , k ( y ) 1 P ( Y = y ) = N P ( Y = y ) D k ( N ) ψ ^ N , k ( y ) 1 P ( Y = y ) J N ,
where
J N : = 1 D k ( N ) j D k ( N ) ( I { Y ( j ) = y } P ( Y ( j ) = y ) ) .
For any considered k, i, y and N the Cauchy - Bunyakovsky - Schwarz inequality implies that
E ( Δ N , k ( i , y ) ) 2 N ( P ( Y = y ) 2 D k ( N ) E J N 4 E ψ ^ N , k ( y ) 1 P ( Y = y ) 4 1 2 .
The Rosenthal inequality (42) yields that E J N 4 2 C ( 4 ) . By means of Lemma 2 (for t = 4 and multipliers c ( p , t ) with p = P ( Y = y ) ), for all considered i, y, k and any N N we come to the bound
E ( Δ N , k ( i , y ) ) 2 N ( P ( Y = y ) 2 D k ( N ) ( 2 C ( 4 ) c ( P ( Y = y ) , 4 ) ) 1 2 N .
Therefore, E ( H N H ˜ N ) 2 0 as N .
Let us define the variable G N by formula similar to H ˜ N but without the multiplier I { B N , k ( y ) } . In view of (44) it is easily seen that
E ( H ˜ N G N ) 2 N ( 2 m ) 4 p 0 4 max y Y ( 1 P ( Y = y ) ) [ N / K ] 1 4 max k = 1 , , K 1 D k ( N ) 0 , N .
Thus E ( η N Q N ) 2 0 as N , where
Q N : = N ( T N ( f U ) Err ( f U ) ) + G N , N N .
Taking into account Formula (6) for the function f = f U , we come to the relation
Q N = N K k = 1 K i = 0 2 m 1 i m < | y | m 1 D k ( N ) j D k ( N ) ( I { Y ( j ) = y , | f U ( X ( j ) ) y | > i } P ( Y = y ) P ( Y = y , | f U ( X ) y | > i ) P ( Y = y ) + ( P ( Y = y ) I { Y ( j ) = y } ) P ( Y = y , | f U ( X ) y | > i ) P ( Y = y ) 2 ) = N K k = 1 K 1 D k ( N ) j D k ( N ) V ( j ) ,
where, for j N ,
V ( j ) : = i = 0 2 m 1 i m < | y | m I { Y ( j ) = y } P ( Y = y ) I { | f U ( X ( j ) ) y | > i ) P ( | f U ( X ) y | > i | Y = y ) .
The variables { V ( j ) , j N } are centered, i.i.d. and uniformly bounded for all j (clearly, V ( j ) = V ( j ) ( U ) ). For each j N , the distributions of V ( j ) and V ( U ) coincide, where V ( U ) is introduced in (22). Thus, one has
var V ( j ) = var V ( U ) = σ 2 ( U ) , j N .
According to the lemma on grouping independent random variables, for each N N , the variables j D k ( N ) V ( j ) , k = 1 , , K , are independent. Since N / D k ( N ) K as N , for k = 1 , , K , we come to the relation
E ( Q N 2 ) = var Q N = N K 2 k = 1 K 1 ( D k ( N ) ) 2 j D k ( N ) var V ( j ) = σ 2 ( U ) 1 K 2 k = 1 K N D k ( N ) σ 2 ( U ) ,
as N . Hence E η N 2 σ 2 ( U ) , N . The goal of the third step has been achieved.
In view of the above approximations (in L 2 ( Ω , F , P ) ) of the initial random variables Z N , introduced by (25), we conclude that E Z N 2 σ 2 ( U ) , as N . Namely, we apply the following elementary statement: if E α N 2 0 and E β N 2 σ 2 then E ( α N + β N ) 2 σ 2 , as N . Therefore, (24) is established. The proof of Lemma 1 is complete. □
Further we will also employ a result that immediately follows from Theorem 1.
Corollary 1.
Let the conditions of Lemma 1 be satisfied. Then the following relations hold:
N E Err ^ K , N , ε N ( f U ) Err ( f U ) 0 , N ,
var ( N Err ^ K , N , ε N ( f U ) ) σ 2 ( U ) , N ,
where σ 2 ( U ) is a variance of the random variable V ( U ) introduced in (22).
Proof. 
Condition (23) implies (20). Thus, according to Theorem 1, we have
Z N D Z N ( 0 , σ 2 ( U ) ) , N ,
where Z N , N N , are defined in (25). Due to Lemma 1 one has the uniform integrability of the sequence ( Z ) N N . Consequently, relation (59) implies (57), i.e., E Z N E Z = 0 , as N . Obviously,
var ( N Err ^ K , N , ε N ( f U ) )
= E N ( Err ^ K , N , ε N ( f U ) Err ( f U ) 2 N E ( Err ^ K , N , ε N ( f U ) Err ( f U ) ) 2 .
Therefore, to obtain (58), it is sufficient to use Lemma 1 and take into account (57). The proof is complete. □
Note that (59) can be obtained directly under conditions of Lemma 1. For each N N and any k = 1 , , K , according to Lindeberg’s theorem applied to arrays { V ( j ) , j D k ( N ) } of centered i.i.d. uniformly bounded summands, where a sequence ( V ( j ) ) j N is introduced in (55), taking into account (56) one has
V N , k : = 1 D k ( N ) j D k ( N ) V ( j ) D Z k N ( 0 , σ 2 ( U ) ) , N .
For every N N , the random variables V N , k , k = 1 , , K , are independent and var V N , k = σ 2 ( U ) . Since N / D k ( N ) K as N , for k = 1 , , K , by virtue of (60) we come to relation
Q N D Z N ( 0 , σ 2 ( U ) ) , N ,
where in view of (54) one has Q N = 1 K k = 1 K N D k ( N ) V N , k , N N . Applying (61) and Slutsky’s lemma, we arrive at (59).
Also note that relation (29) can be easily derived from (36) and (39) without employment of [32].

4. Forward Selection of Relevant Factors

Now we can turn to the sequential selection of factors based on MDR-EFE method. At the first step one searches for j 1 T a point where the function E r r ^ K , N , ε N ( f { i } ) attains the minimum over all i T . If there are several such points, then we take, e.g., one with the smallest index value. Recall that according to (17) (more precisely, after regularization), the random variable E r r ^ K , N , ε N ( f { i } ) is in fact a function of f ^ P A { i } , which is a forecast of the function f { i } . Then this procedure is repeated, namely, if at ( k 1 ) -th step the set S k 1 : = { j 1 , , j k 1 } is constructed, where k { 2 , , r } , then j k T S k 1 is selected at step k in such a way that given j 1 , , j k 1 the function E r r ^ K , N , ε N ( f { S k 1 , i } ) takes the minimum value over i T S k 1 for i = j k . It is convenient to assume that an empty set is taken at the zero step. Then at each next step one new element is added to the previously constructed sets. If at some step there are several minimum points of the considered function then we take only one of them, e.g., with the minimal index.
Thus, for each N N the random sets S k ( N ) = S k ( N , ω ) : = { j 1 , , j k } arise, where k = 1 , , r and j m = j m ( N , ω ) , m = 1 , , r . By construction one can write
j k ( N , ω ) J k ( N , ω ) : = arg min i T S k 1 ( N , ω ) E r r ^ K , N , ε N ( f { S k 1 ( N , ω ) , i } ) ,
where S 0 : = and { , i } : = { i } . In other words the choice j k ( N , ω ) at step k means that, for i T S k 1 ( N , ω ) ,
E r r ^ K , N , ε N ( f S k ( N , ω ) ) E r r ^ K , N , ε N ( f { S k 1 ( N , ω ) , i } ) ,
moreover, j k ( N , ω ) = min { i : i J k ( N , ω ) } , k = 1 , , r . If the joint distribution of X and Y is known, then instead of the described scheme for constructing random sets, S k ( N , ω ) we turn to considering the non-random “oracle” sets T k = { i 1 , , i k } , where k = 1 , , r ,
i k arg min i T T k 1 Err ( f { T k 1 , i } ) ,
T 0 : = , and the functional Err is introduced by formula (2). If there are several i k satisfying (63) we take among them that one which has the minimal value.
For k { 1 , , r } and i T T k introduce
C k , i : = Err ( f { T k 1 , i } ) Err ( f T k ) .
By construction of the sets T k we have C k , i 0 , where k = 1 , , r and i T T k . We call a model, satisfying condition (1), regular whenever the following relation is true:
C k , i > 0 , k = 1 , , r , i T T k .
In other words, for each k = 1 , , r , a point i k in (63) is determined uniquely. Further we employ the penalty function introduced in (11). We also use its strongly consistent estimate of type (48) with
p ^ N : = 1 W N j W N I { Y ( j ) = y } ,
W N { 1 , , N } and W N as N .
Theorem 2.
Let the considered model (1) with a collection of relevant factors having cardinality r < p , be regular, i.e., let (64) take place. Then, for the random sets S r ( N ) introduced above, the following relation is valid
P ( S r ( N ) = T r ) 1 , N ,
where T r is defined by means of (63) for k = 1 , , r . In other words, with probability close to one, the described procedure of forward selection based on statistical estimates of the error functional leads to the “oracle” collection T r , when N is large enough.
Proof. 
For a random set S r ( N , ω ) = { j 1 ( N , ω ) , , j r ( N , ω ) } , where j k ( N , ω ) is an element taken at k-th step, one has
P ( ω : S r ( N , ω ) = T r ) P ( ω : j 1 ( N , ω ) = i 1 , , j r ( N , ω ) = i r ) .
Note that
P ( ω : j 1 ( N , ω ) = i 1 , , j r ( N , ω ) = i r ) P k = 1 r A k ( N ) ,
where
A k ( N ) : = i T T k 1 E r r ^ K , N , ε N ( f T k ) < E r r ^ K , N , ε N ( f { T k 1 , i } ) ,
k = 1 , , r . Thus, we obtain:
P k = 1 r A k ( N ) = 1 P k = 1 r A ¯ k ( N ) 1 k = 1 r P A ¯ k ( N )
1 k = 1 r i T T k 1 P E r r ^ K , N , ε N ( f T k ) E r r ^ K , N , ε N ( f { T k 1 , i } ) ,
where, as usual, A ¯ : = Ω A for A Ω . Then, for k = 1 , , r , i T T k 1 and N N , we get
Δ k , i ( N ) : = E r r ^ K , N , ε N ( f T k ) E r r ^ K , N , ε N ( f { T k 1 , i } ) = ( E r r ^ K , N , ε N ( f T k ) E E r r ^ K , N , ε N ( f T k ) ) + ( E E r r ^ K , N , ε N ( f T k ) Err ( f T k ) ) + ( Err ( f T k ) Err ( f { T k 1 , i } ) ) + ( Err ( f { T k 1 , i } ) E E r r ^ K , N , ε N ( f { T k 1 , i } ) ) + ( E E r r ^ K , N , ε N ( f { T k 1 , i } ) E r r ^ K , N , ε N ( f { T k 1 , i } ) ) .
For U T , set
Z N ( U ) : = E r r ^ K , N , ε N ( f U ) E E r r ^ K , N , ε N ( f U ) .
For any k = 1 , , K , i T T k 1 and each δ ( 0 , 1 ) in light of formula (57) of Corollary 1, for all N large enough ( N N 2 ( δ , k , i ) ) it holds
P ( Δ k , i ( N ) 0 ) P ( N | Z N ( T k ( N ) ) | + N | Z N ( { T k 1 ( N ) , i } ) | N C k , i δ )
P N | Z N ( T k ( N ) ) | ( 1 δ ) N C k , i 2 + P | Z N ( { T k 1 ( N ) , i } ) | ( 1 δ ) N C k , i 2 ,
where C k , i are introduced in (66), Δ k , i ( N ) is defined by (68).
Applying the Bienaymé - Chebyshev inequality and taking into account Formula (58) of Corollary 1, for each U T and any c > 0 , we come, for a centered random variable Z N ( U ) , to the relation
P ( N | Z N ( U ) | c N ) N var Z N ( U ) N c 2 var V ( U ) N c 2 , N ,
where V ( U ) is determined by Formula (22). According to (64), for k { 1 , , r } and i T T k , one has C k , i > 0 . Therefore, for all N large enough ( N N 3 ( δ , k , i ) ), the following inequality takes place:
P ( Δ k , i ( N ) 0 ) 4 ( var V ( T k ) + var V ( { T k 1 , i } ) N ( 1 δ ) 2 C k , i 2 .
For a fixed m N , one can change the summation order over i and y to write Formula (22) as follows:
V ( U ) = y = m m I { Y = y } P ( Y = y ) W ( y , U ) ,
where
W ( y , U ) = 0 i < | y | + m I { | f U ( X ) y | > i } P ( | f U ( X ) y | > i | Y = y ) .
Thus, for any U T , one has
| V ( U ) | 2 m y = m m I { Y = y } P ( Y = y ) .
Consequently, we come to the inequality
var V ( U ) E V 2 ( U ) 4 m 2 y = m m 1 P ( Y = y ) = : a ,
where a = a ( m , ( P ( Y = y ) ) y Y ) . We see that var V ( T k ) + var V ( { T k 1 , i } ) 2 a for all k { 1 , , r } , i T T k 1 and N N . For each δ ( 0 , 1 ) , any k { 1 , , r } , i T T k 1 and all N large enough, we get the following bound:
P ( Δ k , i ( N ) 0 ) 8 a N ( 1 δ ) 2 C k , i 2 .
Hence, for each δ ( 0 , 1 ) and all N large enough, by virtue of (67) the following inequality holds:
P ( S r ( N ) = T r ) 1 8 a r N ( 1 δ ) 2 C 0 2 p + 1 r + 1 2 ,
where C 0 2 : = min k = 1 , , r , i T T k 1 C k , i 2 > 0 according to (64). Thus relation (72) implies the validity of (66). □
Now note that according to (69) the following relation is true:
P ( N | Z N ( U ) | c N ) = O 1 N , N .
The question arises whether this probability decreases like C / N where C is a positive constant or more rapidly. The answer depends on the variance of the random variable V ( U ) given by Formula (22). In view of (70) we will determine when the variable V ( U ) is degenerate, i.e., equal to a constant a.s. This is also of independent interest for the CLT established in Section 6 of [32] and given above as Theorem 1. The following result provides a simple characterization of the V ( U ) degeneracy.
Lemma 3.
For an arbitrary set U T , the variance of the random variable V ( U ) , appearing in Formula (22), is zero if and only if, for every y Y , there is k 0 ( y ) { 0 , , m + | y | } such that
P ( | f U ( X ) y | = k 0 ( y ) , Y = y ) = P ( Y = y ) .
Thus, for each y Y , on the set { Y = y } the random variable f U ( X ) does not necessarily take a constant value. Moreover, the values of k 0 ( y ) need not coincide for different y.
Proof. 
For y = 0 , , m and a random variable W ( y , U ) , introduced by Formula (71), one can write
W ( y , U ) = 0 i < y + m I { | f U ( X ) y | > i } P ( | f U ( X ) y | > i | Y = y ) = 0 i < y + m i < k m + y I { | f U ( X ) y | = k } P ( | f U ( X ) y | = k | Y = y ) = k = 1 m + y i = 0 k 1 I { | f U ( X ) y | = k } P ( | f U ( X ) y | = k | Y = y ) = k = 1 m + y k ( I { | f U ( X ) y | = k } P ( | f U ( X ) y | = k | Y = y ) ) = k = 1 m + y k I { | f U ( X ) y | = k } E ( | f U ( X ) y | | Y = y ) .
In a similar way we consider y = m , , 1 . Thus, for all y Y , one gets
W ( y , U ) = k = 1 m + | y | k I { | f U ( X ) y | = k } E ( | f U ( X ) y | | Y = y ) .
Recall that P ( Y = y ) > 0 for all y Y . If, for some y , k , j Y , k j , we have
P ( | f U ( X ) y | = k , Y = y ) > 0 , P ( | f U ( X ) y | = j , Y = y ) > 0 ,
then on the events { | f U ( X ) y | = k , Y = y } and { | f U ( X ) y | = j , Y = y } the variable W ( y , U ) takes different values. Therefore, V ( U ) takes different values on these events. Hence var V ( U ) > 0 , if (74) is not valid. Thus (74) is a necessary condition to guarantee that var V ( U ) = 0 . Suppose now that, (74) holds. In this case we get
E ( | f U ( X ) y | | Y = y ) = k 0 ( y ) , y Y .
Clearly, k 0 ( y ) depends on U as well. We see that V ( U ) on each set { Y = y } takes (up to the set of measure zero) the value 1 P ( Y = y ) ( k 0 ( y ) k 0 ( y ) ) = 0 , y Y . Therefore, var V ( U ) = 0 . Note that k 0 ( y ) need not coincide for different y Y . The proof is complete. □

5. Concluding Remarks

The established asymptotical result (Theorem 2) is rather qualitative in nature, since relation (66) assumes increasing values of N. Relation (72) is more precise. However, (72) demonstrates that, loosely speaking, one has to employ N > > r p . As previously, we assume that assumption (A), introduced on page 2, is valid. Evidently, the sequential choice of relevant variables based on statistical estimators of the error functional (of response approximation), is attractive for implementation, although suboptimal. In this regard Theorem 2 shows that under certain conditions, forward (random) selection with a high probability leads to the same collection of factors, which is provided by the sequential procedure with known joint distribution of the vector of factors X and the response Y. In the future work, it would be reasonable to supplement the theoretical results by computer simulations (see, e.g., [45]).
Consideration of the proximity of the results of optimal and suboptimal procedures requires a separate study. In addition, we note that within the framework of linear models, estimates of the probability of correct identification of relevant factors are considered, e.g., in [46,47]. Theorem 2 does not assume the linearity of stochastic model. Presumably for the first time, in our work a forward selection of relevant factors affecting the non-binary random response is treated on the base of MDR-EFE method. It would be interesting to extend the conditions allowing to establish relation (66). Moreover, stability problems of FS deserve special attention, see, e.g., [48,49,50]. Algorithms stability for classification problems in the framework of random trees is treated in [51].
Finally, we emphasize that the problem of statistical estimation of the cardinality of a set of relevant factors appearing in definition (1) is very important and complex. Along with dealing with the deterministic number of selected factors, there is a research approach based on developing the rules for stopping the procedures used to identify the relevant set. In this regard, we indicate, e.g., article [52], dedicated to information methods for selecting relevant factors. The study of non-discrete stochastic models is also of undoubted interest, see, e.g., [53].
Further it would be interesting to study other functionals than (2) to measure the quality of a response approximation by means of functions defined on various collections of factors. One can also consider a random number of observations. In this regard we refer, e.g., to [27,54].

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The author is very grateful to the Reviewers for careful reading the manuscript and making valuable remarks and suggestions. He would also like to thank Alexander Tikhomirov for invitation to present manuscript for this issue.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 2

Proof. 
For any t N and p ( 0 , 1 ) , one has
E ( ψ ^ N ) t = N t j = 1 N 1 j t N j p j ( 1 p ) N j = N t p t ( N + 1 ) ( N + t ) j = 1 N ( j + 1 ) ( j + t ) j t N + t j + t p j + t ( 1 p ) ( N + t ) ( j + t ) = 1 p t 1 + h t ( N ) i = t + 1 N + t 1 + a 1 i t + + a t ( i t ) t N + t i p i ( 1 p ) N + t i ,
where h t ( N ) = O 1 / N , as N , and a 1 , , a t N . We do not use the explicit formulas a 1 = t ( t + 1 ) / 2 , , a t = t ! . Note that
i = t + 1 N + t N + t i p i ( 1 p ) N + t i = 1 i = 0 t N + t i p i ( 1 p ) N + t i = 1 g t ( N ) ,
where g t ( N ) : = i = 0 t g t , i ( N ) and, for i = t + 1 , , N + t , one has
0 g t , i ( N ) : = N + t i p i ( 1 p ) N + t i ( N + t ) t ( 1 p ) N = O ( 1 / N ) , N .
For each k = 1 , , t , introduce
q t , k ( N ) : = i = t + 1 N + t 1 ( i t ) k N + t i p i ( 1 p ) N + t i
= 1 p k ( N + t + 1 ) ( N + t + k ) i = t + 1 N + t ( i + 1 ) ( i + k ) ( i t ) k N + t + k i + k p i + k ( 1 p ) ( N + t + k ) ( i + k ) .
Obviously, one can write q t , k ( N ) = O 1 / N k , as
( i + 1 ) ( i + k ) ( i t ) k ( 1 + t + k ) k ( 1 + 2 t ) t
for all i t + 1 , k = 1 , , t , and since
i = t + 1 N + t N + t + k i + k p i + k ( 1 p ) ( N + t + k ) ( i + k ) 1 .
Consequently, for any t N , we get
E ( ψ ^ N ) t = 1 p t 1 + h t ( N ) 1 g t ( N ) + k = 1 t q t , k ( N ) = 1 p t + R t ( N ) ,
where R t ( N ) = O 1 / N , as N . Evidently, E ( ψ ^ N ) 0 = 1 for N N . For each N N , set R 0 ( N ) = 0 . Thus, for t N , one has
E ψ ^ N 1 p t = v = 0 t t v E ( ψ ^ N ) v 1 p t v = v = 0 t t v 1 p v + R v ( N ) ) 1 p t v = O 1 N ,
because
v = 0 t t v 1 p v 1 p t v = 1 p 1 p t = 0 , v = 0 t t v 1 p t v = 1 + 1 p t
and
max v = 0 , , t | R v ( N ) | = O 1 / N , N .
The proof of Lemma 2 is complete. □

References

  1. Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; J.Wiley and Sons Publication: Hoboken, NJ, USA, 2003. [Google Scholar]
  2. Györfi, L.; Kohler, M.; Krzyz˙ak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
  3. Matloff, N. Statistical Regression and Classification. From Linear Models to Machine Learning; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
  4. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  5. Hastie, T.; Tibshirani, R.; Wainwrigth, R. Statistical Learning with Sparsity. The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
  6. Bolón-Candedo, V.; Alonso-Betanzos, A. Recent Advances in Ensembles for Feature Selection; Springer: Cham, Switzerland, 2018. [Google Scholar]
  7. Giraud, C. Introduction to High-Dimensional Statistics; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
  8. Stańczyk, U.; Zielosko, B.; Jain, L.C. (Eds.) Advances in Feature Selection for Data and Pattern Recognition; Springer International Publishing AG: Cham, Switzerland, 2018. [Google Scholar]
  9. Kuhn, M.; Johnson, K. Feature Engineering and Selection. A Practical Approach for Predictive Models; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
  10. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  11. Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
  12. Lyu, Y.; Feng, Y.; Sakurai, K. A survey on feature selection techniques based on filtering methods for cyber attack detection. Information 2023, 14, 191. [Google Scholar] [CrossRef]
  13. Pradip, D.; Chandrashekhar, A. A comprehensive survey on feature selection in the various fields of machine learning. Appl. Intell. 2023, 52, 4543–4581. [Google Scholar]
  14. Htun, H.H.; Biehl, M.; Petkov, N. Survey of feature selection and extraction techniques for stock market prediction. Financ. Innov. 2023, 9, 26. [Google Scholar] [CrossRef] [PubMed]
  15. Laborda, J.; Ryoo, S. Feature Selection in a Credit Scoring Model. Mathematics 2021, 9, 746. [Google Scholar] [CrossRef]
  16. Emily, M. A survey of statistical methods for gene-gene interaction in case-control genomewide association studies. J. Société Fr. Stat. 2018, 159, 27–67. [Google Scholar]
  17. Tsunoda, T.; Tanaka, T.; Nakamura, Y. (Eds.) Genome-Wide Association Studies; Springer: Singapore, 2019. [Google Scholar]
  18. Luque-Rodriguez, M.; Molina-Baena, J.; Jimenez-Vilchez, A.; Arauzo-Azofra, A. Initialization of feature selection search for classification. J. Artif. Intell. Res. 2022, 75, 953–998. [Google Scholar] [CrossRef]
  19. Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
  20. Coelho, F.; Braga, A.P.; Verleysen, M.A. Mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [Google Scholar] [CrossRef]
  21. Kozhevin, A.A. Feature selection based on statistical estimation of mutual information. Sib. Elektron. Mat. Izv. 2021, 18, 720–728. [Google Scholar] [CrossRef]
  22. Latt, K.Z.; Honda, K.; Thiri, M.; Hitomi, Y.; Omae, Y.; Sawai, H.; Kawai, Y.; Teraguchi, S.; Ueno, K.; Nagasaki, M.; et al. Identification of a two-SNP PLA2R1 haplotype and HLA-DRB1 alleles as primary risk associations in idiopathic membranous nephropathy. Sci. Rep. 2018, 8, 15576. [Google Scholar] [CrossRef] [PubMed]
  23. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Applic 2014, 24, 175–186. [Google Scholar] [CrossRef]
  24. AlNuaimi, N.; Masud, M.M.; Serhani, M.A.; Zaki, N. Streaming feature selection algorithms for big data: A survey. Appl. Comput. Inform. 2022, 18, 113–135. [Google Scholar] [CrossRef]
  25. Ritchie, M.D.; Hahn, L.W.; Roodi, N.; Bailey, L.R.; Dupont, W.D.; Parl, F.F.; Moore, J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Human Genet. 2001, 69, 138–147. [Google Scholar] [CrossRef]
  26. Gola, D.; John, J.M.M.; van Steen, K.; Kónig, I.R. A roadmap to multifactor dimensionality reduction methods. Briefings Bioinform. 2016, 17, 293–308. [Google Scholar] [CrossRef]
  27. Bulinski, A.; Kozhevin, A. New version of the MDR method for stratified samples. Stat. Optim. Inf. Comput. 2017, 5, 1–18. [Google Scholar] [CrossRef]
  28. Abegaz, F.; van Lishout, F.; Mahachie, J.J.M.; Chiachoompu, K.; Bhardwaj, A.; Duroux, D.; Gusareva, R.S.; Wei, Z.; Hakonarson, H.; Van Steen, K. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min. 2021, 14, 16. [Google Scholar] [CrossRef]
  29. Yang, C.H.; Hou, M.F.; Chuang, L.Y.; Yang, C.S.; Lin, Y.D. Dimensionality reduction approach for many-objective epistasis analysis. Briefings Bioinform 2023, 24, bbac512. [Google Scholar] [CrossRef]
  30. Bulinski, A.; Butkovsky, O.; Sadovnichy, V.; Shashkin, A.; Yaskov, P.; Balatskiy, A.; Samokhodskaya, L.; Tkachuk, V. Statistical Methods of SNP Data Analysis and Applications. Open J. Stat. 2012, 2, 73–87. [Google Scholar] [CrossRef]
  31. Bulinski, A. On foundation of the dimensionality reduction method for explanatory variables. J. Math. Sci. 2014, 199, 113–122. [Google Scholar] [CrossRef]
  32. Bulinski, A.V.; Rakitko, A.S. MDR method for nonbinary response variable. J. Multivar. Anal. 2015, 135, 25–42. [Google Scholar] [CrossRef]
  33. Macedo, F.; Oliveira, M.R.; Pacheco, A.; Valadas, R. Theoretical Foundations of Forward Feature Selection Methods based on Mutual Information. Neurocomputing 2019, 325, 67–89. [Google Scholar] [CrossRef]
  34. Bulinski, A.V. On relevant feature selection based on information theory. Theory Probab. Its Appl. 2023, 68, 392–410. [Google Scholar] [CrossRef]
  35. Rakitko, A. MDR-EFE method with forward selection. In Proceedings of the The 5th International Conference on Stochastic Methods (ICSM-5), Moscow, Russia, 23–27 November 2020. [Google Scholar] [CrossRef]
  36. Velez, D.R.; White, B.C.; Motsinger, A.A.; Bush, W.S.; Ritchie, M.D.; Williams, S.M.; Moore, J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 2007, 31, 306–315. [Google Scholar] [CrossRef] [PubMed]
  37. Hu, T.-C.; Moricz, F.; Taylor, R. Strong laws of large numbers for arrays of rowwise independent random variables. Acta Math. Hung. 1989, 54, 153–162. [Google Scholar] [CrossRef]
  38. Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
  39. Billingsley, P. Convergence of Probability Measures; John Wiley and Sons: New York, NY, USA, 1968. [Google Scholar]
  40. Borkar, V.S. Probability Theory: An Advanced Course; Springer: New York, NY, USA, 1995. [Google Scholar]
  41. Bulinski, A.V.; Shiryaev, A.N. Theory of Stochastic Processes, 2nd ed.; Fizmatlit: Moscow, Russia, 2005. (In Russian) [Google Scholar]
  42. Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. [Google Scholar]
  43. Petrov, V.V. Limit Theorems of Probability Theory: Sequences of Independent Random Variables; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
  44. Shevtsova, I.G. On absolute constants in the Berry-Esseen inequality and its structural and non-uniform refinements. Informatics Its Appl. 2013, 7, 124–125. [Google Scholar]
  45. Bulinski, A.V.; Rakitko, A.S. Simulation and analytical approach to the identification of significant factors. Commun. Stat.-Simul. Comput. 2016, 45, 1430–1450. [Google Scholar] [CrossRef]
  46. Shah, R.D.; Samworth, R.J. Variable selection with error control: Another look at stablity selection. J. R. Statist. Soc. B. 2012, 74, 1–26. [Google Scholar] [CrossRef]
  47. Beinrucker, A.; Dogan, U.; Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat. Comput. 2016, 26, 1059–1077. [Google Scholar] [CrossRef]
  48. Nogueira, S.; Sechidis, K.; Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
  49. Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
  50. Bulinski, A. Stability properties of feature selection measures. Theory Probab. Appl. 2024, 69, 3–15. [Google Scholar]
  51. Bénard, C.; Biau, G.; Da Veiga, S.; Scornet, E. SIRUS: Stable and Interpretable RUle Set for classification. Electron. J. Statist. 2021, 15, 427–505. [Google Scholar] [CrossRef]
  52. Mielniczuk, J. Information theoretic methods for variable selection—A review. Entropy 2022, 24, 1079. [Google Scholar] [CrossRef]
  53. Linke, Y.; Borisov, I.; Ruzankin, P.; Kutsenko, V.; Yarovaya, E.; Shalnova, S. Universal Local Linear Kernel Estimators in Nonparametric Regression. Mathematics 2022, 10, 2693. [Google Scholar] [CrossRef]
  54. Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F.J. The Methods of Distances in the Theory of Probability and Statistics; Springer: New York, NY, USA, 2013. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bulinski, A. Forward Selection of Relevant Factors by Means of MDR-EFE Method. Mathematics 2024, 12, 831. https://doi.org/10.3390/math12060831

AMA Style

Bulinski A. Forward Selection of Relevant Factors by Means of MDR-EFE Method. Mathematics. 2024; 12(6):831. https://doi.org/10.3390/math12060831

Chicago/Turabian Style

Bulinski, Alexander. 2024. "Forward Selection of Relevant Factors by Means of MDR-EFE Method" Mathematics 12, no. 6: 831. https://doi.org/10.3390/math12060831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop