Ultra-high Dimensional Variable Screening via Density Weighted Variance

Density Weighted Variance (DWV), a novel model-free feature screening criterion is proposed for mean regression with ultrahigh-dimensional covariates. Compared with existing model free screening criteria, DWV criterion possesses faster convergence rate for inactive co-varieties and is as same convergence rate as most existing variable screening procedures for active covariates. Furthermore, DWV criterion is extended to quintile regression and multiple response regression setting. Finally, numerical simulations and a real data analysis are conducted to show the finite sample performance of the proposed methods. Citation: Zhou J, Chen Y (2018) Ultra-high Dimensional Variable Screening via Density Weighted Variance. J Biom Biostat 9: 401. doi: 10.4172/21556180.1000401


Introduction
High dimensional data analysis have been stirring statistical area as data collection becoming more and more convenient. However, many conventional statistical methods couldn't be used directly to deal with high dimensional data, variable selection and dimension reduction are two main techniques to bridge the gap between conventional statistical methods and modern high dimensional data. In this paper, we restrict our attentions to high dimensional variable selection procedures.
Fan and Lv [1] provided a selective overview on existing variable selection methods before 2008. Fan and Lv [2] proposed marginal correlation coefficient screening for ultra-high dimensional linear regression model. After the seminal paper, many other marginal criteria are proposed for linear regression model, generalized linear model and other parametric or nonpara metric regression models, such as, Wang [3], Fan and Song [4], Fan et al. [5], Li et al. [6], and so on. Recently, model-free variable screening procedures received some attention to improve the performances when the model couldn't be specified. Based on distance correlation measure [7], Li et al. [8], proposed DC-SIS procedure. For mean regression model, Shao and Zhang [9] modified the definition of distance correlation and proposed MDC-SIS variable screening method. For discriminant analysis or discrete response regression model, Cui et al. [10] proposed MV-SIS by comparing empirical distribution function.
The purpose of this paper is to propose a new model-free variable screening criterion-Density Weighted Variance(DWV, hereafter) for mean regression model and to extend it to other general regression models which model functionals of conditional distribution of response variable Y given covariates X. Compared with existing model free screening criteria, DWV criterion has faster convergence rate for inactive covariates and has same convergence rate as most existing variable screening procedures for active covariates. Furthermore, the prosed procedure imposes less constraints on response variable, it can be used for univariate or multivariate, discrete or continuous response variable.
The remainder of this paper is organized as follows. In section 2, we develop the DWV-SIS for mean regression. In section 3, we extend the proposed variable screening procedure to quantile regression. In section 4, multiple response regression model was considered. Numerical simulations and real data analysis are conducted in section 5 to show the finite sample performance of the proposed methods. All technical proofs are given in Appendix.

DWV Criterion for Mean Regression
This section is divided into three parts. We present some preliminaries on smooth test statistic and the motivation of DWV criterion in subsection 1. Sequentially, we develop DWV-SIS method for mean regression with continuous covariate variables in subsection 2. Finally, we consider mean regression with mixed discrete and continuous covariate variables in subsection 3.

Preliminaries and motivation
Consider a parametric mean regression where g(·) is a known function, and θ ∈ Θ is unknown parameter. For a given i.i.d. data, {(x i , y i ), i=1, 2, …, n}, to test the goodness of fit of the parametric model (1), let ε=Y-g(X, θ 0 ), where 0 is the projection of Y onto parametric family {g(X, θ):θ ∈Θ}, that is, However, the curse of dimensionality still befalls it because of the nonparametric estimator nature.
Another paper which motivate our work is Shao and Zhang [9]. Inspired by the definition of Distance Correlation (DC) [7], they proposed a new correlation measure, Martingale Difference Correlation (MDC), for mean regression based on E(Y|X)=E(Y) almost surely. Compared with traditional Pearson correlation, MDC can measure nonlinearity correlation, on the other hand, DC(X, Y)=0 is equivalent to X and Y are independent, which is a bit stronger for mean regression whereas MDC is enough.
What's more, the curse of dimensionality can be avoided since we calculate the criterion for all covariate one by one in variable screening, while the appealing properties of U-statistics are preserved. For ease of presentation, we write:

DWV-SIS for continuous covariates
Obviously, from population view of point, we have, for any k ∈ I, Vk=0, while Vk>0 for any k ∈ A. As same as most variable screening procedure, we select the covariate variables with large ˆk V as active covariate variables, that is, where α>0 is preassigned threshold. It is more convenient to rank the DWV criterion and select the largest d covariate variables as active covariate variables in real data application.
Next, we study the theoretical properties of the proposed DWV-SIS procedure. The following assumptions and conditions are imposed to facilitate the technical proofs, although they may not be the weakest ones.
This assumption is identical to condition (C.1) in Li et al. [8], and under this assumption, we have, E{exp[s(Y-E(Y))] 2 }<∞. This assumption is similar to the first part of Assumption 1 in Zheng [11], and our assumption is just make assumption on marginal density function whereas Zheng [11] make assumption on joint density function which can imply our assumption.
This assumption is similar to the second part of Assumption 1 in Zheng [11].
(C.1) Kernel function K(u) is symmetric and satisfies that, is uniformly bounded respect to u and h>0. That is, there is a positive constant M 1 >0 such that, sup h>0,u K h (u) ≤ M 1 <∞. Moreover, the bandwidth h satisfies that h → 0 and nh → ∞ as n → ∞.
For bounded support kernel, this condition is equivalent to K(u) is bounded which is an all pervading condition in nonparametric kernel literatures. While for unbounded support kernel, this condition requires the convergence rate of is not slower than that of h, and the habitual Gaussian kernel is satisfied.
First, we present the convergence results proved in Zheng [11], and Condition (C.1), for any k ∈ I, we have, , and for any k ∈ A, we have, From variable selection viewpoint, we have the following sure screening property.
It is worth mentioning that Fan et al. [12] provided a nonasymptotic bound for sure independence screening procedure based on a variant of SIS by assuming exchangeability condition, which is also hold for our DWV-SIS procedure since the proof is independent with screening criterion, and we restate the result here, more details can be referred in Fan et al. [12].
(A.5) (Exchangeability condition) -Let r ∈ N, the set of natural number. We say the model satisfies the exchangeability condition at level r if the set of random vectors: {(Y, X A, X j1 , X j2 , · · ·, X jr ): j 1 , j 2 , · · ·, jr are distinct elements of A c } is exchangeable.

DWV-SIS for discrete covariates
To deal with nonparametric kernel estimation with categorical covariate variables, Aitchison and Aitken [13] proposed binomial kernel as follows where 1/2 ≤ λ ≤ 1 is smooth parameter, a habitual definition of d(x, y) is 0 when x=y, otherwise 1 if x ≠ y. Subsequently, Hsiao et al. [14] suggested a variation which can also treat ordinal data as follows where smooth parameter λ ∈ [0, 1]. In this paper, the variational binomial kernel (10) is used to handle discretecovariate variable. Thus, our DWV criterion for discretecovariate variable is and we have the following convergence result.

Theorem 4:
Under assumptions (A.2) and Condition (C.1), for any k ∈ I, we have, , and for any k ∈ A, we have, From variable selection viewpoint, we have the following sure screening property.

DWV-SIS for Multiple Response Regression
We have to consider multiple response regression model for some situations, for instance, we have to introduce several dummy variables when the original response variable Y take multi-categorical nominal data. On the other hand, multiple response joint modeling may improve the performance when the components of Y are dependent each other. For multiple response regression model, denote λmax(·) be the largest eigen value of a matrix, we propose the following DWV criteria.
for k ∈ C and for k ∈ D.
To study the asymptotical properties, we make the following assumptions: (A.5) There are positive constants c>0 and 0 ≤ κ<1/2 such that, The response variable Y satisfy the sub-exponential tail probability, that is, there exists a positive constant s0 such that for all 0<s ≤ 2s 0 , is twice continuously differentiable and bounded by a measurable function b(xk) such that Before stating the theoretical results, we restate the following lemma first, which is extracted from corollary (3.2.6) of Bhatia [15].

Lemma 1 (Weyl's inequality):
Let A and B be Hermitian matrices. Then, where ║·║ is operator norm of a matrix, that is, the largest eigenvalues. By the equivalence between matrix norms, there is a constant c>0, such that, where ║·║ F is Frobenius norm of a matrix.
By assumptions (A.5) and theorem 7, we can take where |A| is the cardinality of A.

DWV Criterion for Quantile Regression
In this section, we extend the DWV-SIS procedure to quantile regression setting by means of a transformation. Consider the following quantile regression model If some covariate variable X k has no contribution to the τ-quantile of Y, then Note that, for continuous response variable Y, we have, Thus, the quantile regression model (20) is equivalent to E(1(Y g(X)) | X) τ ≤ = .
Based on this fact, we propose DWV criterion for quantile regression as follows Obviously, Vk=0 for k ∈ I and Vk>0 for k ∈ A.
For a given i.i.d. sample data, we can estimated the criterion as follows Furthermore, we can develop DWV criterion for general regression which model a general functional of the conditional distribution Y given X. Let τ ∼ U(0, 1), we propose DWV criterion for regression as follow The sure screening property can be proved similarly.

Determination of bandwidth
Bandwidth selection is a critical procedure for nonparametric estimation, DWV-SIS is no exception. Most of existing methods depend on the accuracy measure of regression model based on selected active covariates. In this paper, inspired by stable selection method, we propose a bootstrap strategy which doesn't rely on the regression model. Denote { } 1 L s s h = be the candidate bandwidth. Step 2: For each covariate, count the number that it be selected by some active covariate variables set , b s Â , and denote the covariates whose selected number is larger than [αLB] as estimated "true" active covariate Â .
Step 3: The optimal bandwidth hopt is determined by the following

Numerical simulation
In this section, we assess the finite sample performance of the proposed method via Monte Carlo simulation. We set p to be 1000, d to be [n/ log(n)]. All the simulation results are based on 500 replications. We consider the linear model in the Examples 1, additive model in Example 2, multiple response regression in Example 3, balanced and unbalanced categorical response regression in Example 4, interaction affect regression in Example 5 and hetero-scedastic model in Example 6. We evaluate the performance through three criteria: 1. S: the minimum model size to include all active covariate variables. We report the 5%, 25%, 50%, 75%, and 95% quantiles of S out of 500 replications.
2. Ps the proportion that an individual active covariate variable is selected for a given model size d in the 500 replications.
3. Pa the proportion that all active covariate variables are selected for a given model size d in the 500 replications.

Example 1:
This example is designed to compare the finite sample performance of the DC-SIS with SIS (Fan and Lv 2008), DC-SIS (Li et al.), MDC-SIS [9] and SIRS [16]. In this example, we consider the linear model: where X are generated as follows: Z ∼ N(0, Ip), ε is generated from the standard normal distribution. We set n to be 100.

Tables 1 and 2 summarize the simulation results of Example 1. DWV-SIS is comparable with SIS and MDC-SIS, and performs better than DC-SIS and SIRS.
Example 2: In the example, we investigate the performance for additive models. As in Shao and Zhang [9], we consider the following two cases: Y g X g X g X g X g X g X g X g X g X g X g X g X ε = + + + + + + + + + + + + where X1; X2;::: ; Xp are iid U(0; 1). ε is generated from the standard normal distribution. As same as the Example 2 in Shao and Zhang [9], let n=400, g1(x)=x, g2(x)=(2x-1) 2 ( ) ( )   where X are generated as follows: Z ∼ N(0, Ip), ε is generated from the standard normal distribution. We set n to be 200. Tables 6 and 7 show DWV-SIS is significantly better than DC-SIS.

Example 4:
In the example, we investigate the performance for additive models. As in Cui et al. [10], we consider the following two cases: Case 3a: balanced: where r=1, 2,…, R. Given Yk=r, the ith covariate variable Xk is then generated by letting Xk=µr+εi, where the mean term µr ∈ R R with rth component µr,r=3, but other components are all zero, and εi=(εi1, εi2,…, εip) T comes form N(0, Ip). Let R=10, n=100. In the example, we add MV-SIS because it a discriminant analysis screening with the categorical response variable. Tables 8 and 9 show DWV-SIS slightly outperform than MV-SIS, and significantly better than DC-SIS.
Example 5. In the example, we consider the follow interaction item regression: where X2, X4, X6, X8 are iid U(0, 1), other covariate variables are iid t(3). Because X2, X4, X6, X8 have no effect on mean of Y, the mean screening methods do not distinguish these covariate variables, such as SIS, NIS and MDC-SIS. We only compare MWV-SIS with SISR and DC-SIS.
Tables 10 and 11 summarize the simulation results and show that the proposed DWV-SIS performs better than SIRS and DC-SIS.

Example 5:
In the example, we consider the heteroscedastic model: where Z, X2, X3,…, Xp and ε are iid n(0, 1). X1 is generated form Poisson distribution with Z parameter. Like Example 5, X1, X6, X7, X8 have no effect on mean of Y. The sample size is 400.

Real data analysis
The Amazon commerce reviews can be found in https://archive.ics. uci.edu/ml/datasets/Amazon+Commerce+reviews+set, which is used for authorship identification in online Write print which is a new research field of pattern recognition. The data are derived from the customers reviews in Amazon Commerce Website for authorship identification, which collected 30 reviews for 50 of the most active users (represented by a unique ID and username). Based on lexical, content specific and idiosyncratic aspects, 10000 characteristic variables are collected, such as usage of digit, punctuation, words and sentences' length and usage frequency of words and so on. Compared with the sample size n=1500, the dimension p=10000 is very large.
The response is a categorical variable. We use R function unique to extract unique elements for each covariate variable. From Table 14, the number of covariates which only include two elements are 2178. The maximum of unique elements is 297. Hence, covariate variables can be viewed as categorical variables. DWV-SIS, MV-SIS and DC-SIS rank ten characteristic variables in Table 15. To compare the performance of three methods, we fit linear discriminant analysis using R function lda in MASS package. MV-SIS is the best with the total accuracy of 32.5% and DMV-SIS outperforms DC-SIS.