A general framework for tensor screening through smoothing

: Screening is an important technique for analyzing high-dimensional data. Most screening tools have been developed for vectors and are marginal in the sense that each variable is evaluated individually at a time. Many multi-dimensional arrays (tensors) are generated nowadays. In addition to being high-dimensional, these data further have the tensor structure that should be exploited for more eﬃcient analysis. Variables adjacent to each other in a tensor tend to be important or unimportant at the same time. Such information is ignored by marginal screening methods. In this article, we propose a general framework for tensor screening called smoothed tensor screening (STS). STS combines the strength of current marginal screening methods with tensor structural information by aggregating the information of its adjacent variables when evaluating one variable. STS is widely applicable since the statistical utility used in screening can be chosen based on the underlying model or data type of the responses and predictors. Moreover, we establish the SURE screening property for STS under mild conditions. Numerical studies demonstrate that STS has better performance than marginal screening methods.


Introduction
A large number of tensor datasets have been appearing in modern scientific research, attracting much attention to the analysis of such datasets. For example, advanced neuroimaging technology often generates imaging data in the form of tensors. Electroencephalography monitors brain activity at several locations for a period of time, resulting in 2-way tensors. Magnetic resonance imaging produces 3-way tensors as scans of brains. Similarly, functional magnetic resonance imaging data are 4-way tensors with the 4th dimension being the time domain. Tensor data are also frequently seen in other fields, such as computational biology, personalized recommendation, and image recognition analysis.
In principle, we could vectorize the tensors and then apply existing vector methods on the dataset. However, it is a wide consensus that vectorization could lead to loss of efficiency in analysis, as the tensor structure contains valuable information that usually cannot be adequately modeled by existing vector models. In this article, we focus on an especially important tensor structure, the smoothness structure.
Oftentimes, tensor data are collected in a way such that elements close to each other are similar. Thus, the coefficients of them are smooth. For example, [38,39,53] argue that brain images often have the smoothness structure in that voxels adjacent to each other tend to be all important or unimportant at the same time, because biologically voxels in the same brain region usually function together. [59] consider tensor generalized linear models with low-rank and sparse coefficients, as they want to perform region selection instead of variable selection. [49] impose the smoothness structure to capture the dynamic nature of tensor. When analyzing a brain image data collected over time, they assume that the brain activities change smoothly along the time domain. More concretely, in Section 5.2 we consider an electroencephalography (EEG) data, in which voltage fluctuations are collected from 64 electrodes placed on the scalp, which are sampled at 256 Hz for 1 second. As a result, the EEG images are stored as 2-way tensors, with smoothness structure along rows (i.e., time points) and columns (i.e., locations of the electrodes). Vectorization of tensor data destroys such smoothness structure; the adjacent elements may no longer be close to each other after vectorization. Instead, it is highly desirable to preserve and leverage the tensor structure rather than to vectorize the data.
Another common property of tensor data is their high dimensionality. Many tensor data naturally have a large number of elements in them. For example, the EEG dataset in Section 5.2 is of dimension 256 × 64, with 16,384 elements in total. [24] and [42] analyzed the attention deficit hyperactivity disorder (ADHD) data of dimension 30 × 36 × 30, with 32, 400 elements in total. Such high dimensionality calls for additional assumptions for accurate modeling. Therefore, many researchers borrow the popular sparsity assumption from highdimensional statistics to enforce variable selection and thus reduce the model complexity in tensor data analysis [59, 48, 42, e.g].
However, these high-dimensional tensor methods generally take the penalized approach, in which we consider an optimization problem that combines an appropriate loss function and a sparsity-inducing penalty. When the dimension is excessively high, penalized methods can be time-consuming or computationally unstable [10]. In such cases, screening is widely regarded as a computationally efficient preprocessing tool. For vector data, a large family of screening methods have been proposed [10,12,18,13,8,25,35,3,19,11,36,4, e.g]. These methods marginally rank the variables by some properly chosen screening utilities. Only the variables that appear marginally important are preserved for further analysis.
Although screening methods effectively mitigate the impact of high dimensionality on vector data, when applied to tensor data, they are typically incapable of utilizing the tensor data structure. Since the screening utilities are calculated on individual variables, screening does not recognize the aforementioned common spatial structure in tensor data, nor does it encourage the selection of important regions. Ignoring this important piece of information could drastically decrease the efficiency of our statistical analysis, especially in the presence of the high dimensions and the relatively limited sample size in tensor data.
To tackle this challenge, we propose a general framework for tensor screening that explicitly takes advantage of the tensor structure. We incorporate the common smoothness structure in tensor data into the screening procedure, resulting in a smoothed tensor screening (STS) framework. When we calculate the screening utility for one variable, we not only consider this individual variable, but also aggregate the information from adjacent variables. Consequently, adjacent variables tend to receive similar rankings, and we encourage the selection of important regions instead of scattered elements. STS can be combined with any existing marginal screening method to exploit the tensor structure for better variable selection. Moreover, STS can be completed with the same order of computationally costs as the corresponding marginal screening method and thus preserves the most attractive feature of marginal screening. We show that STS enjoys the so-called SURE screening property that it preserves all the important variables with a probability tending to 1 under mild conditions.
After STS, refined analysis can be performed on the reduced set. One could use either vector-based methods or tensor-based methods for this purpose. There are many methods developed for tensor model fitting. For example, for regression problems, see [59,20,43,53,24,57,31]. For classification problems, see [32,42]. Many of these methods utilize the low-rank assumption, which is related to tensor decomposition [5, 50, 28, 56, e.g]. Most model-fitting methods can be easily combined with STS, either directly or with slight data augmentation. Some of them consist of variable selection and thus can further exclude more variables from the dataset. Others may not perform variable selection on their own. STS is a convenient way to add variable selection to these methods, besides boosting their computation efficiency. Our numerical studies demonstrate superior performance for STS combined with many popular model-fitting methods.
The rest of this article is organized as follows. In Section 2, we start with a review of marginal screening. We also introduce some useful tensor notation. Section 3 presents the procedure of STS and the analysis afterwards. The SURE screening property is established in Section 4. In Section 5, we present simulation results as well as a real data analysis example. Section 6 summarizes our contributions and discusses some future research directions. Additional numerical studies and technical proofs are given in the Appendix.

Marginal screening
We first briefly review marginal screening for vector data, as the main purpose of this article is to generalize these methods to tensor data. Consider a random pair {U, Y }, where U ∈ R p is a p-dimensional vector of predictors, and Y is the univariate response that can be either continuous or discrete. Some works consider multivariate Y , but we focus on the univariate response case for ease of presentation. Nevertheless, our framework easily extends to the multivariate response. We observe n independent and identically distributed copies of the random pair, denoted as U i , Y i , i = 1, . . . , n. Consider high-dimensional problems where p is much larger than n. In this scenario, fitting models with all variables often leads to a drastic loss of efficiency and accuracy. Rather, statisticians often perform variable selection under the celebrated sparsity assumption. The sparsity assumption means that only a small subset of variables are related to the response Y . More rigorously, we define the set of important variables [60] as where U j is the jth feature of U and F (y | U) = pr(Y < y | U) is the conditional distribution function of Y given U. The sparsity assumption implies that |D| p. Hence, if we can identify D, estimation and prediction can be performed within a much lower-dimensional space.
Marginal screening is a family of computationally efficient techniques to identify D. They are usually carried out in the following procedure. First, we choose a proper screening utility φ nj that measures the marginal dependence between U j and Y such that a larger value of φ nj indicates stronger dependence. For example, if Y is continuous, we can choose φ nj to be the absolute value of the Pearson correlation between Y and U j [10]. If Y is binary, φ nj could be the absolute value of the t-statistic of U j across the two levels of Y [7]. As mentioned in Section 1, more sophisticated utilities have been proposed in the literature as well to accommodate more complicated statistical models. With a chosen screening utility, all the variables are ranked by their corresponding φ nj . Only the variables with the highest ranks are selected, i.e., we select the set where d n is a positive integer predefined by users. Since most penalized methods can only deal with o(n) variables, the common choices for d n are n and n/ log n , where a = min{i : i ≥ a and i is an integer} for a > 0.
Since we only fit the model on S φ (d n ), it is of utmost importance that S φ (d n ) contains all the important variables in D. More formally, a screening method is said to enjoy the SURE screening property if D ⊆ S φ (d n ). Most existing screening methods enjoy the SURE screening property under two types of interpretable conditions. Define φ j as the population counterpart of φ nj . The two conditions are: Condition (V2). There exist a constant 0 > 0 and a monotonically decreasing function ζ n such that for any 0 < < 0 , Condition (V1) guarantees the validity of screening on the population level; if we knew the true model, φ j should provide a reasonable ranking such that the important variables are ranked higher than the unimportant ones. Condition (V2) requires φ nj to be accurate approximations of φ j , such that we can preserve the ranking on the sample level. Condition (V2) is replaced with suitable lowerlevel conditions for specific screening methods (see, e.g., [13,25,35]).

Tensor notation
We introduce some tensor notation that will be used throughout the rest of the article. See [21] for a review of notation and operations of tensor. A tensor is a multi-dimensional array, and its dimension is called the order or ways of the tensor. An R-dimensional array A ∈ R p1×...×p R is a tensor of order R. Vectors and matrices can also be viewed as order-one (R = 1) tensors and order-two (R = 2) tensors. The order of a tensor is also known as modes. The moder product of tensor A with a matrix α ∈ R d×pr is defined as A × r α and it yields a tensor of size p 1 × · · · × p r−1 × d × p r+1 × · · · × p R . The Tucker decomposition of A, defined as A = C × 1 G 1 · · · × R G R , can decompose A into the product of a core tensor C ∈ R d1×···×d R and R factor matrices G r ∈ R pr×dr , r = 1, . . . , R. The Tucker decomposition is often written in a shorthand notation, A = JC; G 1 , . . . , G R K. If all elements in A independently follow the standard normal distribution and X = μ + JA; Σ 1/2 1 , . . . , Σ 1/2 R K, then X follows a tensor normal (TN) distribution, denoted as X ∼ T N(μ, Σ 1 , . . . , Σ R ). If R = 2, the tensor normal distribution reduces to the matrix normal (MN) distribution [17].

The proposed smoothed tensor screening procedure
We consider screening on tensor data. We are interested in the random pair (X, Y ), where X ∈ R p1×...×p R is a R-dimensional tensor predictor and Y is the univariate response that can be either continuous or discrete. The tensor predictor X has p = R r=1 p r elements, which is often an intimidating number. We observe a random sample X i , Y i , i = 1, . . . , n, where n is much smaller than p.
We want to perform screening on the observed data to reduce the number of variables. Also, recall that we want to leverage the tensor structure for better screening results. To this end, we assume that variables adjacent to each other tend to be important or unimportant at the same time. Let J = (j 1 , . . . , j R ) be an index and X J be the J th element in X. The tensor version definition of the set of important variables is D = {J : F (y | X) functionally depends on X J for some y}.
For high-dimensional tensors, we propose a general smoothed tensor screening (STS) framework for (X, Y ). STS consists of the following three steps.
First, we choose an appropriate measurement of the dependence between Y and X J , φ nJ , for each J . Most existing marginal utilities can be candidates as long as their corresponding models are reasonable. For example, if we believe that Y and X are related through a linear regression model, φ nJ could be the Pearson correlation [10]. If a generalized linear model is suitable, we can take φ nJ to be the coefficient of the marginal generalized linear model [13]. If we wish to perform screening in a model-free fashion, distance correlation can be applied [25]. We will demonstrate our proposed framework with three popular choices in later sections.
Second, we exploit the tensor structure by incorporating the neighborhood information. For each J = (j 1 , . . . , j R ), define Ω J = {I = (i 1 , . . . , i R ) : |i r − j r | ≤ 1, r = 1, . . . , R}\{J }. Apparently, Ω J contains all the predictors adjacent to X J . Then we obtain the smoothed screening utility If desired, c can be chosen by cross-validation. However, we discover that this is generally not necessary if smoothness structure exists. The screening results are not sensitive to the choice of c, as long as it is in a reasonable range. Define ω = max J |Ω J |. Generally, if p r ≥ 3 for r = 1, . . . , R, we have ω = 3 R − 1. If ω/2 ≤ c ≤ ω, the screening results are roughly constant in all our simulation models that satisfy the smoothness assumption. Following the convention in marginal screening, we could set d n to be n/ log n or n. Our proposed screening procedure is different from marginal screening in that it smoothes the screening utilities across the locations over the tensor. Hence, we refer to it as smoothed tensor screening (STS). STS is a general framework for tensor screening because it can be combined with any marginal screening method to achieve smoothed variable selection on tensor data. By summing φ nJ with its neighbors, we utilize the natural spatial structure in tensor data to obtain better variable selection. Moreover, STS has the same order of computation cost as marginal screening. For example, if the computation cost for φ nJ is O(n γ ) for some γ > 0, then the computation cost for the corresponding marginal screening is O(n γ · p). STS has an additional smoothing step, with the computation cost When c = 0, the procedure is equivalent to marginal screening that ignores the tensor structure and does not effectively separate the signals and noises. After smoothing the statistics, we have a better recovery of the true signal block.
of O(ωp) = o(n γ · p) as n becomes large. Hence, the total computation cost for STS remains the same as the corresponding marginal screening in use.
We further present a toy example to illustrate the procedure of STS.

Example 1.
Consider a binary classification problem where the response Y = 1 or 2 with equal probabilities and the predictor X ∈ R 64×64 . We generate X from the matrix normal distribution In the first class, μ 1,D = 0.35 and μ 1,D c = 0. In the second class, μ 2 = 0. For each class, we simulate 150 samples, which is a small sample size compared to the 4,096 elements in X.
Our model is known as a tensor discriminant analysis model [42]. It is similar to a discriminant analysis model, in which the t-statistics can be used for marginal screening [7]. Denote T nJ as the absolute value of the t-statistic calculated on (X J , Y ). Then we compute T Many important elements have small T nJ , while many unimportant elements have misleadingly large T nJ . However, as we exploit the tensor structure by increasing c, the signal block stands out clearly. While the block for c = 1 (Panel (c)) is somewhat blurry, it becomes much more distinctive for c = 4, 8 (Panels (d) & (e)). Hence, by encouraging smoothness, STS is more efficient in distinguishing the important elements from the unimportant ones. Moreover, Panels (d) & (e) look similar, which demonstrates our earlier claim that STS is not sensitive to the choice of c as long as it is reasonably large.
With the screening utilities, STS chooses the top d n predictors. We let d n = n and plot the proportion of selected active predictors and that of selected inactive predictors, corresponding to c = 0, 1, . . . , 8 in Figure 2. With the increase of c, most active predictors are selected and very few inactive predictors are selected. Also, the proportions are very stable for c ≥ 2, indicating that there is no need to finely tune c under the smoothness assumption.
Finally, we comment on the robustness of STS. STS is proposed to exploit the  smoothness structure in the data. If there is no smoothness, STS is not recommended, as the screening results are expected to be not as accurate as marginal screening. However, STS is resistant to partial violations of the smoothness assumption. By our construction, STS is most efficient when the tensor is smooth along all modes. But sometimes in practice, the tensor is only smooth along some modes. For example, for a matrix predictor, it could be the case that the tensor is only smooth along the rows, but not the columns. Our numerical studies show that STS still works well in this partially smooth scenario. Moreover, recall that, in smooth models, we recommend ω/2 ≤ c ≤ ω. Among these choices, c = ω/2 is the most robust to possible violations of the smoothness assumption. See Section 5.1 for empirical evidence of our discussion.

Other possible approaches for smoothed tensor screening
Compared to marginal screening, the most important innovation for STS is to aggregate information from neighbors to take advantage of the smoothness structure. We achieve this goal by taking the weighted average of an element and its immediately adjacent neighbors. However, there are other possible ways to smooth the screening utilities. We discuss two possibilities.
First, STS only uses the immediately adjacent elements, resulting in neighborhoods of size 3 along each mode. One could expand the neighborhoods to obtain more smoothed results. We investigate the effect of neighborhood size in Appendix C. We find that if there is strong smoothness structure, a neighborhood of size 5 or 7 (i.e., two or three adjacent elements are used instead of one along each direction) could have slightly better results than 3, but the improvement is minimal. Further increasing the neighborhood size no longer helps, and overly large neighborhood will eventually hurt the performance of STS. Moreover, even for the size of 5, there is a notable drop in the robustness. In other words, when the smoothness assumption is not fully satisfied, STS has much worse results with a neighborhood size of 5 compared to 3. Therefore, larger neighborhoods should only be used when the true model is very smooth. Otherwise, the neighborhood size of 3 is safer.
Second, there are other methods for the smoothing step in STS. For example, Gaussian filter is a very popular imaging processing technique [16]. Although Gaussian filter is often employed to smooth the data itself instead of screening utilities, it is straightforward to be applied in the screening context. With Gaussian filter, we still take the weighted average of the screening utilities within a neighborhood, but the weights are generated by a Gaussian kernel instead of being a constant as in STS. See Appendix D for an introduction of Gaussian filter and the comparison between it and STS. In STS, the constant c controls the degree of smoothness. In Gaussian filter, the standard deviation of the Gaussian kernel has the same purpose. When the standard deviation is chosen well, Gaussian filter performs almost identically to STS. Hence, throughout the rest of this article, we restrict our attention to STS to avoid redundancy.

Analysis after smoothed tensor screening
After STS, we perform model fitting on the selected variables. This is usually straightforward to do if we choose a vector-based method. We simply apply a vector-based method on (Y, X STS ), where X STS contains all the elements preserved by STS. However, it might be difficult to directly use a tensor-based analysis tool, because the reduced set X STS may no longer be a tensor. Hence, if we hope to use a tensor-based analysis tool, we first augment the reduced set into a tensor. Given S Smooth φ (d n ), we find the index set for each mode. Specifically, for r = 1, . . . , R, define Then we keep all the variables in The resulting X is the smallest tensor that contains X STS . Further analysis can be performed on X. We continue to use the model setting in Example 1 to illustrate this augmentation step.

Example 1. (Cont'd)
Under the same setting in Example 1, we plot the selected variables in Figure 3. Apparently, the selected variables (black) no longer form a tensor. We augment the selection results by further including the grey elements in Figure 4. The black and grey elements together form a smaller tensor. When c is reasonably large and STS works well, the augmentation does not include too many extra variables.
When the signal in tensor is smooth and STS produces an accurate selection result, the augmented data X is often a small tensor, as is the case in Example 1. But if one is concerned with the inclusion of extra variables in the augmentation, the screening utilities can be combined with the penalty to filter out these variables. For example, denote B ∈ R |M1|×···×|M R | as the parameter  of interest and J λ J P (B J ) as the penalty function we choose. Then we can let λ J = λ/φ Smooth nJ for some λ > 0 such that the added variables are more heavily penalized than the ones preserved by screening. However, we do not observe significant improvement in this penalized procedure because X usually only includes a small number of extra variables when STS is performed with a suitable c.

Smoothed tensor screening with three popular screening methods
In what follows, we present three examples for STS combined with three popular marginal screening utilities: the t-statistic, the maximum marginal likelihood estimator (MMLE), and the distance correlation. The t-statistic is a natural screening statistic for classification problems, MMLE works for the generalized linear model, and distance correlation is a successful model-free screening method. All these methods were originally proposed for vector data, but their direct generalizations to tensor data are straightforward. We first give a short review of the three statistical utilities in their original model setting, and then we extend them to tensor data. Throughout the rest of this section, we use U ∈ R p to denote the vector predictors and X ∈ R p1×···×p R to denote the tensor predictors. Our observations are denoted as

The t-statistic
Consider a binary classification problem where Y ∈ {1, 2} denotes the class label and U ∈ R p denotes the vector predictor. When Y = k, k = 1, 2, we assume that where μ k ∈ R p is the mean vector of class k and k ∈ R p is the error term, in which each element has mean zero. [7] proposed to calculate the t-statistic of each U j across the two levels of Y and rank the importance of U j by the magnitude of its corresponding t-statistic.
For tensor data, consider the class label Y ∈ {1, 2} and the R-way tensor predictor X ∈ R p1×...×p R . When Y = k, we assume that where μ k ∈ R p1×...×p R is the mean tensor of class k and k ∈ R p1×...×p R is the error term where each element has a mean of zero. We continue to use the two sample t-statistic on tensor data. In a dataset with n observations, assume that there are n k samples within the class Y = k. We calculate Note that φ t nJ is the t-statistic divided by n 1/2 . Since φ t nJ and n 1/2 φ t nJ give us the same ranking of predictors, they are the same for the sake of screening. But such rescaling helps us define the population statistical utility where μ kJ and σ 2 kJ are the true population mean and variance of the J th variable in class k and π k = pr(Y = k).
With φ t nJ , we apply STS by calculating The variables are then ranked by φ t.STS nJ and we select the following subset We refer to this procedure as STS-t screening. Similar to the t screening, STS-t screening is suitable when the response Y is a categorical variable. However, STS-t screening is designed for tensor predictors instead of vector predictors. By taking into account the smoothness structure of tensor data, we achieve more efficient screening. After STS-t screening, we could apply a wide range of model-fitting methods. For example, if we focus on X D , we can apply sparse discriminant analysis methods for vector data, such as [2,6,45,54,9,37,55,34]. Alternatively, with the data augmentation in Section 3.3, we can apply tensor classification methods such as covariate-adjusted tensor classification (CATCH, [42]) and tensor regression based on CP decomposition (CP-GLM, [59]). Note that the tensor methods cannot be easily combined with marginal t-screening because t-screening does not honor the tensor structure.

The marginal maximum likelihood estimator
Suppose that the response Y is from an exponential family whose probability density function has the canonical form where b(·), c(·) are known functions and θ is an unknown function. Suppose that the predictor U is a p-dimensional vector and β = (β 0 , β 1 , . . . , β p ) is the parameter. The following generalized linear model is assumed: where g = (b ) −1 is the link function. Under this model, [13] proposed the screening method based on the maximum marginal likelihood estimator (MMLE), which is obtained from componentwise regression. The marginal estimator β M j is defined by Then the marginal screening utility is chosen to be φ GLM The marginal screening utility is chosen to be φ GLM nJ = | β M J |. Its population version φ GLM J is similarly obtained by minimizing E(l (β 0 + β J X J , Y )). To apply STS, we further calculate The variables are then ranked by φ GLM.STS nJ and we select the following subset We refer to this procedure as STS-GLM screening. For various types of response such as Y is binomial or normal, whenever the generalized linear model is suitable, we can apply the STS-GLM screening. Through using the smoothness structure of tensor data, STS-GLM screening is more efficient than the vectorbased GLM screening on tensor data. After obtaining the screened data, we can apply vector-based methods such as [14], or, with the data augmentation step, we can also apply tensor regression methods such as CP-GLM [59].

Distance correlation screening
Distance correlation screening [25] is a model-free screening method that works for any statistical model. It uses distance correlation [51] to measure the dependence between the response and the predictor. We first briefly review distance correlation.
Consider two random variables V, W . Distance correlation can be calculated for a pair of vectors, but we only consider univariate random variables V, W ∈ R here for ease of presentation. The squared distance covariance is defined as: where α V,W is the joint characteristic function of (V, W ), α V is the characteristic function of V and α W is the characteristic function of W . The distance correlation dcorr(V, W ) between V and W is defined as The distance correlation is a measurement of dependence between V, W because dcorr(V, W ) = 0 if and only if V and W are independent. In practice, with n samples, the distance covariance is estimated by and dcorr(V, W ) can be obtained accordingly. In distance correlation screening, the estimated distance correlation is used as the screening utility.
To apply distance correlation screening with tensor predictors, we consider the response Y and the tensor predictor X ∈ R p1×···×p R . We define the marginal screening utility To apply STS, we further compute the smoothed screening utility Then we keep the variables with the top d n values of φ DC.STS nJ , that is We refer to this procedure as STS-DC screening. Since the distance correlation screening is a model-free screening method, the STS-DC screening is also modelfree. Almost all model-fitting methods can be applied after STS-DC screening. Moreover, STS-DC is a screening method for tensor predictors and is expected to achieve better screening results when there is a smoothness structure in the tensor.

A generic theorem
In this section, we present the theoretical properties of STS. Since STS is a general framework that can be combined with any screening utility, we first consider its properties for a generic utility φ nJ . Then we present more concrete results for STS with the three popular screening methods mentioned in Section 3.4. All our proofs are included in Appendix A. For a generic φ nJ and its smoothed version φ Smooth nJ , we respectively define their population counterparts as φ J and φ Smooth J Condition (T2). There exist a constant 0 > 0 and a monotonically decreasing function ζ n such that for any 0 < < 0 , we have Apparently Conditions (T1) & (T2) are similar to Conditions (V1) & (V2) introduced in Section 2.1 for marginal screening; see more discussion after Theorem 4.1. Recall that ω is the maximum number of neighbors, and c is the weight for neighbors. We have the following theorem.
Theorem 4.1. Suppose that Condition (T2) holds. We have the following conclusions.
(i) There exists 0 > 0 such that for any 0 < < 0 , should be able to accurately detect the important variables. In this sense, Condition (T1) is similar to Condition (V1) for vector methods. In what follows, we further show that when the true signal is smooth, (T1) is a natural generalization of (V1) to tensor data. First, we rewrite Condition (V1) for tensor data. With a little abuse of terminology, we still refer to the tensor version as Condition (V1).

Condition (V1). There exists S such that D ⊆ S and δ
Next, we consider a smoothness assumption. Let Φ ∈ R p1×···×p R denote the tensor with its J th element being the screening utility φ J . We make the following assumption: Assumption (A1) is an ideal case where the true signals are piece-wise constant. It can be seen as an approximation to the practical case where the true signals slightly fluctuate within each region. A similar assumption has been used by [44] to study the theoretical properties of the fused lasso.
For a set S that satisfies Condition (V1), we consider the set S 1 = S ∪ {∪ J ∈S Ω J }. The set S 1 contains all the elements in S and their neighbors. We have the following lemma.

Lemma 4.1. Under Condition (V1) and Assumption (A1), for any non-negative c, we have S 1 satisfies Condition (T1) and δ Smooth
It can be seen that Conditions (V1) & (A1) together imply Condition (T1). Hence, Condition (T1) is very intuitive. Moreover, we generally hope that gap δ Smooth S1 is large. Lemma 4.1 suggests that, when c is considerably smaller than ω, the lower bound for δ Smooth S1 is also much smaller than δ S . Hence, it makes sense to only consider c comparable to ω, which is in accordance to our proposal of c = ω/2 or c = ω. In the meantime, since R is a fixed constant and J 0 ≤ |S|, |S 1 | ≤ 3 R−1 (|S| + 2J 0 ) indicates that |S 1 | is at the same order of |S|. Consequently, STS can handle the same level of sparsity as the marginal screening utility it is combined with.
Finally, we note an important difference between screening and hypothesis testing, although many screening utilities are historically test statistics, such as the t-statistic we discussed. In screening, the refined analysis will be performed exclusively on the reduced data. Consequently, if an important variable is falsely removed by screening, it is impossible to recover it in the final modeling. This is why we need the SURE screening property to ensure that there is no false negative. On the other hand, if an unimportant variable survives the screening step, a sparse method in the second step can still identify it. Screening is tolerant of false positives in this sense.
In contrast, in multiple testing problems, it is essential to control the false discovery rate. For example, [41] proposed to use a smoothing procedure to detect clusters. In particular, a bias adjustment is developed to prevent the smoothing from inflating the false discovery rate. No such adjustment is needed in STS because STS is concerned with false negatives rather than false positives.

Theoretical properties for STS-t, STS-GLM and STS-DC
In what follows, we study the theoretical properties for STS-t, STS-GLM, and STS-DC. For all the three examples, our study verifies that φ Smooth nJ uniformly converges to φ Smooth J at the same rate as φ nJ converges to φ J by showing that they have the same order in their probability bounds. For each method, we also replace Condition (T2) with suitable lower-level conditions.

The STS-t screening
According to Theorem 4.1, the SURE screening property of STS-t depends on Conditions (T1) and (T2). We introduce the following condition that guarantees Condition (T2). To apply STS-t screening, we assume the model in (3.4).

Condition (T3). Within Class
are bounded away from 0 uniformly.
Straightforward proof shows that Condition (T3) implies Condition (T2); see Appendix A.3 for the proof. Condition (T3) is a widely used assumption in high-dimensional statistics. Hence, Condition (T2) is indeed very mild. As a consequence of Theorem 4.1, we have the following result concerning the SURE screening property of STS-t. ( R r=1 log(p r )/n) 1/2 , then STS-t enjoys the SURE screening property with a probability tending to 1.

Corollary 4.1 implies that φ t.STS
nJ converges to its population counterpart at the same rate of φ t nJ . Moreover, STS-t enjoys the SURE screening property even when the dimension of each mode of X grows at an exponential rate of n. Thus, STS-t is suitable for very high-dimensional tensor datasets.

The STS-GLM screening
For STS-GLM, we study the SURE screening property for two most important generalized linear models, the linear regression model and the logistic regression model. We use the following condition.
Condition (T4). Both X and Y satisfy the sub-exponential tail probability uniformly. That is, there exists a positive constant s 0 such that, for all 0 ≤ s ≤ 2s 0 , Condition (T4) is a mild condition that implies Condition (T2). Detailed proof can be seen in Appendix A.4. Similar conditions have been used in [13] to guarantee the SURE screening property of MMLE for vector data. Under Condition (T4), we establish the SURE screening property of STS-GLM using Theorem 4.1.

Corollary 4.2.
Suppose that Condition (T4) holds. We have the following conclusions.
Corollary 4.2 suggests that STS-GLM can also handle the tensor data with very high dimensionality along each mode.

The STS-DC screening
To study the theoretical properties of STS-DC, we again consider Condition (T4). Condition (T4) was used in [25] to guarantee the SURE screening property of the distance correlation screening method for vector data. In Appendix A.4, we showed that Condition (T4) implies Condition (T2). Therefore, we establish the SURE screening property of STS-DC in the following corollary.

Corollary 4.3.
Suppose that Condition (T4) holds. We have the following conclusions.
(i) There exists 0 > 0 such that for any 0 < < 0 and for any 0 < v < 1/2, By Corollaries 4.1, 4.2 and 4.3, all the three STS screening methods enjoy the SURE screening property when the dimension of each mode of the tensor grows at an exponential rate of the sample size. We also note that their convergence rates are identical to those of their marginal counterparts. Hence, the STS procedure reserves the nice theoretical properties of their marginal counterparts while taking advantage of the tensor structure.

Structure of the screened data
As suggested by a referee, we further examine the impact of screening on modeling. We are interested in whether the screened data follow the same type of model as the original data. For such study, although screening can be performed in a model-free fashion, we consider two popular tensor models that could be fitted after screening, the tensor discriminant analysis (TDA) model [42], and the generalized linear tensor regression model [59].
There are two possible results of STS. On one hand, if we simply apply STS, the predictor may no longer be a tensor and cannot be modeled by tensor models. On the other hand, if we combine STS with the augmentation introduced in Section 3.3, we will end up with a sub-tensor of X. For simplicity, we pay more attention to the latter case, as it is more relevant for tensor data analysis. The former case is briefly discussed afterwards.
Denote V as a subset of indices such that X V is also a tensor. In the context of screening, V could be the target set for STS with augmentation. Note that, For the TDA model, Y ∈ {1, . . . , K} is the class label. We assume that where 0 < π k < 1, k π k = 1 are the prior probabilities for Class k, μ k ∈ R p1×···×p R is the within-class mean, and Σ r , r = 1, . . . , R are positive definite covariance matrices.

K. Min and Q. Mai
Note that STS-t, STS-GLM and STS-DC are all applicable under this model. For example, STS-t is suitable because the TDA model satisfies Condition (T3), which eventually leads to the SURE screening property. To see why Condition (T3) is true, we note that the TDA model can be equivalently written as (3.4) with the additional assumption that k ∼ T N(0, Σ 1 , . . . , Σ R ). It follows that, for each J , kJ ∼ N (0, σ 2 kJ ) with σ 2 kJ = R r=1 σ r,jrjr . Hence, kJ is sub-Gaussian with variance proxy σ 2 kJ , as required by Condition (T3). Under the TDA model, we have the following lemma for the sub-tensor X V .

Lemma 4.2. Under the TDA model, for any index set V, we have
where Σ r,Vr is the sub-matrix of Σ r containing elements with indices in V r ×V r . Lemma 4.2 implies that any sub-tensor X V preserves the TDA model. Therefore, if (X, Y ) follows the TDA model, we can still fit a TDA model after STS with data augmentation.
Meanwhile, the generalized linear tensor regression model considers a univariate response Y that could be continuous or discrete. It assumes that where g(·) is the link function, μ = E(Y | X), β 0 is the intercept and B has rank H. The inner product between two tensors is defined as B, X = J B J X J .
We also refer to this model as the rank-H generalized linear tensor regression model to highlight the low-rank structure. Recall that, we assume that B is sparse and is only nonzero over a set D. We have the following lemma for X V .

3)
where B V also has rank H. Lemma 4.3 indicates that, if a sub-tensor X V contains all the important elements, then it is connected to Y with the rank-H generalized linear tensor regression model. In our context, take V to be the target set of STS with data augmentation. If D ⊆ V, the rank-H generalized linear tensor regression is preserved after STS with data augmentation. The assumption D ⊆ V is reasonable because we expect STS to enjoy the SURE screening property when we apply it.
Finally, we point out that STS alone (without data augmentation) also preserves some important properties of the original model. For example, if (X, Y ) follows the TDA model, then it can be shown that the screened data follows the linear discriminant analysis model, which is the counterpart for the TDA model on vector data. Similarly, if (X, Y ) follows the generalized linear regression tensor model, the screened data follow the generalized linear model as long as all the important variables are kept.

Simulations
We present the numerical performance of STS. In all simulations, we generate independent training, validation, and testing sets. The training set and validation set contain n observations to be specified, while the testing set contains 10000 observations. All matrix predictors (R = 2) have dimension 64 × 64. Three-way tensor predictors (R = 3) have dimension 30 × 36 × 30.
We study classification models and regression models under both smooth and non-smooth settings. We first introduce the classification models with smoothness. In each model, there are 2 classes and 75 observations within each class. The predictor X is generated from the tensor discriminant model with K = 2: . . , R and π k = pr(Y = k), k = 1, 2. By [42], the best classifier under this model is Therefore, to ensure sparsity on the population level, we specify sparse B, along with Σ 1 , . . . , Σ R . Then the mean tensors are set to be μ 1 = 0 and μ 2 = JB; Σ 1 , . . . , Σ R K. The active set D of B is defined as In our model settings we only specify elements of B in D. It is always assumed that b J = 0 for any J ∈ D c . For an index set A, we use the notation B A = a to denote that b J = a for any J ∈ A where a is a number. For ease of presentation, we define two sets of integers, L 1 = 31 : 34 and L 2 = 21 : 23. For a covariance matrix Σ, we use the following notations. Let Σ = AR(ρ) denote that Σ is autoregressive, i.e., σ ij = ρ |i−j| . Let Σ = CS(ρ) denote that Σ has the compound symmetry structure, i.e., σ ij = ρ, i = j. In all simulations we consider π 1 = π 2 = 1/2. We consider the following three smooth classification models. For regression models, we set n = 200. Define B as the regression coefficient. The active set D is defined as We consider the following four models.  {(i, j) : i, j ∈ L 1 }, D 2 is a triangle area with 9 variables and the three vertexes are located at (50,50), (52,48), (52,52). To study the robustness of our proposed method, we also include the following four models with model misspecification. , we consider STS-t, STS-GLM and STS-DC. For regression models (Models 4-7, 10-11), we use STS-GLM and STS-DC. We choose c from {ω/2, ω}, that is, {4, 8} for matrix models and {13, 26} for three-way tensor models. We compare these methods with the corresponding marginal screening methods.
The minimum numbers of variables needed to recover all active predictors are reported in Table 1. The reported numbers are medians from 500 replicates and their standard errors computed from bootstrap. A closer number to |D| indicates a better screening technique. It can be seen that the STS methods are uniformly more efficient than their marginal counterparts in identifying the important variables in Models 1-7. This demonstrates the benefits of leveraging the tensor structure for better screening when the smoothness structure is present. In these models, often both c = ω/2 and c = ω give good results. For Models 8 & 10 where there is completely no smoothness structure, STS is worse than marginal screening due to model misspecification, although c = ω/2 is significantly better than c = ω. Thus, when the smoothness assumption does not hold at all, STS is not recommended. However, for Models 9 & 11 where there is partial smoothness structure along one direction, STS again outperforms marginal screening with both choices of c, but c = ω/2 is better than c = ω. Therefore, if it is believed that some level of smoothness exists, c = ω/2 is a robust choice to be used in STS.
To further demonstrate that STS can improve the analysis accuracy, we apply different classification and regression methods on the screened data. For all models, if screening is applied, we let d n = n/ log n . For classification, we include two tensor-based methods, covariate-adjusted tensor classification (CATCH, [42]) and tensor regression based on CP decomposition (CP-GLM, [59]), and two vector-based methods, 1 -penalized generalized linear model ( 1 -GLM, [14]) and 1 -penalized Fisher's discriminant analysis ( 1 -FDA, [54]). For regression, we include CP-GLM and 1 -GLM. For the implementation of CATCH, 1 -GLM, from https://hua-zhou.github.io/TensorReg. STS based on the three statistical utilities give similar patterns of performance. For the sake of space, here we only present the results using regressionbased screening, that is, logistic regression for Models 1-3 & 8-9, linear regression for Models 4-6 & 10-11 and Poisson regression for Model 7. The simulation results using t-statistic and distance correlation for classification models can be found in Appendix B.
We report the model fitting results of 100 replicates in Table 2. For classification models, we report the classification errors and their standard errors, while for regression models we report the root mean square error (RMSE), which is defined as We further report the true positive rate (TPR) and the false positive rate (FPR) for classification models in Table 3. Let D be the set of active predictors in the final model. The TPR and FPR are defined as According to Table 2, the comparison among different c and the original data suggests that smoothing screening statistics is critical to tensor screening when smooth structure exists (i.e., Models 1-7). Even though the classification errors and RMSE are reduced for some methods with marginal screening, the minimum error is obtained at c = ω/2 or ω, and the decrease is significant. In Table 3, STS successfully increases the true positive rate while reducing the false positive rate. Higher TPR and lower FPR are achieved at c = ω/2 or ω compared to c = 0. This coincides with the changing trend in classification errors, which further shows the importance of smoothing screening statistics. Moreover, in several cases such as CP-GLM in Models 3, 5 & 6, the standard errors are lowered after applying screening with or without STS. This indicates that screening tends to make the analysis more stable. For non-smooth models (Models 8 & 10), STS has lower accuracy and worse variable selection results compared to marginal screening, which is a consequence of model misspecification. However, STS is again better than marginal screening combined with most model-fitting methods when partial smoothness exists (Models 9 & 11).

Real data analysis
Electroencephalography (EEG) is used to record the electrical activity of human brains and to detect the associated brain disorders. The dataset we use arises from a study that analyzes the relationship between EEG and alcoholism, available at http://kdd.ics.uci.edu/databases/eeg/eeg.data.html. There are 122 subjects, including 77 alcoholic individuals and 45 nonalcoholic individuals. Each subject is exposed to either a single stimulus or to two stimuli that are pictures chosen from [47]. For two stimuli, the two pictures can be either identical or different. Each subject completes 120 trials under different stimuli. In all trials, 64 electrodes are placed on a subject's scalp which are sampled at 256 Hz (3.9-msec epoch) for 1 second and the voltage fluctuates are collected. More information about the collection process can be found in [58]. The same dataset is also analyzed in [23]. They only consider the single stimulus condition and take the average of all the trials under that condition. We use the same part of the data. Thus, we have each predictor being an EEG image of size 256 × 64 and a binary response variable indicating whether the subject is alcoholic or nonalcoholic.
Given the large number of predictors, we assume a sparse model for the prediction of the alcoholic status. For the predictor X ∈ R 256×64 , we assume that there exist a few entries X ij that are responsible for the prediction. In the context of the EEG data, we assume that only the measurements of a small number of electrodes at a few time points are helpful for predicting the alcoholic status. We further assume that the sparsity pattern is smooth, in that X ij 's close to each other tend to be important or unimportant at the same time.
The smoothness of EEG data results from two aspects. First, for each row of an EEG observation, the numbers are the voltages collected over a continuous time period. The voltage will fluctuate when the brain responds to the stimulus and then decay gradually in time. Second, each column of the image represents the voltages collected from 64 different electrodes. Most of these electrodes are entered into the dataset in a way such that two adjacent columns correspond to electrodes positioned symmetrically on the left and right sides of the brain. For example, Column 1 contains measurements from the position FP1, and Column 2 contains those from FP2. FP1 and FP2 are on the same location of the left and right hemispheres [46]. Symmetric locations are functionally similar and thus may resemble each other in the response to stimuli.
Data are standardized before use. The data is randomly split 500 times. In each replicate, we randomly split the data with a 4 : 1 ratio into a training set of 97 subjects and a testing set of 25 subjects. For all classification methods available for cross-validation, we set the number of folds to be 5. We choose d n = n/ log(n) = 21 and weight c is chosen from the set {0, 4, 8} in screening. We use STS based on three different statistical utilities, the t-statistic, logistic regression, and distance correlation. Classification errors on the original data and the screened data for the three methods are reported in Table 4. The results show that screening can either maintain or lower the error rate for most methods.
Comparing the results at c = ω/2 or ω with the results at c = 0, it can be seen that error rates are further reduced after we add a weight in screening, which supports the application of STS. We also want to mention that, in almost all the analyses, we directly perform screening and/or model fitting on the original dataset of dimension 256 × 64, with the only exception for CP-GLM without screening. CP-GLM requires the sample size to be no smaller than the dimension of X on each mode, so when it is applied to the original dataset, we downsize the predictor to 32 × 32 first and then fit a rank-1 model. This can also be viewed as an advantage of STS. With STS, the dimension of predictor is largely reduced so that CP-GLM can be easily fit without further downsizing, and we observe a uniform decrease in classification errors this way.
To further validate the smoothness assumption among electrodes, we randomly shuffle the columns of the matrix predictors so that it is certain that no smoothness structure exists along this mode. We perform STS and model fitting on the shuffled data. The classification error rates are reported in Table 5. It can be seen that the error rates are generally higher when we change the order of the columns, indicating that the original ordering has some useful information. Moreover, the lowest error rate achieved after shuffling is significantly larger than the error rate without shuffling. Since the lowest error rate is achieved by 1 -GLM in Table 4, we perform a paired t-test for the results before and after shuffling for 1 -GLM at c = ω/2. The p-values under STS-t, STS-GLM, STS-DC are 2.0 × 10 −12 , 1.2 × 10 −7 , 9.6 × 10 −10 , respectively. Thus, analysis on the original dataset is significantly better, which again provides evidence of the smoothness among electrodes. Another interesting fact is that, even without smoothness along the columns, STS continues to give comparable error rates to marginal screening. This is likely because we still have smoothness along the time domain, and, as noted in Section 5.1, STS is capable of exploiting partial smoothness. Moreover, the average computation time to perform screening is reported in Table 6, which confirms that STS is as computationally efficient as the corresponding marginal screening methods.

Conclusion
In this article, we propose STS, a general screening framework for tensors. STS integrates the traditional marginal screening methods with the tensor structural information. With a wide selection of statistical utilities in screening, STS is not limited to any model setting or data type of the responses and predictors. We establish the SURE screening property for the procedure and give three examples. Moreover, we examine the performance by comparing the classification error and regression accuracy on screened data. STS gives better results than directly applying traditional screening methods on the vectorized tensor in both simulation and real data study. In practice, researchers can combine STS with other suitable screening utilities to improve their performance on tensor data. But an exhaustive study along this direction is apparently out of the scope of this article. Table 5 The means and standard errors of binary classification error rates on EEG dataset after shuffling electrodes. We only report the results for STS because the results without screening and those with marginal screening are the same as in Table 4   If we are fitting a specific kind of tensor model, it is also possible to modify the screening utilities to further take advantage of the tensor structure (with or without the smoothness assumption). For example, under the TDA model, the variance σ kJ can be estimated much more accurately [33,42,40] than the sample estimate. The improved estimation could benefit screening. We tested this idea on our simulations models, and the results are comparable to those of STS, so we do not report them. However, it is still likely that screening utilities calculated utilizing the tensor structure will be helpful in other settings or under other models. It is a topic worth exploring in future research.
We focus on the problem where the predictor is a tensor, but the response is a scalar. There are considerable interests in the literature where the response is a tensor as well [20, 31, 29, 15, 26, 30, 1, 27, e.g.]. These papers generally study fitting tensor-on-tensor regression models. It will be interesting to develop screening methods under such models as well, but such developments are out of the scope of this paper for two reasons. On one hand, it is unclear how to generalize the smoothness assumption to tensor-on-tensor problems. There are several possibilities that could be explored, such as smoothness in the response alone, in the predictor alone, or both. On the other hand, it is critical for screening methods to enjoy the SURE screening property under ultra-high dimensions, but the aforementioned tensor-on-tensor works either have few results on the statistical properties, or only have results in low-dimensional problems. Hence, full developments on the theory could be challenging for screening methods in the same context. A relevant screening method is proposed in [22], where the response is a matrix (two-way tensor), and the predictor is a vector. Our problem is different in that our predictor, instead of the response, is a tensor of arbitrary order. Nevertheless, it will be intriguing to investigate in the future whether some of their results can assist in tensor-on-tensor screening.
Next, we derive the upper bound for |S 1 |. Suppose that B j ∈ R d1×···×d R , then |B j | = d 1 × · · · × d R . We use B j to denote the set containing all elements in B j and their neighbors. By definition, we have that |B j | = (d 1 + 2) × · · · × (d R + 2). For fixed |B j |, when d k = |B j | for some k and d i = 1 for all i = k, |B j | reaches its maximum at 3 R−1 (|B j | + 2). Therefore, we have

A.3. Proof of Corollary 4.1
The proof of Corollary 4.1 relies on the following proposition. We first prove Proposition A.1 and then prove Corollary 4.1.
The proof of Proposition A.1 is straightforward, but we include it for completeness. To prove Proposition A.1, we start with some propositions that are used in our proofs and they are extracted from [52].
The screening utility is defined as where σ 2 kJ is the sample variance of variable J within class k. Recall the screening utility and its population counterpart defined in (3.5) & (3.6). We further define an intermediate quantity To make the proof easier to read, we introduce the following shorthand notations. For a given J , let For Condition (T3), we use positive constants a, σ 2 1,min , σ 2 1,max , σ 2 2,min and σ 2 2,max to define the bounds. Then we have In the proof, we use γ i , i = 1, 2, . . . to denote positive constants. They can have different value each time they appear.
Proof of Proposition A.1. Since we have we will bound the probabilities L 1 and L 2 separately. For L 1 , First, we give an upper bound to |a nJ − a J |. Note that where we use Proposition A.2 to get (A.4) from (A.3).
As n 1 and n 2 are sums of n independent and identically distributed Bernoulli random variables, by Hoeffding's inequality, we have (A.5) Next, we give an upper bound to |b nJ − b J |, By definition, we have Hence, Proposition A.6 is a straightforward extension of the main result in Theorem 4 in [13] to tensor case. Proposition A.7 is a straightforward extension of Theorem 1 in [25] to tensor case. The original theorems were developed for vectors. We simply rewrite them in tensor notations. Moreover, we replace n −κ with in the probability bounds. By Propositions A.6 and A.7, Condition (T4) implies Condition (T2).

A.5. Proof of Lemmas 4.2 and 4.3
We first present the following proposition and its proof. It is used to prove Lemma 4.2.
Proposition A.8. Let X ∈ R p1×···×p R be a R-way random tensor and X follows a tensor normal distribution such that X ∼ T N (μ, Σ 1 , . . . , Σ R ). Let M ∈ R d1×···×d R be a constant tensor and U r ∈ R dr×pr , r = 1, . . . , R be matrices with full row rank. If . . , R. Proof of Proposition A.8. By definition, there exists a random tensor Z such that Z ∼ T N (0, I 1 , . . . , I R ) and where we use the property that X × r A × r B = X × r (BA) from [21]. Since By construction, X V is the sub-tensor of X. Therefore, there exist some matrices G 1 , . . . , G R such that where G r ∈ R lr×pr and l r = |V r |. Specifically, the ith row of G r is the V r,i th standard basis e Vr,i for the space R pr where V r,i is the ith element in set V r . Therefore, G r has full row rank. By Proposition A.8, we have Similar to the construction of X V , we have μ k × 1 G 1 · · · × R G R = μ k,V . Moreover, the (i, j)th element in matrix G r Σ r G T r can be written as e T Vr,i Σ r e Vr,j , which is exactly the (i, j)th element of matrix Σ r,V . Therefore, we show that G r Σ r G T r = Σ r,V and we complete the proof of Lemma 4.2. Proof of Lemma 4.3. In the rank-H generalized linear tensor regression model [59], we have where g(·) is the link function, μ = E(Y | X) and β 0 is the intercept. In addition, the coefficient tensor B ∈ R p1×...×p R can be decomposed as where a r i k h is the i k th element of a r h . Since V = V 1 ×· · ·×V R , we let [a r h ] Vr denote the sub-vector of a r h restricted on the index set V r , i.e. [a r h ] Vr = (a r ih 1 i∈Vr ). Similar to the construction of X V , we define B V . It is easy to see that Therefore, B V is also a tensor of rank-H. When D ⊆ V, B J = 0 for J / ∈ V. The tensor regression model (A.14) can be rewritten as Thus, we can fit a rank-H generalized linear tensor regression model on X V and Y .

Appendix B: Additional simulation results
We show additional simulation results for Models 1-3 and 8-9 using t-statistic and distance correlation in the screening procedure. Tables 7 and 8 report the classification errors on the original data and on the screened data with different screening weights. The results at c = ω/2 and ω are generally similar, and they are better than the results at c = 0. This indicates that STS outperforms the marginal screening methods and is not overly sensitive to the choice of c.

Appendix C: Effect of the neighborhood size
Define neighborhood size to be the number of elements in a neighborhood along each mode. The default STS proposed in Section 3.1 has a neighborhood size of 3. To study the effect of the neighborhood size in STS, we let the neighborhood size vary from 3 to 9. For all the models, we select the first d n = n/ log n predictors and compare the true positive rates (TPR) of the screened data. We conducted our simulations with all methods on all models considered in Section 5.1. The pattern is similar across all models. For the sake of space, we only present the results for STS-GLM of two models, the smooth Model 1 and the non-smooth Model 8. The TPRs are plotted in Figure 5. Results show that, in the smooth Model 1, sizes of 5 and 7 give the best results, but the size of 3 is only slightly worse. If we take a large neighborhood size of 9, there is a notable decrease in TPR. On the other hand, in the non-smooth Model 8, the size of 3 with c = ω/2 gives a reasonable TPR that is close to 80%, but the larger neighborhoods have drastically worse performance. Hence, unless there is a very strong belief in smoothness, a neighborhood size of 3 should be preferred.

Appendix D: Comparison with Gaussian filter
In this section, we compare STS with Gaussian filter, which is an image processing technique that blurs an image with the Gaussian function [16]. Gaussian filter is usually used in a somewhat different context from ours. It is often applied on two-way or three-way tensors, and it smoothes images instead of screening utilities. However, it is straightforward to generalize it to our problem of interest. For easy presentation, we directly present a definition for R-way tensor screening. The Gaussian filter function for the R-dimensional space is defined as follows: H (u 1 , . . . , u R ) = exp{−(u 2 1 + · · · + u 2 R )/(2η 2 )}.
where u r is the distance from the origin in the r-th mode and η is the standard deviation. Let d nb be an odd number that denotes the size of the filter. Then Then we have a R-way filter tensor which is also referred to as Gaussian kernel. After obtaining the kernel, we do a convolution between the kernel and the screening utility tensor. Let Φ ∈ R p1×···×p R denote the screening utility tensor where Φ j1,...,j R is the screening statistic corresponding to the predictor X j1,...,j R . The filtered screening utility matrix Φ GF is obtained by Φ GF j1,...,j R = u1,...,u R ∈D nb H (u 1 , . . . , u R ) Φ j1+u1,...,j R +u R . (D.1) We will pad the tensor on the boundary if j r + u r exceeds the index range. Then we will use Φ GF j1,...,j R to do screening. We performed screening with Gaussian filter combined with the three screening utilities on all the models in Section 5.1. For all models, we select the first d n = n/ log n predictors and calculate the true positive rate (TPR). We consider two neighborhood sizes of 3, 5 and a range of η ∈ [0. 2,2]. For the sake of space, only results for STS-GLM in the classification Model 1 and the regression Model 4 are shown in Figure 6. Other models exhibit the same pattern. Figure 6 shows that when the parameters in the Gaussian filter are chosen appropriately, it performs similarly to STS with c = ω/2. It is particularly important to choose η in a reasonable range, but the screening results are not sensitive within this range. Note that the larger η is, the flatter is the Gaussian filter, and thus the smoother is the screening result. Figure 6 confirms that η should be somewhat large so that smoothness is encouraged to achieve better screening results.
Given the similar performance, we recommend using STS in practice unless there is strong prior knowledge that the Gaussian filter should be preferred. In general, the weighted average in STS is easier to interpret, and the role of c in STS can be more easily understood than the parameter η for researchers with relatively less background in statistics.