Lazy FSCA for Unsupervised Variable Selection

Various unsupervised greedy selection methods have been proposed as computationally tractable approximations to the NP-hard subset selection problem. These methods rely on sequentially selecting the variables that best improve performance with respect to a selection criterion. Theoretical results exist that provide performance bounds and enable"lazy greedy"efficient implementations for selection criteria that satisfy a diminishing returns property known as submodularity. This has motivated the development of variable selection algorithms based on mutual information and frame potential. Recently, the authors introduced Forward Selection Component Analysis (FSCA) which uses variance explained as its selection criterion. While this criterion is not submodular, FSCA has been shown to be highly effective for applications such as measurement plan optimisation. In this paper a"lazy"implementation of the FSCA algorithm (L-FSCA) is proposed, which, although not equivalent to FSCA due to the absence of submodularity, has the potential to yield comparable performance while being up to an order of magnitude faster to compute. The efficacy of L-FSCA is demonstrated by performing a systematic comparison with FSCA and five other unsupervised variable selection methods from the literature using simulated and real-world case studies. Experimental results confirm that L-FSCA yields almost identical performance to FSCA while reducing computation time by between 22% and 94% for the case studies considered.


Introduction
When dealing with high dimensional datasets, unsupervised dimensionality reduction is often performed as a pre-processing step to achieve efficient storage of the data and robustness of machine learning algorithms. The underlying assumption is that such datasets have high levels of correlation among variables and hence redundancy that can be exploited. Principal Component Analysis (PCA) is the most popular linear technique for unsupervised dimensionality reduction (Jolliffe, 1986;Van Der Maaten et al., 2009), consisting of the transformation of variables into a set of orthogonal components that correspond to the directions of maximum variance in the data.
In some applications it is desirable to achieve unsupervised dimensionality reduction through the selection of a subset of initial variables that best represent the information contained in the full dataset, for example in sensor selection problems (Krause et al., 2008;McLoone et al., 2018). Working with a subset of the original variables also facilitates more transparent and interpretable models than those based on principal components (Flynn and McLoone, 2011). The unsupervised variable selection problem may be formulated as follows. Given a dataset X ∈ R m×v containing m measurements of v variables, and the index set for the variables (i.e. columns of X) I X = {1, 2, . . . , v − 1, v} and defining I S and I U as the indices of the selected and unselected variables (I S ∪ I U = I X ), respectively, we wish to solve where I * S denotes the optimum subset, k is a cardinality constraint on I S and g(·): 2 v → R is a performance metric (set function) which measures the suitability of the selected variables. An alternative formulation of the variable selection problem places a constraint τ on the target value for the performance metric, that is: (2) Finding the optimal subset of variables from a set of candidate variables is an NP-hard combinatorial problem and therefore intractable in practice even for relatively low dimension problems. Consequently, developing computationally efficient techniques that approximate the optimum solution have been the focus of research for many years. Existing techniques can be split into four categories as depicted in Fig. 1. Forward greedy search methods, which are the focus of this study, estimate the optimum subset of variables by recursively adding them one at a time, such that the variable added at each iteration is the one that gives the optimal improvement in the chosen performance metric g(·). The general structure of a forward greedy selection algorithm for Eq. (1) is as defined in Algorithm 1. The corresponding greedy algorithm for solving Eq.
(2) can be obtained by replacing the while loop condition in line 2 with g(I S ) < τ .

Algorithm 1 Forward greedy variable selection
Input: X, k However, if g(·) is a monotone increasing, submodular 1 metric and g(∅) = 0, then the greedy search solution g(I S ) is guaranteed to be within 63% of the optimal solution and a more efficient implementation of the greedy search algorithm, referred to as the lazy greedy implementation (Minoux, 1978), is possible.
Several greedy search based unsupervised variable selection algorithms have been proposed in the literature using a number of different performance metrics. Variance explained (VE) was employed in the forward selection component analysis (FSCA) algorithm introduced in Puggini and McLoone (2017) and Prakash et al. (2012), while the squared multiple correlation was employed by Wei and Billings (2007) in their forward orthogonal selection maximising overall dependency (FOS-MOD) algorithm. VE also underpins the orthogonal principal feature selection (PFS) algorithm proposed in Cui and Dy (2008). In PFS variables are selected based on their correlation with the first principal component of the subspace orthogonal to that spanned by the currently selected variables. In contrast, the unsupervised forward selection (UFS) algorithm by Whitley et al. (2000) selects the variables having the lowest squared multiple correlation with the subspace spanned by the currently selected variables. While these metrics are natural choices for unsupervised variable selection given their close relationship with linear regression and PCA, they are not submodular functions. Motivated by the optimality bound and more efficient algorithm implementation that follow from submodularity, Krause et al. (2008) used mutual information (MI), Ranieri et al. (2014) used frame potential (FP) and Joshi and Boyd (2008) and Rao et al. (2015) used log-det as submodular metrics to design near-optimal greedy sensor selection algorithms.
To date the literature lacks a comparison of the main unsupervised greedy selection algorithms. There is also a lack of awareness of the benefit of using a lazy greedy implementation even when submodularity does not apply. While VE does not satisfy the conditions for submodularity, and therefore does not 1 Formally defined in Section 2. enjoy convenient to compute theoretical bounds on its performance, practical experience shows that greedy search algorithms based on VE perform very well.
This may mean that for many problems VE is close to being submodular, and indeed, as observed by Das and Kempe (2008), it is provably so in certain cases.
Motivated by these observations, this paper makes the following novel contributions.
• A lazy greedy implementation of FSCA (denoted L-FSCA) is proposed that is able to achieve comparable performance to FSCA while being up to an order of magnitude faster to compute.
• L-FSCA and the main unsupervised greedy variable selection algorithms proposed in the literature are formalised within a common framework and compared with respect to Big-O computational complexity.
• The algorithms are then evaluated experimentally with respect to the aforementioned performance metrics (VE, MI and FP) on a diverse set of simulated and real world benchmark datasets. While demonstrating the efficacy of L-FSCA, this also serves as the first systematic comparison of unsupervised variable selection algorithms in the literature.
In addition, to facilitate benchmarking against other unsupervised variable selection algorithms in the future, the code for the algorithms and experimental studies presented in the paper have been made available on GitHub 2 .
The remainder of the paper is organised as follows: Section 2 provides a summary of the main theoretical results on performance guarantees that apply to the unsupervised variable selection problem at hand, and introduces the lazy greedy algorithm that can be exploited when selection metrics are submodular.
Section 3 then briefly introduces the unsupervised variable selection algorithms and performance metrics under investigation and provides an analysis of their computational complexity. The performance comparison on simulated and real 2 https://github.com/fedezocco/GreedyVarSel-MATLAB datasets is presented in Section 4 and conclusions are provided in Section 5.
The following notation is adopted: matrices and vectors are indicated with bold capital and lower case letters, respectively, while sets and scalars are denoted by capital and lower case letters, respectively.

Theoretical Guarantees on Performance
Let v, k, I S and I * S denote the total number of candidate variables, the number of selected variables, the index set of the variables selected via greedy search and the corresponding optimal index set, respectively. If the greedy unsupervised variable selection metric g(·) is a monotone increasing, submodular set function and g(∅) = 0, then, according to the seminal result by Nemhauser et al. (1978), the greedy search solution g(I S ) is guaranteed to be within 63% of the optimal solution, or more specifically: A set function g(·) is defined as being monotone increasing if ∀A ⊆ B ⊆ X, it holds that g(A) ≤ g(B). It is defined as submodular (Das and Kempe, 2008) if If the condition in Eq. (4) holds with equality then g(·) is a modular function and the greedy solution is guaranteed to be the optimal solution, i.e. g(I S ) = g(I * S ). In order to improve on B N and obtain results for more general forms of set function the concepts of curvature (Conforti and Cornuéjols, 1984;Iyer et al., 2013;Sviridenko et al., 2017) and submodularity ratio Kempe, 2011, 2018) have been introduced as measures of how far a submodular function is from being modular and how far a non-submodular function is from being submodular, respectively. Bian et al. (2017) recently provided a unified treatment of these concepts and showed that if g(·) is a non-negative non-decreasing set function with curvature α ∈ [0, 1] defined as and submodularity ratio γ ∈ [0, 1] defined as then the following lower bound applies to the greedy solution When γ=1, g(·) is submodular. If γ=1 and α=1, B αγ yields Nemhauser et al.'s bound (Eq. 3) and if α=0, g(·) is supermodular and B αγ reduces to Finally, when α=0 and γ=1, g(·) is modular and g(I S ) = g(I * S ). Other definitions of curvature have also been introduced for non-submodular functions and used as a measure of how close a function is to being submodular, such as those by Sviridenko et al. (2017), Wang et al. (2016) and Hashemi et al. (2019), but the resulting bounds are weaker than B αγ .
In general, computing these bounds is as computationally intractable as the original subset selection problem. However, for specific problem formulations bounds can be computed for the curvature and submodularity ratio parameters that are computationally tractable, allowing weaker versions of the performance guarantees to be computed (Bian et al., 2017;Das and Kempe, 2011). Overall these bounds are highly conservative with greedy search algorithms frequently achieving near optimal solutions. As such, their primary value is providing a theoretical foundation for the effectiveness of greedy based search methods.
It should be noted that even when g(·) is not a modular or submodular set function in general it may have these properties for restricted forms of X.
For example, while variance explained (squared multiple correlation) is not a submodular function, Das and Kempe (2008) show that if X does not contain any suppressor variables, then g(·) will be submodular and Nemhauser et al.'s performance bound B N applies. Similarly, if X is a set of uncorrelated variables, i.e X is an orthogonal matrix, then g(·) will be modular and the greedy search algorithm yields the optimal solution.

Submodularity and the Lazy Greedy Implementation
Submodularity is essentially a diminishing returns property. Defining the marginal gain (improvement in performance) of adding x to I S at the k-th iteration of the greedy search algorithm as Eq. (4) implies that ∆g j (x) ≤ ∆g k (x) for ∀j > k, that is, the marginal gain achievable with a given variable decreases (or at best remains the same) the later in the sequence it is added. Consequently, the marginal gain given by an element, i.e. variable, x at the current iteration is an upper bound on its marginal gain at subsequent iterations. This property can be exploited to arrive at a greedy search algorithm for submodular functions that has much lower computational complexity than the conventional greedy search algorithm, as first proposed by Minoux (1978). The algorithm, usually referred to as the lazy greedy search algorithm, operates by maintaining a descending-ordered list of the upper bounds on the marginal gain of each candidate variable (which is initialized at the first iteration), and then, at each iteration, computing the marginal gains sequentially from the top of the list until a candidate variable is found whose marginal gain is greater than the next largest upper bound in the list. In this way, the number of set function evaluations required to identify the best candidate variable at each iteration can be substantially reduced.
The pseudocode for a lazy greedy variable selection algorithm is given in Algorithm 2. Here G o U is a decreasing-ordered list of the marginal gain bounds for the set of currently unselected candidate variables, I o U is the corresponding index set, and EB o flag is a Boolean set used to track which entries in the ordered list are exact marginal gains (= 1) and which are upper bounds (= 0). The function reorder(·) is used to denote the insertion of the updated marginal gain value in the appropriate location in the ordered list and the corresponding reordering of the entries in I o U and EB o flag . This step can be efficiently implemented taking advantage of the existing ordering of the lists.

Algorithm 2 Lazy greedy variable selection
Input: X, k

Candidate Algorithms
This section provides a description of the six baseline algorithms of our comparative study. Without loss of generality, the data matrix X is assumed to have mean-centered columns.

Forward Selection Component Analysis (FSCA)
FSCA (Puggini and McLoone, 2017), first proposed in Ragnoli et al. (2009) in the context of optical emission spectroscopy channel selection, has been found to be an effective tool for data-driven metrology plan optimization in the semiconductor manufacturing industry Prakash et al., 2012;Susto et al., 2019). The algorithm, which can be regarded as the unsupervised counterpart of forward selection regression (Bendel and Afifi, 1977), selects, in a greedy fashion, the variables that provide the greatest contribution to the variance in the dataset X. As depicted in Algorithm 3, FSCA operates recursively selecting, at each iteration, the variable that maximizes the variance explained with respect to the residual matrix R obtained by removing from X the contribution of the variables selected during the previous iterations (Step 6). The variance explained (VE) performance metric is defined as whereR(r i ) is the matrix R reconstructed by regressing on r i , the i-th column of R, i.e.R Here, ||R −R(r i )|| F is the Frobenius norm of the difference between the reconstructed and actual residual matrix. Maximising the explained variance in (10) is equivalent to maximising the Rayleigh Quotient of RR T with respect to r i and this can be exploited to achieve a computationally efficient implementation of FSCA as described in Puggini and McLoone (2017).
In terms of the greedy search algorithm formulation in Algorithm 1, FSCA corresponds to setting g(I S ) = V X (X(X S )), where X S is the subset of columns of X corresponding to I S andX(X S ) is the projection of X on X S , that is: A lazy greedy implementation of the FSCA algorithm can then be realised as set out in Algorithm 2. This will be referred to as L-FSCA, hereafter.
5: FOS-MOD, proposed by Wei and Billings (2007), selects the most representative variables based on their similarity with the unselected variables using the squared correlation coefficient. Given two vectors x i and x j , the squared correlation coefficient is defined as The rationale behind the algorithm is to select, at each iteration, the residual direction r i having the highest average ρ 2 with the unselected variables. Hence, the variable selection function is defined as Note that, since r i is orthogonal to the selected variables, the contribution of these variables to the summation is zero. The pseudocode for FOS-MOD is given in Algorithm 4.

Information Theoretic Feature Selection (ITFS)
ITFS was originally proposed by Krause et al. (2008) for a sensor placement problem where an initial sensor deployment is exploited to perform a datadriven optimal placement maximising the mutual information (MI) between

Algorithm 4 FOS-MOD
Input: X, k 7: end for 8: return I S the selected and unselected sensor locations, such that the information lost by removing a sensor at each step is minimized. Specifically, a discrete set of locations V is split into the set of selected S and unselected U locations, respectively.
The goal is to place k sensors in order to have the minimal uncertainty over the unselected locations, i.e.
where H(·) is the entropy over a set. Hence, S * is the set that maximizes the reduction of entropy (uncertainty) over the unselected locations. To achieve a mathematical expression for the MI it is assumed that the variables are distributed according to a multivariate Gaussian distribution. Under this assumption, we can employ a Gaussian process regression model (Kersting et al., 2007) to estimate the multivariate distribution of a set of unselected variables U given the selected ones in S. For a measurement vector s ∈ S the multivariate distribution of the unmeasured variables is still multivariate Gaussian with mean µ * and covariance matrix Σ * , i.e. U ∼ N (µ * , Σ * ). These are computed as: and where I ∈ R |S|×|S| and K(·, ·) is a covariance matrix that can be estimated from the available historical data. In general, given two sets of variables A and B indexed by I A and I B , we have where X A and X B are the matrices formed by the columns of X indexed by I A and I B , respectively. The hyperparameter σ takes into account the measurement noise. The problem of maximizing the MI belongs to the NP-complete class of problems (Krause et al., 2008), hence an approximated solution can be found using a greedy approach that sequentially selects the variable x i (i.e. the i-th column of X) that maximizes the increment in mutual information (Krause et al., 2008): with the conditional distribution x i |U ∼ N (µ * U , σ * U ), and µ * U and σ * U estimated via Eq. 16 and 17. Recalling that for a Gaussian random variable p, the entropy Eq. 19 can be expressed as The pseudocode for ITFS is given in Algorithm 5. While MI is not in general monotonically increasing, Krause et al. (2008) show that, when |I S | << v, the diminishing returns property (Eq. 4) applies, hence ITFS is guaranteed to be within 63% of the optimal value in terms of the MI of X.

Principal Feature Selection (PFS)
PFS, presented in Algorithm 6, is a PCA guided approach to variable selection introduced by Cui and Dy (2008). At each iteration it selects the variable

Algorithm 5 ITFS
Input: X, k Step at line 7 withR defined as in Eq. 11). The correlation is measured using Pearson's correlation coefficient which, for mean-centred residual vector r i and principal component p 1 , is defined as Algorithm 6 PFS Input: X, k 8: end for 9: return I S

Forward Selection Frame Potential (FSFP)
The FSFP algorithm introduced in Zocco and McLoone (2017) was inspired by the frame potential based sensor selection method proposed by Ranieri et al. (2014) for linear inverse problems. The frame potential (FP) of a matrix X is defined as where x j denotes the j-th column of X. It is an attractive metric for variable selection as minimising FP encourages orthogonality among the selected columns of X. Ranieri et al. (2014) showed that the set function f ( and f (∅) = 0. Therefore it satisfies the conditions for Nemhauser's bound (Eq. 3) and its greedy maximisation is guaranteed to be within 63% of the global optimum solution. However, with this formulation variable selection is achieved by backward elimination, that is, the algorithm begins with all variables selected and then unselects variables one at a time until k variables remain. Therefore, the algorithm must iterate v − k times, at which point the desired variables are given by I S = I X \ I U . Backward elimination is computationally much more demanding than forward selection, especially when k << v. In our previous work (Zocco and McLoone, 2017) we propose an FP based forward selection algorithm (i.e. FSFP) where the variable selection metric is given by The resulting algorithm (summarised in Algorithm 7) is computationally much more efficient than backward elimination, but no longer meets the requirements for the B N bound (Eq. 3) since g(·) is a monotone decreasing function and g(∅) = 0. However, submodularity does apply allowing a lazy greedy implementation.
As suggested in Ranieri et al. (2014), FP-based greedy selection provides better results when matrix X is normalized to have columns with unitary norm.
However, this presents a challenge for FSFP as all variables have the same FP, precisely equal to one, leading to an ambiguous choice for the first variable.
Different solutions can be adopted to address this ambiguity. The simplest solution is to randomly select the first variable, however this leads to an output I S that is not unique for a given X. Two alternatives are to select the optimum pair of variables instead of a single variable at the first step, or to evaluate FSFP considering all the possible variables as first choice and then choosing the best one, but both these methods lead to a huge increase in computational cost. The approach we have chosen instead is to select the first variable using FSCA. The resulting algorithm, denoted as FSFP-FSCA, is given in Algorithm 8.

Algorithm 7 FSFP
Input: X, k Using this metric the UFS algorithm selects at each iteration the variable with the smallest R 2 , while discarding variables whose R 2 value exceeds a user defined similarity threshold R 2 max , until all variables have either been selected or discarded. To maintain consistency with the forward greedy variable selection algorithm structure considered in this paper (Algorithm 1), in our implementation of UFS we omit the similar variable pruning step and instead terminate when k variables have been selected, as described in Algorithm 9.
The UFS algorithm is equivalent to defining the selection metric g(·) in Algorithm 1 as which corresponds to defining the marginal gain as −R 2 (x, C S k ) and replacing argmin with argmax at step 6 of Algorithm 9. It can be shown that this metric is a submodular function, hence an exact lazy greedy implementation of UFS is possible. However, g(I S ) is a monotone decreasing function, rather than increasing, and therefore the B N performance bound (Eq. 3) does not apply.
: end for 10: return I S

Computational Complexity
All the algorithms considered share a similar greedy selection structure, and all rely on the existence of linear dependency between the selected and unselected variables. The element that distinguishes one algorithm from another is the variable selection criterion employed (see Table 1). The differences in these criteria result in algorithms with substantially different computational complexity, as summarised in Table 2    Algorithm

Performance Metrics
The comparison is based on six metrics. Three of these are the variance explained V X (X S ), the frame potential F P (X S ) and the mutual information M I(X S ), as previously defined in Eq. (10), (23) and (19), respectively. These are complemented by three metrics introduced to characterise the evolution of V X (X S ) with the number of selected variables k, namely, the area under the variance curve AU C, the relative performance r, and the number of variables needed to reach n% variance explained k n% . AU C is defined as whereX k denotes the approximation of X with X S , where |I S | = k, as defined in Eq. (12). The factor 100(v −1) normalises the expression to yield a maximum value of 1 when 100% of the variance is explained by a single variable. The relative performance r expresses how well an algorithm performs relative to the other algorithms over the first k * 99% selected variables. Specifically, it is the percentage of times over the interval k = 1, ..., k * 99% that the algorithm is the top ranked, i.e. yields the maximum value of V X (X k ), that is, Here k * 99% denotes the minimum value of k required to have all the algorithms exceeding 99% variance explained and is top rank(k) is 1 if the algorithm is the best among those being evaluated and zero otherwise, for |I S | = k. Finally, k n% provides a measure of the level of compression provided for a given target reconstruction accuracy n, and is defined as Note that the AU C, r and k n% metrics are functions of VE, hence algorithms optimizing VE are implicitly optimizing them too. The choice of having four out of six metrics based on VE is motivated by the fact that, in contrast to MI and FP, VE is a direct measurement of the approximation error X −X which, in practice, is the quantity we wish to minimize.

Results
This section compares the performance of the algorithms described above, namely FSCA, L-FSCA, FOS-MOD, PFS, ITFS, FSFP-FSCA and UFS, on the simulated and real world datasets summarized in Table 3. For reasons of space, in this section FOS-MOD will be referred to as FOS, FSFP-FSCA as FP-CA and L-FSCA as L-CA. The code that generated the results is written in MATLAB and is publicly available 3 to facilitate further benchmarking of greedy search algorithms for unsupervised variable selection. Our experiments were conducted using Matlab R2019b running on an Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz with 32 GB RAM.   accuracy.
The indices of the variables selected by each algorithm for selected datasets are shown in Table 5. Each table entry is the index of the variable selected at the k-th iteration. The sequence of variables selected by a given algorithm is therefore defined by the corresponding table column. In all cases FSCA and its lazy implementation L-CA select the same variables which further sustains the hypothesis that variance explained is behaving as a submodular function

Simulated Datasets
Simulated dataset 1. In this dataset, which is from Puggini and McLoone for i = 1, . . . , 5, and h 1 = w + x + 21 , h 2 = y + z + 22 . The final matrix is then defined as X = [w, w 1 , . . . , w 5 , x, x 1 , . . . , x 5 , y, y 1 , . . . , y 5 , z, z 1 , . . . , z 5 , h 1 , h 2 ] (30) with the columns generated as m = 1000 realizations of the random variables, yielding a dataset X ∈ R 1000×26 . Figure 2 shows the explained variance as a function of the number of selected variables for each algorithm. With the exception of UFS, all the methods achieve a VE greater than 90% with 4 variables. Recalling Table 4, in this case FOS is the best according to all three metrics, whereas UFS requires more than 6 variables to exceed 95% VE, which is beyond the figure horizontal axis limit.
As observed in Table 5, in this dataset the algorithms select similar variable sequences with the exception of UFS.  Simulated dataset 2. This dataset is also taken from Puggini and McLoone (2017) and is characterized by having two blocks of variables. The first block X I is composed of u independent variables, while the second block X D is linearly dependent on the first block and is obtained as a random linear combination of the variables in X I perturbed by additive Gaussian noise. More formally, the dataset is defined as follows: • X I ∈ R m×u : X I i,j ∼ N (0, 1) The matrices Φ ∈ R u×(v−u) and E ∈ R m×(v−u) can be generated by sampling each element from a Gaussian distribution, in particular Φ i,j ∼ N (0, 1) and E i,j ∼ N (0, 0.1). The algorithms were evaluated for the case u = 25, v = 50 and m = 1000. Table 6 reports the values of VE, MI and FP of each algorithm for k = 5, 10.
All three metrics increase with k. For k = 5 the best value of each metric is achieved by the algorithm designed to optimize it, that is, VE, MI and FP achieve their best results with FSCA/L-CA, ITFS and FP-CA, respectively.
For k = 10 this pattern is broken for FP, with UFS yielding the lowest FP,

Pitprops Dataset
The 'Pitprops' dataset was first introduced by Jeffers (1967) for PCA performance analysis. The data are measurements of physical properties taken from different species of pit props. The scope of that analysis was to determine if the considered pit props were sufficiently strong for deployment in mines. The considered population included samples of different species, size and geographical region. In Jeffers (1967) the correlation matrix of the original dataset is pro- vided. An approximation of the original 13 attributes is then defined in order to obtain a correlation matrix close to the original one, yielding a data matrix X ∈ R 180×13 . The small size of v in this dataset permits an exhaustive search to be performed for the optimal subset of variables for a fixed value of k. Table 7 lists the variables selected at each step by each algorithm and compares them with the optimal solution for |I S | = 7. The number of selected variables that each method has in common with the optimal solution is expressed as the parameter n b in the final row of the table (i.e. n b = |I * S ∩ I S |). ITFS finds the optimum solution. The next closest are UFS with n b = 5 and FP-CA with n b = 4. ITFS also yields the highest VE. FSCA, L-CA, PFS and FOS only have 3 variables in common with the optimum subset, but in terms of VE they give the second highest value (85.3%) followed by FP-CA (83.9%) and UFS (78.3%). FSCA/L-CA and FOS select the same seven features, but 'length' and 'ringbut' are selected as first and third variables, respectively, by FSCA/L-CA, whereas FOS selects them as third and first variables, respectively. Similarly, PFS selects the same variables, but in a different order. Consequently, the VE at k = 7 for FSCA, L-CA, FOS and PFS is the same. The fact that these four methods achieve a VE that is 98% of the optimum solution with n b = 3, while UFS only achieves a VE that is 90% of the optimum with n b = 5, highlights how the complex correlation relationships between variables makes optimum variable selection such a challenging problem.

Wafer Profile Reconstruction
This case study was provided by a semiconductor manufacturer and is concerned with determining a reduced and optimal set of measurement sites distributed over a silicon wafer surface in order to adequately monitor the uniformity of the thickness of a layer of material being deposited by a chemical vapour deposition production process. The dataset consists of measurements taken at 50 candidate sites from a set of 316 wafers, hence X ∈ R 316×50 . As process monitoring is a time consuming activity, it is desirable to have a re- The VE values with increasing k are reported in Table 8 The values of VE, FP and MI obtained with each variable selection algorithm are compared in Table 9 for k = 3, 7. All the three metrics increase with k. While the best values of FP and MI are achieved by the corresponding algorithms, the best value of VE for k = 7 is given by PFS, followed by FOS and then ITFS.
This contrasts with the results for k = 3 and the results for the 'Simulated 2' dataset reported in Table 6, and is a consequence of the high level of redundancy in this dataset with several combinations of 7 variables sufficient to achieve 99% VE, as evident from Table 5. This is also reflected in the almost identical FP values achieved by all algorithms. For k = 3, the second ranked algorithm in terms of FP is UFS and in terms of MI it is PFS, while for k = 7 it is FSCA/L-CA and FOS, respectively. Therefore we can see that the algorithms rank quite differently for different metrics, although the overall pattern remains with VE based algorithms performing well in terms of MI, and UFS and FP-CA  performing similarly well for FP, but poorly for the other metrics.

Case Studies from the UCI Repository
This section considers four datasets taken from the UCI Machine Learning Repository. The datasets are listed and briefly described below.
Arrhythmia (Guvenir et al., 1997). This dataset is taken from medical records of arrhythmia patients. It was originally intended to train a classifier able to distinguish between 16 different classes of Arrhythmia defined considering m = 452 patients. The v = 279 features are both identifiers of the particular patient, e.g. age, sex, weight, and sampled electrocardiogram signal recordings.
Since there are some missing values present in the original dataset, we have reduced the number of variables to v = 258. In this scenario, the output of the algorithms I S can be interpreted as a reduced set of patient data enabling a faster data collection and training process with regard to the classifier development. Additional information on the dataset can be found in Guvenir et al. (1997).
Sales (Tan and San Lau, 2014). This dataset is composed of weekly purchase quantities of m = 811 products over v = 52 weeks. The data has a time evolution of 52 weeks and the objective is to identify which weeks to record to get the most accurate prediction of the purchases over the other weeks. For a more detailed dataset description the reader should refer to Tan et al. (2015).
Gases (Vergara et al., 2012;Rodriguez-Lujan et al., 2014). This dataset, which was also considered in our previous work Zocco and McLoone (2017), consists of measurements of 8 gas-related parameters provided by each of 16 chemical sensors used to monitor gas level concentrations. Here we considered just the third batch of data, hence X ∈ R 1586×129 .
Music (Zhou et al., 2014). A set of m = 1059 musical tracks from different countries, each one v = 68 samples long, leads to the final dataset X ∈ R 1059×68 .
Here variable selection identifies a reduced set of samples/columns, each one corresponding to a specific time within the musical tracks. Refer to Zhou et al. (2014) for further details.
The results for the UCI case studies are summarized in Table 10 for four different compression thresholds k n% . The best performing algorithm is highlighted in bold in each case. Overall, FSCA/L-CA achieves the highest compression performance with 12 entries in bold followed by PFS with 9, and then FOS and ITFS with 3 each. The 'Music' dataset is the most difficult to compress; for a reconstruction accuracy of 80% the number of samples to record is reduced by 59% with the best algorithm, but it is only reduced by 29% when the desired reconstruction accuracy is 95%. 'Sales' only requires a large number of variables at the 99% accuracy level. For lower threshold levels, data from fewer  (Olszewski, 2001). USPS 6 is a subset of the USPS handwritten digit database (Hull, 1994)

Computation Times
To assess the efficiency of L-FSCA relative to FSCA and provide a comparison of the computation requirements of all algorithms considered in the paper, in this section two tables of timing results are presented. In Table 11, the speed-up achieved by L-FSCA over FSCA is computed for a range of dataset dimensions (m × v) and k values for: (1) randomly generated data, X i,j ∼ N (0, 1); (2) data with the same correlation structure as the 'Simulated dataset 2' case study, but with u = 50. In Table 12, the speed-up ratio relative to FSCA is reported for each algorithm for the four largest datasets for selected values of k. Results are also included for the 'Sim. 2 (u = 50)' and 'Random' datasets with m = 2194 and v = 2046. The corresponding FSCA execution times (in seconds) are also recorded.
The speed-up ratio, s i , for the ith algorithm is defined as s i = tFSCA ti , where t i is the execution time of the ith algorithm and t FSCA is the execution time of FSCA. If s i < 1 then the algorithm is slower than FSCA, in which case the values have been reported as a fraction 1/q i , where q i indicates how many times slower algorithm i is than FSCA. The results in Table 11 and Table 12 show that the speed-up achievable with L-FSCA (L-CA) varies considerably depending on the problem dimension, value of k, and the correlation structure in the data. The lowest speed-up observed is 1.29 with the highly correlated 'PlasmaEtch' dataset (k = 5) and the largest speed-up is 17.01 with the 2194 × 2046 randomly generated data (k = 20). In general the larger the problem dimension (m × v), the larger the value of k, and the less correlated the data, the greater the speed-up achievable with L-FSCA.
It is interesting to note that PFS proves to be exceptionally efficient to compute for 'PlasmaEtch' (16−30 times faster than FSCA). This contrasts with L-FSCA which only achieves modest speed-ups for this problem (1.3−2.3). These differences are largely due to the highly correlated nature of this problem, which results in: (1) a covariance matrix spectral distribution that favours rapid convergence of the NIPALS procedure; (2) a reduction in the efficiency of the lazy greedy procedure due to the need to search through more of the ordered list of marginal gains before locating the next best variable. As evidence of this, when PFS is applied to 'Sim. 2' and 'Random' datasets of the same dimension as 'PlasmaEtch' it is only 1.7−2.9 times faster than FSCA while L-FSCA is 4.7−17.0 times faster. Hence, it can be concluded that problem characteristics have a big impact on the computational efficiency of both algorithms.
UFS is the most scalable of the variable selection algorithms with respect to problem dimension (m × v). It is between 17−36 times faster to compute than FSCA for 'PlasmaEtch' and 5-16 times faster for 'YaleB'. However, its performance deteriorates with increasing k, which contrasts with L-FSCA which becomes more competitive with increasing k.
As expected, FP-CA and ITFS are the most computationally expensive algorithms and quickly become prohibitive to compute as the problem size increases.
They are 36−74 and 72−236 times slower that FSCA, respectively, for the case studies considered. FOS is also at least an order of magnitude slower than FSCA for most problems.
It should be noted that the execution time of an algorithm depends on how it is implemented (in contrast with the Big-O complexity in Table 2).
Hence, when considering the results in Table 12

Conclusions
This paper has proposed a novel greedy unsupervised variable selection algo- and L-FSCA. UFS is the least sensitive to problem dimension, v, making it much more efficient to compute than the other algorithms when v is large. However, its computational advantages diminish with increasing m and k, with the result that it is outperformed by the other algorithms when m >> v, particularly for larger values of k (see, for example, the 'WaferIPCM' case study).
The relative performance of PFS and L-FSCA is problem dependent with PFS faster than L-FSCA for 5 of the 18 dataset-variable selection combinations investigated in Table 12, with the biggest differences occurring for the highly correlated 'PlasmaEtch' case study. This arises because the complexity of PFS is O(N kmv), with N determined by the implementation of NIPALS and the spectral characteristics of the data covariance matrix, while the complexity of L-FSCA tends towards O(kmv) as a function of the efficiency of the lazy search, both of which are impacted by the level of correlation in the data. High levels of correlation among variables is detrimental to the efficiency of L-FSCA, but beneficial to the rate of convergence of the NIPALS procedure in PFS.
The main limitation that arises with L-FSCA is that its computation time is problem dependent and therefore cannot be determined a priori, a characteristic it has in common with PFS, its closest competitor. However, unlike PFS, since it is lower bounded by the computational complexity of FSCA, it is guaranteed to be at least as fast as FSCA, whose computation time is deterministic. Also, as already noted, L-FSCA does not enjoy theoretical performance guarantees because VE does not meet the submodularity requirements that underpin the available theory, a limitation its shares with FSCA, PFS and FOS. However, this in no way diminishes its practical value and effectiveness, as evident from its excellent performance over the broad range of case studies investigated. Thus, considering overall computational efficiency, VE performance, and simplicity of use, L-FSCA is the algorithm of choice for unsupervised variable selection.