Recovery of Linear Components: Reduced Complexity Autoencoder Designs

Reducing dimensionality is a key preprocessing step in many data analysis applications to address the negative effects of the curse of dimensionality and collinearity on model performance and computational complexity, to denoise the data or to reduce storage requirements. Moreover, in many applications it is desirable to reduce the input dimensions by choosing a subset of variables that best represents the entire set without any a priori information available. Unsupervised variable selection techniques provide a solution to this second problem. An autoencoder, if properly regularized, can solve both unsupervised dimensionality reduction and variable selection, but the training of large neural networks can be prohibitive in time sensitive applications. We present an approach called Recovery of Linear Components (RLC), which serves as a middle ground between linear and non-linear dimensionality reduction techniques, reducing autoencoder training times while enhancing performance over purely linear techniques. With the aid of synthetic and real world case studies, we show that the RLC, when compared with an autoencoder of similar complexity, shows higher accuracy, similar robustness to overfitting, and faster training times. Additionally, at the cost of a relatively small increase in computational complexity, RLC is shown to outperform the current state-of-the-art for a semiconductor manufacturing wafer measurement site optimization application.


I. INTRODUCTION
U NSUPERVISED variable selection and dimensionality reduction are fundamental aspects of data analysis, either as objectives in their own right, for example for sensor selection [1], factor analysis [2], and data visualisation applications [3], or as preprocessing steps to address the curse of dimensionality and collinearity issues that arise with high dimension data. Principal Component Analysis (PCA) is a well-established linear technique for unsupervised dimensionality reduction. This employs a linear transformation that maps the original variables in the data onto a new set of orthogonal components that are ordered such that they are successively orientated in the directions of maximum variance in the data [4]. A reduced dimension model is then obtained by retaining the first k components. Other linear transformation techniques include Fisher's linear discriminant analysis (LDA) [5], canonical correlation analysis (CCA) [6], and multidimensional scaling (MDS) [7]. For a comprehensive review of linear dimensionality reduction the interested reader is referred to [8].
Rather than transforming to a new set of variables, it is often desirable to reduce dimensionality by selecting a subset of the original variables (or features) as this enhances the transparency and interpretability of models [9], and in some applications it is a necessity, for example in sensor selection and measurement plan optimisation problems [10], [11]. Since determining the best subset of variables from a candidate set, with respect to a given performance metric, is an NP-hard combinatorial optimisation problem, practical implementations of unsupervised variable selection usually involve approximate solutions such as those provided by greedy search algorithms [12], [13], [14], [15], [16].
The first four of the aforementioned neural network based approaches employ an autoencoder architecture, as depicted in Fig. 1. This consists of two sub-networks in cascade: an encoding network, t = M e k (x), that maps a v-dimensional input vector x to a k < v dimensional intermediate vector t; and a decoding network,x = M d k (t), that attempts to reconstruct x from t. M e k (x) and M d k (t) are in general nonlinear functions whose parameters are determined by training the overall network to produce the identity mapping, i.e. M d k (M e k (x)) → x. Dimensionality reduction is then achieved by virtue of the bottleneck layer of size k.
These neural network based techniques have the advantage of being able to exploit recent advances in deep learning to improve their accuracy, but this comes at the expense of substantially greater computational complexity and training times [35], [36]. Consequently, in this paper we propose an unsupervised dimensionality reduction and feature selection methodology, referred to as Recovery of Linear Components (RLC), that is designed to be a middle ground between autoencoders and linear dimensionality reduction techniques. The methodology involves two stages; first, linear techniques are used to produce an ordered set of k lin linearly transformed components from the original v dimensional dataset; then, the enhanced representational capabilities of neural networks are exploited to predict these k lin components, from a subset of the first k components. The advantage of this two-stage approach is that it requires less complex networks compared to autoencoders, thereby accelerating the training phase.
The main contributions of this paper are the development . . . . . . . . . Fig. 1. Autoencoder neural network topology: the network is trained to reconstruct its own input x as its output via a hidden (bottleneck) layer of size k < v whose output t then defines the reduced dimension representation of x. The encoding and decoding functions, M e k (x), and M d k (t), respectively are, in general, nonlinear multilayer neural networks. of the RLC methodology and training algorithm, and the benchmarking of its performance on a range of case studies. In particular: • we show experimentally that the RLC, when compared with a conventional nonlinear autoencoder of similar complexity, shows higher accuracy, similar robustness to overfitting, and faster training times; • for the specific case of a semiconductor manufacturing wafer surface measurement plan optimisation application which motivated this work, we show that RLC outperforms the state-of-the-art, allowing fewer measurements to be taken while still accurately recovering the full wafer profile [37], [12]. The remainder of the paper is organized as follows. Section II provides a brief description of the dimensionality reduction, variable selection and autoencoder algorithms that underpin the work. The RLC methodology and training algorithm are introduced in Section III. Experimental results are presented in sections IV and V for unsupervised variable selection and unsupervised dimensionality reduction, respectively. Finally, conclusions are provided in Section VI.

II. PRELIMINARIES
The following notation and conventions are adopted throughout the paper. Matrices and vectors are denoted by bold capital and lowercase letters, respectively, while scalars are denoted by non-bold lower case letters. The dimensionality reduction/variable selection problem is with respect to a dataset consisting of m measurements of v variables collected in a matrix X ∈ R m×v , hereinafter assumed to be scaled to have zeromean columns. The objective is to find a lower dimensional representation of X that adequately captures the information contained in X. Denoting this representation as T k ∈ R m×k , the encoding process can be expressed as T k = M e k (X) and the decoding process (i.e. the reconstruction of X from T k ) asX = M d k (T k ). The adequacy of T k as a representation of X can be quantified in terms of the percentage variance explained V EX (·) between the original matrix X and its reconstructionX, that is where || · || F denotes the Frobenius norm. Alternatively the lossiness of T k can be measured in terms of the mean squared reconstruction error M SE (·) defined as hence, as an optimization objective, maximizing V EX (·) is equivalent to minimizing M SE (·).
The dimensionality reduction techniques that are the foundation of RLC can be classified as either linear or nonlinear. In the former both the encoding and decoding processes are linear transformations, while in the latter they are nonlinear.

A. Linear techniques
Dimensionality reduction: In linear dimensionality reduction the encoding function is given by The corresponding decoding transformation iŝ where defines the least squares error linear reconstruction of X from T k . The encoding transformation matrix P k maximising the variance explained, that is is then given by the first k principal components (PCs) of the principal component analysis (PCA) of X [4]. These may be computed as the eigenvectors of the sample covariance matrix corresponding to its k largest eigenvalues [38]. Under these conditions, the decoding transformation reduces to B = P k .
Variable selection: When performing dimensionality reduction via variable selection T k is constrained to be a subset of the columns of X, or equivalently the columns of the encoding transformation matrix P k are constrained to be a subset of the standard basis vectors of X. A standard basis vector e i is a vector with its i th element equal to 1, and all other elements equal to zero, hence the identity matrix I v can be written as .. e v ]. Using this notation the variable selection optimisation problem can be compactly expressed as While finding the optimum subset of k variables from a candidate set of v variables is an NP-hard combinatorial optimisation problem, and therefore computationally intractable in general, satisfactory approximate solutions can be found using greedy search methods such as Forward Selection Component Analysis (FSCA) [12]. FSCA performs greedy selection using V EX (·) as a variable ranking criterion. More precisely, at each iteration it adds to the set of currently selected variables the one that maximizes the variance explained. Since (9) is a constrained version of the PCA optimisation problem (7), it follows that the number of PCs required to achieve a specific value of V EX (·) is a lower bound of the number on variables needed to achieve the same reconstruction accuracy [12].
To reduce the gap between the FSCA solution and the optimum solution, should one exist, [12] propose several refinements to the basic FSCA algorithm. Among these, the two approaches that provided the best accuracy-complexity compromise were the Single-Pass Backward Refinement (SPBR) and the Multi-Pass Backward Refinement (MPBR) algorithms. Since these algorithms can be viewed as refinements of the encoding map M e k (X) in order to improve its performance, they are complimentary to the methodology proposed herein as it can be interpreted as a refinement of FSCA that focuses on improving the decoding map M d k (T k ).

B. Neural network based nonlinear techniques
Dimensionality reduction: The autoencoder, as depicted in Fig. 1, is a well established approach to achieving unsupervised nonlinear dimensionality reduction. It consists of a multilayer neural network model with a bottleneck layer whose structure and parameters are determined by solving a nonconvex optimization problem [29]. A classical single hidden layer artificial neural network (ANN) generates a nonlinear mapping between the desired target Y ∈ R m×ny and the input X of the form where, n y and n h are the number of neurons in the output and hidden layer, respectively, W 1 ∈ R n h ×v , W 2 ∈ R ny×n h are the weight matrices between the input and the hidden layer and hidden layer and the output, respectively, B 1 ∈ R n h ×m , B 2 ∈ R ny×m are the corresponding bias matrices, f l (·) is the activation function of the l-th layer andŶ ∈ R ny×m is the network output (i.e. the estimate of the desired target Y transposed). When (10) is configured as a single hidden layer autoencoder, Y = X and n h = k. If linear activation functions f l (·) are used, the autoencoder is linear and yields equivalent V EX (·) performance to PCA. However, the practical value of an autoencoder is its ability to produce more general nonlinear encoding-decoding transformations through the use of nonlinear activation functions such as the logistic sigmoid, the hyperbolic tangent, or the rectified linear unit (ReLU). Denoting the output of the autoencoder as where for compactness of notation we introduce the parameters of the optimum autoencoder N * (X; θ * ) can be computed as with J(X; θ) usually chosen as the mean square error cost function M SE (X;X(θ)), that is This non-convex optimisation problem is typically solved using a gradient descent algorithm [39], a quasi-Newton algorithm [40], or the Levenberg-Marquardt algorithm [41]. A single hidden layer autoencoder has limited nonlinear compression capabilities since representational capabilities are a function of the number of hidden neurons, which is required to be small to achieve dimensionality reduction. To achieve a high level of input space compression, i.e. k v, a more complex autoencoder structure has to be designed. Thus, further hidden layers are added to obtain a deep structure called a stacked autoencoder (SAE) [42]. Assuming the general case of an ANN having L layers, the autoencoder equation (11) becomes However, increasing the number of layers makes training the autoencoder much more challenging due to the much greater computational complexity, the vanishing gradient problem and the potential to over fit the data when using deep neural networks [36]. Many different strategies have been proposed to address these challenges. Among the most effective are: early stopping where training is stopped when the error on a validation set starts to increase [43]; introducing different types of operation in the hidden layers such as convolution or pooling to achieve weight sharing and network size reduction [44], [30]; Dropout [45] and its extension DropConnect [46] where neurons are randomly dropped out with a fixed probability during the training phase as a form of regularisation; and greedy layer-wise pre-training [47], [48] of networks to improve regularization and training performance through good weight initialization. Various approaches to achieving greedy layer-wise pretraining have been proposed including using a deep belief net (DBN) [49], a stacked denoising autoencoder (SDAE) [50], a k-sparse autoencoder [51], and a sparse autoencoder [52]. In our work, a sparse autoencoder is used, hence, during pre-training, the basic cost function (14) is augmented with regularisation and sparsity terms to become Here, λ and γ are the weightings applied to the weight decay (i.e. L 2 or ridge regression) and sparsity terms, respectively. The sparsity penalty KL(ρ||ρ i ), defined as is the Kullback-Leibler (KL) divergence between two Bernoulli random variables with mean ρ andρ i , respectively, where the random variable meanρ i is computed as the mean of the output of the i th neuron in the bottleneck layer averaged over the training data, and ρ is the target value for this parameter. This is chosen to be close to zero (e.g. 0.05) to encourage inactivation of the hidden layer neurons, and hence sparsity.
Variable selection: Practically, nonlinear variable selection is not computationally tractable, even using greedy search methods, due to the huge overhead associated with training the nonlinear models for each variable combination evaluated. Consequently, implicit approaches are preferred in the literature when using neural network models, whereby the training cost function J(X; θ) is modified to include a term that induces sparsity in the connection matrix between the inputs and the first hidden layer (i.e. W 1 ). For example, in [34], [53] the training phase is split into two phases with the first phase learning the input-output mapping and the second phase focusing on shrinking to zero some of the input layer weights by solving an optimization problem based on the nonnegative garrote parameter. The L 1 regularization term, also known as the least absolute shrinkage and selection operator (LASSO), is added to the cost function (14) in [33] to achieve input selection in a Multilayer Perceptron neural network, while [31] and [32] integrated a row-sparse regularization term and a graph regularization trace-ratio criterion, respectively, within the training of an autoencoder to indirectly obtain nonlinear variable selection. Even when using implicit approaches, the computational burden associated with nonlinear variable selection can be prohibitive in time-constrained applications such as dynamic spatial sampling for silicon wafer process monitoring [11], especially when training large neural models. This has motivated our investigation of a methodology that requires the training of less complex nonlinear models, and can be thought as a middle ground between linear and nonlinear variable selection techniques.

III. PROPOSED ALGORITHM
The methodology we propose is called Recovery of Linear Components (RLC). The key idea of our approach is to retain a linear encoding block, but employ a nonlinear model for the decoding block to enhance representational capabilities in the reconstruction phase. Specifically, assuming that the desired reconstruction accuracy of X can be achieved using k lin linear components, thenk of these are discarded to further reduce the dimensionality to k < k lin . Then a nonlinear (neural network) model is used to recover the discarded linear components from the subset that have been retained. In this way, the nonlinear model is only needed to capture the nonlinear relationships between the retained and discarded linear components leading to a much lower complexity model Algorithm 1 Recovery of Linear Components (RLC) 1: Input: X, τ , T ask, k, neural model hyperparameters 2: Output: P k lin , N * , β 3: k lin = Def ine k lin (X, τ, T ask) 4: switch T ask do 5: case DR 6: defines P k lin 7: case VS 8: defines P k lin 9: end switch 10: T k lin = XP k lin uses P k lin 11: if k lin ≤ k then 12: switch T ask do 13: case DR 14:X = T k lin P k lin 15: case VS 16: end switch 19: else 20: T k : first k ordered columns of T k lin

21:
Tk: remaining k lin − k ordered columns of T k lin

22:
Recover the linear space of size k lin by training neural model N (T k ; θ) to minimize (18) defines N * 23: than required for a full v to v autoencoder. In particular, the sum of the input and output dimensions of the RLC neural network model is k +k = k lin < v, compared to 2v for a standard autoencoder. The methodology can be used for both unsupervised dimensionality reduction and unsupervised variable selection if combined with either PCA or FSCA, respectively, to implement the linear encoding step. Hereinafter, the RLC algorithm will be referred to as PCA-RLC or FSCA-RLC whenever it is used in combination with PCA or FSCA, respectively.
The implementation of RLC is described in Algorithm 1. The inputs to the algorithm are the data matrix X ∈ R m×v , the threshold τ for the desired level of variance explained, the string T ask ∈ {DR, VS} indicating whether dimensionality reduction (DR) or variable selection (VS) is being requested, the number of linear components to be selected k, and the neural network hyperparameters (number of layers, number of units per layer, maximum number of training epochs, and regularization parameters).
The algorithm begins by calling the function Def ine k lin which is described in Algorithm 2. This returns k lin as the number of components needed to achieve the desired reconstruction accuracy τ with the selected linear technique (PCA or FSCA). The goal of RLC is then to achieve this level of accuracy using only k components, where k < k lin , allowingk = k lin − k components to be discarded. The Nonlinear Iterative Partial Least Squares (NIPALS) algorithm [54], which computes principal components sequentially in order of significance, can be used to efficiently compute the first k lin principal components if k lin << v, otherwise they can be computed using singular value decomposition (SVD) [55] or by the covariance method described in the previous section.
The outputs of the RLC algorithm are the linear encoding matrix P k lin ∈ R v×k lin , the linear decoding matrix β, and the trained neural network model N * whose parameters are obtained by minimizing The matrix T k ∈ R m×k is the reduced dimensionality dataset consisting of either the first k PCs or FSCA selected variables, and T + k ∈ R k×m is its pseudo-inverse. In the algorithm pseudocode, the text on the right between · highlights the distinction between the training and test phases. The steps defining the models belong to the training phase whereas steps using the models belong to the test phase.
For benchmarking purposes we implemented two additional algorithms to allow comparisons with more complete deep neural network autoencoder implementations. The first of these, presented in Algorithm 3, tackles the unsupervised dimensionality reduction problem using PCA and a stacked autoencoder (SAE) in cascade, and is called PCA-SAE. PCA reduces the dimension of the input space from v to k lin , then the encoder block of the SAE reduces the dimensionality to k. The decoder block followed by linear regression are applied to recover X. The training of the SAE is performed in two stages: first a sparse autoencoder based greedy layer-wise pre-training is performed, then the output layer is added and the full network is fine-tuned using the Levenberg-Marquardt training algorithm with early stopping. The objective function for the fine-tuning stage is The second algorithm, Algorithm 4, tackles the unsupervised variable selection problem. In common with PCA-SAE it employs a deep neural network model and requires a two stage training process, but differs from PCA-SAE by virtue of the fact that the encoding map M e k (X) is fully linear as in the RLC algorithm, and only the decoder block is nonlinear. The objective function when training the stacked decoder (SDE) block is The algorithm, referred to as FSCA-SDE, has been developed to benchmark the RLC algorithm against a technique that shares the same linear encoding map, but exploits the accuracy of a deep neural network in the decoding map M d k (T k ). The RLC, PCA-SAE and FSCA-SDE architectures can be interpreted as large neural network models consisting of two or three sub-neural models that are independently trained and connected in series. This is illustrated graphically in Fig. 2.

12:
Greedy layer-wise SDE pre-training via sparse autoencoder (16) with λ = 0.001, γ = 1, ρ = 0.05 13: Include the output decoder and fine-tune the full SDE N (T k ; θ) to minimize (20) defines N * 14:X = N * (T k ; θ * ) uses N * 15: end if [12] are considered for benchmarking purposes. We begin with comparisons on a simulated dataset and then consider 6 real world datasets, 4 of which are publicly available. The algorithms were implemented in MATLAB with MATLAB's Deep Learning Toolbox used to build the autoencoders. All code developed is available on GitHub 1 . The simulations were executed either on an Intel i5-7440HQ CPU with 8 GB of RAM or on an Intel i7-6700 CPU with 16 GB of RAM.
The variable selection algorithms have been applied to this dataset with v = 50 and m = 500 for two different noise levels, σ 2 = 0 and σ 2 = 0.01, and with the experimental parameters as described in Table I. The average and standard deviation (over 100 Monte Carlo simulations) of the variance explained by each algorithm as a function of k is reported in Fig. 3. FSCA-SDE and FSCA-RLC are the best performing methods with almost equivalent variance explained profiles in terms of mean and standard deviation. SPBR and MPBR perform similarly to FSCA with the exception of k = 3 (σ 2 = 0) and k = 4 (σ 2 = 0.01) where they are marginally superior. All methods are relatively robust to variations in the dataset, as suggested by the low standard deviation for each k and their convergence to the same V EX when k > 5. In the absence of noise the nonlinear methods are able to achieve V EX > 98% with 3 variables (k = 3), whereas 5 variables are required to achieve the same level of accuracy with the linear methods. The frequency with which each variable is selected in the linear encoding step when k = 10 is shown in Fig. 4. This reveals that the first 10 variables are selected with increasing frequency as we move from FSCA to SPBR to MPBR, and are almost exclusively chosen when using MPBR. The focusing of selection on the first 10 variables in more pronounced with the noisy version of the dataset. This is a consequence of the reduced coupling between the first 10 nonlinear variables and the linearly related variables with the addition of noise. While the variables selected by each method vary considerably between methods and between simulation runs they yield similar levels of variance explained, as evident from Fig. 3.
The algorithm computational times, relative to FSCA, are reported in Table II for several values of k (σ 2 = 0.01). FSCA is used as the baseline as it has the lowest computational complexity and is executed as the first step in each of the other methods. SPBR is the quickest method, followed by MPBR, and then FSCA-RLC. FSCA-RLC is 35 times more computationally intensive to compute than MPBR for k = 2 but becomes more efficient with increasing k and only requires twice the computation time of MPBR for k = 8. In contrast, FSCA-SDE is more than two orders of magnitude more computationally intensive than FSCA-RLC.

B. Real world benchmark datasets
To assess the performance of the variable selection algorithms more generally they are further evaluated for six real world datasets, namely, data on weekly purchases of products  (Xbusiness [56]), chemical sensor array data (Xchemistry [57], [58]), the extended Yale face database B (YaleB [59], [60]), the USPS handwritten digit dataset (USPS [61], [62]), and two industrial case studies on semiconductor manufacturing (Xwafer [37] and Xsemicon [63]). Further details on these datasets and the experimental settings employed are provided in Table I. The computation times (relative to FSCA) for each algorithm are reported in Table II while their performance in terms of variance explained as a function of k is presented in Fig. 5. Whenever results have been omitted for an algorithm it is because it has proven to be too computationally expensive to compute compared to the other algorithms. Overall, the most effective variable selection method is FSCA-SDE followed closely by FSCA-RLC, but the former is at least 100 times more computationally complex to train than the latter, hence it was deemed intractable for the largest datasets. Consistent with the finding of [12], MPBR has an accuracy comparable to SPBR and a computation complexity that increases more rapidly with k. For the lowest values of k SPBR is the fastest algorithm to compute, but it is less accurate than FSCA-RLC. As k increases, the difference in variance explained between algorithms decreases and for k ≥ 40 FSCA-RLC becomes faster to compute than SPBR, as observed with YaleB and USPS. This can be explained considering that SPBR has O(2v 2 k 3 + 8vk 3 ) complexity and hence O(k 3 ) complexity with respect to k, whereas the RLC neural network has a complexity of O(mθe) where e is the number of epochs required to satisfy early stopping and θ is the number of weights in the network [20]. In these experiments θ = kh + h + hk +k = hk lin + h + k lin − k, hence the size of the network decreases as k increases, as does the computational complexity per iteration (hypothesizing that e does not increase if the network gets smaller).
Extended analysis of Xwafer: The development of RLC was initially motivated by a semiconductor manufacturing spatial sampling design problem, as exemplified by the Xwafer dataset [37], [64]. Here, each column of the data matrix represents the measurements of a process parameter, such as etch depth or deposition layer thickness, at a specific wafer site (location on the wafer surface) for a batch of wafers. The objective is to identify the minimum number of measurement sites needed to accurately monitor the full wafer surface, or equivalently, to find the minimum subset of columns of the matrix that can achieve a desired level of accuracy in terms of reproducing the full matrix. An important requirement for practical application is to have a relatively low computation time for the algorithm performing site selection and surface reconstruction. Therefore, complex deep learning techniques are not suitable in this context.
The site selection problem has already been addressed in [37], [64] and the state-of-the-art is defined by the SPBR and MPBR proposed in [12], where for a target surface reconstruction accuracy of 99% or greater they yield a minimum measurement plan of 6 sites in contrast to 7 sites with FSCA. To assess the performance of FSCA-RLC and FSCA-SAE for this problem Table III focuses on the results for k = 5 and k = 6 since these are the minimum values of k required to achieve the desired reconstruction accuracy with at least one of the algorithms. FSCA-RLC yields the best performance on average and exceeds the 99% threshold more frequently than the other methods, with FSCA-SDE a close second. SPBR is the least computationally complex of the 4 algorithms and, along with MPBR, exhibits the smallest standard deviation in performance over the 100 Monte Carlo simulations. The situation for k = 5 is further clarified with the boxplot in Fig.      6 where the 99% threshold is indicated by the yellow dashed line. Note that, even though FSCA-RLC does not employ greedy layer-wise pre-training as part of the learning process, it achieves comparable robustness to overfitting to FSCA-SDE, with the added benefit of being more than 150 times faster to train.

V. RESULTS FOR UNSUPERVISED DIMENSIONALITY REDUCTION
In this section three unsupervised dimensionality reduction methods are compared: PCA, PCA-SAE and PCA-RLC. The experimental setup and software tools employed are as introduced in Section IV. The MATLAB code developed for the experiments is available on GitHub 1 . Four datesets are used as case studies, namely, Xsyntethic, Xwafer, Xchemistry and USPS. For each one three different PCA-SAE and PCA-RLC training scenarios are investigated: Scenario A: In addition to employing the settings detailed in Table IV, the algorithms are implemented in accordance with Algorithms 1 and 3 with different randomly generated training and test datasets used for each Monte Carlo simulation; Scenario B: Here the same training and test datasets are used in all simulations. Hence, any variation in V EX will be a consequence of the neural network training process only. In addition, greedy layer-wise pre-training is not performed with PCA-SAE. Scenario C: This is an extension of Scenario B where a network size equivalence condition (23) is imposed on PCA-SAE and PCA-RLC, as described later. Fig. 7 shows the variance explained as a function of k for the four different datasets and the three training scenarios. Overall Fig. 7(a) shows that PCA-SAE is the most accurate method followed by PCA-RLC and PCA. While in Xsynthetic PCA-RLC has almost equivalent performance to PCA-SAE, in the other case studies PCA-SAE significantly outperforms PCA-RLC at lower k values. With the USPS dataset PCA-SAE struggles at the higher values of k and is the worst performing of the three methods for k > 10. The variance in performance over the Monte Carlo simulations is relatively small for each algorithm and value of k, which confirms the achievement of two positive outcomes: first, the training set contains enough information to permit us to define a model able to generalize on the test set, and second, the algorithms are robust to overfitting. Note also that the differences in variance explained between PCA-SAE and PCA-RLC is greater than the corresponding differences between FSCA-SDE and FSCA-RLC. This is simply a reflection of the fact that PCA-SAE benefits from having the additional representational power of a nonlinear encoding map whereas the other methods (including FSCA-SDE) have an encoding map that is entirely linear.
The most striking feature of the results for Scenario B (Fig.  7(b)) is the impact on the performance of PCA-SAE of not employing greedy pre-training. This autoencoder is now the worst performing method, particularly for lower values of k, with both a lower mean and much higher dispersion in V EX for each value of k, highlighting the importance of performing pre-training to regularize deep neural networks. Since the training and test datasets remain the same for all simulations in this scenario, the observed dispersion is caused exclusively by the sub-optimal neural network training process. In contrast to PCA-SAE, the dispersion in V EX decreases for PCA-RLC relative to Scenario A and is zero for PCA. PCA-SAE shows contradictory behaviour for higher values of k in the USPS case study as it achieves superior results to the other methods, whereas the opposite was the case for Scenario A.
Overall PCA-RLC yields stable training performance at each k for all case studies and does so without the use of any pre-training. This can largely be attributed to the lower complexity neural network models that arise with the RLC topology compared to PCA-SAE, as detailed in Table IV. The PCA-RLC neural network N has fewer neurons and only a single-hidden layer, leading to a less challenging training problem than with the deeper PCA-SAE architecture. Therefore, as a final evaluation, Scenario C compares the performance of the two approaches when they are constrained to have the same number of parameters and layers, as expressed in the following equivalence condition (23): Equation (21) defines the number of parameters of a singlehidden layer N PCA-SAE architecture, while equation (22) is the equivalent expression for a PCA-RLC with h hidden neurons. In both topologies the neural network hidden layers are implemented with hyperbolic tangent activation functions, while the output layers are linear.
The results for Scenario C are presented in Fig. 7(c). Since PCA-SAE is not employing a deep network, as was the case in Scenario B, it has much more stable training performance even without pre-training, but its representational capabilities are greatly diminished with the result that it can only achieve the same level of explained variance as linear PCA. Under these conditions PCA-RLC is substantially superior to PCA-SAE while also having much shorter training times (see Table  V).
To gain some insight into why training times are much shorter for PCA-RLC than for PCA-SAE in Scenario C, even though both networks have the same number of weights, the number of epochs required to train each network in each simulation and the mean computational time over the first few epochs are plotted in Figures 8 and 9, respectively, for the four case studies. Two observations can be made with regard to these results: first, recalling that early stopping is employed as part of the training process, the total number of epochs required to train a PCA-RLC averages between 15 and 50 and seldom exceeds 100, whereas early stopping rarely interrupts the training of PCA-SAE which often reaches the 1000 limit on the number of epochs allowed (see Table IV); and second, performing a single epoch of PCA-RLC is on average 30% faster than one epoch of PCA-SAE. This suggests that the RLC framework leads to inherently better conditioned and therefore easier to solve training problems than those that arise with the standard SAE architecture. Note that the different pattern observed with PCA-SAE on USPS in Fig. 8 is due to the limit of 25 epochs imposed on PCA-SAE training for this case study (instead of 1000). This was necessary due to the prohibitive training times for the SAE with this size of dataset.
The PCA-RLC and PCA-SAE autoencoder training times for Scenario A and Scenario C, expressed as fraction of the computation time for PCA, are given in Table V. Overall, PCA-SAE takes between 30 and 400 times longer to train than PCA-RLC for Scenario A, and between 5 and 25 times longer for Scenario C, with the differential increasing with larger values of k.
VI. CONCLUSIONS Recovery of Linear Components (RLC) has been proposed as a novel methodology for unsupervised dimensionality reduction and variable selection, whereby starting with a set of k lin ordered linear components, a neural network is used to recover the full set of components from a subset of the first k components. Crucially, by employing a linear encoding block and a nonlinear decoding block, RLC provides a compromise between fully nonlinear autoencoders and linear techniques such as PCA and FSCA, delivering improved representational capabilities over linear methods, while requiring significantly lower complexity autoencoder designs than conventional SAEs.
The FSCA and PCA implementations of RLC (PCA-RLC and FSCA-RLC), representative of variable selection and dimensionality reduction formulations, respectively, have been presented and evaluated on a synthetic dataset and several real world case studies. The results confirm that RLC is able to achieve significantly higher levels of variance explained than linear techniques when high levels of compression are required (i.e. the number of components k is relatively small). It is also robust to overfitting and substantially faster to train than SAEs that can achieve similar or better performance. In particular, for the specific case of the semiconductor manufacturing wafer measurement plan optimisation problem that motivated this work, FSCA-RLC is able to achieve state-ofthe-art performance with a modest increase in computational complexity, and is two orders of magnitude faster to train than the equivalently performing FSCA-SDE.    Early stopping is employed for regularization and the maximum number of epochs allowed is as specified in Table IV.