Broad Autoencoder Features Learning for Classification Problem

Activation functions such as tanh and sigmoid functions are widely used in deep neural networks (DNNs) and pattern classification problems. To take advantage of different activation functions, this work proposes the broad autoencoder features (BAF). The BAF consists of four parallel-connected stacked autoencoders (SAEs), and each of them uses a different activation function, including sigmoid, tanh, relu, and softplus. The final learned features can merge by various nonlinear mappings from original input features with such a broad setting. It not only helps to excavate more information from the original input features through utilizing different activation functions, but also provides information diversity and increases the number of input nodes for classifier by parallel-connected strategy. Experimental results show that the BAF yields better-learned features and classification performances.


INTRodUCTIoN
With the quick advancement and deployment of information technologies, there are gigantic volumes of information in various organizations accessible on the Internet, for example, video, image, and medical data. The need of mining helpful data from these huge information poses a great challenge to the AI people group. Typical AI techniques requiring hand-crafted features cannot discover hidden information from data and may experience the ill effects of either data misfortune or overfitting. Conversely, DNNs have been effectively applied in a wide range of AI applications and delivered incredible outcomes in computational intelligence, e.g., video processing, image classification, speech recognition, and computer visual recognition.
Until recently, machine learning techniques can partition into generative and discriminative strategies. At present, the most commonly utilized DNN are generative models, e.g., the Deep Belief Networks (Hinton et al., 2006), the Restricted Boltzmann Machines (Salakhutdinov et al., 2007), and the Deep Boltzmann Machines (Salakhutdinov et al., 2009). These techniques prepared the logprobability gradient prepared to utilize MCMC-based strategies that turn out to be progressively uncertain as preparing advances. It is because examples from the Markov Chains cannot blend between 2 models sufficiently quickly. Moreover, generative models, e.g. the Autoencoder (AE) (Bengio et al., 2009), the Variational Autoencoder (Kingma and Welling, 2014;Mescheder et al., 2017;Tan et al., 2018), and the Important Weighted Autoencoders (Burda et al., 2016) have created to utilize direct back-proliferation for preparing and maintaining a strategic distance from challenges yielded by the MCMC preparation. Each of these strategies considered as the projection that yields a considerable classification result by anticipating tests from the original feature space into a projected space with a better class-separability for pattern classification problems (Wasikowski et al., 2010). Among them, the AE (Bengio et al., 2009) is an unsupervised feature learning technique that aims to recover the representation to be roughly equivalent to the original sources of the inputs. The number of hidden units is normally bigger than the number of feature dimensions for feature representation learning. The projection at the hidden layer of the AE yields a helpful representation of the original sources of inputs (Bengio et al., 2009).
In practice, the original data may be missing or obscured. For example, the blocked part of an object in an image or the missing word in text data is usually missing or obscured. Therefore, the Dual Autoencoders Features method (DAF) (Ng et al., 2016) combines two activation functions to deal with this problem. With different activation functions, the DAF uses two Stacked AEs (SAE) to learn features by extracting different characteristics from the training data. However, the DAF only uses sigmoid and tanh functions, leaving other activation functions aside such as relu and softplus. It is better to use sigmoid when the characteristic difference is more complicated or the difference is not particularly large. But it suffers from the vanishing gradient problem during the backpropagation which makes it difficult to complete the training of a deep neural network. The effect of using tanh is incredibly good when the characteristic difference is obvious and the characteristic effect will be continuously expanded during the cycle. Moreover, the convergence speed of SGD obtained by using relu will be much faster than sigmoid/tanh while softplus can be regarded as the smoothed version of relu. These motivate us to utilize advantages of different activation functions, and we propose the Broad Autoencoders Features (BAF) extend the superiority of the DAF. There are four SAEs with different activation functions used to learn the mapping from the input space to the representation space to capture different characteristics in the original data. The learned features are parallel concentrated to form the BAF that provides information diversity and increases the number of input nodes for Softmax classifier. Different activation functions have different advantages and disadvantages, the BAF does not automatically choose the activation function to optimize the model performance.
The paper is organized as the following: Section II gives a brief review of the related works. Section III proposes the BAF. Sections IV and V give experimental results and the conclusions, respectively.

ReLATed woRK
The Autoencoder (AE) is a type of neural network that can use to dimensionality reduction (Bengio et al., 2009). The AE learns low-dimensional features of the original inputs by unsupervised learning (Baldi, 2012). The representation capability of learned features can be further improved by a deep architecture obtained by stacking several AEs. In general, an AE consists of an input layer, an encoding layer, and a decoding layer. The input data x is mapped by a deterministic mapping to a hidden representation f x ( ) at the encoding layer: where S e ⋅ ( ) , b e , and W e denote the activation function of the encoding layer, the bias vector, and the weight matrix, respectively. The decoding layer maps f x ( ) onto a reconstruction g f x ( ) where S d ⋅ ( ) , b d , and W d denote the activation function of the decoding layer, the bias vector, and the weight matrix, respectively. The AE's goal is to minimize the reconstruction error between inputs x and outputs g f x ( ) ( ) , which is done by finding a set of optimal parameters θ = { Hence, the optimization process of an AE is defined as follows: AE's feature learning has been widely used in different application fields, such as very deep AE for content-based image retrieval by learned features from images (Krizhevsky & Hinton, 2011). Heydarzadeh et al. propose a robust and accurate algorithm for in-bed posture classification using AE.
It is regularly accepted that the number of hidden neurons is smaller than the number of input features so that the AE performs the compression process from the original inputs to the hidden layers (encoding). If a random Gaussian sequence is fed to AE as the input, the meaningful features will fail to extract from the original inputs. Under these circumstances, a large number of hidden neurons are needed for the AE (the number may be larger than the number of input neurons) to achieve feature extraction. It may cause failure to compress the input features. In another aspect, the AE can excavate correlated relationships among input features. Therefore, the Sparse Autoencoder (SAE) (Ng, 2011) adds a sparsity constraint to hidden neurons and discovers the latent structured data. SAE has been applied to detect human body in depth images (Su et al., 2015), obtain learning-based representation (Tang et al., 2016;Wong et al., 2018), and achieve soft sensor modeling (Yuan et al., 2018).
The AE is pre-trained to determine an initial value of the weight parameters W ahead of the training process. But AE gets easily overfitted due to a series of problems in the initial model. For example, the noises in the training data, the large scale of the training data, and the complexity of the AE model. To prevent the overfitting, the input data (input layer of the AE) is often added with noise, so that the learned W can enhance the AE model's generalization capability. The Denoising Autoencoder (Vincent et al., 2010) (DAE) extends the classical AE that learns a more stability projection towards the infection data. The input data is randomly dropped in some input dimensions of the randomly selected training samples to improve the AE's robustness (Vincent et al., 2010). The feature learning process with the Stacked Denoising Autoencoder (SDAE) has been widely used in pose-invariant speech recognition (Weninger et al., 2014) and face recognition (Kang et al., 2013). There are other applications on the DAE such as electrocardiogram active classification (Al Rahhal et al., 2016) and saliency detection (Han et al., 2016). Jiang et al. propose a novel feature representation learning method for wind turbine gearbox fault diagnosis which uses the stacked multilevel denoising autoencoders.
A good feature representation should be both a good reconstruction of the input data and robust to partly disturbance of the inputs. The traditional AE, the stacked AE, and the sparse AE meet the first standard while the DAE and the SDAE are consistent with the second standard. The robustness is more important in many cases. To learn robust features, the AE should be robust to small perturbation added to the training samples. Rifai et al. propose Contractive Autoencoder (CAE) which trains AE by minimizing the reconstruction error plus a new penalty. The Frobenius norm for the inputs' Jacobian matrix is computed as the penalty term. Moreover, a method of using stacked CAE is applied for hierarchical feature extraction (Masci et al., 2011). . Figure 1 shows the workflow of the BAF. The greedy layer-wise training algorithm without the fine-tuning step is applied to train each 2-layer SAE with different activation function independently. As shown in Figure 1, each of the 2-layer SAEs consists of two AEs. In each, the first AE learns to encode inputs using the encoding layer by minimizing the reconstruction error between outputs of its decoding layer and the original inputs from the training dataset. Then the second AE uses the outputs from the encoding layer of the first AE as input to learn the encoding via a minimization of the reconstruction error between the outputs of the encoding layer of the first AE and the decoding layer of the second AE. The outputs of the encoding layer of the second AE are used as the learned features. Such that, the four 2-layer SAEs using different activation functions learn four different sets of learned features which are then concatenated to form the BAF. Figure 2 shows the detailed algorithm of the feature learning of the BAF.
When a sample arrives, it is encoded by using encoding layers of AEs in the sigmoid SAE, the tanh SAE, the relu SAE, and the softplus SAE, respectively, to form four sets of learned features. Then, these four sets of learned features are concatenated to form the BAF for this sample. Figure 3 shows the detailed algorithm of the feature encoding procedures of the BAF.

Training of the BAF
At the beginning of training, all weights are randomly initialized, and biases are set to zeros. Then, weights are updated using the standard backpropagation algorithm. Samples are then fed to the AE in the forward pass to generate output values of the decoding (i.e. output) layer. These values are then used to compute the gradients of both weights and the bias for updating in the decoding layer as follows: / . The gradients for both weights and the bias in the encoding layer are computed as follows: where δ δ e e d W f = ′ and ′ f denotes the partial derivative of f x ( ) . Weights and biases of AEs in the BAF are updated repeatedly until converge using (4) -(7) by the Limited-memory BFGS (L-BFGS) (Byrd at el., 1995) method. ( ) = ( ) .

eXPeRIMeNTAL ReSULTS
In experiments, we use four benchmark datasets to compare the performance of the BAF, DAF, and AFs. Each AF is a BAF with only one type of activation function. For instance, the Sigmoid-AF uses sigmoid function for all SAEs Four benchmark datasets include the MNIST dataset, the Fashion_ MNIST dataset, the COIL20 dataset, and the ORL dataset. There are 60,000 grayscale handwritten digital images in MNIST's training set, and 10,000 images in its testing set. Each image is a blackand-white image which ranges from 0 to 9. In Fashion_MNIST, there are also 60,000 training images and 10,000 test images, which are in 10 fashion categories. Each image is a 28 28 × grayscale image. The dataset can be regarded as an evolutionary version of the MNIST dataset. The category labels are T-shirts/ blouses, pants, pullovers, dresses, coats, sandals, shirts, canvas shoes, packages, short boots. The COIL20 dataset contains 20 objects, including 1440 images from varying angles, which are at an interval of 5°. There are 72 images of the size of 32 32 × the pixels. The ORL dataset contains 40 subjects, including 400 images. These images are different facial expressions, such as open/closed eyes, glasses/no glasses, and smiling/no smiling facial expressions. Table 1 shows the characteristics of the datasets mentioned above. In Table 1, the second and the third columns show the numbers of features and classes. The fourth and fifth columns show the number of training and test set. Figure 4 are some examples from the four datasets. The experimental results are judged by the mean and the standard deviation values of the accuracy on the test set.
As mentioned earlier, BAF combines four SAE with four different activation functions in parallel, each SAE has 2 layers. They are sigmoid, tanh, relu, and softplus functions. The DAF (Ng et al., 2016), which is used for imbalanced pattern classification problems, is similar to the setting of the BAF. The Sigmoid-AF, the Tanh-AF, the Relu-AF, and the Softplus-AF use the same architecture with BAF, while each of them uses one type of activation function respectively. All the four AFs, the BAF, and the DAF use Softmax as the classifier. In experiment comparison, default values are used for parameters of these methods.
All the four AFs, the BAF, and the DAF use the same number of hidden neurons in each SAE's first and second layers (i.e. m m 1 2 = ). The experiments run for 10 independent times. In the experiments, we adopt 400 as the number of the first layer's hidden units. For the following layers' hidden unit numbers, they are from 100 to 300 at the step of 100 for all SAEs. Tables 2-5 report the average and the standard deviations over the 10 experiments' ACC, along with the number of second hidden layer units. We can see that the proposed BAF performs best compared with the other methods in the tables.
Moreover, to further prove BAF's superiority, we adopt one-tailed t-tests that aim to verify whether BAF's performance is better than the other methods. For instance, when we compare the performance of BAF with DAF, it can be expressed as VAF vs. DAF. The statistically significant level is set to 0.01 in our experiments. Table 6 presents the p-values results of the pairwise one-tailed tests on the models' accuracies. It is obvious that the all pairwise one-tailed t-tests are less than 0.01. It shows that the proposed BAF performs significantly better than the other methods. Table 7 shows the overall average and standard deviation results of the four datasets accuracies, which yield that BAF has the best average performance. The BAF has the highest average performance in all datasets according to Figure 5. Moreover, Figure 6 presents the average classification accuracy    of all the methods, in which BAF yields the highest average results. The acceptable reason that BAF outperforms DAF and other AFs is that more effective features are learned by AEs using different activation functions.

CoNCLUSIoN
In this paper, we propose the BAF to deal with the image pattern classification problems. Stacked AEs are used in the BAF, and four different activation functions are used to assert a more robust feature learning process compared with the DAF and other four SAEs which use one type of activation function. Experimental results show that the BAF yields higher statistically significant accuracies than the feature-learning based DAF and other four methods for different datasets. Classification accuracies yielded in experiments show that the BAF with sigmoid, tanh, relu, and softplus stacked AEs provides a significantly better feature representation. The architecture of the network will be further studied to improve the performance of the BAF in the future work. Automatic selection of activation functions and network architecture of BAF is an important future work of us.