CAST: A constant Adaptive Skipping Training Algorithm for Improving the Learning Rate of Multilayer Feedforward Neural Networks

Multilayer Feedforward Neural Network (MFNN) has been administered widely for solving a wide range of supervised pattern recognition tasks. The major problem in the MFNN training phase is its long training time especially when it is trained on very huge training datasets. In this accordance, an enhanced training algorithm called Constant Adaptive Skipping Training (CAST) Algorithm is proposed in this research paper which intensifies on reducing the training time of the MFNN through stochastic manifestation of training datasets. The stochastic manifestation is accomplished by partitioning the training dataset into two completely separate classes, classified and misclassified class, based on the comparison result of the calculated error measure with the threshold value. Only the input samples in the misclassified class are exhibited to the MFNN for training in the next epoch, whereas the correctly classified class is skipped constantly which dynamically reducing the number of training input samples exhibited at every single epoch. Thus decreasing the size of the training dataset constantly can reduce the total training time, thereby speeding up the training process. This CAST algorithm can be merged with any training algorithms used for supervised task, can be used to train the dataset with any number of patterns and also it is very simple to implement. The evaluation of the proposed CAST algorithm is demonstrated effectively using the benchmark datasets Iris, Waveform, Heart Disease and Breast Cancer for different learning rate. Simulation study proved that CAST training algorithm results in faster training than LAST and standard BPN algorithm.


INTRODUCTION
Multilayer Feedforward Neural Network (MFNN) with a single hidden layer has been explored as the best neural network architecture for nonlinear classification problem due to its capability to approximate any nonlinear function mapping (Mehra and Wah, 1992;Hornik et al., 1989;Huang et al., 2000). The Back Propagation (BPN) is the most popular supervised training algorithm that has been used to train MFNN extensively for the past two decades (Razavi and Tolson, 2011). It is fragmented into two phases: Training Phase (also called as Learning Phase) and Testing Phase (also called as Evaluation Phase). Among these two phases, the training phase plays an important role in establishing nonlinear models. In order to obtain better performance, it still requires many epochs for training the simple problem using MFNN. So the BPN is unfortunately very slow. And also BPN training performance is literally associated with the type and size of network architecture, the number of epochs and patterns to be trained, training speed, and the dimensionality of the training datasets.
In order to enhance the training performance, the training speed is the factor that is considered to be very important. The training speed is highly depends on the dimensionality of training dataset. In general, training MFNN with a larger training datasets will generalize the network well. But, lengthy training time is needed for larger training dataset (Behera et al., 2006) which influence the training speed.
This research proposes a new training algorithm to improve the training speed by reducing the training time of MFNN through the stochastic manifestation of training datasets. The correctly classified class input samples in the training datasets will be skipped constantly from the training for the consecutive n epochs. Thereby, the CAST algorithm dynamically diminishing the number of training input pattern samples constantly exhibited at every single epoch. Thus diminishing the size of the training datasets constantly can reduce the total training time, thereby speeding up the training process. Hence, the overall training time for actual training of the MFNN is often reduced by several hundred times than in the standard training algorithm. This method is carried out by merging into any algorithm used for training the supervised task.
The content of this research paper is materialized as follows. The brief review of the previous works done relevant to the research problem is given and then the formulation of the given research problem s shown. The proposed CAST algorithm is presented. Followed by Performance evaluation of CAST using the benchmark datasets for the classification problems is simulated. Finally, the experimental results are summarized and analyzed along with the conclusions of the research paper.

RELATED WORKS
In order to speed up the MFNN training process, many researchers have investigated the above detriments and devoted many of their research works through various formation ranges from different amendments of existing algorithms to evolution of new algorithms. The formation of improving the training speed and maintain the generalization includes initialization of optimal initial weight (Nguyen and Widrow, 1990;Varnava and Meade Jr., 2011), adaptation of learning rate (Plagianakos et al., 1998), adaptation of the momentum term (Shao and Zheng, 2009), adaptation of the momentum term in parallel with learning rate adaptation (Behera et al., 2006), and using second order algorithm (Ampazis and Perantonis, 2002;Wilamowski and Yu, 2010;Yu and Wilamowski, 2012).
During the training process, the number of iterations will be scaled down through the proper initialization of the weight which in turn will increase the training speed. Some of the techniques applied for initializing the weight have been discussed here. Nguyen and Widrow (1990) initialize the layer's intermediate weight within the specified range for faster learning. Varnava and Meade Jr. (2011) used the polynomial mathematical models for obtaining the network synaptic initial value. The learning rate is one of the training parameters that fine-tune the size of the network's respective old weights during learning. Assigning the constant value of the learning rate will degrade the speed of the training which results in slow convergence. But, adaptation of learning rate using the Barzilai and Borwein is proposed by Plagianakos et al. (1998) in order to improve the convergence speed. Based on the factor inclined to investigate, several dynamic methods for assigning the learning rate adaptively have been codified. Behera et al. (2006) developed two new algorithms designated as LFI and LF II from Lyapunov theory of stability where the learning rate is assigned to the adaptive values instead of fixed value. Next, the algorithm that derives the second order differential equation from the cost functions for updating the weight during the training process has been listed. The most popular second order training algorithm are quasi-Newton methods or Levenberg-Marquardt (LM) (Wilamowski and Yu, 2010;Yu and Wilamowski, 2012) and conjugate gradient (CG) methods (Ampazis and Perantonis, 2002). Eventhough, the above second order approaches achieve good results, but they are computationally very expensive. Ampazis and Perantonis (2002) extracted the importance of the Levenberg-Marquardt and Conjugate Gradient methods and derived the two different approaches Levenberg-Marquardt with adaptive momentum (LMAM) and optimized Levenberg-Marquardt with adaptive momentum (OLMAM) second order algorithm. Wilamowski and Yu (2010) applied vector multiplication for determining the gradient vector and Hessian matrix instead of matrix multiplication (Yu and Wilamowski, 2012) which significantly reduces the cost of memory cost for training and thereby improves the training speed.
However, the disadvantages found in the traditional method are not surmounted by the above discussed techniques. All of the above mentioned efforts are focused directly or indirectly on tuning the network's training parameters.
And besides, the formation discussed above consumes totally all the input samples till the training terminates. If a large amount of training data with high dimension is rendered for classification, then a problem is introduced by the above discussed technique which will slow down classification. So, the intention of this research is to impart a simple and new algorithm CAST for training the ANN in a fast manner by presenting the training input samples randomly based on the classification.
Problem formulations: BPN algorithm is an iterative gradient training algorithm designed to estimate the coefficients of weight matrices that minimizes the total Root Mean Squared Error (RMSE). The RMSE is defined between the desired output and the actual output summed over all the training pattern input to the network.
Where P is the total number of training sample patterns, m is the number of nodes in the output layer, ˮ is the target output of the kth node for the pth sample pattern, and ˳ is the actual output of the kth node estimated by the network for the pth sample pattern.
According to the Equation (2), there is a real fact that the correctly classified input samples does not involve in the updating of weight since the error value generated by that sample pattern is zero. Here the intention of this research is to partition the training input samples into two distinct classes, classified and misclassified class, based on the comparison result of the calculated error measure with the maximum threshold value. By doing so, the training input samples whose actual output is same as target output will belong to the classified class; the remaining training input samples will belong to the misclassified class. Only the input samples in the misclassified class are presented to the next epoch (Epoch is one complete cycle of populating the MFNN with the entire training samples once) for training, whereas the correctly classified class will not be presented again for the subsequent n epochs. The adaptive skipping training algorithm is used to estimate the skipping factor value. In the LAST algorithm (Devi et al., 2013), the value of skipping factor is increased linearly that is the input samples are skipped linearly. In the proposed CAST algorithm, the correctly classified class input samples will be skipped constantly from the training for the consecutive n epochs. Thereby, the CAST algorithm dynamically diminishing the number of training input pattern samples constantly exhibited at every single epoch. Thus diminishing the size of the training datasets constantly can reduce the total training time, thereby speeding up the training process. The dominance of this CAST algorithm is that its implementation is extremely simple and easy, and can lead to significant advances in the training speed.

PROPOSED CAST METHOD
Overview of CAST Architecture: The CAST algorithm that is contained in the prototypical MFNN architecture is outlined in Fig. 1.
Assume that the network contains n input nodes in the input layer, p hidden nodes in the hidden layer and m output nodes in the output layer. Since the above network is highly interconnected, the nodes in each layer are connected with all the nodes in the next layer. Let P represent the number of input patterns in the training dataset. The input matrix, X, of size p × n is presented to the network. The number of nodes in the input layer is equivalent to the number of columns in the input matrix, X. Each row in X is considered to be a real-valued vector x i єℜ n+1 where 1 ≤ i ≤ n. The summed real-valued vector generated from the hidden layer is represented z i єℜ p+1 where 1 ≤ i ≤ p. The estimated output real-valued vector generated from the network is denoted as y i єℜ m where 1 ≤ i ≤ m and the corresponding target vector is represented as Let f N (x) and f L (x) be the non-linear logistic activation function and linear activation function used for computation in the hidden and output layer respectively. Let v ij be the n × p weight matrix contains input-to-hidden weight coefficient for the link from the input node i to the hidden node j and v oj be the bias weight to the hidden node j. Let w jk be the p × m weight matrix contains hidden-to-output weight coefficient for the link from the hidden node j to the output node k and w ok be the bias weight to the output node k. Proposed CAST Algorithm: The working principle of the CAST algorithm that is incorporated in the BPN algorithm is summarized below: Step 1: Weight initialization: Initialize weights to small random values; Step 2: Furnish the input sample: Disseminate to the input layer an input sample vector x k having desired output vector y k ; Step 3: Forward phase: Starting from the first hidden layer and propagating towards the output layer: • Calculate the activation values for the Hidden layer as: o Estimate the net output value: o Estimate the actual output: • Calculate the activation values for the Output layer as: o Estimate the net output value: o Estimate the actual output: Step 4: Output errors: Calculate the error terms at the output layer as: Differentiate the activation function in Equation 6: Substitute the resultant value of Equation (8) in (7): Step 1: Backward phase: Propagate error backward to the input layer through the hidden layer using the error term.
Differentiate the activation function in Equation 4: Substitute the resultant value of Equation (11) in (10): Step 2: Weight amendment: Update weights using the Delta-Learning Rule.
Weight amendment: For Output Unit.
Weight amendment: For Hidden Unit.
Step 3: CAST Algorithm: Incorporating the CAST algorithm.
• Compare the error value, Éˮ − ˳ É with threshold value, d max .
If equation 15 generates 0, then the x i is correct.
• Compute: The probability value for all input samples.
• Calculate the skipping factor, sf i, for all input samples o Initialize the value of sf i to zero (for first epoch) o Increment the value of sf i constantly for correctly classified samples alone. • Skip the training samples with prob (=0) for the next sf i epoch Step 4: Repeat steps 1-7 until the halting criterion is satisfied, which may be chosen as the Root Mean Square Error (RMSE), elapsed epochs and desired accuracy.
Working flow of CAST: The block diagram of the proposed strategy is illustrated in the Fig. 2.

Empirical result and analysis:
This section holds about the description of the dataset used for the research, the experimental design and results. Dataset properties: In this section, the performance of the proposed CAST algorithm is evaluated on the benchmark two-class classification and multi-class classification problems. The benchmark datasets used for two-class classification problem are Iris and Waveform Data Set, and multiclass classification problem are Heart and Breast Cancer Data Set. The fore-mentioned datasets are fetched from the UCI (University of California at Irvine) Machine Learning Repository (Asuncion and Newman, 2007). The extracted results are compared with the existing BPN and LAST algorithms for both two-and multiclass classification problems. The specification of the benchmark datasets utilized for training in the research is summarized in the Table 1.  Experimental design: A 3-layer feedforward neural network is adopted for the simulations of all the training algorithms with the selected training architecture and training parameters mentioned in the Table 2. The simulations of all the training algorithms are repeated for two different learning rates such as 1e-4 (0.0001) and 1e-3(0.001).
The simulations of all the above training algorithms are done using MATLAB R2010b on a machine with the configuration of Intel ® Core I5-3210M processor, 4 GB of RAM and CPU speed of 2.50GHz.
According to the idea of Nguyen-Widrow algorithm (Nguyen and Widrow, 1990), the MFNN weight coefficients are initialized with the random values within the specified range -0.5 to +0.5. The Fivefold cross validation method is applied to train and test the above training algorithms. Each dataset is split into five disjoint subsets. Among these subsets, a single subset is retained for testing, and the remaining four subsets are used for training. The validation process is repeated five times with each of the five subset used exactly once for testing.

• Experimental Result • Multiclass Problems • Iris Data Set
The IRIS dataset is furnished with 150 iris flower samples collected equally from three different varieties of iris flowers. The varieties are listed as Iris Setosa, Iris Versicolour and Iris Virginica. These varieties are identified based on the four characteristics of iris flower such as width and length of Iris sepal, and width and length of Iris petal. Among these varieties, Iris Setosa is easier to be separated from the other two varieties, while the other two varieties, Iris Virgincia and Iris Versicolour, are partially obscured and harder to be distinguished.
The total number of IRIS input samples consumed by BPN, LAST and CAST training algorithms at every single epoch is graphically represented in the Fig. 3 and 4 with the learning rate of 1e-4 and 1e-3 respectively. Figure 5 and 6 illustrates the epoch wise training time comparison between BPN, LAST and CAST training algorithm for the learning rates 1e-4 and 1e-3 respectively.
Waveform data set: The Waveform database generator data set consists of measurements of 5000 wave's samples. The 5000 wave's samples are equally scattered (about 33%) among the three classes of waves (Asuncion and Newman, 2007). These samples are collected from the generation of 2 of 3 "base" waves. It contains 21 attributes of numeric values which are involved in the categorization of each class of waves.
The total number of Waveform input samples consumed by BPN, LAST and CAST training algorithms at every single epoch is graphically represented in the Fig. 7 and 8 with the learning rate of 1e-4 and 1e-3 respectively. Figure 9 and 10 illustrates the epoch wise training time comparison between BPN, LAST and CAST training algorithm for the learning rates 1e-4 and 1e-3 respectively.

Two-Class problem:
Heart data set: The Statlog Heart disease database consists of 270 patient's samples. The presence or absence of each patient's heart disease is predicted using 13 attributes. Among these 270 patient's samples, 150 samples are the samples of heart disease which is 'absent' and 120 samples of heart disease which is 'present'.
The total number of Heart input samples consumed by BPN, LAST and CAST training algorithms at every single epoch is graphically represented in the Fig. 11 and 12 with the learning rate of 1e-4 and 1e-3 respectively. Figure 13 and 14 illustrates the epoch wise training time comparison between BPN, LAST and CAST training algorithm for the learning rates 1e-4 and 1e-3 respectively.
Breast cancer data set: The Wisconsin Breast Cancer Diagnosis Dataset contains 569 patient's breasts samples among which 357 diagnosed as benign and 212 diagnosed as malignant class. Each patient's characteristics are recorded using 32 numerical features.  The total number of Heart input samples consumed by BPN, LAST and CAST training algorithms at every single epoch is graphically represented in the Fig. 15 and 16 with the learning rate of 1e-4 and 1e-3 respectively. Figure 17 and 18 illustrates the epoch wise training time comparison between BPN, LAST and CAST training algorithm for the learning rates 1e-4 and 1e-3 respectively. Table 3 to 10 shows the experimental results of BPN, LAST and CAST algorithm observed at each step across five repeats of fivefold cross validation using two different learning rates such as 1e-4 and 1e-3. From these Table 3 to 10, the CAST algorithm yields improved computational training speed in terms of the total number of trained input samples as well as total training time over BPN and less than LAST. But, when the skipping factor goes higher, the accuracy of the system is affected highly.

Training samples comparison:
The comparison results of the total number of input samples consumed for training by BPN, LAST and CAST with the                     learning rate of 1e-4 and 1e-3 are shown in Fig. 19 to 26. From the Fig. 19, it is portrayed that the total number of IRIS data samples consumed by CAST algorithm for training under the learning rate of 1e-4 is reduced by an average of nearly 67% and 44% of BPN and LAST algorithm respectively.
From the Fig. 21, it is portrayed that the total number of Waveform data samples consumed by CAST algorithm for training under the learning rate of 1e-4 is reduced by an average of nearly 50% and 40% of BPN and LAST algorithm respectively.
From the Fig. 22, it is portrayed that the total number of Waveform data samples consumed by CAST algorithm for training under the learning rate of 1e-3 is reduced by an average of nearly 51% and 41% of BPN and LAST algorithm respectively.
From the Fig. 23, it is portrayed that the total number of Heart data samples consumed by CAST algorithm for training under the learning rate of 1e-4 is reduced by an average of nearly 51% and 17% of BPN and LAST algorithm respectively.
From the Fig. 24, it is portrayed that the total number of Heart data samples consumed by CAST algorithm for training under the learning rate of 1e-3 is From the Fig. 25, it is portrayed that the total number of Breast Cancer data samples consumed by CAST algorithm for training under the learning rate of 1e-3 is reduced by an average of nearly 66% and 42% of BPN and LAST algorithm respectively.
From the Fig. 26, it is portrayed that the total number of Breast Cancer data samples consumed by CAST algorithm for training under the learning rate of 1e-3 is reduced by an average of nearly 63% and 38% of BPN and LAST algorithm respectively.

Training time comparison:
Thus decreasing the size of the trained input samples can reduce the training time which is shown in this section, thereby increasing the speed of the training process. Figure 27 to 34 illustrates the training time comparison between BPN, From the Fig. 27, the total training time for training IRIS dataset by CAST algorithm is reduced to an average of 67% of BPN algorithm and 37% of LAST algorithm for the learning rate of 1e-4.
From the Fig. 28, the total training time for training IRIS dataset by CAST algorithm is reduced to an average of 70% of BPN algorithm and 43% of LAST algorithm for the learning rate of 1e-3.
From the Fig. 29, the total training time for training waveform dataset by CAST algorithm is reduced to an average of 56% of BPN algorithm and 40% of LAST algorithm for the learning rate of 1e-4.
From the Fig. 30, the total training time for training waveform dataset by CAST algorithm is reduced to an average of 56% of BPN algorithm and 39% of LAST algorithm for the learning rate of 1e-3.
From the Fig. 31, the total training time for training Heart dataset by CAST algorithm is reduced to an average of 60% of BPN algorithm and 45% of LAST algorithm for the learning rate of 1e-4.
From the Fig. 32, the total training time for training Heart dataset by CAST algorithm is reduced to an average of 52% of BPN algorithm and 28% of LAST algorithm for the learning rate of 1e-3.
From the Fig. 33, the total training time for training Breast Cancer by CAST algorithm is reduced to an average of 80% of BPN algorithm and 68% of LAST algorithm for the learning rate of 1e-4.
From the Fig. 34, the total training time for training Breast Cancer dataset by CAST algorithm is reduced to an average of 69% of BPN algorithm and 50% of LAST algorithm for learning rate of 1e-3.
Although the training performance of CAST achieves faster, it still lacks in the accuracy rate due to high skipping factor. So, further work should be concentrated on how to improve the accuracy rate of the training algorithm also.

CONCLUSION
In this brief, a simple and fast training algorithm called Constant Adaptive Skipping Training (CAST) Algorithm is presented. The simulation results showed that, compared to other training methods, the new algorithm improves the training speed by significantly reducing the total number of training input samples consumed by MFNN for training at every single epoch. Hence, the overall training time for actual training of the MFNN is often reduced by an average of 50% than in the standard training algorithm. It is concluded that the proposed CAST algorithm are much faster than the standard BPN and LAST algorithm and also the proposed CAST Algorithm can be merged in addition with any algorithm used for training any real-world supervised task classification.