A Novel Strategy for Speed up Training for Back Propagation Algorithm via Dynamic Adaptive the Weight Training in Artificial Neural Network

The drawback of the Back Propagation (BP) algorithm is slow training and easily convergence to the local minimum and suffers from saturation training. To overcome those problems, we created a new dynamic function for each training rate and momentum term. In this study, we presented the (BPDRM) algorithm, which training with dynamic training rate and momentum term. Also in this study, a new strategy is proposed, which consists of multiple steps to avoid inflation in the gross weight when adding each training rate and momentum term as a dynamic function. In this proposed strategy, fitting is done by making a relationship between the dynamic training rate and the dynamic momentum. As a result, this study placed an implicit dynamic momentum term in the dynamic training rate. This αdmic = . This procedure kept the weights as moderate as possible (not to small or too large). The 2-dimensional XOR problem and buba data were used as benchmarks for testing the effects of the ‘new strategy’. All experiments were performed on Matlab software (2012a). From the experiment’s results, it is evident that the dynamic BPDRM algorithm provides a superior performance in terms of training and it provides faster training compared to the (BP) algorithm at same limited error.


INTRODUCTION
The Back Propagation (BP) algorithm is commonly used in robotics, automation and Global positioning System (GPS) (Thiang and Pangaldus, 2009;Tieding et al., 2009).The BP algorithm is used successfully in neural network training with a multilayer feed forward (Bassil, 2012, Abdulkadir et al., 2012, Kwan et al., 2013, Shao and Zheng, 2009).The back propagation algorithm led to a tremendous breakthrough in the application of multilayer perceptions (Moalem andAyoughi, 2010, Oh andLee, 1995).It has been applied successfully in applications in many areas and it has an efficient training algorithm for multilayer perception (Iranmanesh and Mahdavi, 2009).Gradient descent is commonly used to adjust the weight through the change training errors , but the gradient descent is not guaranteed to find the global minimum error, because training is slow and converges easily to the local minimum (Kotsiopoulos and Grapsa, 2009, Nand et al., 2012, Shao and Zheng, 2009, Zhang, 2010).The main problem of the BP algorithm is slow training; it needs a long learning time to obtain the result (Scanzio et al., 2010).However, stuck at a local minimum when O r , the output training of hidden layers and O r , the output training of output layer, extremely approaches 1 or (Dai and Liu, 2012, Shao and Zheng, 2009, Zakaria et al., 2010).
To overcome this problem, there are techniques for increasing the learning speed of the BP algorithm or escaping the local minimum, such as the flat spots method, the gradient descent method through magnifying the slope, or changing the value of gain in the activation function, respectively.In addition, the heuristics approach is one of them, which focuses on the parameter training rate and momentum term.In this study, we propose a dynamic function for each training rate and momentum term.
However, this problem has been discussed thoroughly by many researchers.More specifically, to give the BP algorithm faster convergence through modifying it by using some parameter as a modified gain in the sigmoid function in back propagation Zhang et al. (2008).In addition, the ∆w jk is affected by the slope value.The small value of the slope makes back propagation very slow during training.In addition, the large value of the slope may make it faster in training.The value of the gain and momentum parameter directly influences the slope of the activation function, so Nawi et al. (2011), adapts each parameter gain and momentum to remove the saturation, but (Oh and Lee, 1995), focuses on magnifying the slope.The objectives of this study involve improving the speed of training of the back propagation algorithm through adapting each training rate and momentum by using a dynamic function.
Current work for solving the slow training back propagation algorithm is through adaptation of a parameter (e.g., training rate and the momentum term), which controls the weight of the adjustment along the descent direction (Iranmanesh and Mahdavi, 2009), Asaduzzaman at el., 2009).Improving the speed of the back propagation algorithm through adapting each training rate and momentum by dynamic function Xiaozhong and Qiu (2008) has improved the back propagation algorithm by adapting the momentum term.
For a new algorithm tested by XOR -2 dimensions, the experiment results demonstrated that the new algorithm is better than the BP algorithm.Burse et al., (2010) proposed a new method for avoiding the local minimum by adding the momentum term and PF term.Shao and Zheng (2009) proposed new algorithm, PBP, is based on adaptive momentum.The simulation result has shown that the new algorithm has faster convergence and smoothing oscillation.Zhixin and Bingqing (2010) have improved the back propagation algorithm has improved based on the adaptive momentum term.A new algorithm was tested using the 2-dimensional XOR.The simulation results show that the new algorithm is better than the BP algorithm.On the other hand, some studies focus on the adaptive training rate Latifi and Amiri (2011) presented in a novel method based on adapting the variable steep learning rate to increase the convergence speed of the EBP algorithm.The proposed convergence is faster than the back propagation algorithm.Gong (2009) proposed a novel algorithm (NBPNN) beside this is on the selfadaptive training rate.From the experiment results, the NBPNN gives more accurate results than the others.Iranmanesh and Mahdavi, (2009) proposed different training rate for different location for output layer.Yang and Xu (2009)

MATERIALS AND METHODS
This kind of this research belong the heuristic method.Heuristic method included two parameter such training rate and momentum term.This study will be

NEURAL NETWORKS MODEL
In this section, we will propose the ANN model, which consists of a three-layer neural network composed of an input layer, a hidden layer and an output layer.The input layer is considered as {x 1 , x 2 , ..., x i } nodes, which depends on the kind or attribute of the data.The hidden layer is made of two layers with four nodes.The output layer is made of one layer with one neuron.Of the three biases, two are used in the hidden layers and one in the output layer, denoted by u 0j , v 0k and w 0r .Finally, the sigmoid function is employed as an activation function, which is linear for the output layer in (Hamid et al., 2012).The proposed neural network can be defined as {I, T, W, A}, where, I denotes the set of input nodes and T denotes the topology of NN, which covers the number of hidden layers and the number of neurons.W jr denoted the set of weight and A, denoted by the activation function as Fig. 1.
Before presenting the BPDRM algorithm, let us bri efly define some of the notations used in the algorithm as follows: : First hidden layer for neuron h, h = 1, …, q ZZ r : Second hidden layer for neuron j, j = 1,.: The error back propagation at neuron j

CREATING THE DYNAMIC FUNCTIONS FOR THE TRAINING RATE AND MOMENTUM TERM
One way to escape the local minimum and save training time in the BP algorithm is by using a large value of η in the first training instance.On the contrary, the small value of η leads to slow training (Huang, 2007).In the BP algorithm, the training rate is selected by depending on experience and a trail value between (0, 1) in (Li et al., 2010(Li et al., , 2009)).Despite this, there are studies that have proposed techniques to increase the value of η to speed up the BP algorithm through creating a dynamic function.However, the increasing value of η becomes too large; it leads to oscillated output training in (Negnevitsky, 2005).Even a large value of η is unlikely for the training BP algorithm.The weight update between neuron k from the output layer and neuron j from the hidden layer is as follows: where, the ( ) changes, the weight is updated for each epoch from equation1, slow training or fast depends on some parameter, which affects updating the weight.The key for the convergence of the error training is monotonicity function in (Zhang, 2009).
Many studies adapt the training rate and momentum by using a monotonicity function such as (Shao andZheng, 2009, Yang andXu, 2009), used exponentially to increase the speed of the BP algorithm.The exponential function is a monotonic function.We propose a dynamic training rate as follows: Substituting α dmic from Eq. ( 2) into Eq.( 1) to obtain: Alternatively, we can extend the Eq. ( 1) by adding a momentum term to become as follows: ( ) In the back propagation algorithm, the value of the momentum term and training rate are selected as a trial value from the interval [0, 1] or 0<α≤1.
In this study, we proposed a new strategy, which consists of two steps to avoid inflation in the gross weight when added for each training rate and momentum term as a dynamic function.We proposed a new strategy to avoid the gross weight of the fitting producer by creating a relationship between the dynamic training rate and the dynamic momentum, so we placed an implicit momentum function in the training rate ( ) , which was defined as the implicit training rate proposed in Eq. 2. From the previous decoction, we can propose the dynamic function of the momentum term as follows: 1 From Eq. 5 we see the relationship between α dmic and η dmic are inverse.By having this the weight is moderator (no large value, no small value) for avoid the gross the weight or according the overshooting of training.Substituting η dmic from Eq. ( 2) into Eq.( 5), the dynamic of the momentum term is defended by Eq. ( 6) as follows: The value of dynamic of α dmic is located (0, 1) for epoch.The small value of α dmic avoids the gross weight for each equation (25,26,27,28,29,30), while the weight is updated.

BACK PROPAGATION WITH DYNAMIC TRAINING RATE AND MOMENTUM (BPDRM) ALGORITHM
The back propagation algorithm, BP, is trained with a trial value of the training rate between a range of 0<η≤1 l and 0<α≤1.Many techniques for enhancing the BP algorithm neglect speeding up the training, using flat-spot, gradient descent and the heuristics technique, which include the training rate and the momentum term.The weight update for every epoch or iteration in the new algorithm BPDRM between any neurons {j, k, …, r} from any hidden layer or output layer is as follows: Forward propagation: In the feed forward phase, each input unit x i receives an input signal x i and broadcasts this signal to the next layer until the end layer in the system.
Equation 6 indicted the update to the weight for a new algorithm that we denote as BPDRM.The best value of the ε at ε = 0.0042: Then, each hidden unit computes its activation to get the signal Z h : ( ) It then sends its output signal to all the units in the second hidden layer.And each hidden unit (zz j j = 1, 2, …, p) calculates the input signal: It also calculates the output layer of hidden zz: It sends out layer zz to output layer o r then calculates the input layer for the out layer: Backward propagation: This step starts when the output of the last hidden layer or feed forward reaches the end step then starts the feedback that is obvious in Fig. 1.The information provides feedback to allow the adjustment of the connecting weights between each layer.The goal of the BP is to get the minimum error training between the desired output and actual data, as Eq. ( 13): Calculates the weight correction term (used to update w jr latter): Calculate, the bias correction term (used to update the news w 0r : And then sends δ r to hidden units ach hidden unite (zz j, j = 1, …, p) Sums weighted input from the units in the layer above to get: Calculate the local gradient for hidden layer (zz j ) to get: ( ) Calculate weight correction term (used to update the news v hj ): hj hj Calculates the bias collection term (used to update ˰ " newest): It then sends δ j to hidden unit, each hidden unit's (Z h h = 1, …, q) sum is the weighted input from the unit in the layer above and gets: Calculate the local gradient of hidden layer z h (expressed in terms of x i ): ( ) Calculates the weight correction (update ih u newest): Calculates the bias weight corrective term (used to update the news u 0h ): Update the weight: The weight adjustment stage for all the layers are adjusted simultaneously.The adjustment of the weight is based on the above calculated factor in this cases the formal of update the weight is given by as below: For each output layer (j = 0, 1, 2, … p; r = 1…, m): The weight update for every layer according of the equations below: ( ) Then the weight update dynamically for every layer under effect of the Eq. ( 2) and ( 6), as follows: ( ) For the bias: ( ) For the bias: For each hidden layer, ( ) i 0, , n ; h 1,..., q For the bias: ( )

IMPLEMENTATION OF THE BPDRM ALGORITHM WITH XOR-2BIT AND BUBA DATE SET
In this section, we implement the Dynamic BPDRM algorithm with the XOR problem is famous use of training in BP algorithm.XOR problem gives the response true if exactly one of them in put value is true otherwise the response is false.XOR problem it has two input with four patterns.Also buba data is famous data which consist 6 inputs with 345 patterns.In this case, the structure of the BP and BPDRM algorithm is 2:2:1 with XOR problem.However, the structure of the BPDRM algorithm and BP algorithm is 6:2:1 with buba data.
Step 1: Read the number of the neuron hidden layer Step 2: Read the pattern from XOR-2Bit, get to find the target and the limited error = 10 power-6 Step 3: Read the dynamic training rate and momentum Step 4: While (MSE>limited error), do steps 5-18 Step 5: For each training pair, do steps 6-17 Forward Propagation Step 6: Compute the input layer of hidden layer Z using Eq. ( 7) and output value using Eq. ( 8).
Step 7: Compute the input layer of hidden layer ZZ using Eq. ( 9) and output value using Eq. ( 10).
Step 8: Compute the input layer of hidden layer O r using Eq. ( 11) and output value using Eq. ( 12).

Back propagation:
Step 9 : Calculate the error training using Eq. ( 13) Step 10: Computing the error signal δ r at neural r using Eq. ( 14).
Step 11: Calculate the weight correction for each jr w ∆ and bias 0r w ∆ using Equations 15 and 16, respectively.
Step 12: Send δ k to zz j and calculate the error signal j in δ − and local gradient of error signal δ j using Eq. ( 17) and ( 18), respectively.
Step 13: Calculate the weight correction for each hj v ∆ and bias 0 j v ∆ using Eq. ( 19) and ( 20), respectively Step 14: Send δ j to z h and calculate the error signal h in δ − and local gradient of error Signal δ h , using Eq. ( 21) and ( 22), respectively.
Step 15: For layer z h , calculate the weight correction for each ih u ∆ and bias 0h u ∆ using Equations 23 and 24, respectively.

EXPERIMENTS RESULTS
In this section, we report the results obtained when experimenting with our proposed method with the 2-bit XOR parity problem and the iris data as a benchmark.We use Matlab software R2012a running on a Windows machine.There are no theories to determine the value of the limited error, but the range of the limited error effects the training time (Kotsiopoulos and Grapsa, 2009) determines the error tolerance by l to a power of -5.The convergence rate is very slow and it takes 500000 epochs, but (Cheung et al., 2010) determined the limited error by less than 3 to a power of -4.The convergence rate is very slow and it takes 1000 epochs.

Experiments the BPDRM algorithm:
We run the BPDRM algorithm, which is given in Eq. ( 2) and ( 6).Ten experiments have been done at the limited error 1.0E-05.The average time for training and the epoch for all experiment results are tabulated in Table 1.
From Table 2, the best performance of the BP algorithm is a achieved at η = α = 0.9, whereas the time training was 9.3020 sec.The worst performance of the BP algorithm was achieved at η = α = 1, whereas the training time was 1920 sec.The range of the time training is located 1920≤t≤9.3020sec.We consider the 1920 sec as the maximum training time and the value 9.3020 as the minimum training time.In addition,

BPDRM Algorithm experiments using the data training set:
We test the performance of our contribution, created in Eq. ( 2) and ( 6), by using 178 patterns as a form of training.Ten experiments have been done; the simulation results are tabulated in Table 3.
From Table 3,

Experiments of the BP algorithm with buba-training set:
In this part, we test the performance using 180 patterns as a form of training.100 experiments have been done and then taken average of the experiments.The results are tabulated in the Table 4.
From Table 4, the best performance of the BP algorithm is achieved at, η = 0.6, α = 0.5, whereas the training time is 16.61482 sec.The worst performance of the BP algorithm was achieved at η = 0.9, α = 0.99, whereas the training time is 4750.909sec.The range of the average training time is located between 16.61482≤t≤4750.909sec.We consider that 4750.909sec is the maximum amount of training time and the value 16.61482 is the minimum amount of training time.The BP algorithm suffers the highest saturation at a value for each η and momentum term α at a value of 1.The curve of training as shown in Fig. 5.

Experiments of the BPDRM algorithm with bubatesting set:
In this section, we implement the BPDRM algorithm using the buba data testing set.A hundred and twenty patterns were used as a test set.The input layer equals the attribute of the data.The structure of the BPDRM algorithm becomes 6:2:1.All experiments are illustrated in the Table 5.
From Table 5,    Experiments on the BP algorithm for the bubatesting set: We implement the BP algorithm using 120 patterns, which represents the test data set.A hundred experiments have been done on matlab.The experiment results are tabulated in Table 6.
From Table 6, the best performance of the BP algorithm was achieved at η = 0.6 and α = 0.5.In addition, the BP algorithm provides fast training at the same point, whereas the training time = 24.69sec.The worst performance of the BP algorithm is achieved at η = 0.9, α = 1 whereas the training time = 4330.909and MSE = 0.5.The range of the average training time location is 24.69≤t≤4330.909sec.We consider that 4330.909sec is the maximum of training time and the value 24.69 is the minimum training time.The BP algorithm suffers the highest saturation at a value for each η = 0.9 and momentum term α = 1.

DISCUSSION
In this part, we discuss and compare the BPDRM algorithm with consider the BP algorithm on three criteria: the training time, MSE and the number of epoch.According to (Saki et al., 2013;Nasr and Chtourou, 2011;Scanzio et al., 2010) we calculate speed up training by formulae as follow: Speed up = 1L?=IHCHC ?@ 1L?=IHC HC ?@ For XOR problem the dynamic propagation provides better training which is show in the Table 7.
From Table 7, it is evident that the BPBRM algorithm provides superior performance over the BP For buba training set we compare the BPDRM algorithm and the BP algorithm on three criteria: training time, MSE and number of epochs to discover which gives the superior training.The comparison between them is Table 8.
From Table 8, it is clear that the BPDRM algorithm has superior performance over the BP algorithm, whereas the BPDRM algorithm is 40.32 ≈ 40 times faster than the BP algorithm as a maximum training in the same way as the BPDRM algorithm is 3.6398 ≈ 4 time faster than the BP algorithm as a minimum training time.
For iris testing set the dynamic propagation provides better training that is show in the Table 9.

CONCLUSION
The back propagation BP algorithm is widely used in many tasks such as robot control, GPS and image restoration, but it suffers from slow training.To overcome this problem, there are many techniques for increasing the speed of the back propagation algorithm.In this study, we focused on a heuristic method, which included two parameters, the training rate and the momentum term.This study introduces the BPDRM algorithm, which is training by creating the dynamic function for each training rate and momentum.The dynamic function influenced the weight for each hidden layer and output layer.One of the main advantages of dynamic training and the momentum term is a reduction in the training time, error training and number of epochs.All algorithms were implemented on Matlab software R2012 a.The XOR problem and buba data were used as benchmarks.For the XOR problem, in the experiments result, the BPDRM algorithm is 1862 times faster than the BP algorithm at a maximum time.In addition, the BPDRM algorithm is 9 times faster than the BP algorithm at a minimum training time.For the buba data training set, the BPDRM algorithm is 976 times faster than the BP algorithm at the maximum time.For the buba data testing set, the BPDRM algorithm is 595 times faster than the BP algorithm at the maximum time.
have proposed to modify the training rate by a math formula based on a two-step function.From the experiment results, the new algorithm gives a superior performance compared to the back propagation algorithm.Al-Duais et al. (2013) improved BP algorithm by created the mathematic formula of the training rate.The experiments results show that the Dynamic BP algorithm gave a faster training rate than the BP algorithm.

Fig. 1 :
Fig. 1: Training of back propagation gradient for an output derivative of the activation function of O r to get: ) helps the back propagation algorithm to reduce the time for training.Whereas t = 1.0315 sec, the average value of the MSE performance is a very small value for every epoch training.Training is shown in Fig. 2. From Fig. 2 the training curve as a beginner is a slightly vibrating curve during the first training, then the curve decays with an inverse of the index of epoch.From the figure above, the training curve is smooth and convergence is quickly at global minimum.Experiments on the BP algorithm: We are going to run the BP algorithm, which is given in equation 1 with trial or manual values for each training rate and momentum term.The value of η and α are chosen ∈ [0, 1].The experiments' result is tabulated in Table the value of MSE and the number of epochs at η = α = 1 whereas the value of MSE = 0.0960 and number or epoch is 1259443.The large value of MSE at η = 0.9, η = 0.1.That means the weight change is very slight or equal for every epoch.The figure training is shown in Fig. 3.
we are shown the average of the training time is very short and also the epoch number is very small.That indicates the dynamic training rate and momentum term to help the back propagation algorithm to remove the saturation training and reach the global minimum training.The training curve of the BPDRM algorithm on buba data is as shown in Fig. 4.

Fig. 2 :
Fig. 2: Training curve of BPDRM algorithm From Fig. 4, we can see that the BP starts training with a small value for a training error, whereas MSE is 4.5 power-3, then the MSE decays quickly with an inverse index of the epoch number.At around 10 epochs, the value of MSE = 0.005 then reaches the global minimum.
the dynamic training rate and momentum reduces the time for training and enhancing the convergence of MSE.The average training time is 7.2813 sec at an epoch of 180.The curve of training as shown in Fig. 6.From Fig. 6 we can see the training curve of the back propagation as it starts training with a small value of training error.The average value of MSE decays fast

Table 1 :
Average time training of BPDRM algorithm with XOR

Table 3 :
Average time training of BPDRM algorithm with buba training set

Table 4 :
Average Time Training of BP algorithm with buba-training set Value of

Table 7 :
Speeding up BPDRM versus BP algorithm with XOR

Table 8 :
-Speed up BPDRM versus BP Algorithm with bub-Training set

Table 9 :
Speeding BPDRM versus BP algorithm with buba-Testing set ≈ 595 times faster than the BP algorithm maximum training time, on the other hand, the BPDRM algorithm is 3.3908 ≈ 3.4 time faster than the BP algorithm as a minimum training time.