Minimization Algorithm for Training Feed Forward Neural Network

In this paper we suggested a new learning rate  , which improves the classical Backpropagation algorithm (BP). The derivatation of  are based on approximating the error function E to the quadratic one in sufficiently small neighborhood for the optimal weight vector. The suggested algorithm (Spectral Backpropagation SBP say) is tested and the experimental results show that the SBP learning strategy improves the considered methods


1-Introduction
The batch training of a feed forward Neural network (FNN) is consistent with the theory of unconstrained optimization [5] and can be viewed as the minimization of the function E; that is to find a minimizer where E is the batch error measure defined as is the squared difference error between the actual output value at jth output layer neuron for pattern p and the target output value.The scalar p is index over input-output pairs The widely used batch Back Propagation (BP) , [13] is a first order neural network training algorithm, which minimizes the error function using the steepest descent (SD) method [4]: where k indicates iterations (k=0,1,…) and , the gradient vector is usually computed by the BP of the error through the layers of the FNN (see [9]) and  is a constant heuristically chosen learning rate or (step length).Appropriate learning rates help to avoid convergence to a saddle point or a maximum.In practice a small constant learning rate is chosen 1 0    [12] in order to secure the convergence of the BP training algorithm and to avoid oscillation in a direction where the error function is steep.It is well known that this approach tens to be inefficient [11].For difficulties in obtaining convergence of BP training algorithm utilizing a constant learning rate see [7].On the other hand, there are theoretical results that guarantee the convergence when the learning rate is a constant [12].In this case the learning rate is proportional to the inverse of the Lipschitz constant i. e 0 ||, where

2-proposed spectral Learining
An interesting new idea is the choice of step length that are proposed by [1] for the steepest descent (SD) method for unconstrained optimization The key element for derivation of our new algorithm is based on the following theorem .

Theorem (1) [3]:
The general function behaves like a quadratic function in a sufficiently small neighborhood of .As a consequence of theorem (1) the following relation hold (5) Note that equation (5) true only on small neighborhood of * w and G k is the Hessian matrix of the error function, therefore we may use the Barziui Browein approximation to the [8] i.e.
for the unconstrained optimization problem given in equation ( 1), as we know a necessary condition for the point w * be an optimal solution is g(w * )=0 (7) This is a system of non-linear equations which must be solved to get the optimal solution w * .In order to fulfill this optimality condition the following continuous gradient flow reformulation of the problem is suggested [6].Solve the following system of ordinary differential equation: ) with initial condition: w(0)=w 0 (9) The solution of the system (8) with initial condition ( 9) is convergence to optimal solution which is minimum of the function given in (1) according to the following theorem see [2] .[2]: Consider that w * is a point satisfying (7)

Theorem (2)
 is positive definite, if the initial point w 0 is close enough to w *, then w(t) is the solution to the (8) and tends to w * as t→∞.

Theorem (3) [2]:
Let w(t) be the solution of (8), for fixed t 0 ≥ 0 if g (w(t))  0 for all t>t 0 .then E(w(t)) is strictly decreasing with respect to t for all t > t 0 .
As we have seen solving the unconstrained optimization problem (1) has been reduced to that of integration of the ordinary differential equation (8) with initial condition (9).One simple algorithms for solving (8) and ( 9) is the following [ Where and  [0,1] is scalar.If  = 0 the above discretization is the explicit forward Euler's scheme on the other hand where we have used the implicit backward Euler's scheme.But g k+1 =g k +G k s k (11) from (10) and (11) The method based on the algorithm (12) has quite good performance if G k is positive definite and have desirable feature but not recommended for practical use, the major drawback of the algorithm is computing At each iteration and also there is no specified value of h.However one can deduce a simple implementation of the algorithm given in (12) with preserving useful theoretical features as follows: Since E (w) is continuously differentiable therefore the gradient vector g k is Lipshitz continuous that satisfies equation (3), without loss of generality we may take h k = L k i.e.: From ( 6), ( 12) and ( 13) we get: therefore we can adjust the weight vector according to the following equation: where: We summarize the above algorithm (15-16) (the specral step size SBP) as follows: Step (1): Initialization: number epochs k = 1,  k  (0,1), error goal = eg, weight vector = w k stopping criteria = , g k =E(w k ) Step ( 2

3-Experiments and Results:
A computer simulation has been developed to study the performance of the learning algorithms.The simulations have been carried out using MATLAB version 5.4.The performance of the specralstep size BP (SBP) has been evaluated and compared with batch versions of BP, constant learning BP (CBP) known as (traingd) see Appendix, in the neural net work toolbox, adaptive BP (ABP) (traingda) and BP with momentum MBP (traingdx).Toolbox default values for the heuristic parameters of the above algorithms are used unless stated otherwise.The algorithms were tested using the same initial weights, initialized by the Nguyen-Widrow method [10] and received the same sequence of input patterns.The weights of the network are updated only after the entire set of patterns to be learned has been presented.
For each of the test problems, a table summarizing the performance of the algorithms for simulations that reached solution is presented.The reported parameters are: min the minimum number of epochs, mean the mean value of epochs, max the maximum number of epochs, Tav the average of total time and succ.The succeeded simulations out of (100) trials within the error function evaluations limit.
If an algorithm fails to converge within the above limit, it is considered that it fails to train the FNN, but its epochs are not included in the statical analysis of the algorithms one gradient and one error function evaluations are necessary at each epoch.

Problem (1): (SPECT Heart Problem):
This data set contains data instances derived from Cardiac Single Proton Emission Computed Tomography (SPECT) images from the university of Colorado [8].The network architectures for this medical classification problem consists of one hidden layer with 6 neurons and an output layer of one neuron.The termination criterion is set to E  0.1 within limit of 1000 epochs, table (1) summarizes the result of all algorithms i e for 100 simulations the minimum epoch for each algorithm are listed in the first column (Min), the maximum epoch for each algorithm are listed in the second column, third column contains (Tav) the average of time for 100 simulations and last columns contains the percentage of succeeds of the algorithms in 100 simulation.

Problem (2): Continuous function Approximation:
The second test problem we consider is the approximation of the continuous trigonometric function: f(x)=sin(x)*cos(3x).
The network architectures for this problem is 1-15-1 FNN (thirty weights, sixteen biases) is trained to approximate the function f(x), where x  [-,] and the network is trained until the sum of the squares of the errors becomes less than the error goal 0.1.The network is based on hidden neurons of logistic activations with biases and on a linear output neuron with bias.Comparative results are shown in table (2). Figure (1) shows performance of SBP.
: is matlab function (in the matlab toolbox) utilize steepest descent direction with constant step-size to minimize error function E (training the network) known as standard Backpropagation.2-traingda: is matlab function (in the matlab toolbox) utilize steepest descent direction with adaptive step-size to minimize error function E (training the network) known as standard Adaptiv Backpropagation.3-traingdx: is matlab function (in the matlab toolbox) utilize steepest descent direction with momentum and computes step-size by line search procedure to minimize error function E or (training the network).