Elsevier

Neural Networks

Volume 18, Issue 10, December 2005, Pages 1341-1347
Neural Networks

Stability analysis of a three-term backpropagation algorithm

https://doi.org/10.1016/j.neunet.2005.04.007Get rights and content

Abstract

Efficient learning by the backpropagation (BP) algorithm is required for many practical applications. The BP algorithm calculates the weight changes of artificial neural networks, and a common approach is to use a two-term algorithm consisting of a learning rate (LR) and a momentum factor (MF). The major drawbacks of the two-term BP learning algorithm are the problems of local minima and slow convergence speeds, which limit the scope for real-time applications. Recently the addition of an extra term, called a proportional factor (PF), to the two-term BP algorithm was proposed. The third increases the speed of the BP algorithm. However, the PF term also reduces the convergence of the BP algorithm, and criteria for evaluating convergence are required to facilitate the application of the three terms BP algorithm. This paper analyzes the convergence of the new three-term backpropagation algorithm. If the learning parameters of the three-term BP algorithm satisfy the conditions given in this paper, then it is guaranteed that the system is stable and will converge to a local minimum. It is proved that if at least one of the eigenvalues of matrix F (compose of the Hessian of the cost function and the system Jacobian of the error vector at each iteration) is negative, then the system becomes unstable. Also the paper shows that all the local minima of the three-term BP algorithm cost function are stable. The relationship between the learning parameters are established in this paper such that the stability conditions are met.

Introduction

The backpropagation (BP) algorithm is commonly used for training Artificial Neural Networks (ANN) (Rumelhart & McClelland, 1986). Training generally consists of iterative updating of weights, usually employing the negative gradient of a mean-squared error function, which is the difference between desired and actual output values, multiplied by the slope of a sigmoidal activation function. The error signal is then backpropagated to the lower layers. Traditionally, two parameters, called learning rate (LR) and momentum factor (MF), are used for controlling the weight adjustment along the steepest descent direction and for dampening oscillations. The BP algorithm is popular and used for many applications. Unfortunately, its convergence rate is relatively slow, especially for networks with more than one hidden layer. The reason for this is the saturation behaviour of the activation function used for the hidden and the output layers. Since the output of a unit exists in the saturation area, the corresponding descent gradient takes a very small value, even if the output error is large, leading to very little progress in weight adjustment.

The problem of improving the efficiency and convergence rate of the back-propagation algorithm has been investigated by a number of researchers. Jim et al. (Yam & Chow, 2000) proposed an approach for finding optimal weights of feed forward neural networks. The rationale behind the approach was to reduce the initial network error while preventing the network from getting stuck with initial weights. The approach ensured that the outputs of the hidden units are in an active region, where the derivative of the activation function has a large value. From the outputs of the last hidden layer and the given output pattern, the optimal values of the last layer of weights are evaluated by a least-squares method. It is noted that many local minima difficulties are closely related to the neuron saturation in the hidden layer. Once such saturation occurs, neurons in the hidden layer will lose their sensitivity to input signals, and the propagation of information is blocked severely. In some cases, the network can no longer learn (Wang, Tang, Tamura, & Ishii, 2004). Kamarthi et al. (Kamarthi & Pittner, 1999) proposed an algorithm based on extrapolation, to accelerate the BP algorithm. This requires the error surface to have a smooth variation along the main axes, so that extrapolation is possible. For performing extrapolation, at the end of each epoch, the BP algorithm convergence behaviour of each network weight is individually examined. Cho et al. (Cho & Chow, 1999) presented an approach based on the least-squares method to determine the weights between the output layer and the hidden layer, in order to maintain convergence. The weights between the hidden layer and the input layer were evaluated by the well-known penalized optimization method during problems of local minima. Ampazis et al. (Ampazis, Perantonis, & Taylor, 1999) studied the escaping behaviour of the BP algorithm in the vicinity of temporary minima, for a network with an arbitrary number of input units, two hidden layer units and one output unit. Temporary minima correspond to the phase when the network remains in the vicinity of critical points of the plane trajectory, and are in fact saddle points. At these points, the network moves away slowly from the critical points, because the largest eigenvalue of the Jacobian matrix of the linearized system is very small and, therefore evolves very slowly. However, as training continues, small perturbations in the coefficients of the system are reflected in small perturbations in the eigenvalues, and causes them eventually to bifurcate. At that point, the largest eigenvalue evolves at a much faster rate, and the error curve drops to a significantly lower level. Lisboa et al. (Lisboa & Perantonis, 1991) obtain solution of the excitation values which may occur at the local minima of the XOR problem, using the gradient backpropagation algorithm. The back-propagation algorithm with the momentum term is analyzed by Phansalkar et al. (Phansalkar & Sastry, 1994) and it is shown that all local minima of the cost function are stable. Hahnloser (Hahnloser, 1998) propose a framework for local and simple learning algorithms based on interpreting a neural network as a set of configuration constraints. The results can be useful both for the analysis and for the synthesis of learning algorithms. Empirical studies are discussed by Carl (1996) to help the learning of three layers network structure. The necessary conditions that should be satisfied when a given output unit is to saturate prematurely are established by Vitela (Vitela & Reifman, 1997), and it is concluded that the momentum term plays the leading role in the occurrence of premature saturation of the output units of feedforward multilayer networks during the training with the standard backpropagation algorithm. Ellacott (Ellacott, 1994) gives a detailed analysis of the delta rule algorithm, indicating why one implementation leads to a stable numerical process. The effect of filtering and other preprocessing of the input data is discussed, with a new result on the effect of linear filtering on the rate of convergence of the delta rule.

Some of the proposed modifications of BP algorithms require complex and costly calculations at each iteration, which offset their faster rates of convergence. Another disadvantage of most acceleration techniques is that they must often be tuned to fit a particular application.

Convergence analysis for the three-term BP algorithm is investigated recently (Zweiri, Whidborne, & Seneviratne, 2002) to bound the learning parameters such that the algorithm let the system to learn robustly.

A new approach to calculate the change of weight for the link joining the jth unit to the ith unit was proposed recently (Zweiri, Whidborne, & Seneviratne, 2003). In this approach, a new term is introduced in addition to the learning rate (LR) and momentum factor (MF); this being a proportional factor (PF). The new three-term algorithm can be viewed as being analogous to the common three-term PID algorithm used in feedback control. The comparative test results in Zweiri et al. (2003) indicate that the new algorithm offers much higher speeds of convergence than the standard BP algorithm. Consequently, the improvements presented provide a valuable and viable alternative to existing training methods.

In this paper the convergence behaviour of the three-term backpropagation algorithm is studied, and it is shown that if the coefficients of the three-term BP algorithm satisfy certain conditions, then it is guaranteed that the system is stable and will converge to a local minimum. It is shown that the local minima of the cost function are the only locally asymptotically stable points for the algorithm. The paper is organized as follows. In Section 2, the backpropagation algorithm and the proportional factor term are proposed. Convergence analysis of the cost function is reported in Section 3. The relationships between the learning parameters are established in Section 4. Numerical validation is given in Section 5 and conclusions are drawn in Section 6.

Section snippets

Back-propagation

The back-propagation algorithm for multi-layer networks is a gradient descent procedure used to minimize a least-square objective function (error function). Assume a batch of training sample pairs: (I1,T1),…,(In,Tn), where Is, 1≤sn, represent the sth input in the batch, and Ts, 1≤sn is the corresponding desired output (target). For arbitrary hidden layers neurons, the least-square objective function in the weight space of the networks isE=1nZMs=1n[TsOsM]T[TsOsM],where OsM is the output

Stability analysis

In this section, the convergence of the modified three-term BP algorithm is analyzed, and it is shown that the local minima of the least square error function are the only locally asymptotically stable points of the algorithm. Eq. (5) can be written as,W(k+1)=W(k)αE(W(k))+βΔW(k1)+γe(W(k)).

Let ϱ1(k)=W(k) and ϱ2(k)=W(k)W(k1). Then a state variable representation for Eq. (6) isϱ1(k+1)=ϱ1(k)αE(ϱ1(k))+βϱ2(k)+γe(ϱ1(k))andϱ2(k+1)=αE(ϱ1(k))+βϱ2(k)+γe(ϱ1(k))

Lemma 1

c=(c1,c2) is an equilibrium point of

Relationship between the learning parameters

From inequality (40) the value of the momentum factor β is bounded but the values of the learning rate α; the proportional factor γ and the eigenvalue of F are bounded together in inequality (31). The aim of this section is to investigate sufficient stability conditions such that relationships between the learning parameters α and γ can be found and possible bounded value for each of them can be established. The investigation is carried out by understanding the conditions on the matrix D such

Numerical validation

This section illustrates the convergence behaviour of the three-term BP using a numerical example taken from Aleksander and Morton (1995). An inversion problem is considered where the input pattern is a vector consisting of ones and zeros. The desired output is set to one if the corresponding input pattern is made up of zero, otherwise the desired output is one. In the first case, 50 sets of the learning rate, momentum factor and proportional factor for the algorithm are selected such that the

Conclusions

In this paper the necessary and sufficient conditions for the convergence and stability behaviour of the three-term backpropagation algorithm are established. It is shown that if the coefficients of the three-term BP algorithm satisfy conditions (38), (39), then it is guaranteed that the system is stable and will converge to a local minimum. Condition (39) may be violated if the eigenvalue of matrix F is relatively large. But in most cases all the minima which are of interest lie within a

Acknowledgements

The authors would like to thank Professor J.G. Taylor for his advice.

References (19)

There are more references available in the full text version of this article.

Cited by (0)

View full text