SAGRAD: A Program for Neural Network Training with Simulated Annealing and the Conjugate Gradient Method

SAGRAD (Simulated Annealing GRADient), a Fortran 77 program for computing neural networks for classification using batch learning, is discussed. Neural network training in SAGRAD is based on a combination of simulated annealing and Møller’s scaled conjugate gradient algorithm, the latter a variation of the traditional conjugate gradient method, better suited for the nonquadratic nature of neural networks. Different aspects of the implementation of the training process in SAGRAD are discussed, such as the efficient computation of gradients and multiplication of vectors by Hessian matrices that are required by Møller’s algorithm; the (re)initialization of weights with simulated annealing required to (re)start Møller’s algorithm the first time and each time thereafter that it shows insufficient progress in reaching a possibly local minimum; and the use of simulated annealing when Møller’s algorithm, after possibly making considerable progress, becomes stuck at a local minimum or flat area of weight space. Outlines of the scaled conjugate gradient algorithm, the simulated annealing procedure and the training process used in SAGRAD are presented together with results from running SAGRAD on two examples of training data.


Introduction
Neural networks are computational models that work by simulating the way the brain processes information. They are often used to recognize patterns in a data set, say X , in Euclidean d − dimensional space, d some positive integer. Once the neural network is appropriately trained on representative sample patterns of X , it can then be used for attempting to recognize other patterns in X as they are fed through the network. Accordingly, it is assumed X is partitioned into n distinct types/classes of patterns, n some positive integer.
Let A be a set of training data for X , i.e., a subset of X in which the n distinct types/classes of patterns are well represented. The basic structure of a neural network associated with X (to be trained on A ) consists of layers or columns of mostly computing nodes, or neurons, arranged from left to right (see Fig. 1) in such a way that the result of a computation at each neuron in a layer contributes to the input of neurons in the next layer. The layer at the extreme left of the network is called the input layer of the network (see  k , = 1, , 1 k d+  , coordinate k a is assigned to neuron k (neuron with label k ) and as such interpreted to be the output of neuron k (neuron 1 d + is called a bias neuron and its output is 1 for all patterns). The layer immediately to the right of the input layer, unlike the input layer, consists of computing neurons (except for the last neuron which is a bias neuron). From left to right in the network it is the first layer with computing neurons and as such is called the first layer of the network (the layer immediately to the right of this layer is called the second layer of the network, and so on). Like the input layer, the first layer has 1 d + neurons which are then labeled with integers from 2 d + to 2 2 d + . Given integer i , x is designated the input to neuron i (in the first layer) which is a weighted sum of the outputs of the input layer (the coordinates of the augmented pattern a ) expressed as Here for each k , = 1, , 1 k d+  , ki w is the weight modifying the pattern coordinate k a before it is fed into neuron i (as part of i x ). In order to make neuron i into a computing neuron, the sigmoid activation function l , 2 l , 1 2 < l l , integers such that neurons in layer L are labeled with integers from 1 l to 2 l ; and neuron l , a neuron in layer L , The layer at the extreme right of the network is called the output layer of the network (see Fig. 1). Layers between the input layer and the output layer are called hidden layers (in Fig. 1 the first layer and second layer of the network are the only hidden layers), and hidden layers to the right of the first layer (there is only one, the second layer of the network, in the network of Fig. 1) are all assumed to be of the same length, i.e., to consist of the same number of neurons, a number greater than 1 and preferably greater than d and n . For consistency with definitions above involving consecutive layers M and L we assume at first that the output layer contains a bias neuron besides n computing neurons. As will become apparent below, there is a one-to-one correspondence between the n computing neurons in this layer and the n classes of patterns (as defined for X ) into which the set A of training data can be partitioned. After reducing the number of neurons in the output layer to n by dropping the dummy bias neuron in the layer, so that for some positive integer nq , nq is the total number of neurons in the network, neurons in the output layer are then labeled with integers from 1 nq n − + to nq . Additionally, letting nw be the total number of weights in the network, a natural order can be established for weights so that any given set of nw weights can be uniquely identified with a vector, called a weight vector, in weight space, the Euclidean space of dimension nw , and vice versa. Given a pattern a in A , then for some q , 1 q n ≤ ≤ , a is in class q , and an n − dimensional vector

∑ ∑∑
where w is the unique vector in weight space corresponding to the current set of weights in the network. As E is implicitly defined in terms of compositions of linear functions between layers in the network and activation functions assigned to neurons in the network, E has partial derivatives of all orders at any w . Accordingly, any optimization method of the gradient kind can be applied for the purpose of hopefully minimizing E . If the result of training the neural network on A , i.e., minimizing E (with gradients, metaheuristics, etc.), is a weight vector w at which E is zero then it must also be true that the neural network defined by w classifies correctly all patterns in the set A of training data, i.e., identifies correctly the class to which each pattern belongs. We say then that w is a reasonable solution. Additionally, if a subset of \ X A is also available in which the n distinct types/classes of patterns are also well represented, and each pattern in the subset is of known classification, then the neural network defined by w should be applied on such a subset for classification results. If the results for a good percentage of the patterns in the subset, say over 90 %, are correct then we say that besides being a reasonable solution, w is also a quality solution.
In this paper we discuss SAGRAD, a Fortran 77 program for computing neural networks for classification using batch learning. Classification is one of the most important applications of neural networks. An extensive survey on neural networks for classification can be found in [18]. On the other hand, batch learning is exactly the type of training described above where all patterns in training data are introduced into the network before the training of the network or minimization of the total error E begins. This is in contrast with on-line learning where training of the network is done one pattern at a time: each time a pattern in the training data is introduced into the network, training of the network takes place immediately starting at the current solution obtained from introducing the previous pattern, and the training is done only on exactly those patterns, including the current one, that have been introduced into the network so far. Neural network training in SAGRAD is based on a mixture of simulated annealing [15] and Møller's scaled conjugate gradient algorithm [7,9], the latter a variation of the traditional conjugate gradient method [5], better suited for the nonquadratic nature of neural networks. In what follows an outline of Møller's algorithm is presented that closely resembles the implementation of the algorithm in SAGRAD. In addition, other aspects of the implementation of the training process in SAGRAD are discussed such as the efficient computation of gradients and multiplication of vectors by Hessian matrices that take place in Møller's algorithm; the (re)initialization of weights with simulated annealing required to (re)start Møller's algorithm the first time and each time thereafter that it shows insufficient progress in reaching a possibly local minimum; and the use of simulated annealing when Møller's algorithm, after possibly making considerable progress, becomes stuck at a local minimum or flat area of weight space. Outlines of the simulated annealing procedure and the training process used in SAGRAD are also presented together with results from training with SAGRAD data for two examples. A copy of SAGRAD can be found at http://math.nist.gov/~JBernal.

Scaled Conjugate Gradient Algorithm
SAGRAD is based on a combination of simulated annealing [15] and Møller's scaled conjugate gradient algorithm [7,9] for minimizing the total squared error E as a function of weights. Møller's algorithm, an outline of which is presented below, is based on the well-known conjugate gradient method [5] which works well for quadratic or nearly-quadratic functions. Since the Hessian matrix ( ) E w ′′ of the squared error function E at w may not be positive definite for w in certain areas of weight space, Møller modified the conjugate gradient method based on the approach of the Levenberg-Marquardt algorithm [2]. If at some point during the execution of the conjugate gradient method for some p and w in nw is computed resulting in a nonpositive δ , one makes δ positive by adding t p p λ to it for some > 0 λ , i.e., by scaling the Hessian matrix ( ) E w ′′ with the appropriate > 0 λ so that δ becomes ( ( ) ) t p E w I p λ ′′ + , I the identity matrix. Once λ is initialized it is used and adjusted appropriately throughout the execution of the algorithm so that each δ computed as above remains positive. However, since the accuracy of the conjugate gradient method depends on approximating ( ) E w with a quadratic function that involves ( ) E w ′′ , care must be taken that the scaled ( ) E w ′′ does not produce a bad approximation. This is again taken care of by appropriately raising and lowering λ . The outline of the scaled conjugate gradient algorithm below includes the manipulations for raising and lowering λ . Here the column vector ( ) E w ′ is the gradient of E at weight vector w . The outline closely resembles the implementation of Møller's algorithm in SAGRAD.

Computing the Gradient
In order to attempt to minimize the error E as a function of w using the scaled conjugate gradient algorithm as described above, the capability must exist for the efficient computation of the gradient ( ) E w ′ of E at w and multiplication of a vector by the Hessian matrix ( ) E w ′′ of E at w . In this section we develop formulas used in SAGRAD for the computation of the gradient ( ) E w ′ as presented in [4]. They originate from the so-called delta rule in [13], [17].
Given a A ∈ , w in weight space, the error at w due to a is 2 =1 . Therefore, by fixing a in A , in what follows it will suffice to develop only the formulas associated with a E .
As will become apparent from the formulas below, the calculations of the partial derivatives of a E with these formulas must take place in a specific order, from right to left in the network. This is because each calculation corresponding to a given weight depends on calculations corresponding to other weights in the network to the right of the given weight. Computing in this manner is called backpropagation, originally described in [16]. However, it is an implementation issue that is taken care of in SAGRAD and not necessary for the development of the formulas. Case 1. Layer K is not the input layer and layer M is a hidden layer so that layer L is either a hidden layer or the output layer.

Fast Exact Multiplication by the Hessian
In this section we develop the formulas used in the implementation of the scaled conjugate gradient algorithm in SAGRAD for the fast exact computation of the product of the Hessian matrix ( ) E w ′′ with an nw − dimensional vector v in the context of Møller's algorithm. With these formulas the calculation of the complete Hessian matrix is avoided. These formulas were originally derived by Pearlmutter [12] and Møller [8,9], and involve the so-called {} ⋅  operator. As in the case of the gradient ( ) E w ′ , by fixing a in A , in what follows it will suffice to develop only the formulas associated with a E . These formulas depend on the formulas developed above for the computation of the gradient ( ) E w ′ , thus simultaneously as ( ) E w ′ is computed with backpropagation, the exact product of v and ( ) E w ′′ is computed with these formulas in either a feed-forward fashion or in the manner of backpropagation. But again this is an implementation issue that is taken care of in SAGRAD and not necessary for the development of the formulas.
Let f be a differentiable function from nw − dimensional Euclidean space into any other finitedimensional Euclidean space.
, it also follows that for each k , = 1, , w v  With these equations and the formulas obtained in the previous section for the components of ( ) a E w ′ , the formulas for the components of ( ) a E w v ′′ can be derived. In what follows weights are doubly indexed. Since there is a one-to-one correspondence between the components of w and v then the components of v will be similarly indexed.
As Case 1. Layer K is not the input layer and layer M is a hidden layer so that layer L is either a hidden layer or the output layer.
Applying {} ⋅  on i x and i y , we get the feed-forward formulas: a i E y ∂ ∂ , as computed in the previous section for case 1, we get the backpropagation formulas:

Simulated Annealing
An outline of the simulated annealing procedure used in SAGRAD follows. It is based on the simulated annealing procedure presented in [15].
In the outline that follows ε is a tolerance reasonably chosen (e.g., 3 = 10 ε − ). Accordingly if b w is the current best solution found by the procedure and the squared error for b w , i.e., = ( ) b b E E w , is less than ε then b w is declared to be a reasonable solution and the procedure is terminated. 1. Input: i w , nw ; i w a weight vector of nw coordinates, > 10 nw .
2. Initialize 1 K , 2 K , temprture , tfactor , coef , ε (e.g., 1 Else if < E E (the input solution i w was improved and c w now equals the best solution found by the procedure). 9. Output: c w , c E , 0 k (if 0 > 0 k then the input solution i w was improved and c w equals the best solution found by the procedure; if 0 = 0 k then the initial solution i w was not improved and c w equals the last solution that was not an improvement but was still accepted by the procedure).
Two versions of the simulated annealing procedure outlined above are used in the training process in SAGRAD. One of low intensity with a relatively small number of iterations, high initial temperature and small neighborhood of exploration (currently with 1  As pointed out in [6], neural network training is typically a two-step process. First, with a method such as the low-intensity version of the simulated annealing procedure mentioned above that tends to elude local minima, weights are initialized. Then an optimization algorithm such as the conjugate gradient method is applied in hopes of finding a global minimum. In the training process in SAGRAD we follow a slight variation of this two-step strategy while adding an additional step. The additional step involves the high-intensity version of the simulated annealing procedure mentioned above that intensively exploits weight space for a possible global solution. It is used principally when the scaled conjugate gradient algorithm (in the second step), after possibly making considerable progress, becomes stuck at a local minimum or flat area of weight space.

Training Process
In SAGRAD, neural network training is essentially a three-step process that while still following the two-step strategy in [6], combines in a slightly different manner the two versions of the simulated annealing procedure mentioned above with Møller's scaled conjugate gradient algorithm. An outline of the training process in SAGRAD in terms of the three steps follows below. There and in what follows a weight vector will be declared to be a reasonable solution if for some reasonably chosen ε (e.g., 3 = 10 ε − ), the squared error for the weight vector is less than ε . It should also be noted that at different times during the execution of the process, in order to provide the user with the option of getting out of a possibly bad run of the training process, the user will be asked to decide on whether or not to terminate the current run of the training process. If the run is terminated then the user will be asked to decide on whether SAGRAD should stop or do a new cold start of the training process.
In the outline of the training process below note that before the third step is executed (at most once), the first two steps, one following the other, may be executed several times in hopes that the scaled conjugate gradient algorithm in the second step will eventually compute a reasonable solution. The scaled conjugate gradient algorithm in the second step uses as its initial solution the output weights from the last execution of the low-intensity simulated annealing in the first step. On the other hand, the low-intensity simulated annealing in the first step uses as input the best weights found so far among all executions of the scaled conjugate gradient algorithm in the second step except at the start of the execution of the process. At the start of the execution of the process the low-intensity simulated annealing in the first step uses input weights that are randomly generated in the interval ( 1,1) − . It should be noted that all executions of the low-intesity simulated annealing in the first step tend to produce good initial solutions for the scaled conjugate gradient algorithm in the second step. However it is the first execution of the low-intensity simulated annealing in the first step that usually reduces considerably the squared error while all others usually produce no reduction at all. Eventually, as the first two steps are repeatedly executed, either a reasonable solution is found and the process is terminated, or the third step, which involves an execution of http://dx.doi.org/10.6028/jres.120.009

Cushing Syndrome Classification
Here we present results from running SAGRAD on a small example associated with the so-called Cushing syndrome. This is an example used in [3] as an application of neural networks for classification.
The Cushing syndrome is a disorder that occurs when the body is exposed to high levels of the hormone cortisol for a long time. Three types of the syndrome are recognized: adenoma, bilateral hyperplasia, and carcinoma. In the presence of the Cushing syndrome the following observations were made that represent urinary excretion rates (mg/24h) of the steroid metabolites tetrahydrocortisone (in the second column below) and pregnanetriol (in the third column). Each line of observations has a label that appears in the first column, and each of the lines corresponds to an individual identified with each of the observations in the line, an individual with a known type of the syndrome. Accordingly, lines labeled a1, ..., a6 correspond to individuals with the adenoma type; lines labeled b1, ..., b10 correspond to individuals with the bilateral hyperplasia type; and lines labeled c1, ..., c5 correspond to individuals with the carcinoma type. Lines labeled u1, ..., u6 correspond to individuals with the syndrome, each individual with an unknown type of the syndrome. Finally, the fourth and fifth columns of the data below have the same data as the second and third columns, respectively, but on a log scale. From these results it appears that adenoma is the type corresponding to line u4; bilateral hyperplasia is the type corresponding to lines u1, u3, u5, u6; and carcinoma is the type corresponding to line u2. This classification of these lines is consistent with the classification of the same lines in [3].

Wine Classification
Data in [1] is the result of a chemical analysis of wines produced in the same region in Italy from three different cultivars. Each line in the data corresponds to a wine and contains quantities of 13 constituents in the wine that were determined through the chemical analysis.
The 13 constituents were: 1) Alcohol 2) Malic acid 3) Ash 4) Alkalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10) Color intensity 11) Hue 12) OD280/OD315 of diluted wines 13) Proline Training data for SAGRAD was obtained from [1] as follows. The first 50 lines of data for wine from the first cultivar were extracted from the data and identified as Class 1 training data; the first 60 lines of data for wine from the second cultivar were extracted from the data and identified as Class 2 training data; and the first 40 lines of data for wine from the third cultivar were extracted from the data and identified as Class 3 training data. In all cases each line of data consisted of 13 numbers corresponding in the same order to the quantities of constituents listed above. For example, the first line in the Class 1 training data appeared exactly as follows: Using this data, SAGRAD was then executed, and after two cold starts of training process, training was completed on a 4-layer network associated with the data. The input layer of this network had 14 nodes, the first layer had 14, the second layer had 15, and the output layer had 3. For the purpose of testing the trained network the remaining 9 lines of data for wine from the first cultivar were extracted from the data and identified as Class 1 independent data; the remaining 11 lines of data for wine from the second cultivar were extracted from the data and identified as Class 2 independent data; and the remaining 8 lines of data for wine from the third cultivar were extracted from the data and identified as Class 3 independent data. http://dx.doi.org/10.6028/jres.120.009 gradient algorithm which is a variation of the traditional conjugate gradient method, better suited for the nonquadratic nature of neural networks, an outline of Møller's algorithm was presented that resembles its implementation in SAGRAD. Important aspects of the implementation of the training process in SAGRAD were discussed such as the efficient computation of gradients and multiplication of vectors by Hessian matrices that are required by Møller's algorithm. Accordingly, formulas for the product of vectors by Hessian matrices depending on those for the gradients used in SAGRAD were developed. Because of this dependence it was pointed out that calculations with these formulas of the gradient at a vector and the product of the Hessian at the same vector with another vector in the context of Møller's algorithm occur simultaneously and take place in either a feed-forward fashion or in the manner of backpropagation.
As neural network training in SAGRAD is also based on simulated annealing, an outline of the simulated annealing procedure implemented in SAGRAD was also presented. It was then pointed out that two versions of this procedure are used in the training process in SAGRAD, one a low-intensity version for the (re)initialization of weights required to (re)start the scaled conjugate gradient algorithm the first time and each time thereafter that it shows insufficient progress in reaching a possibly local minimum; and the other a high-intensity version to be used once the scaled conjugate gradient algorithm has possibly reduced the squared error considerably but becomes stuck at a local minimum or flat area of weight space. An outline of the training process was then presented.
Finally, results from executions of SAGRAD were reported. SAGRAD was run on two essentially small examples of training data consisting of sample patterns of dimension 2 and 13, respectively. The trainings with SAGRAD of the training data for the two examples were declared to be good as reasonable solutions were obtained. Classification results were then presented from applying the corresponding neural networks on the trained data, and other independent data of known and unknown classification, and these results were also declared to be good as for over 90 % of the patterns in the data the results were correct.
It should be noted that in general it may take several cold starts of the training process in SAGRAD before a reasonable solution is obtained. Such a solution should then be tested for quality by applying the corresponding neural network on independent sample patterns of known classification. A copy of SAGRAD can be obtained at http://math.nist.gov/~JBernal.