A New Oscillating-Error Technique for Classifiers

This paper describes a new method for reducing the error in a classifier. It uses an error correction update that includes the very simple rule of either adding or subtracting the error adjustment, based on whether the variable value is currently larger or smaller than the desired value. While a traditional neuron would sum the inputs together and then apply a function to the total, this new method can change the function decision for each input value. This gives added flexibility to the convergence procedure, where through a series of transpositions, variables that are far away can continue towards the desired value, whereas variables that are originally much closer can oscillate from one side to the other. Tests show that the method can successfully classify some benchmark datasets. It can also work in a batch mode, with reduced training times and can be used as part of a neural network architecture. Some comparisons with an earlier wave shape paper are also made.


Introduction
Neural networks and classifiers in general are statistical processors. They all work by trying to reduce the error in the system through an error correction method that includes transposition through a function. Neural networks in particular, are based loosely on the human brain, with a distributed architecture of relatively simple processing units. Each neural unit solves a small part of the problem, where collectively, they are able to solve the whole problem. Being statistical classifiers, they try to converge to some solution without any level of intelligence outside of the pre-defined function. This works very well for a statistical system, but the simulation of a brain-like neuron could include a little bit more. It does get involved in different kinds of biochemical reaction [4] [29] and may even have a DCS 21 November 2017 2 type of memory [26]. For this paper, the neuron is able to react to its input and apply a very simple rule of either adding or subtracting the error adjustment, based on whether the variable value is currently larger or smaller than the desired value, and on a variable by variable basis. The decision is based on the most basic of reactions and so it could be part of an automatic theory. It is also well known that resonance is a feature of real brain operations and other simulation models [3] [14]. The idea of resonance would be to use the data shape to determine what values go together, where earlier research [13] and this paper suggest that the data shape can be represented by a single averaged value. The procedure is shown to work surprisingly well and be very flexible and so it should be taken seriously as a general mechanism.
The rest of this paper is organised as follows: section 2 briefly outlines the reasons for the new method. Section 3 introduces some related work and section 4 describes the theory behind the new classifier. Section 5 runs through a very simple test example, while section 6 gives the result of some tests on real datasets. Finally, section 7 gives some conclusions to the work.

Reasons for the New Method
The proposed method would give the component slightly more flexibility, or if arguing for a neural component, then a small amount of intelligence, but still keep it at a most basic and automatic level. Each variable can reduce its error in a way that best suits it, with a dampening effect that is independent of the other variables. Basically, if the data point (variable value) is less than the desired value, the weight adjustment is added to it and if it is larger than the desired value, the weight adjustment is subtracted from it. This means that variables of the same input set to the neuron could be treated differently when the neuron applies the function, which gives added flexibility to the convergence procedure. Through a series of transpositions or levels in the classifier, a variable that is far from the correct value can be adjusted by the full amount in the same direction each time. A variable that is at the correct value can oscillate around it and therefore some of the adjustment size can even be removed. The method is implemented here in matrix form, but as it uses a neuron-like 3 architecture, it can be compared more closely with neural networks, or simply as a general update mechanism. The weight correction can also be added or subtracted and not multiplied, where the data works best with some form of normalisation, but considering a binary-style of reduction, it does not take many steps for the error to reduce. The error correction is also calculated by using the input and desired output values only and not any intermediary error value sets. Although, this maybe considers the whole matrix to be a single hidden unit. One other advantage of the method is the fact that it is not necessary to fine-tune the classifier, with appropriate random weight sets, for example. The weight correction procedure will always be the same and only a stopping criterion is required, along with the dataset pre-processing.

Related Work
Related work would therefore include neural networks [27][31] and the resonance type in particular [3] [14]. The Adaptive Resonance Theory is an example of trying to use resonance, created by a matching agreement, as part of a neural network model. It is also categorical in nature, but can learn category patterns and includes a long-term memory component that is a matrix of weight updates. The primary intuition behind the ART model is that object identification and recognition generally occur as a result of the interaction of 'top-down' observer expectations with 'bottom-up' sensory information and the idea of resonance is the agreement between these two processes. Resonance suggests a repeating value or state, which then suggests an averaged value, which is why it may be possible to represent a wave shape that way. The Fuzzy-ART system uses what is called a one-shot learning process, where each input item can be categorised after just one presentation. Cellular automata possibly have some relation as well [32] [5], because the new neural component is at a similar level of complexity. It is not usual for a neural component to make a decision, but the decision is so simple that it might be compared to a reaction. The paper [15] is also interesting in this respect, with their Gauss-Newton gradient descent Marquardt algorithm.
It uses batch processing to compute the average sum of squares over the dataset error, and can add or subtract a value from the step value, which is also a feature of the related Marquardt-Levenberg algorithm. So in fact, these algorithms do make a similar decision, 4 although it applies to the weight rather than the value itself. The rule that the new neuron uses can probably make the best fit result non-linear, even if it is linear with respect to time.
Attempts to optimise the learning process have been made since the early days of neural networks. Kolmgorov's theorem [2] [22] is often used to support the idea that a neural network can successfully define any arbitrary function using just one hidden layer [17].
While Deep Learning has improved on this, it would be an idea of the model of this paper.
The theorem states that each multivariate continuous real-valued function can be represented as a superposition and composition of continuous functions of only one variable. The paper [10] gives a summary of some early attempts, including batch processing and even the inclusion of rules, but as part of different types of learning frameworks. It is interesting that rules and discrete categories or activations, are all quite old ideas. More recently, the deep learning neural network models [18] adopt a policy of many more levels than the earlier backpropagation ones. These new networks include a feedback from one level to previous ones, as well as continuously refining the function, to learn mid-level structures or features. Some Convolutional Neural Networks can also be trained in a oneshot mode. The paper [19], for example, can train the network using only one labelled example per category, as part of a data reduction or transformation process. One-shot learning therefore appears to be the term that was originally used. The paper [12] also uses batch processing or averaging of the input dataset, and uses the term single-pass to mean a similar thing.
Resonance is mentioned because an earlier neural network paper [13] tried to encapsulate the dataset shape into a single averaged value and these papers [3][12] that are interested in resonance also try to condense the input data rows into vectors of single averaged values.
In that case, a relative size of a scalar becomes important, but discriminating comparisons must still be made. To help with this, the dataset is separated for each output category, so that the averaged value applies to one category only. The justification is that each neuron always has to accommodate all of the data that passes through it and so it has to produce an average evaluation for that. Thus, averaging the input data could become a very cheap way of describing the data shape. While the closest classifier might be a neural network, this new model uses a matrix-like structure that contains a number of transitions from one layer 5 to the next. These are however relatively simple transformations of adding or subtracting a value and are really just steps in the same error reduction procedure.

Background Theory and Method Description
The theory of the new mechanism started with looking at the wave shape paper [13], which is described first with some new details. After that, the new oscillating error mechanism is described.

Wave Shape Algorithm
This was proposed in [13] as an alternative way of looking at the relative input and output value sets. The idea was that the value differences would describe a type of wave shape and similar shapes could be combined in the synapses, as they would produce the same type of resonance. That design also uses average values, where both the input and the output can be summed and averaged over each column (all data rows), to represent each variable field with the average value. Tests do in fact show a substantial reduction in the error of the average input to the average output using this method and even on established datasets, such as the Wine dataset [7] [28]. The problem was that while the error could be reduced, it was reduced to an average output value that is not very accurate for each specific instance.
For example, if the output values are 1, 2 and 3, then the input dataset could be averaged to produce a value close to 2, but this is not very helpful when trying to achieve an exact output value of 1 or 3. That procedure, based strongly on shape, could be more useful for modelling the synapses, whereas the neuron needs to compare with the desired result.
Therefore, using actual values instead of differences is probably more appropriate. For example, if the input dataset is 2, 8, 4, 5, 10; then you can measure the average of these values, or the average of their differences: 6, -4, 1, 5. As part of a theory, the synapses could consider shape more than an actual value, as they try to sync with each other, while the neuron compares with the actual result. So possibly, modelling the network can consider that neurons and synapses are measuring a different type of quantity over the same value set and for a different purpose -one to reinforce a type of signal (synapse) and one to produce a more exact output (neuron). As stated however, averaging over the whole 6 dataset makes the network too general and so possibly the ideas of the next section can be tried.

Oscillating-Error Method
This is the new algorithm of the paper and resulted from trying to make the input to output mapping of the last section more accurate. The new neuron can take an input from each variable or column and adjust it by either adding or subtracting the weight update, on a variable by variable basis. As the error oscillates from one side to the other, a bit of it gets removed, as the current difference and so it will necessarily reduce in size. The new neuron is therefore the same as a traditional one, except for the inclusion of the rule as part of the calculation and separate weight sets for each category, during training. The new mechanism has been tried using batch values, as for section 4.1, but the learning procedure is different to the earlier models mentioned in section 3. It has been implemented in a matrix form of levels that pass each input to the next level and is not as a flexible neural network, but the units that are used would be suitable for neural networks in general. The calculations are really only the ones described later and the equations suggest that time would be linear with increasing dataset size or number of levels. The tested datasets required only a second or less to be classified, where additional time to create the initial category groupings might be the only consideration. The pre-processing however creates the batch rows, only 1 for each category and so much fewer row numbers are subsequently used for training. This paper only considers categorical data, where each input row belongs to a single category. If represented by a single output neuron however, this can still produce a range of output values, but they represent a discrete set instead of a continuous one. In the case of the Wine dataset [7], the 3 output categories can be represented by the values 0, 0.5 and 1.0, for example. As described in section 4.1, the current wave shape method is not accurate enough, as it averages over all categories. The new method therefore sums and averages over each category group separately. In effect, it divides the dataset into batches, representing the rows in each category and produces an averaged data row for each category group. For the Wine dataset, there are therefore three sets of input data, one for each category, represented by 3 averaged data rows. These then update the classifier 7 separately, which stores different sets of weight or error correction values for each category group. The weight value sets can then be combined into a single weight value set after they are learned, to be used over any new input. For the Wine dataset, during training for example, the structure would store 3 sets of 13 weight or error correction values, relating to the 3 output categories and the 13 input variables. After the error corrections have been determined, the 3 values for each variable are summed and averaged to produce the value to be used by the classifier on any classification task. This also becomes the starting set of weight update values for the next network layer. The method also vertically adjusts the error, instead of using a multiplication factor.

Training Algorithm
The following algorithm helps to describe the process: 1. Group all data rows for each output category. Each group is then processed separately during training. a. For each category group, sum and average all input points for each variable (or data column) to produce an averaged data row for that category. i. If the value is smaller than the desired output value, then add the previous layer's averaged weight correction value to it. ii. If the value is larger than the desired output value then subtract the previous layer's averaged weight correction value from it. iii. Measure the difference between the new weight-corrected value and the desired category output. Take the absolute value of that as the weight error correction value for the data point in the category group. iv. The error value can also be summed and compared with earlier layers, to evaluate the stopping criterion. c. The weight update method is essentially a single event that sets the value for the category group in the layer. d. After evaluating the weight sets for each category group separately, average over them and store the averaged list as a new transposition layer in the matrix. 8 3.
The transposed values can also be stored as each new layer is added, to make the next learning phase quicker. It can continue from the last layer, instead of running the values through the whole matrix again. 4. Go to step 2 to create the next matrix layer in the structure, and repeat the process until a stopping criterion is met. 5. A stopping criterion can be number of iterations, or if the total error does not reduce by a substantial amount anymore.

Example Trace of a Scenario
The following scenario traces through the process for a dataset with 5 variables. The example assumes that they have already been grouped for the output category and is intended to demonstrate the error correction procedure only. The desired output category value is '4'. The following steps show how the variables can converge to that value at each iterative step 2 .
• Determine the new difference from the desired output to get the new weight set.
Input plus/minus error correction to layer 2: 4, 4, 4, 4, 4 Input-Output Differences = Abs(4 -4), Abs(4 -4) , Abs(4 -4) , Abs(4 -4) , Abs(4 -4) Absolute error = 0, 0, 0, 0, 0 Continue until the stopping criterion is met. In this case, the error is now 0. It is interesting that with a single output category, this method reduces the error to 0 in 1 step. If there are several output categories and their weights sets are averaged, then the weight update will not necessarily reduce the error to 0. Also, if there was another layer, then it would adjust input values that are '0, 0, 0, 0, 0' and not the original input value set.

Test Results
A test program has been written in the C# .Net language. It can read in a data file, normalise it, generate the classifier from it and measure how many categories it subsequently evaluates correctly. The classifier was designed with only one output node, as described in section 4.2. The input values were also normalised. Therefore, 3 categories would produce desired output values of 0, 0.5 and 1. The conversion from a category to a real number is not implicit in the data and so it is possible to use a value range to represent each category, just as easily as a single value. It might be interesting however for numerical data, if specific output values can be learned accurately. The error margin that is discussed as part of the result does not relate to distributions, but relates to the smallest margin around the output value representing the category that will give the best percentage of correct classifications.
The representative value is still what the classifier tries to learn, but then a value range round that can only reduce the number of errors. For example, consider 3 categories again.
These are represented by the output values 0 (category 1), 0.5 (category 2) and 1.0 (category 3), which gives a gap of '0.5' between each value. It would therefore be possible to measure up to 49% of that gap, either side of a category value and still be 100% reliable with respect to the category classification. A 20% error margin, for example, would be calculated as 0.5 * 20 / 100 = 0.1. This would mean that a range of 0.4 -0.6 would be classified as the category 2 and anything outside of this range could be classified as incorrect. A 15% margin of error would mean that the range would have to be 0.425 -5.75, and so on. So a smaller error margin would simply indicate that the classifier could be more accurate to an exact real value and there is no ambiguity over the results presented in this paper. Binary data could also be handled equally easily.
The process is completely deterministic. There are no random variables and so a dataset with the same parameter set will always produce the same result. Two types of result were measured. The first was an average error for each row in the dataset, after the classifier was trained, calculated as the average difference between actual output and the desired output value. The second measurement was how many categories were correctly classified, but also with a consideration of the value range (error margin) just discussed. If increasing the margin around a category value did not substantially increase the number of correct classifications, then maybe it would not be worthwhile.

Benchmark Datasets with Train Versions Only
The classifier was first tested on 3 datasets from the UCI Machine Learning Repository [28].
Recent work [12] has tested some benchmark categorical datasets, including the Wine Recognition database [7], Iris Plants database [6]  The paper [20] tested a number of datasets, including Iris, Wine and Zoo, using k-NN and neural network classifiers, with maybe 95.67%, 96% or 94.5% as the best results from one of the classifiers respectively. The values presented here are therefore probably better than that. It also tested the Hayes-Roth dataset, but to only 50% accuracy. Other papers have quoted better results and there is a test dataset available, but without any specified categories. None of the other quoted results are close to 100% however. The paper [11] tested the Liver dataset [24] to 74% accuracy using a sparse grid method, but the new method achieves 100% accuracy in only 2 iterations. The table shows that for all datasets, the error between the desired and the actual output values has reduced to practically zero, but different margins of error are required for the number of correct classifications to be optimised. The percentages still compare favourably with the other researchers' results. 13

Separate Train and Test Datasets
Four datasets were tried here, where two of them -User Modelling [21] and Bank Notes [25] -were also tested in [12]. They have separate test datasets to the train datasets. This is typically what a supervised neural network should be able to do and the results of this section, given in Table 2  The User Modelling dataset [21] was used as part of a knowledge-modelling project that produced a new type of classifier in that paper. Their classifier was shown to be much better than the standard ones for the particular problem of web page use, classifying to 97.9% accuracy. This was compared to 85% accuracy for a k-NN classifier and 73.8% for a Bayes classifier. This new model however appears to classify even better, at 98.5% accuracy.
Another test tried to classify the bank notes dataset [25]. These were scanned variable values from 'real' or 'fake' banknotes, where the output was therefore binary. This is another different type of problem, where a Wavelet transform might typically be used. The dataset again contained a train and a test dataset, where the best classification realised 100% accuracy. In that paper they quote maybe only 61% correct classification, but other papers have quoted close to 100% correct for similar problems.
A third dataset was a heart classifier from SPECT images [23]. While they noted 84% accuracy on the test dataset using a sparse grid method, the new method can achieve 100% accuracy. A fourth dataset was a letter recognition task [8]. Letters were categorised into one of 26 alphabet types, where there were 20000 instances in total, with 16000 instances 14 in the train set and 4000 instances in the test set. They used a fuzzy exemplar-based rule creation method, but achieved 82% accuracy as compared to 92% accuracy here.

Conclusions
This paper describes a new type of weight adjustment method that can be used as part of a classifier, or a neural network in particular. It is basically a neural unit with the addition of a very simple rule. The inclusion of the comparison rule however gives the mechanism much more control over weight updates and the unit could still operate in an almost automatic manner. The classifier does not need to learn any complex data rules, but for best results, data normalisation would be required. Another feature is the fact that the weight value can be added or subtracted, and not multiplied, which is the usual mechanism. Another potential advantage is the fact that it can be calculated using only the input and the output values. It is not therefore necessary to fine-tune the classifier with initial weights, or increment/decrement factor amounts, to start with. A stopping criterion should be added however, where each iteration adds a new transposition layer to the matrix. Looking at related work, the learning algorithm is possibly more similar to the Gauss or Pseudo-Newton gradient descent ones [15]. So again, while the method appears to be new, there are similarities with older models. The test results are very surprising. The new classifier appears to work best of all classifiers and across a range of problems. It is also very fast, requiring only a second or less and the setup is really minimal.
Each learning iteration produces a new set of error correction values and so when used, any input value goes through a series of transformations, which is separate for each variable or column value. It is thought that the weight adjustment performs a type of dampening on the error, and so it should reduce for each transposition stage. The orthogonal nature allows the variables to behave slightly differently to each other, where a variable that is close to the desired output value can oscillate around it, while one that is still far away can make larger corrections towards it. There are probably several examples of this type of phenomenon in nature. Another paper that uses an even more orthogonal design is [12], although the results for this paper are maybe slightly better.

Acknowledgement
The author wishes to acknowledge an email discussion with Charles Sauerbier of the US Navy, mainly because of its timing. He pointed out a belief that neural networks were a form of cellular automata and several other points, which the author did not fully appreciate, but the simple rule of this paper would push a neural element in that direction. The research itself however derived from a different place, looking at wave shapes and possibly some earlier ideas.

Addendum
It has not been made clear in the paper that the classifier actually used the correct output category value to converge to a result when classifying any of the datasets. So even if the classifier had not seen the dataset before, it still used its output category as part of the classification process. This is a major constraint that might be resolved by testing with each output category and selecting the category with the smallest error. However, the average error is also incorrect as it did not take account of negative totals, but it can still be in the hundredths or thousandths after being corrected. So, the results are correct for what is described, apart from the error, but that can still be from a similar scale. A new paper 'An Improved Oscillating-Error Classifier with Branching' has solved the other problems and should also be read.