Experimental Evaluation of Memory Capacity of Recurrent Neural Networks

In this article, the importance of correct representation of input data for recurrent neural network is experimentally analysed on the basis of the task for recognizing handwritten digits and task for incrementing an integer. In order to solve this task, the same information in a different form is provided for the neural network and quality of classification is evaluated. It was received, that a simple permutation of inputs has caused the decrease of quality from several percentage points (for short sequences, e.g. incrementing 32-bit integer in binary) up to 15% for long ones (784 steps). In addition, the phenomena that models examining the depiction of handwritten digits, presented in a horizontal way converge on average faster than analogue models with vertical digit representation.

In general, neural network is a special form complex non-linear function f(x), which parameters are being optimized with the help of gradient methods and algorithm of back error propagation (e.g.(LeCun et al., 1998)).Thus, during the training algorithm gets the collection of pairs S = ((x 1 , y 1 ), … , (x m , y m )), x i X, y i  Y , where x i is a task input (e.g.graphical file with a handwritten digit) and y i is a corresponding answer (e.g.recognized digit).The task of training algorithm is to get a function f, which insures the best mapping approximation of X ↦ Y on x  X, that were not included in S.
where each of f iis a non-linear function, representing matrix multiplication and nonlinearity (e.g.sigmoid).In such a case, parameters are usually matrix elements that participate in matrix multiplication.
Below, general application of such a model for recognizing handwritten digits is reviewed.Assume, that each digit is presented by a graphical file of size 100 by 100 pixel.Each pixel is characterized by 3 numbers of R-, G-and B-components intensity in RGB colour representation scheme.In such a case information stored in such a graphical file can be represented in a form of numeral vector of size 3•100•100 = 30000x i (in our MNIST experiments we have grayscale images with size 28x28, so our vectors have size 1•28•28=784).The problem solution, i.e. y i , can be presented as a 10-dimensional vector, in which all elements are equal to zero, except of exactly one element equal to 1, that corresponds to the digit, presented on the picture.Then f(x i ) can be seen as a vector in which each element is a probability, that the corresponding digit is shown from the point of view of a model.
In that case f can be selected as follows: Function uses P for transforming arbitrary numeral vector into probability distribution.The model received is called a neural network with one hidden layer of a size of 128 neurons (Fig. 1).The size and the number of hidden layers are selected based on the problem and the size of data used for model training.
However, neural networks of such type are not suitable for all the tasks.Suppose, the size of picture that should be analysed is of a free size and is not limited to 100 by 100 pixels.Then the size of matrix A 1 to be selected is not clear.However, in such a case the majority of parameters are useless.Also for the parameters, that are responsible for the bigger picture size it is rather difficult to find the necessary number of training samples.Of course, such an approach cannot solve a problem, when the initial size is not limited, as in case of translation or speech recognition.Another option is to perform initial processing of incoming data.For example, in image recognition problems before applying the model image can be brought to the canonical size by stretching or compression.Such an approach cannot be applied for the mentioned translation or speech recognition tasks.
For such tasks, recurrent neural networks (e.g.(Jain and Medsker, 1999)) are used.In many cases problems with a variable input can be presented as a sequence of data of the same type.For example, for machine translation input text can be presented as a sequence of words:   = ( 1 ,  2 , … ,    ) .For speech recognition in many cases the initial audio file is split into segments of a fixed length, e.g. 25 ms (see (Hinton et al., 2012)).
On a high level recurrent neural networks present themselves as a function g(x i , s i-1 ) = (y i , s i ).Thus, this function has two entry points: x i is another sequence element, s istate of a neural network in a previous period.This function returns two outputs: y i is an answer on the i th period and s i is a state on a current period.This means that on each period, the same function g is used, and the number of its parameters does not depend on the sequence length (see Fig. 2).

Fig. 2. Recurrent neural network application algorithm
Such a construction appeared to be very powerful, i.e. being able to calculate extremely complex functions: such networks are used for speech recognition (e.g.(Miao et al., 2015)).Moreover, in (Siegelmann and Sontag, 1995) the fact was proved, that for any function that is computable on a Turing machine it is possible to find a recurrent network that can approximate this function with a defined precision.
However, in practice training of recurrent neural networks is so an easy task (refer to (Pascanu et al., 2012) for review of model training difficulties).Moreover, additional problem is based on that even format of input data has a great influence on model training effectiveness, i.e. it is not enough just to send available data to the input, but it is also necessary to structure it in a convenient way.
Article is organized in a following way: in Part 2, a short literature review on the topic is performed, in Part 3 the experiments performed are described and in Part 4 the results achieved and future research topics are discussed.

Prior and related works
Recurrent neural networks were developed in 1980s (refer to (Goodfellow et al., 2016) for review).One of the first neural networks were Hopfield networks.
Neural networks are difficult for training.The biggest obstacles for that are usually related to problems, caused by exploding and vanishing gradients.They are mainly related to the fact, that gradient, that carries information on error, passes through the big number of non-linearities (proportional to the depth of the network, and what is more important, to the length of a sequence).This means that if gradient norm is multiplied by a number more than 1 while passing through the non-linearity, the norm will grow exponentially, causing numerical stability problems.
The main method for dealing with exploding gradients is gradient clipping (please refer to (Pascanu et al., 2012) for a more detailed problem discussion).Vanishing gradients problem is usually solved by a proper architecture selection for the recurrent neural network.The most popular in this case are LSTM (Hochreiter, 1997) and GRU (Cho et al., 2014a) methods.They solve a problem by selecting a special gradient path that does not change its norm.As a result, even if the gradient has vanished on the main path, the additional path will protect against information losses.In (Chung et al., 2014) authors have compared different types of recurrent units and found that gated units (such as LSTM or GRU) are indeed better than traditional units and that GRU is comparable to LSTM.In (Jozefowicz et al., 2015) it was empirically found that careful initialization of bias in forget gate in LSTM closes the gap between LSTM and GRU models in all problems evaluated by the authors.We use LSTM, GRU and simple RNN models in our increment experiments.
General review of neural networks and samples of their successful practical applications can be found in (Karpathy, 2015) and (Goodfellow et al., 2016).
From a model point of view, many facts on the calculative power of recurrent neural networks were proved.The most important result is the proof of their equivalence to the Turing machine (Siegelmann and Sontag, 1995).In (Khrulkov et al., 2017) the theorem, stating that a special kind of recurrent neural networks is exponentially better than one layer convolutional networks, was proved, i.e. that it requires less parameters for achieving the same quality of results.For our digit recognition experiments one-layer recurrent neural networks will be used.For quicker convergence the initialization scheme, proposed in (Le at al., 2017) will be applied.
Regarding MNIST datasets and object recognition in general, convolutional neural networks are better suited for the task.For example, (Ciresan et al., 2012, Jarrett et al., 2009, Ranzato et al., 2006) report several ways to train and build CNN model to get recognition accuracy of 99.5% or higher.The core idea of this paper is not to achieve better accuracy, but to investigate the storage capacity of recurrent models.

Experiments
In the following subsections we present two experiments supporting the importance of correct representation of sequential data for recurrent neural networks.We use numerical data in the task of incrementing the integer (subsection 3.1) and images in the handwritten digits recognition problem (subsection 3.2).Both experiments show that the rate of the convergence is very dependent on the chosen data representation despite the same amount of information given to the model.

Increment task
In this subsection the experiments, related to the increment task of an integer were carried out.Number in a K-base numeral system of N digits with possible leading zeros was considered.The model task is to read the given number and provide a number increased by one (or to show zero if K N -1 was given on input) as an output.
In the first experiment (let's call it "simple") on entry the number was provided on entry by one digit from lower order to higher.In the second experiment (to be called "complicated") digits were provided in a random, but fixed within the frame of one experiment, order.
In case of a "complicated" experiment model has not enough information to give "the first" digit of an increased number during the first step (since "the first" digit can be arbitrary to the digit in a permutated version, i.e. to be dependent on lower orders, that were not seen by the model on the first step.That is why the initial number was given to input two times and we were expecting to obtain the same orders for first K reports and increased orders for the second K reports.Explanations are provided on Fig. 3 and Fig. 4. Three architectures of recurrent neural networks were used: vanilla rnn (the same as in experiment with MNIST), lstm (Hochreiter, 1997) and gru (Cho et al., 2014b).In each experiment the minibatch of a size of 16, 5 epochs with 10240 minibatches in it, single layer neural network with the size of a recurrent cell equal to 16 elements and optimizer RMSprop with a learning rate of η = 3 • 10 −4 were used.At the end of each minibatch the accuracy of the obtained model was calculated with the help of hold-out set.This set was formed of all possible K N , or, if the resulting size was more than 3 • 10 5 , of 3 • 10 5 random numbers of the same distribution, as the training set.
In case if number is selected out of K N equally likely options, rather simple and "uninteresting" samples will be dominating in a measurable metric, where the lower order has just to be increased by one in order to perform the increment task.Because of that the distribution was modified to select an "interesting" sample with a 0.1 probability in the following way:  choosing an equally likely integer number T from 0 to N  returning (K T − 1) + RK (N − T ) • K T , where RK (N − T )random number in a K-base numeral system of N − T digits.In such an interesting sample there will be at least T transfers from the lower to higher order.As it can be seen from Figures 5, 6, 7 in his experiment the similar behavior as in case with MNIST experiment is observed: in case of a "correct" of data provided for input model converges to optimum much quicker.In case of "complicated" case convergence is visible, but it is always much slower.

Recognizing of handwritten digits
In this subsection we present the experiment carried out with MNIST dataset (LeCun and Cortes, 2010), which is used for recognizing handwritten digits.This dataset contains 60 000 samples for training and 10 000 samples for testing.Each sample is an image of size 28 by 28 pixels.In Fig. 8, several samples from the dataset specified are presented. original random permute: equivalent to the original model, but the order of vectors is random (not compulsory to be upside down), still it is fixed for the training and testing sets. transposed random permute: equivalent to the original random permute, but each sequence element is a column of an image, not a row;  flattened: each picture is a sequence of 784 single-size vectors, pixels are presented in a left-to-right and upside down order;  transposed flattened: each picture is a sequence of 784 single-size vectors, pixels are presented in an upside down and left-to-right order;  flattened random: each picture is a sequence of 784 single-size vectors, pixels are presented in a random order, but this order is fixed for the training and testing sets.In Table 1, different representation of the same testing image information is provided.
Cases flattened random and flattened are in fact Permuted MNIST and Sequential MNIST from (Le et al., 2015).
It can be noticed, that flattened representation complicates the digit recognition, since adjacent image pixels can be distinct from one another in the representation.So for a neural network, that works with rows, it is necessary to understand, that for example in case of digit 1 white pixels should follow after 28 counts to form a white vertical line.The same is to be said for a neural network working with columns in case of the horizontal line in digit 7.
Even more complicated task is for a model that works with any kind of a random permutation.In these cases, proximity in representation does not mean that there is the proximity in image.Nevertheless, for any function of an orderly sequence a function returns the same answers for a random case.That means that in case of ideal selection of training parameters and big amount of data in both cases the same result will be achieved.
For each representation option we were training a single-layer recurrent network with a size of 128 neurons with the help of RMSProp algorithm (Tieleman and Hinton, 2012), batch size was equal to 16 images and the learning rate was equal to 3 10 5 and 10 6 .Two parameter options were selected for more reliable determination of data representation option.
For all random variation, training was performed on three different permutations in order to remove random fluctuations.Training with the help of Keras framework (Chollet, 2015) took two weeks on a 4-core computer.Models were training with the help of a cross entropy function between the last element of the output sequence and the correct answer.
The source code used in experiments can be found at https://github.com/kolesov93/rnn-memory .
The best model for each option are presented in Table 2. Table 3 provides a more detailed information on each of training launches.
As it can be seen, the transposed model makes a decision on 4-5 counts (out of 28!) earlier, compared to the analogical original model and the same tendencies remain for other permute cases.Thus it can be noticed, that permutation decreases the classification quality by relative 0.7% for transposed case, by 1.4% for the original, by 16% for transposed flattened and by 14% for flattened.Image stretching into line also decreases the quality of classification, for example, original has decreased by 7% (0.979 to 0.910), and apart from that number of mistakes has increased 4.2 times.It is necessary to stress, that the quality decrease can be explained by a fact, that in initial model there are more parameters (since matrix multiplication is performed with a 28-column matrix, while in flattened case there is a 1column matrix).
In Fig. 9, curves representing model by accuracy for a sample testing set in each of the groups are presented.

Conclusions and future work
It can be stated, that data representation format highly influences the convergence.It was empirically proven for data types: numerical data (in integer increment problem) and images (handwritten digits recognizing).
However, it can be expected, that neural network will be able to determine dependencies in complicated conditions.For example, on Fig. 9 it can be seen, that best cases provide the result, close to an optimum already during the 10 th epoch.It takes longer for weaker cases to grow.However, in practice we cannot expect that any data representation case will lead to success due to a limited amount of data and limited calculation budget.For example, original random permute case was not able to minimize the lag from the best three cases even after 250 epochs.
The phenomena that models examining the depiction of handwritten digits, presented in a horizontal way converge to the optimum on average faster than analogical models with vertical digit representation was noticed.The experiments proving our hypothesis, that such models require less counts for decision making on classification and more computational resources can be spend on classification itself, rather than on memorizing the sequence, were carried out.However, additional research is planned on that topic and should be carried out in future.
In addition, additional topics for future work are defined:  Bigger depth allows models to produce more sophisticated functions.It is necessary to check, how the depth increase influences the speed of sequence memorization. It is necessary to verify the achieved results on other data types like audio or text information.

Fig. 1 .
Fig. 1. Outline of a neural network for solving handwritten digit recognition problem.

Fig. 3 .Fig. 4 .
Fig. 3. Sample input data for an increment task.N = 5, K = 4. Number x = 201334 = 54310, repeated two times is provided to input (please note, that number is provided in the order from lower to higher).On output we expect to obtain 2N digits: first of all N digits of initial number, then N digits of the increased number (in that exact case 54410 = 202004)

Fig. 8 .
Fig. 8.Samples of input data from the MNIST dataset

Fig. 9 .
Fig. 9. Training curves for best cases in each of the groups

Table 2 .
Average number of counts necessary for the model to stop changing the answer.

Table 4 .
Best models for each option