Efficient Training of Multi-Layer Neural Networks to Achieve Faster Validation

Artificial neural networks (ANNs) are one of the hottest topics in computer science and artificial intelligence due to their potential and advantages in analyzing real-world problems in various disciplines, including but not limited to physics, biology, chemistry, and engineering. However, ANNs lack several key characteristics of biological neural networks, such as sparsity, scale-freeness, and small-worldness. The concept of sparse and scale-free neural networks has been introduced to fill this gap. Network sparsity is implemented by removing weak weights between neurons during the learning process and replacing them with random weights. When the network is initialized, the neural network is fully connected, which means the number of weights is four times the number of neurons. In this study, considering that a biological neural network has some degree of initial sparsity, we design an ANN with a prescribed level of initial sparsity. The neural network is tested on handwritten digits, Arabic characters, CIFAR-10, and Reuters newswire topics. Simulations show that it is possible to reduce the number of weights by up to 50% without losing prediction accuracy. Moreover, in both cases, the testing time is dramatically reduced compared with fully connected ANNs.


Introduction
The powerful tools of artificial intelligence are increasingly attractive for the analysis of real-life problems. Due to the proven efficiency of deep learning, artificial neural networks (ANNs) are the most frequently used strategy, with substantial applications and results in areas such as physics [1], engineering [2], and biology [3]. ANNs are usually composed of multiple layers of (artificial) neurons designed to replicate the functioning of biological neurons. In traditional ANN designs, all neurons are interconnected; that is, there exists a weight between all neurons, meaning that the number of weights is quadratic in the number of neurons. This leads to significant complexity of the learning process associated with a solution of an appropriate error minimization problem with respect to these weights. When the number of neurons is small, the complexity is feasible, but this can change as the number of neurons increases.
To resolve this issue, various fast and efficient learning strategies have been developed, including the scaled [4] and adaptive [5] conjugate gradient algorithms, adaptive learning rate method [6], sensitivity analysis [7], normalized perceptron-based methods [8], improved search methods [9], distributed Newton methods [10], and accelerated methods [11]. Alaba et al. [12] offered a comprehensive review of some recent advanced deep learning strategies. Pruning algorithms have also been shown to be quite effective in the analysis of various types of data. For a comparative study, see [13].
A novel method inspired by biological neural networks was proposed by Le Cun et al. [14] and developed by Mocanu et al. [15], in which weights between neurons are adaptively updated. Specifically, network initialization is accompanied by the creation of random weights. At each learning stage (epoch), weak weights, that is, those smaller than a prescribed threshold, are removed and new weights are created (see Fig. 1). Eventually, the training of the network is carried out only for strongly connected neurons, substantially reducing the complexity of the associated minimization problem. The pseudocode of the training algorithm, called the sparse evolutionary training (SET) algorithm, is shown in Algorithm 1.  A similar approach [16] suggested random dropouts of neurons together with their connections with other neurons. This "dropout" technique is often used to overcome overfitting problems. However, as shown below, it can also be used to train sparse ANNs.
This study aims to develop the idea of sparse connectivity by training networks with an initially reduced number of weights. Unlike Mocanu et al. [15], whose network was initially designed as fully connected and whose weights were adaptively updated during the learning process, we suggest the training with initial sparsity (TIS) method, which uses a reduced number of weights at network initialization and reduces weak weights during training. The simplest gradient descent algorithm is used for backpropagation (BP). A more advantageous learning algorithm can be developed using more advanced minimization techniques.
The rest of the paper is organized as follows. Section 3 presents the details and pseudocode of the algorithm. In Section 4, we apply TIS to the recognition of handwritten digits and Arabic characters and document the performance advantages of ANNs with reduced connectivity over the fully connected network. In Section 5, we compare TIS with similar methods. Conclusions and future research directions are discussed in Section 6.

Literature Review
Multi-layer neural networks have received significant attention and wide interest from scholars and scientists, as they are a powerful tool and efficient method in many fields, such as image processing [17], pattern recognition [18], natural language processing [19], VLSI design [20], and system diagnosis [21].
Yu et al. [22] developed a new ANN training technique in which neurons are computed neuron by neuron; they also implemented two different methods, namely, the Leveuberg-Marquardt (LM) and error backpropagation (EBP) methods. Schwenk et al. [23] presented a detailed survey on the best methods and practices of training a large ANN with more than 5.5 billion words. Adhikari et al. [24] developed a novel high-performance strategy for implementation in chip learning in multi-layer neural network hardware.
Recently, researchers used a metaheuristics algorithm in a multi-layer perceptron (MLP). Metaheuristics include the lightning search algorithm [25], ant lion optimization [26], the Salp swarm algorithm [27], mothflame optimization [28], the whale optimization algorithm [29][30][31][32], and the crow search algorithm [33]. These algorithms have been successfully applied in training MLPs. MLPs can be employed in three different ways, which are each very efficient. First, an MLP can be used to obtain the weights of optimal connections. Second, it can be used to find the optimal architecture. Finally, it can be used to tune the arguments (parameters) of the algorithms.

TIS Algorithm
TIS is a training algorithm for an ANN with reduced connectivity. By reducing the number of weights, we aim to improve the computational effectiveness of the testing (validation) process compared with a fully connected ANN. The reduction of weights should not substantially affect the accuracy. Thus, we seek faster validation without reducing accuracy and possibly improving it.
To illustrate, we construct a standard neural network with two hidden layers, where are matrices of the weights between the input and first hidden layer, first and second hidden layers, and second hidden layer and output layer, respectively; b H 1 ; b H 2 ; and b O are bias vectors of the layers corresponding to their superscripts; and n 1 ; n 2 ; and n 0 are the numbers of neurons in the corresponding layers.
The training process has two steps: forward and backward propagation. Forward propagation can be described as follows. Given an input vector x j 2 R n I and a target vector t 2 R n O ; j ¼ 1; 2; . . . ; n 1 , consider the standard sigmoid function as the activation function: whose responses for the first and second hidden layers will be The real output y j 2 R n O will then be evaluated as In the BP stage, the standard quadratic error function is considered: The gradient of E with respect to the weight matrices and bias vectors has the following components: At the NN initialization step, random weights and biases are set. A prescribed level of initial sparsity is set by randomly removing the corresponding portions of the weights. Standard gradient descent is then used to update the remaining weights and biases according to the iterative procedure defined by @b where a is a constant learning rate.
At each epoch from the first to the last, after updating all weights, the weak weights (those with absolute values close to zero) are removed and replaced by random weights. At the last epoch, weak weights are removed and not replaced. The pseudocode of this process is presented as Algorithm 2, and the corresponding flowchart is shown in Fig. 2 As a simple application of the algorithm, we compare the performance of ANNs with reduced connectivity to that of fully connected ANNs. The algorithm was implemented in MATLAB and run on an Intel Core i7-4700MQ CPU @ 2.40 GHz (8 CPUs).

Handwritten Digits (MNIST Dataset)
First, we consider the publicly accessible MNIST dataset of handwritten digits, which contains 6·10 4 samples for learning and 10 4 for testing 1 (see also [34] for a comparison of different machine learning algorithms for the MNIST dataset). Each sample was a 28 pixel × 28 pixel image. The data were in the x j ; t j È É 2 R 785 ; j ¼ 1; 2; 7; 10 4 format, where x j is the input vector and t j is the perfect output guess. We set n 1 ¼ 784; n 1 ¼ 16; n 2 ¼ 500; and n 0 ¼ 10. Training was carried out with a fixed learning rate / ¼ 0:1 in 200 epochs with 5,000 learned samples per epoch.
By experimenting with different levels of initial sparsity, we observed that it is possible to achieve accuracy comparable to that of a corresponding fully connected NN even with substantially fewer weights. Tab. 1 shows that the NN with reduced number of weights has slightly better accuracy than the fully connected NN. Specifically, the accuracy of the NN with 50% initial sparsity was 0.65% higher.
Tab. 2 presents more quantitative information about the number of remaining weights, including the complexity of error minimization in the appropriate layer. Specifically, for a fully connected network, (1) must be minimized with respect to 25,544 weights; with 50% reduced weights, this number is 11,497. However, as expected, higher levels of initial sparsity affect convergence. As Fig. 3 shows, when the level of initial sparsity is 50%, the error starts to decrease later than for a fully connected network. The difference occurs within the first 40 epochs.
The connectivity of visible neurons is plotted at different epochs of the training process in Fig. 4. A path between connectivity can be observed through the epochs, and it concentrates strong weights at the center of the sample. Weights near the sample boundary are too weak and therefore are removed. Such a picture should be expected, as the digits appear exactly in the domain of the samples where the visible neurons have the strongest weights.

Handwritten Arabic Characters
The TIS algorithm was also applied to analyze the publicly accessible Arabic Handwritten Characters Dataset, 2 which contains 13,440 samples for learning and 3,360 for testing. Each sample was a 32 pixel × 32 pixel image that was reshaped to a row vector of 1,024 elements. We set n I ¼ 1024, n 1 ¼ 250, n 2 ¼ 250, and n O ¼ 28. Training was carried out with a fixed learning rate of a ¼ 0:5 in 100 epochs with 5,000 learned samples per epoch.
In this case also, initially sparse NNs provided accuracies comparable to those of fully connected NNs even with substantially fewer weights. It can be seen from Tab. 3 that the indicators of accuracy of the NN with 5%, 15%, and 50% initial sparsity were higher than those of the corresponding fully connected NN by 0.6%, 1.02%, and 0.21%, respectively.
The number of remaining weights for different levels of initial sparsity is presented in Tab. 4. It is worth noticing that compared with 325,500 weights in the case of a fully connected network, there are only 146,476 weights for the case of 50% initial sparsity.
The connectivity of visible neurons at different epochs of the training process is plotted for this case in Fig. 5. Through epochs, strong weights were concentrated around the domain where Arabic letters appear. Weaker weights were concentrated near sample boundaries and therefore were removed.  A straightforward consequence of reduced weights is the dramatic reduction of the time spent testing the test samples (see Fig. 6).
It is easy to see that our algorithm does not depend on the choice of the BP algorithm.

Comparing Validation Accuracies of Various Training Algorithms
As mentioned above, we chose the simplest training algorithm for BP, namely, gradient descent. We compare the validation time versus initial sparsity with some other BP algorithms. These are the wellknown conjugate gradient, quasi-Newton, and Levenberg-Marquardt algorithms (see Tab. 5).

Comparison with Existing Methods
We compare TIS with the SET [15] and DO [16] methods on three datasets: MNIST, CIFAR-10, 3 and Keras Dataset Reuters newswire topics (RNT). The CIFAR-10 dataset consists of 60,000 32 pixel × 32 pixel color images in 10 classes with 6,000 images per class. There are 50,000 training samples and 10,000 test samples. RNT consists of 11,228 newswires from Reuters, labeled over 46 topics. Each wire is encoded as a sequence of word indices. We used the following BP algorithms: conjugate gradient, gradient descent, quasi-Newton, and Levenberg-Marquardt. Comparative analysis indicates that TIS has advantages over SET and DO for all BP algorithms. Therefore, in the interest of brevity, we carry out the comparison analysis only for the gradient descent BP algorithm.
Open-source implementations are freely available for both SET 4 and DO. 5 We use these implementations to benchmark the performance of TIS against the other two methods.

MNIST Dataset
We compare the performance of SET, DO, and TIS on the MNIST dataset considered in Section 3.1. For TIS with 50% initial sparsity (Fig. 7b), the accuracy of TIS increases much faster. It is also evident that TIS ensures a faster convergence rate. Fig. 7b shows that TIS with 50% initial sparsity has the highest convergence acceleration among the tested methods.

CIFAR-10 Dataset
We compare the performance of TIS, SET, and DO on the CIFAR-10 dataset. Fig. 8 demonstrates that TIS has an advantage over SET and DO in both accuracy and convergence.

RNT Dataset
We compare the performance of SET, DO, and TIS on the RNT dataset. For TIS with 50% initial sparsity (Fig. 9b), the accuracy of TIS increases much more quickly. It is also evident that TIS ensures a faster convergence rate. Fig. 9b shows that TIS with 50% initial sparsity has the highest convergence acceleration among the tested methods.

Conclusions and Future Work
As artificial neurons are designed to mimic the functioning of biological neurons, it is natural to expect that artificial neural networks should possess the key features of biological neural networks, which would lead to efficient learning. Features reported to have a significant impact on learning efficiency include sparsity [35,36], scale-freeness [37], and small-worldness [38]. Mocanu et al. [15] designed a sparse and scale-free ANN, which was shown to substantially enhance learning efficiency. In our method, at the initial step, the ANN is assumed to be fully connected; weights with small absolute values are removed; and new random weights are added (see Algorithm 1).
In this study, we introduced the concept of initial sparsity, that is, the ANN is assumed to be sparse at the initial step, with the possibility to prescribe the level of initial sparsity. At each training epoch, weights that are close to zero in absolute value are removed, and random weights are added (see Algorithm 1). The test network has two hidden layers. Comparative analysis shows that networks with initial sparsity of up to 50% exhibit better accuracy than the initial fully connected network. Moreover, it is observed that after training is finished, the testing time dramatically decreases with the increase of the initial sparsity level. It is also Figure 9: Benchmarking TIS, SET, and DO on RNT dataset: (a) accuracy plot for TIS with 25% initial sparsity, DO with 15% rate, and SET with 25% sparsity; (b) accuracy plot for TIS with 50% initial sparsity, DO with 10% rate, and SET with 50% sparsity; (c) convergence plot for TIS with 25% initial sparsity, DO with 15% rate, and SET with 25% sparsity; (d) convergence plot for TIS with 50% initial sparsity, DO with 10% rate, and SET with 50% sparsity observed that convergence of the error is slower for initial sparsity levels greater than 40% (compared with an initially fully connected network). Comparative analysis shows that the gradient descent, quasi-Newton, and Levenberg-Marquardt BP algorithms increase the accuracy of validation compared with gradient descent BP (with 50% initial sparsity).
The proposed method was also compared with other similar methods, namely, SET and DO. An analysis was carried out on the MNIST, CIFAR-10, and RNT datasets. The analysis showed that TIS outperforms both SET and DO in accuracy and convergence rate. These observations apply to the four tested BP algorithms: conjugate gradient, gradient descent, quasi-Newton, and Levenberg-Marquardt.
These observations motivate us to improve the general algorithm, which will be a focus of future work. In this study, we used gradient descent, one of the simplest BP methods, for error minimization. A priority of future work will be to implement the developed algorithm with more advanced minimization strategies, such as the modified conjugate gradient descent and distributed Newton methods, combined with a more efficient line search strategy. We also intend to test the algorithm on some variants of convolutional neural networks. Another challenging problem is the optimal choice of the level of initial sparsity in the context of the network structure and the particular dataset.