Enhancing deep neural network training efficiency and performance through linear prediction

Deep neural networks have achieved remarkable success in various fields. However, training an effective deep neural network still poses challenges. This paper aims to propose a method to optimize the training effectiveness of deep neural networks, with the goal of improving their performance. Firstly, based on the observation that parameters (weights and bias) of deep neural network change in certain rules during training process, the potential of parameters prediction for improving training efficiency is discovered. Secondly, the potential of parameters prediction to improve the performance of deep neural network by noise injection introduced by prediction errors is revealed. And then, considering the limitations comprehensively, a deep neural network Parameters Linear Prediction method is exploit. Finally, performance and hyperparameter sensitivity validations are carried out on some representative backbones. Experimental results show that by employing proposed Parameters Linear Prediction method, as opposed to SGD, has led to an approximate 1% increase in accuracy for optimal model, along with a reduction of about 0.01 in top-1/top-5 error. Moreover, it also exhibits stable performance under various hyperparameter settings, shown the effectiveness of the proposed method and validated its capacity in enhancing network’s training efficiency and performance.


Introduction
From epoch-making Convolutional Neural Network (CNN) [1] to Deep Belief Network (DBN) [2] and various effective and remarkable neural network structures [3], [4], [5], [6], [7], [8], [9], Deeplearning Neural Networks (DNN) today has undoubtedly become the mainstream of Machine Learnings, and has demonstrated remarkable success in tasks such as computer vision, natural language processing, and speech recognition.However, as DNN models grow larger and more complex, training DNN models remains a challenging and time-consuming task, often requiring extensive computational resources and careful hyperparameter tuning.
The training effectiveness of DNN models is crucial for their success and widespread adoption.Despite their impressive capabilities, DNN are susceptible to several challenges that can hinder their training and limit their performance, such as vanishing or exploding gradients and overfitting.These problems have been discussed and studied from the early age of DNN, and a lot of creative methods have been proposed to solve these problems.For the gradient problems, activation function ReLU is the one of the most famous [10].Besides this, structures like Resnet and Densnet was proved be effective in solving gradients problem in deeper models [6], [7].For the overfitting problems, "early-stopping" [11], as well as data augmentation [12] and regularization [13] are almost the standard configuration of DNN training nowadays.
Furthermore, even equipped with the methods above, researchers still have to carefully tuning hyperparameters to struggle for the 1% accuracy improvement, because the better performance of DNN usually comes at the availability of exceptionally large computational resources on specialized hardware accelerators in model training process, which necessitate similarly substantial energy consumption [14], [15].Moreover, the extensive periods need for DNN model training also prolong the validation cycle of the algorithm, which often result in researchers invest time and effort in vain.Therefore, research on how to improve the training efficiency and performance of DNN is still necessary.
To get better DNN training efficiency and performance compare to normal training method (the method only adopts SGD for parameter optimization during training process), in this paper, we propose a DNN Parameter (weights and bias) Linear Prediction (PLP) method.Instead of just using SGD to find the optimal parameter values step by step.Proposed PLP method takes every 4 iterations a cycle.Firstly, stores the first 3 iteration results of parameters get from SGD process.Secondly, calculate the slope of the median line of the triangle formed by the first 3 iteration results, and take the midpoint of the last 2 stored results as the start point.And then make the linear prediction for the parameters using the slope and the start point.Finally update the predict parameters to the model for the next optimization iteration.
By adopting proposed PLP method, it is proved be able to get better training results (lower average loss and higher accuracy on validation set, higher accuracy and lower top-1/top-5 error on test set) compare to normal ways with the same settings and training epochs.And with the experiments results on several representative backbones, the proposed method is proved to be general and can be extended to other tasks with minor effort.
The rest of this paper is organized as follows.In section 2, relate works that aim to improve DNN training efficiency and performance are introduced.In section 3, we detailed the implementations of proposed PLP method.In section 4 we provide experiments results on some representative backbones (Vgg, Resnet and GoogLeNet) with CIFA-100 dataset.Finally, conclusions are drawn in section 5.

Related Works
Researchers have proposed various methods in order to obtain better DNN training efficiency and performance.Those methods can be roughly divided into 2 categories, i.e, Model based method and Algorithm based method.
Model based method Model based methods aim to get better training results of DNN by optimizing the model structures.The well know works are Vgg [4], GoogLeNet [5] and Resnet [6] in Convolutional Neural Network (CNN), LSTM [16], GRU [17] in Recurrent Neural Networks (RNN), DCGAN [18], StackGAN [19] in Generative Adversarial Networks (GAN).Those networks are now often adopted as backbone for different applications with according modifications [20], [21], [22].The boundary between Model based method and Algorithm based method is relatively vague, as the introduction of new algorithms usually results in certain changes in the model.
Algorithm based method Algorithm based method intended to introduce new theories or methods to improve the training efficiency and performance.One of the widely used Algorithm based method is early stopping [11] that attempts to avoid overfitting by stopping the training when the validation error starts to increase.For the same purpose, dropout [3] that randomly removes hidden units during training process was proposed by Hinton.Data augmentation is a method to increase the diversity of data samples by random transformation or expansion of training data, so as to improve the generalization ability of DNN model [12].Another commonly used method to optimize the DNN training is regularization, which introduced additional penalty terms into the loss function, it limits the size or complexity of model parameters to improve its generalization ability [13].
These classical methods are continuously optimized and applied to the training of DNN.Li R et.al [23] proposed a dual-dropout method that based on Pearson correlation analysis approach and a random dropout function so as to prevent overfitting and optimize training efficiency.Considering the randomness introduced by dropout causes un-negligible inconsistency between training and inference, Wu L et.al [24] proposed a R-Drop strategy that forces the output distributions of different sub models generated by dropout to be consistent with each other.Chen P et.al [25] proposed a novel data augmentation method `GridMask' based on the deletion of regions of the input image showed good performance.Tran N T et.al [26] propose a principled framework, termed Data Augmentation Optimized for GAN to enable the use of augmented data in GAN training to improve the learning of the original distribution.For the purpose of stabilizing the GAN training, Zhang H et.al [27] introduced consistency regularization in GAN training, and achieves impressive results.
Currently, although most of the DNN training optimization ideas focus on the optimization and improvement of the above methods, and good optimization results have been achieved for specific problems.The exploration of enhancing DNN training efficiency and performance is still meaningful.

Proposed Method
In this section, we first discuss the reason why proposed PLP method works.And then the implementation details of proposed method are introduced.Fig. 2 shows the basic architecture of proposed Linear Prediction method.

Problem Formulation
The proposed method is based on the observation that the DNN parameters (weights and bias) are gradually changed with certain laws rather than being random like noise from initial values until model convergence, which show the feasibility of parameter prediction to improve the training performance of DNN model under the same settings and training times.As shown in Fig. 1, it is the changing curve of a weight and bias that are randomly picked from the training process of a Vgg16 network trained on CIFAR-100 dataset.It's obvious that although the changes of parameters exhibit some instability due to the adoption of SGD, but from the smoothed data curve, it is not difficult to learn that their change still shows a certain regularity.Therefore, if we can predict the value for each parameter during training process accordingly, then it is possible to achieve better training efficiency than normal method under the same conditions.Meanwhile, due to the characteristics of SGD or different settings of DNN, such as parameter initial values, learning rate or batch-size, the same parameter in the same DNN train with the same dataset may show different changing laws.Based on this fact, considering the number of DNN parameters are millions or even billions nowadays, it is impossible to train specific prediction models for each parameter in real-time during the training process using more accurate regression algorithms, such as Least Squares or Support Vector Regressions.Additionally, since SGD inherently allows gradient computations with noise [28], the SGD process itself has certain tolerance for the loss of parameter prediction accuracy.Therefore, the real-time performance and calculation amount of the prediction method must meet the data and calculation amount limit of DNN training, but there can be certain trade-offs in accuracy.
Furthermore, considering that the noise injection can bring regularization effect to DNN model training [29], the introduction of noise caused by parameter prediction error has the effect on model generalization, and also have the potential to help the SGD optimization process jump out of the local optimal solution.Therefore, it is trustworthy that parameter prediction in DNN models can also be able to improve the performance of DNN training model in the aspect of accuracy and generalization capability.
Based on the discussions above, we proposed a Parameter Linear Prediction (PLP) method for DNN parameter prediction.

Parameter Linear Prediction
As shown in Fig. 2 ， proposed Parameter Linear Prediction (PLP) method takes every 4 iterations a cycle (iteration refers to a mini-batch go through a complete forward and backward propagation).In each cycle, firstly, stores the first 3 iteration results of parameters wn_1, bn_1, wn_2, bn_2 and wn_3, bn_3 calculate by SGD (n refers to nth layer of model).
Secondly, calculate the midpoint m12 and m23 between two pairs as follows (take weights as example, the calculation for bias is exactly the same): And then, as shown in eq.2, calculate the slope of median line of the triangle formed by the first 3 results using the midpoint m12 and m23 (if 3 points are in the same straight line, then the slope of the line is adopted).23 12 slope m m =− (2) Afterward, take the midpoint m23 as the starting point for linear prediction and make the linear prediction for the parameters using the slops and the start point as shown in eq.3.The "step" in eq.3 refers to the number of steps you want to predict, to avoid introducing to much prediction error, we set "step" to 1.
The proposed method takes 3 results for prediction instead of 2 or more because the 3-point prediction is smoother than the 2-point prediction, and saves more memory and computational power than the method using more points.

Experiments
In this section, experimental details are described.And then we perform experiments to evaluate the performance and efficiency of our proposed method.

Implementation Details
We experiment with 3 representative backbones, i.e., Vgg16 [4], Resnet18 [6] and GoogLeNet [5], and all networks are randomly initialized under the default setting of PyTorch with no pretraining on any external dataset.We evaluate our method on these networks with CIFAR-100 dataset, which has 100 categories, each containing 600 images, there are 50000 training images and 10000 test images.We randomly split 20% of the training set into validation set.And a NVIDIA GeForce RTX 3060 Laptop GPU is used to implement and evaluate the method.
The experiments are implemented based on Python 3.10.9and Pytorch 2.0.1.All models are trained 100 epochs with SGD, and mini-batch size is set to 128.We adopt warm-up strategy in the first epoch, and for the rest 99 epochs, we use the cyclic learning rate update strategy (i.e., torch.optim.lr_scheduler.CyclicLR in Pytorch) with base learning rate of 0.001 and max learning rate 0.002.The weight decay is 1e-4 and momentum is 0.9 in our evaluations.To make the validation of the PLP method more objective and avoid erroneous validation results caused by randomness, the accuracy and top-1/top-5 error evaluations are repeated 10 times (i.e., we trained 10 models for each of the 3 selected networks using the normal method and the proposed PLP method, and tested each them on the test set), and we reported the average of them.

Results & Analyzation
Fig. 3 shows the accuracy and comparison of proposed PLP method and normal method on training set.It can be seen that the accuracy of proposed PLP method is superior to normal method in almost the whole training process.Combined with the loss curve on validation set shown in Fig. 4, it is not difficult to draw a conclusion that the model is about to overfit around 40th epoch, and this has explained the "up-down" tendency of accuracy difference curve in Fig. 3, that is, because the parameters in the overfitting stage gradually converge, leading to the performance of the normal method gradually equalize to the proposed PLP algorithm.Fig. 4 shows the loss and comparison of proposed PLP method and normal method on validation set.There are some outliers in Fig. 4 (e.g., point A, B, C), where the proposed PLP method showing worse performance, i.e., the validation loss of proposed PLP method is larger than basic models.The root cause of these singularities are the characteristics of SGD that may lead to sudden changes in parameter values, result in proposed PLP method unable to follow its changes.Besides this, except for the overfitting stage after about 40th epoch, proposed method shows better performance during model training compared to the baseline model.Fig. 3 and Fig. 4 show the 3 pair of training data selected from 10 sets of trained models of the 3 selected networks (Vgg16, Resnet18 and GoogLeNet), so as to demonstrate the ability of proposed PLP method in improving DNN training performance.To better illustrate the capacity of proposed method in improving the training efficiency of DNNs, Table 1, Table 2 and Table 3 have shown the epoch and loss values corresponding to the optimal training model obtained from 10 sets of tests for the 3 selected models.
Table 1 shows the comparison between PLP and normal method based on Vgg16 net.Except in 5th test the PLP method obtain higher loss in later epochs, and in 3rd and 7th tests lower loss is obtained in later epochs.In the rest of tests, the PLP method is able to get better performance in earlier epochs compare to normal method.The reason for the abnormality of the 3rd, 5th and 7th tests is that the randomness of SGD may lead to the gradient mutation of parameters during training process, resulting in the noise introduced by PLP method exceeding the tolerance of SGD, which makes PLP method spend longer training epochs than normal method to converge.However, even under this condition, due to the introduction of noise, PLP may still be able to get better training performance with the regularization effect caused by the introduction of noise (3rd and 7th tests).Table 2 and Table 3 show the comparison between PLP and normal method based on Resnet18 and GoogLeNet.As shown in the tables, there are also some outliers like Table1 mentioned above, but overall, in most cases, the PLP method can obtain the optimal model faster than normal method during training process, verified the capacity of proposed method in improving the training efficiency of DNNs.To further demonstrate the effectiveness of the proposed method in enhancing DNN training efficiency and performance, the average accuracy and top-1/top-5 error test results on CIFAR-100 test set are shown in Table 4.As can be seen, the model trained with proposed PLP method obtain higher accuracy than the baseline model with more than 1% accuracy improvement in average, showing that the better training performance of proposed PLP method is not achieved at the expense of generalization, but even achieving better generalization than baseline models.Also, the proposed PLP method get smaller top-1/top-5 error in average, further proved the effectiveness of proposed PLP method in getting better training performance.

Conclusion
In this paper, we proposed a new DNN training methodology called Parameter Linear Prediction (PLP) method.It is based on the observation that the DNN parameters (weights and bias) are gradually changed with certain laws from initial values until model convergence during training process, which indicate the feasibility of parameter prediction to improve the training efficiency of DNN models.And the introduction of noise caused by inevitable prediction error also brings regularization effect for DNN training, which helps to improve the generalization ability of the model.Thus, the proposed PLP method has the potential to improve DNN training efficiency and performance The proposed PLP method has been transplanted to several commonly used and representative backbones, i.e., Vgg16, Resnet18 and GoogLeNet, and experiment on CIFAR-100 dataset, results show that it has brought an average accuracy improvement of 1% than that of the normal method, while the top-1 and top-5 error is reduced by about 0.01 on average, verified the capability of PLP in improving DNN training efficiency and performance.
Furthermore, the PLP method comprehensively considers issues of real-time needs, data amount and computational complexity, and it's also easy to implement, hardware friendly and no extra hyperparameters are introduced.

Figure 1 :
Figure 1: Example of parameter changing tendency during Vgg16 training.
the predicted parameters to the model for the next optimization iteration.

Figure 2 :
Figure 2: Overview of Linear Prediction method.Note that wn_ 1/2/3 and bn_ 1/2/3 represents the n th layer of the model, and 1/2/3 means the parameter obtained from the 1st/2nd/3rd iteration in this cycle.In step b and c

Figure 3 :
Figure 3: Accuracy and comparison of proposed PLP and baseline models.The "Acc Diff" in figure refers to "PLP method accuracy subtract normal method accuracy".

Figure 4 :
Figure 4: Loss and comparison on validation set of proposed PLP method and baseline models.The "Loss Diff" in figure refers to "PLP method loss subtract normal method loss".

Table 4 .
Average accuracy and top-1/top-5 error results on test set