Improving the efficiency of RMSProp optimizer by utilizing Nestrove in deep learning

There are several methods that have been discovered to improve the performance of Deep Learning (DL). Many of these methods reached the best performance of their models by tuning several parameters such as Transfer Learning, Data augmentation, Dropout, and Batch Normalization, while other selects the best optimizer and the best architecture for their model. This paper is mainly concerned with the optimization algorithms in DL. It proposes a modified version of Root Mean Squared Propagation (RMSProp) algorithm, called NRMSProp, to improve the speed of convergence, and to find the minimum of the loss function quicker than the original RMSProp optimizer. Moreover, NRMSProp takes the original algorithm, RMSProp, a step further by using the advantages of Nesterov Accelerated Gradient (NAG). It also takes in consideration the direction of the gradient at the next step, with respect to the history of the previous gradients, and adapts the value of the learning rate. As a result, this modification helps NRMSProp to convergence quicker than the original RMSProp, without any increase in the complexity of the RMSProp. In this work, many experiments had been conducted to evaluate the performance of NRMSProp with performing several tests with deep Convolution Neural Networks (CNNs) using different datasets on RMSProp, Adam, and NRMSProp optimizers. The experimental results showed that NRMSProp has achieved effective performance, and accuracy up to 0.97 in most cases, in comparison to RMSProp and Adam optimizers, without any increase in the complexity of the algorithm and with fine amount of memory and time.


Literature review
Background. In this subsection, the focus is on the idea of how the optimizers work in Deep Neural Networks (DNNs). It presents some previous enhancement techniques of different traditional optimizers that deal with DL problems. Then, some modified versions of these optimizers are presented, with discussing how they work on different problems under different conditions. Training deep models is effectively remaining one of the most required tasks for researchers and practitioners in both real-world DL research, and application work. So far, the vast majority of the deep model training is based on the back propagation algorithm, which propagates the errors from the output layer backward, and uses gradient descent-based optimization algorithms to update the parameters layer by layer. Therefore, in order to achieve an effective model, the suitable optimizer to handle the problem should be chosen. There are different types of optimizers, such as Gradient Descent based Learning (GD), Momentum based Learning, Adaptive Gradient based Learning, and Momentum Adaptive Gradient based Learning Algorithms.
Gradient descent based learning (GD) algorithms. GD 13 is utilized to reduce some functions by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It is used to update the parameters' model in DL. In addition; there are multiple types of GD, such as the stochastic gradient batch (vanilla) 14 , gradient descent, or mini-batch gradient descent. The main difference between Batch Gradient Descent (BGD) 15 and Stochastic Gradient Descent (SGD) 16 is the cost of one example for each procedure in SGD is only computed. In contrast, in BGD, the cost for all training examples in the dataset has to be computed. This extremely quickens the neural networks. Basically, this is what stimulates utilizing SGD. SGD is utilized to update parameters in a DL, as shown in Eq. (1). Besides, this equation is employed to update parameters in a backwards pass, with the help of back propagation 17 , to compute the gradient. Each parameter, θ, is taken and updated by getting the original parameter and subtracting the learning rate times the ratio of change.
It is noteworthy that in order to solve these drawbacks of SGD, an enhancement should be done. Mini Batch Gradient Descent 18 would be adopted because it has the best of the two approaches. It also executes an update for each batch of n training examples in every batch. The following sub-section is devoted for discussing the other types of optimization algorithms.
Momentum based learning algorithms. Instead of depending only on the gradient of the current step to steer the search, momentum, the gradient of the previous steps is considers 19 . The gradient descent equations are changed as given in Eq. (2).
In the end, the parameters are updated through θτ + 1 = θτ − µτ. Thus, this allows us to fine-tune the updates to the slopes of our error function, which speeds up SGD, as shown in Figure 1 which illustrates how Momentum speeds up the SGD in training process 20 . Moreover, it helps adapt the updates to every individual parameter to execute larger or smaller updates relying on their impact. In the sub-section below, some adaptive algorithms are introduced.
The Nesterov Momentum update 21 is a significantly modified adaptation of the momentum update that has recently gained popularity. It has higher efficiency convergence promises for convex functions. It also performs slightly better in practice than the ordinary momentum. When the current parameter vector is at x, it is interfered from the momentum update above that the momentum term alone is about to nudge the parameter vector by µu*υ. Therefore, to compute the gradient, the future approximate position, x + µ u*υ, should be considered as a looked-ahead 22 at a further point to stop at. As a result, instead of computing the gradient at the previous position, x, compute it at x + µu*υ as clarified in Fig. 2.
Adaptive gradient based learning algorithms. This subsection investigates a group of learning algorithms, such as Adagrad, Adadelta, and RMSProp,that use adaptive learning rates to update variables. Adagrad 23 is a technique that permits the learning rate to adapt depending on the parameters. Thus, it provides big updates for infrequent parameters, and small updates for frequent parameters. Also, it is well known for being appropriate for handling sparse data. This technique utilizes a different learning rate for each parameter at a time step based on the past gradients that were calculated for that parameter. Adadelta 24 is considered to be an expansion of AdaGrad that lean towards to eliminate the decay caused by the learning rate. This strategy limits the window of collected past gradients to few settled estimate weights rather than pressing the entire squared first-order derivative by using a decay average. RMSProp can be thought of as an extension of AdaGrad which uses a moving It is available to compute learning rates for every parameter. Moreover, individual changes for every parameter could be calculated and packed independently.
Momentum and adaptive gradient based learning algorithms. This subsection discusses some learning algorithms that incorporate the benefits of the previous two algorithms such as Adaptive Moment Estimation (Adam) 25 which can be defined as another approach which calculates the adaptive learning rates for every parameter. Besides, it keeps the average of all past squared gradients, such as AdaDelta. Adam is one of the effective optimization approaches in adaptive algorithms. It is an SGD algorithm that relies on the concept of momentum to quickly and effectively arrive at the loss function's global minimum. This aids in efficiently modifying the learning rate for each parameter, hastening the time to convergence to the minimum. Adam calculates distinct adaptive learning rates based on the first and second moment values for various parameters as clarified in Eqs. (5) and (6).
where µ is the mean and υ is the variance of the first-order derivative in the same order. Equation (7) gives the final step of updating θ.
Adam performs well in practice, and provides results that can be favorably compared to other optimization algorithms since it achieves the optima fast, and its performance is quiet quick and so effective. The technique of adaptive algorithms has the ability to also correct and fix each issue that can be confronted in other optimization algorithms that may cause fluctuation in the loss function adaptive techniques. There is an issue concerning the learning rate to be "just right", which is so tricky. If it is selected too small, there will be no progress. On the other hand, if it is too large, the solution will fluctuate, and be in the worst condition; it may even diverge. So, if is specified and selected automatically, or even this step is avoided, the second order techniques, which search not only for the value and gradient of the objective but also for its curvature, can be beneficial in this case.
Related work. Dozat 26 improved Adam and explained and pointed out how to regenerate of Nesterov Accelerated Gradient (NAG) to be more direct and precise concerning performance. In this work, he had not executed the step of modifying the parameters with just the momentum procedure to compute the gradient in order to get back to the main parameter state. After that, the momentum procedure continues again during the real authentic update. Furthermore, the time of the momentum procedure can be applied only one time during the update of the previous time phase. Tato and Nkambou 27 introduced additional hyperparameters to Adam optimizer that preserves the direction of the gradient through ingrained optimization execution. On the other hand, Keskar and Socher 28 created a modified version of Adam (AAdam) to accelerate and quicken its performance. They aim to obtain a better minimum for the loss function, in comparison with the main algorithm, by extracting some thoughts from the momentum relying on the optimizer and exponential decay methodology. The clarified also that the procedure magnitude is produced by Adam to adjust the parameters. It is the way that the new procedure takes into account both of the direction of the gradient and the modification applied to the previous procedures. The authors also used MNIST data set for evaluation. The results showed that AAdam achieves the best results particularly on the validation set, even in the cases that require more memory. Also, they showed that it surpasses and outturns Adam and NAdam in decreasing training, and validation loss, and achieve better accuracy than the other methodologies. Figure 2. The effect of Nestrove on the Gradient step.  29 introduced a combined hybrid methodology that begins with a flexible technique and then switches to SGD when convenient. They also presented SWATS which is considered as a simple methodology that converts from Adam to SGD, when a triggering case is satisfied. The case which is introduced is related to the projection of the Adam procedures on the gradient subspace. The cost of the observation procedure of this case is very low, and does not expand the number of hyperparameters in the optimizer. Furthermore, the authors created both switchover point and learning rate for SGD after the switch is assigned as a part of the algorithm. Due to these requirements, no convenient effort is added. Additionally, the authors demonstrated the adequacy of this methodology on ImageNet data sets. The results clarified that the proposed methodology compared with to SGD, while retaining the beneficial qualities of Adam such as the insensitivity, and quick initial advance of the hyperparameter. Hoseini et al. 30 proposed an algorithm to enable the training model in DCNN to switch between RMSprop and Nestrove optimizers to grantee that the AdaptAhead optimizer does not make any change of the structure of RMSprop and Nestrove algorithms. AdaptAhead built three switches; the first is responsible for a set the value of hyper-parameter norms, which corresponds to norm-1, Euclidean norm, and max-norm. The second switch determines whether gradients whether in the normal or in the Nesterov method. The third switch determines when the learning rate works whether by applying the calculated norm in an adaptive manner, or in the normal manner based on Nesterov method. Xue et al. 31 suggested an approach to enhance the training of feed-forward NNs, which integrates the advantages of Differential Evolution and Adam. This approach explores the search space using a population-based method and adaptively modifies the learning rate to hasten convergence. Their results show that this proposed approach exhibited impressive outcomes in terms of accuracy and convergence speed across.
Wang et al. 32 presented architecture for communication-efficient compressed federated adaptive gradient optimization, FedCAMS, which tackles the adaptively problem in federated optimization techniques while substantially reducing communication overhead. He suggested a universal adaptive federated optimization framework, FedAMS, as a base for FedCAMS. FedAMS, that includes different iterations of Adam characteristic max stabilization techniques. They offer an enhanced theoretical examination of the convergence of adaptive federated optimization, based on which they demonstrate that their suggested FedCAMS accomplishes the same convergence rate as its uncompressed counterpart FedAMS with a number of magnitude less communication cost in the no convex stochastic optimization context.

Proposed model NRMSProp
The main contribution of this work is to propose a model to enhance the performance of Adaptive Gradient based Learning algorithms in different ways such as the time and the accuracy. The proposed NRMSprop model gains its power from the advantages of Nestrove approach and the way in which RMSprop is calculating the gradients. The steps of NRMSprop are shown in Algorithm 1. Step 1: τ = τ + 1 Step 2: compute the gradient at step τ Step 3: calculate Nestrove momentum vector Step 4: calculate the square of exponential moving average with term (g t − µ t ) Step 5: correction of Biasμ t = µt Step 6: Apply the update of θ End While As clarified in Algorithm 1, Step 1 allocates the initial time, τ, and Step 2 calculates the initial gradient g t which is the main parameter that optimizers of DL used to update the overall weights and the other parameters of the optimization process as the end of the process in Step 6. In the first iteration, g t , is computed depending on initial random point as a start point. As the goal of optimization process is reducing the difference between this start point and the actual target minima NRMSprop computes the Nestrove vector µ in Step 3 and adding β t 2 to its computations that can effectively affect on how to apply look-ahead technique by calculating the gradients not at the current point, but with respect to the approximate future point. In Step 4, NRMSprop keeps the value of all past gradients as a history parameter of each movement and calculates the exponential moving average (s). Keeping this history helps NRMSprop to be more stable and prevents overshoots when the target minima is very close. As in adaptive learning technique, the step size is not a static value for all iterations, NRmsprop takes care of the step size and adapts it for the optimal value depending on the distance between the current position and the minima, so there is a need to penalty the gradient with µ in term g t − µ t to reduce the step size if the target point is near to the current position. To decide the next step and choose the right direction, NRMSprop calculates the correction of Bias μ t in step 5. Finally in step 6 we update the weights θ by the form of Eq. (12) with adding Nestrove by replacing ( μ t−1 ) of the earlier step by ( μ t ) of the current momentum vector to get θ t+1 as the last step of NRMSprop. To examine this proposed method we did experiments on Fashion-MNIST, CIFAR-10 and Tiny-ImageNet datasets using three Adam, RMSProp and the proposed optimizer NRMSPROP

Experiments
Datasets description. Fashion-MNIST 33 consists of 60,000 28 × 28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. Figure 3 shows these 10 categories 34 . CIFAR-10 data 35 includes 6000 images per class in 10 classes totaling 60,000 32 × 32 colour images; sample of this dataset are shown in Fig. 4. These images are splitting into 10,000 test photos and 50,000 training images are available. Tiny ImageNet dataset 36 is a version of Image Net dataset. It is a container of 200 categories in which there are 100,000 images in these categories and 10,000 images for each validation and test processes. Sample of this dataset are shown in Fig. 5.  The model structure. To examine NRMSProp, we used two models: a simple Convolution Neural Network (CNN) 39 model and ResNet model. Figure 6 illustrates the structure of CNN layers 40,41 . Table 1 illustrates the structure of NRMSProp model and its layers with 50 epochs for training, where the values of the hyper parameters are: η is the learning rate (the default is 10-3), β1 and β2 are the smoothing parameters (β1 = 0.9, and β2 = 0.999), and ϵ is a small number and tis usually set as 10 −7 .
Residual Network (ResNet) 42 is a group of DNN models that has attained outstanding results on a variety of tasks associated with computer vision, including segmentation based on semantics, recognizing objects, and classification of images. The introduction of residual links, which enables the network to develop residual mappings that may be quickly optimized using gradient-based approaches, is the primary improvement of the ResNet models. The difference between the input and output of a group of convolutional layers, which is then added back to the input, is used for computing these residual mappings. Instead of trying to learn the complete mapping from scratching the network can learn to concentrate on the disparities between the input and the desired output in this manner. There are many depths of ResNet models, ranging from the original ResNet-18 to the far deeper ResNet-152. These models have already been trained using massive datasets like ImageNet. Here, ResNet  Figure 7 illustrates the structure of ResNet_V2 layers. The results of the two models on the three datasets are illustrated in the next section.    Next experiment we present classes of CIFAR-10 dataset as (Class 1 refer to Airplane, Class 2 refer to automobile, Class 3 refer to Bird, Class 4 refer to Cat, Class 5 refer to deer, Class 6 refer to Dog, Class 7 refer to Frog, Class 8 refer to horse, Class 9 refer to Ship and Class 10 refer to truck). Table 5 illustrates the overall performance of Adam on CIFAR-10 dataset in which it achieves precision 83% which belongs to Class ship, where 81% recall is the highest rate belongs to Class truck. On the other hand F1-score is 78% is the highest value belongs to Class automobile. Table 6 illustrates the overall performance of RMSProp on CIFAR-10 dataset in which it achieves precision 91% which belongs to Class automobile, where 89% recall is the highest rate belongs to Class ship. On the other hand F1-score is 81% is the highest value belongs to Class automobile. Table 7 illustrates the overall performance of NRMSPprop on CIFAR-10 dataset in which it achieves precision 85% which belongs to Class automobile, where 87% recall is the highest rate belongs to Class ship. On the other hand F1-score is 81% is the highest value belongs to Class automobile.

Results of the experiments.
From Figs. 2, 3, 4, 5, 6, 7, the overall performance of NRMSProp is showed to be higher than Adam and Rmsprop in reaching high value in most classes of Fashion-MNIST and CIFAR-10 datasets.
The second criteria for the evaluation stage is constructing confusion matrix 43 , in which it gives an accurate assessment of the model's accuracy in terms of true positives, true negatives, false positives, and false negatives. This aids in comprehending the model's performance and locating potential improvement areas. Also, it offers a     www.nature.com/scientificreports/ more in-depth and a picture of the model's performance. It can assist in pinpointing certain instances where the model is functioning successfully or incorrectly. The model's general effectiveness can be improved by detecting problem areas with the use of the confusion matrix, which can also be used to guide parameter adjustment and refining. Three confusion matrices are constructed for each dataset as given below. Figures 8,9,10 show the confusion matrices for Fashion-MNIST datasets. Figure 8 illustrates that Adam is getting confused mostly between similar classes in comparison between actual and predicted values. Some of the higher values of negative classification are between these classes: (T-shirt/top and Shirt) (Pullover and Shirt) (Coat and Shirt) (Coat and Pullover) with values (117, 68, 165, 105). Figure 9 illustrates that RMSProp is getting confused in many classes in comparison between actual and predicted values. Some of the higher values of negative classification are between these classes: (T-shirt/top and Shirt) (Pullover and Coat) (Dress and Coat) (Coat and Pullover) (Shirt and coat) (Shirt and Pullover) with values (224, 66, 52, 92, 154, 168). Figure 10 illustrates NRMSprop is getting confused just in very similar classes in comparison between actual and predicted values. Some of the higher values of negative classification are between these classes: (T-shirt/top and Shirt) (Pullover     147, 123,92). Depending on these experiments on Fashion-MNIST dataset, NRMSprop is shown to be the less confusion degree than Adam and Rmsprop. Figures 11, 12, 13 show the confusion matrices for CIFAR-10 datasets. Figure 11 illustrates that Adam is getting confused in many classes in comparison between actual and predicted values. By focus on some of higher values of negative classification are between these classes we found many classes have high confusion value like (airplanes, cats, deer, dogs, and trucks). Figure 12 illustrates that that RMSProp is getting confused in many classes in comparison between actual and predicted values. By analyzing the behavior or RMSProp, the most of values are shown to be very close to each other and RMSProp is hardly distinguished images between   optimizers under all measurement criteria. The overall performance of the NRMSProp optimizer is more efficient after adding the Nestrove term to its steps. Therefore, the power of Nestrove can be utilized in to enhance the accuracy and to speed up the process of the optimizer in general. Moreover, NRMSProp has an effect feature that keeps the history of the current point to be more efficient in speeding up the process of taking the decision of which direction should be chosen. When adding Nestrove to NRMSProp's steps, it gives