Unified Algorithm Framework for Nonconvex Stochastic Optimization in Deep Neural Networks

This paper presents a unified algorithmic framework for nonconvex stochastic optimization, which is needed to train deep neural networks. The unified algorithm includes the existing adaptive-learning-rate optimization algorithms, such as Adaptive Moment Estimation (Adam), Adaptive Mean Square Gradient (AMSGrad), Adam with weighted gradient and dynamic bound of learning rate (GWDC), AMSGrad with weighted gradient and dynamic bound of learning rate (AMSGWDC), and Adapting stepsizes by the belief in observed gradients (AdaBelief). The paper also gives convergence analyses of the unified algorithm for constant and diminishing learning rates. When using a constant learning rate, the algorithm can approximate a stationary point of a nonconvex stochastic optimization problem. When using a diminishing rate, it converges to a stationary point of the problem. Hence, the analyses lead to the finding that the existing adaptive-learning-rate optimization algorithms can be applied to nonconvex stochastic optimization in deep neural networks in theory. Additionally, this paper provides numerical results showing that the unified algorithm can train deep neural networks in practice. Moreover, it provides numerical comparisons for unconstrained minimization using benchmark functions of the unified algorithm with certain heuristic intelligent optimization algorithms. The numerical comparisons show that a teaching-learning-based optimization algorithm and the unified algorithm perform well.


I. INTRODUCTION
A USEFUL way to train deep neural networks is to solve a nonconvex optimization problem in terms of deep neural networks [1], [2], [3] and find suitable parameters for them. Many algorithms have been presented to solve nonconvex optimization problems. The simplest algorithm for the problem is stochastic gradient descent (SGD) (see, e.g., [4], [5] for recent studies for SGD). Adaptive-learningrate optimization algorithms (see Subsection 8.5 in [6]) are powerful methods that adapt the learning rates of the model parameters to solve the problem quickly. Examples include Adaptive Gradient (AdaGrad) [7], Root Mean Square Propagation (RMSProp) [6,Algorithm 8.5], Adaptive Moment Estimation (Adam) [8], Adaptive Mean Square Gradient (AMSGrad) [9], Adam with weighted gradient and dynamic bound of learning rate (GWDC) [2, Algorithm 2], AMSGrad with weighted gradient and dynamic bound of learning rate (AMSGWDC) [2,Algorithm 3], and Adapting stepsizes by the belief in observed gradients (AdaBelief) [10]. Note that the existing adaptive-learning-rate optimization algorithms use the inverses of certain positive-definite matrices at each iteration; in other words, they are framed depending on the definitions of such positive-definite matrices.
In this paper, we first show that the positive-definite matrices used in the existing adaptive-learning-rate optimization algorithms assume common conditions (Assumption III.1). Next, we present an algorithm [11] (Algorithm 1) for which convergence is guaranteed under Assumption III.1. This implies that the algorithm is a unification of the existing ones (see also Example III.1). For the convergence analyses of the The convergence rate of nonconvex (resp. convex) optimization is measured by min k=1,2,...,n E[ ∇f (x k ) 2 ] (resp. R(T )/T ), where n denotes the number of iterations and T denotes the number of training samples (see Table 2 for the definitions of the notation). C and C i (i = 1, 2) are constants that are independent of n, T , and the constant learning rates α and β. A diminishing learning rate is αn = 1/ √ n.
algorithm, we use constant and diminishing learning rates. We show that, for a constant learning rate, the algorithm can approximate a stationary point of the nonconvex stochastic optimization problem (Theorem III.1), while for a diminishing rate, it converges to a stationary point of the problem (Theorem III.2). Table 1 summarizes convergence rate results of SGD and adaptive-learning-rate optimization algorithms for nonconvex and convex optimization that were studied in 2020 and 2021. Our first contribution is to provide convergence analyses of the existing adaptive-learning-rate optimization algorithms, i.e., analyses which could not be done using the methods in [2], [4], [5], [8], [9], [10]. The previously reported results (see, e.g., [2], [8], [9]) tried to minimize the regret (see Table  2 for the definition) for convex optimization. However, regret minimization does not always lead to solutions of optimization problems in deep learning (Subsection IV-E). In contrast to the previous results, this paper explicitly shows that the existing adaptive-learning-rate optimization algorithms can solve such problems (Subsections IV-A-IV-D). For nonconvex optimization, AdaBelief [10] has an O(log n/ √ n) convergence rate, where n denotes the number of iterations (see Table 1). Theorem III.2 ensures that using a diminishing learning rate α n = 1/ √ n allows the proposed algorithm (Algorithm 1) including AdaBelief to have an O(1/ √ n) convergence rate (see also [11]). While GWDC and AMSGWDC [2] with diminishing learning rates can only be applied to convex optimization (see Table 1), the proposed algorithm (Algorithm 1) can be applied to both convex and nonconvex optimization (see Table 1).
In particular, we would like to emphasize that the existing algorithms with constant learning rates can be applied to the problem (Subsection IV-E and Table 1). The results for a constant learning rate are significant from the viewpoints of both theory and practice, since algorithms with a constant learning rate work well whereas the algorithms with a diminishing rate converging to zero do not work. Moreover, we also would like to emphasize that not only Adam and AMSGrad but also GWDC and AMSGWDC can be applied to the problem. This is in contrast to [2], which presented a regret minimization only for GWDC and AMSGWDC with a diminishing learning rate, and [11], which presented convergence analyses only for Adam and AMSGrad for constant and diminishing learning rates. Using constant learning rates allows Algorithm 1 for nonconvex optimization to have approximately an O(1/n) convergence rate (see Table 1).
The second contribution of this paper is to show that the algorithm with a constant learning rate tends to be superior for training neural networks, while the one with a diminishing rate tends not to be good for training neural networks. Here, we focus on image classification using several neural networks and show the effectiveness of the algorithm with a constant learning rate (Subsections V-A-V-B). Moreover, we consider unconstrained optimization in [12] and provide numerical comparisons for Algorithm 1 with heuristic intelligent optimization algorithms, such as the genetic algorithm (GA), particle swarm optimization (PSO) biogeographybased optimization (BBO), an Atom search optimization (ASO), and a teaching-learning-based optimization (TLBO). The numerical comparisons show that, in particular, TLBO and Algorithm 1 perform well (Subsection V-C).
This paper is organized as follows. Section II states the main problem. Section III presents the proposed algorithm for solving the main problem and analyzes its convergence. Section IV compares the analyses in Section III with the ones in the previous reports. Section V numerically compares the behaviors of the proposed algorithm with those of the existing ones. Section VI concludes the paper with a brief summary.

II. OPTIMIZATION IN DEEP NEURAL NETWORKS
The notation used in this paper is summarized in Table 2.
In general, an optimization problem in a deep neural network can be expressed as the following nonconvex optimization problem: The Hadamard product of matrices A and B (x x := ( The metric projection onto a nonempty, closed convex set X (⊂ R d ) P X,H The metric projection onto X under the H-norm The expectation of a random variable Y ξ A random vector whose probability distribution P is supported on a set The gradient of f G(x, ξ) The stochastic gradient for a given ( The set of stationary points of the problem of minimizing f over X f The optimal objective function value for the problem of minimizing f over X R(T ) The regret on a sequence ( T t=1 ft(xt)) defined by R(T ) : There exist c ∈ R and n 0 ∈ N such that, for all n ≥ n 0 , yn ≤ cxn, where (xn) n∈N , (yn) n∈N ⊂ R + the projection can be easily computed; , is well defined, where F (·, ξ) is continuously differentiable for almost every ξ ∈ Ξ, where ξ ∈ Ξ is a random vector whose probability distribution P is supported on a set Ξ ⊂ R d1 . Thus, we would like to find a minimizer of f over X, i.e., x ∈ argmin x∈X f (x).
Even if Problem II.1 is deterministic, i.e., F does not depend on ξ, the existing algorithms, such as the steepest descent method, Newton method, quasi-Newton methods, and conjugate gradient methods, can find a stationary point for the problem of minimizing f over X. From this fact, we will focus on the following stationary point problem [11] associated with Problem II.1: Problem II.2 Under (A1) and (A2), we would like to find a stationary point x of Problem II.1, i.e., The relationships between Problems II.1 and II.2 are as follows: argmin x∈X f (x) = X . We also have that X = {x ∈ R d : ∇f (x ) = 0} when X = R d .
We will consider Problem II.2 under the following conditions.
Algorithm 1 Adaptive-learning-rate optimization algorithm for Problem II.2 Find d n ∈ R d that solves H n d = −m n 7: x n+1 := P X,Hn (x n + α n d n ) 8: n ← n + 1 9: end loop We need the following conditions to analyze Algorithm 1.

A. CONVERGENCE ANALYSIS OF ALGORITHM 1 WITH A CONSTANT LEARNING RATE
The following is the convergence analysis of Algorithm 1 with a constant learning rate. The proof of Theorem III.1 is given in the proof of Theorem 1 in [11].

B. CONVERGENCE ANALYSIS OF ALGORITHM 1 WITH A DIMINISHING LEARNING RATE
The following is a convergence analysis of Algorithm 1 with a diminishing sub-learning rate. The proof of Theorem III.2 is given in the proof of Theorem 2 in [11].
where f denotes the optimal value of the problem of minimizing f over X. Moreover, under η ∈ [1/2, 1), any accumulation point of (x n ) n∈N defined byx n := (1/n) n k=1 x k almost surely belongs to X = argmin x∈X f (x), and Algorithm 1 achieves the following convergence rate: Additionally, for the regret on a sequence of F (·, t) := f t (·) (t = 1, 2, . . . , T ), Algorithm 1 satisfies

A. COMPARISON OF ADAM WITH ALGORITHM 1 IN THE CASE OF EXAMPLE III.1(I)
Theorem 4.1 in [8] indicates that Adam with α n = O(1/ √ n) and β n = λ n (λ ∈ (0, 1)) ensures that there exists a positive real number D such that Unfortunately, Theorem 1 in [9] shows that a counterexample to Theorem 4.1 in [8] exists. Meanwhile, Algorithm 1 with (1) resembles the Adam algorithm (see Footnote 1 for details). Proposition III.2 indicates that Algorithm 1 with (1), η = 1/2, and β n = λ n satisfies This implies that Algorithm 1 based on Adam achieves an O(1/ √ T ) convergence rate, which could not be done using the analysis in [8].

E. DISCUSSION
First of all, we would like to emphasize that regret minimization does not always lead to solutions of problem (5) (see also [14]). This is because, even if (x t ) T t=1 is satisfied, for a sufficiently large number T , Accordingly, from only (6), (7), (8), and (9), we cannot evaluate whether or not the output x T generated by the existing adaptive-learning-rate optimization algorithms approximates the solution of problem (5). Meanwhile, Proposition III.2 and Subsections IV-A, IV-B, IV-C, and IV-D lead to the finding that Algorithm 1, including Adam, AMSGrad, GWDC, and AMSGWDC, ensures that, under η ∈ (1/2, 1], and that, under η ∈ [1/2, 1), The results in (10) and (11) imply that Algorithm 1 can solve problem (5), in contrast to (6), (7), (8), and (9) that are results for regret minimization. Moreover, we would like to emphasize that Algorithm 1 with a constant learning rate can approximate the solution of problem (5) in the sense of the result in Proposition III.1, i.e., The above result indicates that using small constant learning rates would be a good way to solve problem (5). Proposition III.2 shows that any accumulation point ofx n := (1/n) n k=1 x k belongs to argmin x∈X T t=1 f t (x). Accordingly, using a diminishing learning rate is a more robust way to solve problem (5) than using a constant learning rate in theory, as shown in [11], [14], [15]. However, Algorithm 1 with a diminishing learning rate might not work for a rather large number N of iterations, since step 7 in Algorithm 1 with α N ≈ 0 satisfies This implies that using a diminishing learning rate would not be good in practice. The next section shows that using a constant learning rate tends to be superior to using a diminishing one. This tendency was also observed in [11], [14], [15].

V. NUMERICAL EXPERIMENTS
We examined the behavior of Algorithm 1 with different learning rates. The adaptive-learning-rate optimization algorithms with δ = 0.999 [8], [9] used in the experiments were as follows, where the initial points initialized automatically by PyTorch were used and l n and u n were based on [16,Section 4].
The experiments used a fast scalar computation server at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16 GB, 900 Gbps) GPU and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.8.2, and we used the NumPy 1.17.3 package and PyTorch 1.3.0 package.

A. FEEDFORWARD NEURAL NETWORK MODELS
First, we experimented with two feedforward neural network models, a single-layer perceptron and a double-layer perceptron, having different hidden layers for image classification. We used the MNIST dataset, which is a multi-class handwritten digits dataset (0-9), collected by the National Institute of Standards and Technology (NIST). The training data were gathered from Census Bureau employees and high-school students. The training data contains 60,000 grayscale images (28 × 28), and the test data contains 10,000 grayscale images (28 × 28). Figures 1 and 2 indicate that Algorithm 1 with a constant learning rate tended to perform better than Algorithm 1 with a diminishing one in terms of training and test accuracy score. Moreover, Figure 2 shows that ADAM-C1 and ADAM-C2 minimized the training loss function faster than the other algorithms. Meanwhile, Figure 4 shows that GWDC-C1, AMSGWDC-C1, and AMSGWDC-C2 minimized the test loss function faster than the other algorithms. Figures 4 and  8 also show that the algorithm with a constant learning rate outperformed the one with a diminishing rate.

1) Single-layer perceptron
2) Double-layer perceptron Figures 9,11 and Figures13,15 indicate that Algorithm 1 with a constant learning rate tended to perform better than Algorithm 1 with a diminishing one in terms of training and test accuracy score. Figure 10 shows that AMSG-C1 and AMSG-C2 minimized the training loss function faster than the other  Figure 12 shows that GWDC-C1, GWDC-C2, and AMSGWDC-C2 performed well. We can see that GWDC and AMGWDC with constant learning rates were superior at training the neural network. Figures 12  and 16 also show that the algorithm with a constant learning rate outperformed the one with a diminishing rate.

B. CONVOLUTIONAL NEURAL NETWORK MODELS
Next, we experimented with a dense convolutional network (DenseNet) and residual network (ResNet); both are relatively deep models based on convolution neural networks (CNN) for image classification. We used the CIFAR-10 dataset, which is a benchmark for image classification. The dataset is a collection of color image data collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It includes ten      Figures 29,31 indicate that Algorithm 1 with a constant learning rate outperformed Algorithm 1 with a diminishing one in terms of the training and test accuracy score. In particular, ADAM-D1 and AMSG-D1 did not work and had low accuracies. Figure 28 shows that AMSG-C2, AMSG-C3, and GWDC-C1 minimized the training loss function, while Figure 30 shows that ADAM-D2, ADAM-D3, AMSG-D2, and AMSG-D3 minimized the test loss function faster than other algorithms, such as GWDC-Di and AMSGWDC-Di (i = 1, 2, 3). Meanwhile, Figure 32 shows that all algorithms except for ADAM-D1 and AMSG-D1 minimized the test loss function, and Figure 28 and Figure  32 show that the algorithm with a constant learning rate performed better than the one with a diminishing rate.

C. UNCONSTRAINED OPTIMIZATION USING BENCHMARK FUNCTIONS
We performed optimizations for the seven unimodal benchmark functions in Table 3 [12, Table 2] and compared Algorithm 1 with heuristic intelligent optimization methods 2 , as follows: • GA 3 : Genetic algorithm, where the population size was 100, the probability of performing crossover was 0.95, the probability of mutation was 0.025, and the maximum iteration number was 500. where the population size was 100, the mutation probability was 0.01, the number of elites was 2, and the maximum iteration number was 500. • ASO [19]: Atom search optimization algorithm, where the population size was 100, the depth weight was 50, the multiplier weight was 0.2, and the maximum iteration number was 500. • TLBO [20]: Teaching-learning-based optimization algorithm, where the population size was 100 and the maximum iteration number was 500. The stopping condition for all the algorithms was n = 500.

VI. CONCLUSION
This paper presented a unification of the existing adaptivelearning-rate optimization algorithms for nonconvex stochastic optimization in deep neural networks. It also presented two convergence analyses of the algorithm. The first analysis showed that the algorithm approximates a stationary point of the problem when it uses a constant learning rate. The second analysis showed that the algorithm converges to a stationary point of the problem when it uses a diminishing learning rate. The advantage of the proposed convergence analyses over the existing ones is that they show that the The experiments provided support for the convergence analyses. In particular, the numerical results for training neural networks showed that the algorithm with a constant learning rate performed better than the one with a diminishing rate, as promised by the convergence analyses. Moreover, the numerical results for minimizing benchmark functions showed that the proposed algorithm and TLBO performed well.

VII. ACKNOWLEDGMENT
We are sincerely grateful to the Associate Editor, Nishant Unnikrishnan, and the two anonymous reviewers for helping us improve the original manuscript.