Parallel Simultaneous Perturbation Optimization

Stochastic computer simulations enable users to gain new insights into complex physical systems. Optimization is a common problem in this context: users seek to find model inputs that maximize the expected value of an objective function. The objective function, however, is time-intensive to evaluate, and cannot be directly measured. Instead, the stochastic nature of the model means that individual realizations are corrupted by noise. More formally, we consider the problem of optimizing the expected value of an expensive black-box function with continuously-differentiable mean, from which observations are corrupted by Gaussian noise. We present Parallel Simultaneous Perturbation Optimization (PSPO), which extends a well-known stochastic optimization algorithm, simultaneous perturbation stochastic approximation, in several important ways. Our modifications allow the algorithm to fully take advantage of parallel computing resources, like high-performance cloud computing. The resulting PSPO algorithm takes fewer time-consuming iterations to converge, automatically chooses the step size, and can vary the error tolerance by step. Theoretical results are supported by a numerical example. To demonstrate the performance of the algorithm, we implemented the algorithm to maximize the pseudo-likelihood of a stochastic epidemiological model to data of a measles outbreak.


I. INTRODUCTION
Stochastic optimization is of core practical importance in many fields of science and engineering. For instance, epidemiological systems involve individuals subject to exogenous disturbances, which emphasizes the need for models and methods capable of dealing with stochasticity. While many optimization methodologies exist for deterministic systems, these algorithms can give misleading results when applied to stochastic systems. Closed-form solutions do not generally exist for stochastic optimization problems, thus we seek an iterative algorithm which guarantees convergence to the local optimal point. A key property of convex optimization problems is that the local optimum is the same as global optimum. The algorithm proposed here guarantees convergence to a local optimal point, which in the case of convexity, this local optimum is equal to the unique global optimum.
Many approximation algorithms have been developed to solve a wide variety of problems. The steepest descent method [1] is the most prominent iterative method for optimizing a complex objective function. In this method, the user provides a guess value from which a series of steps proceed until convergence criteria are satisfied. On each step, the state is moved in the direction that the function decreases most rapidly, which is the direction of the negative gradient. The main problem of the steepest descent methods is that the steps are often along directions that undo previous progress. Conjugate gradient algorithms have the desirable property of finding the best solution at every step. Conjugate gradient for nonlinear problems was developed by Fletcher and Reeves [2], based on the work of Davidon [3], Fletcher and Powell [4]. The nonlinear conjugate gradient algorithm, which is an extension of the linear conjugate gradient algorithm, is suitable for optimizing a general nonlinear function. In conjugate gradient algorithm, the update direction is defined as a linear combination of the gradient estimate at the current iteration and the update direction from the previous update direction [5]. A simple form of this smoothing technique has been applied on the gradient estimate [6], [7], and later, it has been shown that this smoothing technique can often offer improved performance [8].
The gradient-based algorithms, such as Robbins-Monro [9], Newton-Raphson [10], and neural network backpropagation [11], rely on direct measurements of the gradient of the objective function with respect to the optimization parameter. In this paper, we consider a particular type of optimization problem called gradient-free, referring to the case wherein the gradient of the loss function is not available. This is a common occurrence, for example, in complex systems, such as the optimization problem given in [12], the exact functional relationship between the loss function value and the parameters is not known, and the loss function is evaluated by measurements on the system or by running simulation. In such cases, we can use an approximation method instead, such as Kiefer-Wolfowitz types of algorithms, to estimate the gradient. The original Kiefer-Wolfowitz algorithm [13] is a gradient search algorithm that aimed at finding the maximum of a regression function using gradient estimates computed from finite differences. After 40 years, Spall proposed a stochastic approximation scheme for optimization that does a random search in the parameter space and only requires two function evaluations per iteration, regardless of the parameter dimension [14]. Spall's approach involves perturbing the state space parameter randomly using specific i.i.d. random variables. Later, Spall presented the secondorder modification of his algorithm, called SPSA [15]. The convergence of this algorithm to the optimal value in the stochastic almost sure sense in fewer iterations than other finite difference approximation methods makes it suitable in many applications. Other researchers have looked at ways of enhancing the convergence of the SPSA algorithm, e.g. iterate averaging is an approach aimed at achieving higher convergence rate in a stochastic setting [16]. The SPSA algorithm is used extensively in many different areas, e.g. signal timing for traffic control [17], [18], and some large scale machine learning problems [19]. A variant of SPSA in which the parameter perturbations are based on deterministic, instead of random, perturbations was considered in [20]. While SPSA is straightforward to implement, its performance depends on a set of parameters that need to be properly set. For example, depending on the choice of the step size for updating, the algorithm may be very slow or even diverge. Furthermore, the convergence rate of the SPSA is low in some applications in the literature [21]. In these cases, considering the Kesten's rule for choosing adaptive step sizes could be advantageous [22].
The stochastic nature of some complex models, e.g. epidemiological models, means that each set of parameters maps to a distribution of outcomes, from which each sample (model run) can take several hours to obtain. In the quickto-evaluate deterministic model setting, almost any classical optimization algorithm, such as steepest descent or Newton-Raphson, can be used. However, care must be taken when applying deterministic methods to stochastic objective functions, as the inherent noise causes unexpected behavior. In this paper, we introduce PSPO, an algorithm for optimization of stochastic objective functions. Despite many advantageous properties of the SPSA algorithm, it is a serial algorithm that evaluates the (stochastic) function a few points at a time. The PSPO algorithm we employ here, are thus needed to take advantage of fundamentally-parallel resources like high-performance cloud-based computing. Our method is appropriate for problems with very noisy gradients. We also derive a relationship between the number of parallel rounds of computation and error tolerance of the gradient for each iteration. We demonstrate that PSPO works well in practice and compares favorably to the conventional simultaneous perturbation optimization algorithm (SPSA). The rest of the paper is structured as follows. In Section II, we give a review of the simultaneous perturbation optimization algorithm. The Parallel Simultaneous Perturbation Optimization (PSPO) algorithm is presented in Section III. The numerical simulations are given in Section IV, and Section V concludes the paper.

II. PRELIMINARIES
Consider the problem of minimizing a cost function L(θ) : [14], [15], [23] presented an efficient stochastic algorithm called Simultaneous Perturbation Stochastic Approximation (SPSA). The SPSA algorithm estimates the gradient and the Hessian matrix by finite difference. This algorithm basically consists of two parallel recursions for estimating the optimization parameter, θ, and the Hessian matrix, H(θ). The first recursion is a stochastic equivalence of the Newton-Raphson algorithm, and the second one estimates the Hessian matrix. These two recursions are where a k is a positive scalar factor, P denotes the set of all positive definite matrices, and Π P (·) is the projection into the admissible set P. Here,ĝ andĤ k are the estimated gradient and Hessian at iteration k. The SPSA approach for estimating theĝ k (θ) andĤ k (θ) follows. Let ∆ k ∈ R p be vectors of p mutually independent zeromean random variables satisfying the condition of E{∆ −1 k } be bounded. An admissible distribution is a Bernoulli ±1 distribution. The positive scalars c k are chosen such that they usually get smaller as k gets larger. The two-sided estimate of the gradient at iteration k is given by: where y (±) k are the noisy measurements of the cost function at θ k ± c k ∆ k . Note that in (2), ∆ −1 k is the element-wise inverse of ∆ k . In the case of second order SPSA, it is suggested to use a one-sided gradient approximation, given by:ĝ Now let∆ k ∈ R p be vectors of p mutually independent zero-mean random variables satisfying the same condition of ∆ k . The positive scalarsc k are also chosen such that they get smaller as k gets larger. The numerical value of c k is suggested to be chosen smaller than c k . The following formula gives a per-iteration estimate of the Hessian matrix. where The practical implementation details of the algorithm is presented in [15].

III. PARALLEL SIMULTANEOUS PERTURBATION
OPTIMIZATION: PSPO While SPSA is an efficient optimization algorithm for high dimensional problems, it requires many iterations to converge, particularly for high-noise problems. In the case of high uncertain problems, the computation of the gradient and Hessian from noisy objective function evaluations benefits from more evaluations. The idea here is using parallel computing of the gradient and Hessian with different perturbation vectors, and use the estimated gradient and Hessian in a conjugate gradient type optimization algorithm.
The gradient vector at each iteration can be obtained by performing multiple evaluations of (3) for different values of perturbation. Let g be the gradient at a point θ, and g i represents the directional gradient in ∆ i direction at this point. Then we know that: Letĝ i be the estimated gradient using the one-sided gradient approximation (3) and perturbation vector ∆ i . If ∆ i is a Bernoulli ±1 vector, we havê and then,ĝ where, If {∆ i } 1≤i≤M span R p , then ∆∆ T is invertible. So, the least square estimation ofĝ is given by: On the other hand, if M < p, then (6) is an underdetermined system which has non-unique solutions. Then, the solution of the minimum Euclidean norm, ĝ 2 , among all solutions is given by: The Parallel Simultaneous Perturbation (PSP) algorithm for estimating the gradient at a given point is given in Algorithm 1. In order to improve the efficiency of this algorithm, we can compute L(θ) once, out of the loop. Doing this, we need M + 1 function evaluations for estimating the gradient.
In Algorithm 1, we suggest an efficient strategy to choose independent ∆ vectors. First, we sample ∆ 0 from a Bernoulli ±1 distribution. In round j, 1 ≤ j ≤ p, we need to switch the sign of the jth element in ∆ 0 to generate ∆ j . The same procedure repeats for the next p rounds of computations. Lemma 1. Given the suggested procedure in Algorithm 1 for choosing ∆ i vectors, the vectors {∆ j } 1≤j≤p spans R p for all p = 2.
Since a i = ±1, ∆ (3) is full rank. Now, assume we know that rank(∆ (k) ) = k. We need to prove that rank(∆ (k+1) ) = k + 1. From the procedure in Algorithm 1, we have Since adding a row to a matrix does not change the rank of that matrix, it is clear that Now, we need to prove that ∆ k+1 (the last column of ∆ (k+1) ) is not in the span of the columns of the matrix ∆ (k) . Let c i represent the columns of ∆ (k) . If ∆ k+1 is not independent of the columns, c i , then there should exists a nonzero vector, say α = 0, such that It is easy to show that there is not any real valued α which satisfies these conditions. Thus, ∆ k+1 is independent of the columns of ∆ (k+1) , and adding the ∆ k+1 column, which generates ∆ (k+1) , increases the rank of the matrix by one. Therefore, rank ∆ (k+1) = k + 1, which proves the lemma by induction.
Lemma 2. Given the suggested procedure for choosing ∆ vectors, if ∆ 0 = 1, where 1 is a vector that all its elements are one, then for all p = 2 and M ≥ 4, then Tr(∆∆ T ) −1 ≤ p 4 and Tr(∆ T ∆) −1 ≤ p 4 . Proof. Using ∆ 0 = 1, we have ∆ i = 1 − 2e i . Since (∆∆ T ) is a symmetric full rank matrix, we can write it as its eigenvalues-decomposition, given by: and v i is the eigenvector corresponding to eigenvalue λ i . Then, We know Since M − 4 ≥ 0 and 11 T 0, then (M − 4)11 T + 4I ∆∆ T −1 4 ∆∆ T −1 . Therefore, Proving Tr(∆ T ∆) −1 ≤ p 4 is very similar to the above proof. Due to this similarity, the proof is skipped here. Now let y = L(θ) and y i = L(θ+c∆ i ) represent the noisy measurements at θ and θ + c∆ i respectively. Assume these measurements can be expressed as normal random variables as y ∼ N (µ, σ 2 ) and y i ∼ N (µ i , σ 2 ). It is assumed that the noise variance is constant over the whole state space, θ. Lemma 3. Given y ∼ N (µ, σ 2 ) and y i ∼ N (µ i , σ 2 ), if M ≥ p, then the expected value of the gradient from Algorithm 1 is equal to the gradient g.
Proof. Based on our assumption, For small c, we have If M ≥ p, using (7), we have Theorem 1. If the rounds of computation, M , satisfies then the error in estimated gradient from Algorithm 1 is bounded by Proof. Using the assumptions of y ∼ N (µ, σ 2 ) and y i ∼ N (µ i , σ 2 ), the function difference δf i (θ) can be expressed as: Using (7) for estimating the gradient and Lemma 3, the error in gradient, (ĝ − g), is a normal random variable given by: After M rounds of computation ofĝ, Chebyshev inequality implies that Using the results of Lemma 2, we have Algorithm 2 presents a step-by-step summary of the proposed approach in this paper. This algorithm is a conjugate gradient type algorithm which uses the PSP algorithm to estimate the gradient and Hessian at each iteration. The stopping criterion in this algorithm can be defined by the size of the gradient at the final point, the update size, or even the number of iterations.

Remarks
• PSPO does not need any determination of the step sizes. Instead, the step size is computed in a conjugate gradient setup. • Unlike conventional SPSA, the PSPO algorithm gives the chance to set the error tolerance for each step. One is able to start a coarse search at first iterations and decrease the error tolerance as the algorithm gets closer to the optimal point (by increasing the number of the rounds of computation). • The perturbation vector for computing Hessian (∆ in SPSA andd in PSPO) is not randomly chosen. Instead, the perturbation vector is equal to the update direction at each step. In the gradient descent algorithms, the Hessian matrix scales the update vector based on the curvature of the objective function for each feature. The idea, here, is to estimate the curvature of the objective function for each feature at each iteration, but only in the direction of the update in that iteration. In fact, we only need to estimate the effect of the Hessian matrix in the update direction, and we are not interested on the effect of the Hessian matrix in other directions. Therefore, the perturbation direction is set to the update direction. We only need to make sure that the elements of the perturbation vector be non-zero. So, ε is added to each zero element of the update vector.

IV. SIMULATION
To demonstrate the main contribution of this paper, we implemented the methodology on a variety of examples. We begin with a simple three-dimensional quadratic function to clearly illustrate the efficiency of the algorithm in the case we know the exact optimal solution.
The basic concept of the algorithm can be seen more clearly on a simple toy problem. The objective function, here, is where, w is a Gaussian noise w ∼ N (0, 0.02 2 ). The result of the PSPO optimization algorithm from different initial conditions is plotted in Fig. 1. The optimal point of this function is known and we can easily test our algorithm.  To empirically evaluate the proposed method, we investigated the efficiency of PSPO and SPSA algorithms on a stochastic optimization problem. We use the same initialization condition when comparing two optimization algorithms. We evaluate our proposed method on the following highly noisy objective function where, w is a Gaussian noise w ∼ N (0, 3 2 ). This convex objective function is suitable for comparison of different optimizers without worrying about local minimum issues.
Each algorithm run 200 times for different initial conditions. According to Fig. 2, we found that the PSPO algorithm converges faster than SPSA in average. Note that the maximum number of iteration set to 100 for both algorithms. In Fig. 3, the progress of two optimization algorithms lowering the cost is shown for a fixed initial condition. As it can be seen in this figure, even though the initial cost of the PSPO is larger than the SPSA (due to noise in objective function), but the PSPO converges considerably faster than the conventional SPSA for this example. The efficiency of the two algorithms is also tested through running the algorithms for a fixed initial condition many times. Fig. 4 shows the result.

V. CONCLUSIONS
In many stochastic optimization problems that noise presents, obtaining an analytical solution is hardly possible. In this paper, we described an algorithm for optimal parameter estimation in discretely observed stochastic model. We have introduced PSPO, which is a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards highly uncertain objective functions. We found the minimum number of parallel rounds of computation in order to have a bounded error in the estimated gradient.