Directed particle swarm optimization with Gaussian-process-based function forecasting

Particle swarm optimization (PSO) is an iterative search method that moves a set of candidate solution around a search-space towards the best known global and local solutions with randomized step lengths. PSO frequently accelerates optimization in practical applications, where gradients are not available and function evaluations expensive. Yet the traditional PSO algorithm ignores the potential knowledge that could have been gained of the objective function from the observations by individual particles. Hence, we draw upon concepts from Bayesian optimization and introduce a stochastic surrogate model of the objective function. That is, we fit a Gaussian process to past evaluations of the objective function, forecast its shape and then adapt the particle movements based on it. Our computational experiments demonstrate that baseline implementations of PSO (i.e., SPSO2011) are outperformed. Furthermore, compared to, state-of-art surrogate-assisted evolutionary algorithms, we achieve substantial performance improvements on several popular benchmark functions. Overall, we find that our algorithm attains desirable properties for exploratory and exploitative behavior.


Introduction
Stochastic optimization methods refer to optimization methods that incorporate random variables into a search process (Gentle et al., 2012, Chapter 7) and often improves the performance in a large variety of practical settings (Hoos & Stützle, 2005). Stochastic optimization methods are frequently utilized in black-box optimization settings, in particular for derivativefree optimization (Pham & Castellani, 2014;Rios & Sahinidis, 2013). Here no assumptions regarding the analytic form of the objective function are made and, because of it, gradients are unavailable. That is, one can only query a function f for single points x, for which the corresponding evaluation f (x) is then returned. Such problems are prevalent in numerous applications from engineering, medicine and economics among others, where the underlying function is computationally or economically expensive to evaluate (Rios & Sahinidis, 2013).
A prominent example of stochastic search methods is particle swarm optimization (PSO), which presents the focus of this paper. The idea behind PSO stems from population-based simulations of animal behavior (Kennedy & Eberhart, 1995). As such, a set of candidate solutions (i. e., particles) are initialized at random positions, which are then moved around a search-space. These then proceed towards a mixture of the best-known swarm position and the particle's best location from its past trajectory. Several variants to the original PSO algorithm have been developed (e. g., SPSO2011; Zambrano-Bigiarini et al., 2013), which we summarize in our review section. For detailed surveys of PSO variants, we refer to Poli et al. (2007) and Bonyadi & Michalewicz (2017).
PSO is straightforward to implement and makes little assumptions on the underlying optimization problem. In particular, the optimization is derivative-free and can also adapt to non-convex problems. As a result, it is used for multi-objective optimization (e. g., Liu et al., 2017;Zouache et al., 2018;Yu et al., 2017;Sethanan & Neungmatcha, 2016), as well as in various areas of application, such as energy , supply chain management (Hong et al., 2018), and operations research in general (Etgar et al., 2017;Tasgetiren et al., 2007). PSO can further be extended to unconstrained optimization problems (Bonyadi & Michalewicz, 2017), as well as integer programming (Laskari et al., 2002). Convergence guarantees can be made for a wide range of settings (Jiang et al., 2007).
The information actually utilized by PSO and its variants is limited: each particle is merely controlled by its personal best position, as well as the best of a selection of other particles.
In other words, no knowledge regarding the past trajectories in the search-space are retained and all corresponding function evaluations are "lost". As a result, particles might re-visit a neighborhood that has previously been explored; thereby leading to potentially redundant 2 function calls (Iqbal & de Oca, 2006). On the other hand, all particle movements are guided towards the best known positions. This strategy could miss a global optimum that is hidden in an unexplored region (cf. Pham & Castellani, 2014) and may lead to convergence to a non-optimum (van den Bergh & Engelbrecht, 2002).
This work aims at improving the search strategy that controls the movements of the particle swarm. We thus propose a combination of the PSO methodology, where the swarm intelligence leverages a stochastic surrogate model of the objective function. This allows us to estimate the surface of the objective function (i. e. the landscape) from the past search trajectory of the particles. Based on this surrogate model, we can then enhance particle movements in the search process with respect to exploratory and exploitative behavior. More specifically, we develop a variant (A) in which we modify the movements for all particles by incorporating a third direction pointing to the global optima of the surrogate. Further variants re-locate individual particles in order to let them (B) exploit directly global optima suggested by the surrogate model and (C) explore regions with high uncertainty in the surrogate.
Mathematically, we draw upon Gaussian processes (GPs) as stochastic model for the surrogate. This choice is motivated by common procedures in Bayesian optimization (Snoek et al., 2012) and response surface methods via kriging (Rios & Sahinidis, 2013). In addition, it provides rigorous uncertainty estimates for the function approximation, thereby facilitating explorations of locations that promise the highest probability of improvement in the surrogate model. Our function approximation thus reveals similarities to Bayesian optimization, where one also formulates a stochastic model over the function space (Močkus, 1975). However, one then chooses a sequence of points in order to improve the overall objective, as if a single particle jumps through the search space (see Online Appendix C for details). Different from PSO, these movements are deterministic and can thus not benefit from a randomization in the optimization method.
We finally perform a series of computational experiments in order to demonstrate the improved convergence of our proposed algorithm. For this purpose, we adhere to earlier works that develop modifications to particle swarm optimization, that is, we use the SPSO2011 implementation (Zambrano-Bigiarini et al., 2013). We further follow the literature with regard to the evaluation setting, that is, we use the CEC2013 suite of benchmark functions (Liang et al., 2013;Zambrano-Bigiarini et al., 2013) that is common for evaluating PSO approaches.
Our algorithm can outperform SPSO2011 for the entire set of benchmark functions. We find considerable improvements in the convergence in early stages of the iterative search. Statistical tests further demonstrate that the improvements are significant at common significance levels.

3
Our work entails multiple contributions to the field of stochastic optimization. That is, we present an innovative combination of swarm search and machine learning tools (i. e., Gaussian process). Thereby, we contribute to the state-of-the-art by suggesting a statistical procedure to better leverage search trajectories in PSO. This thus introduces strategic control into the otherwise randomized search process. The resulting performance improvements stem from a better trade-off with respect to exploration and exploitation in the stochastic swarm search.
This may prove useful in a wide variety of real-world optimization settings, where one utilizes PSO.
The rest of this paper is structured as follows. Section 2 reviews the mathematical specification of particle swarm optimization. Based on it, Section 3 proposes our algorithm which combines PSO with a Gaussian-process-based surrogate model. Section 4 then performs a series of computational experiments that demonstrate the acceleration in convergence of our proposed algorithms. Section 5 discusses our algorithm, while Section 6 finally concludes.

Background
Mathematically, we define black-box optimization problem as follows. Let f : D → R denote an arbitrary (and potentially non-convex) function with a known (closed) domain D.
Then our goal is to find x * ∈ arg min x∈D f (x), where we assume that D ⊆ R p is connected and f is continuous. In many applications, the underlying function is expensive to evaluate and, in these case, we might prefer to terminate the search process after a certain number of iterations or when the relative convergence fulfills predefined criteria. The last iteration of the algorithm can then be assessed via the expectation of a fixed loss function, for example, the mean squared loss (1)

Particle swarm optimization
Black-box optimization is often approached by particle swarm optimization. In the following, we provide a brief specification of particle swarm optimization (cf. Bonyadi & Michalewicz, 2017, for a detailed overview). PSO stems from the simulation of socially-coordinated behavior of animal swarms (Kennedy & Eberhart, 1995). It relies on a swarm of particles j = 1, . . . , n par serving as candidate solutions. These particles move through the domain of the function in a coordinated fashion, where each particle is guided by individual experience, as well as observations of other particles. For this reason, each particle comes with information j is the position of particle j, • v (i) j is its velocity, and j is its personal best position up to iteration i.
During the initialization phase, the particle positions x (0) j are distributed across D. This initialization may be uniformly at random, or according to some other initialization scheme (e. g. a Sobol sequence). The initial velocities v (0) j are also chosen at random according to a predefined initialization procedure.
In subsequent iterations, the particles move around the domain D, updating their respective velocity based on the personal best position p (i) j and on the swarm intelligence. In the original PSO, the latter component is set via the global best-known position g , while some PSO variants only consider a certain neighborhood N of particles, depending on the particle. This yields update rules (4) Accordingly, the velocity update is key in controlling the movements of the swarm in space.
In the original PSO (OPSO), the velocity update is given by where R p,j and R g,j , are random vectors with components drawn from a uniform distribution U(0, 1). This yields a linear combination for computing the new velocity with pre-defined coefficients φ p and φ g that additionally determine the relative influence of personal and global best directions and where denotes component-wise multiplication. The vectors p j are referred to as the cognitive influence (CI) and social influence (SI), respectively.

Extension to SPSO2011
Various modifications to the above update rules have been proposed in the literature. For instance, PSO has been extended by adjustments to the method itself (e. g. Yin et al., 2010) or combinations with other optimization schemes (e. g. Fan & Zahara, 2007). For the sake of establishing a benchmark moving forward and focusing on reproducible insights, Bonyadi & Michalewicz (2017) refer to the following variant as the standard PSO (SPSO in short).
Another standard version of PSO was introduced by Zambrano- Bigiarini et al. (2013), which is called SPSO2011. Instead of mixing component-wise uniformly random shifts in the velocity vector, particles are accelerated according to a hyper-spherical distribution. The velocity update is here given by where H (x, r) refers to as a hyperspherical distribution over the sphere with center x and ra- is regarded as state-of-the-art and should be used for benchmarking (Zambrano-Bigiarini et al., 2013). Hence, it is later used in our experiments.
The update rules described above lead to convergence of the individual particles. This convergence behavior has been studied extensively (Bonyadi & Michalewicz, 2017). There is however no guarantee that the particles converge towards a global optimum. Other variants exist, such as modifications that replace the linear position update by arbitrary functions van den Bergh & Engelbrecht, 2002), or which limit the neighborhood N in specific ways (e. g. Bratton & Kennedy, 2007). For a detailed review, we refer to Bonyadi & Michalewicz (2017).

Surrogate-assisted PSO
PSO can often require numerous function calls until convergence, which might not be feasible for problems with expensive function evaluations. Here previous research has proposed to replace certain evaluations of the actual function f by an approximationf (e. g. Bird & Li, 2010;Parno et al., 2012;Regis, 2014b;Sun et al., 2013). Such approaches are also referred to as response surface method, surrogate model, or fitness function. However, in the aforementioned stream of research, the approximation is learned from the function calls to f (i. e., surrogates are predicted based on past observations), while leaving the rest of the algorithm unchanged; the approximation is then merely used as a proxy for f , but any probabilistic interpretation does not further guide the search process. Hence, their only similarity with our approach is that these approaches also draw upon a function approximation, but for a completely different purpose.
Following the above rationale, surrogate-assisted PSO has been developed earlier (e. g. Sun et al., 2017;Yu et al., 2018;Wang et al., 2017). Here algorithms directly model surrogates 6 (e. g., sigmoid-like inertia weights in Modified PSO) or using function approximations as a proxy for f . In fact, (non-stochastic) function approximation has been shown to improve evolutionary algorithms, as the surrogate can be used to evaluate additional candidate solutions within a local neighborhood, while keeping the number of function calls to f unchanged (Regis, 2014a). Examples are surrogate-assisted variants of Gaussian PSO (e. g. Krohling, 2004;Melo & Watada, 2016;Varma et al., 2013;Barman et al., 2016;Liu et al., 2013;Gao et al., 2020), Bayesian PSO (e. g. Zhang et al., 2015;Chen & Yu, 2017;Kang et al., 2018), and modified PSO (e. g. Tian & Shi, 2018;Liu et al., 2015). However, surrogate-assisted algorithms are primarily used to speed up runtime (due to fewer evaluations f ) but with similar convergence characteristics. That is, the swarm movements are not directly adapted to the surrogate, whereas we use a surrogate to guide swarm search towards exploration-exploitation.
Bird & Li (2010) compute a local regression of the function f around a certain point x in order to yield an interpolationf within a close neighborhood and then move the worst particle to the best location of the local regression. This approach resembles our algorithm (B), but the specification in Bird & Li (2010) is restricted to a local neighborhood and can thus take only a small region in consideration. Conversely, our Gaussian process regression computes the stochastic function approximation from all particles and thus yields a global stochastic function approximation. Thereby, it is able to relocate particles not locally but globally. In addition, Bird & Li (2010) merely address local exploitation but ignore our idea of further exploration.
Several attempt have been made to use surrogate models for directing a random search, yet outside of PSO. Liu et al. (2013) propose a surrogate-assisted genetic algorithm based on Gaussian processes (GPEME for short). Here the surrogate is used for prescreening potential function evaluations, whereas our algorithm integrates the surrogate in way that it guides the search process (towards exploration-exploitation). There are further surrogate-assisted evolutionary algorithms for hierarchical swarm searches (Yu et al., 2018), committee-based active learning approaches (Wang et al., 2017), and co-operative optimization (Sun et al., 2017). However, different from the above, we develop a PSO variant on top of a Gaussianprocess-based surrogate that leverages the uncertainty estimates in the GP, so that the swarm search is driven by both exploration and exploitation.

High-level description
Our previous literature review has shown that particle swarm optimization suffers from returning local instead of global optima, as well as ignoring valuable information about the underlying functional landscape. Accordingly, our goal is to extend the strengths of this stochastic, population-based search method for black-box optimization problems with a more efficient search strategy. For this reason, we propose a combination of the PSO mechanism with a stochastic surrogate model of the objective function, so that the swarm search can be directed strategically.
The stochastic surrogate model sheds light into the objective and we can leverage its landscape in the search process of PSO. Thereby, we aim at addressing two problems of traditional PSO algorithms, namely, the risk of finding local optima and redundant function evaluations, i. e., slow convergence in terms of evaluations. Our remedy to both is provided by learning a stochastic surrogate model: it can identify regions that have only barely been explored and where we thus face a high uncertainty regarding the landscape. On the basis of this, we can direct the swarm towards both exploration and exploitation.
Key to the strategic search is an approximation of the underlying objective function. This is formulated as a learning task in which a stochastic model for the underlying function (surrogate model) is fitted. In this, we make use of ideas from the realm of Bayesian optimization to augment our particle swarm through a Gaussian-process-based surrogate model. The latter then guides the swarm towards regions that entail the optimum in our approximation of the objective, or where we have great uncertainty in our knowledge of the underlying function. The idea of a stochastic response surface is also the basis of Bayesian optimization methods. We give a review in Online Appendix C. We choose a Gaussian-process-based surrogate model for several reasons: (1) Gaussian processes are common for modeling uncertainty in multivariate spaces (cf. Ackermann et al., 2011;Buche et al., 2005). The underlying Gaussian distribution makes few assumptions on the actual curvature of the function. (2) Gaussian processes are fairly parsimonious (i. e., they have fewer model parameters as compared to many other surrogate models such as, e. g., artificial neural networks, support vector machines) and are thus beneficial when the swarm search has still little information. (3) Gaussian distributions (that underlie GPs) are found to be highly effective in state-of-the-art Bayesian optimization (cf. Snoek et al., 2012;Swersky et al., 2013).
In principle, the stochastic function model can facilitate swarm movements with regard to both exploration and exploitation, yet is is unclear whether it should influence purely 8 directions, or even the velocity governing the step sizes. In the following, we thus suggest different approaches to how we integrate the Gaussian-process-based surrogate model into the search process and adapt the corresponding swarm intelligence: (A) PSO with heuristic direction. Traditional implementations of particle swarm optimization move the walkers towards a mixture of personal and global best-known solutions. Conversely, we extend the velocity update and introduce a third direction that points towards the expected optimum from the approximation. This changes the velocity update fundamentally: while we previously merely utilized information from positions that we have already visited, the swarm could now even move towards directions that differ from the existing search trajectories. With each iteration, we augment our knowledge of the objective function and obtain an increasingly more accurate estimate of the landscape, including the approximate location of the corresponding optima. As a result, the swarm is additionally attracted by the optimum from the stochastic function model.
The optimum of the stochastic function model is considered as an effective guess to the optimum of the function, and we use this information in combination with cognitive and social influence as a third heuristic influence on the particles.
(B) PSO with heuristic exploitation. Often, an adjustment of the direction is not sufficient to accomplish a fast convergence behavior and we thus propose a variant where we directly relocate unsuccessful particles to the optimum in the approximation model.
In other words, the particle is moved to the best average location given the stochastic function model. As a result, this accelerates convergence, as we explicitly query points which are likely to improve our swarm according to the stochastic function model, as opposed to relying on sheer luck. At the same time, the stochastic movement of the other particles give us continuous exploitation of the best we know. This corresponds to the combination of PSO with a response surface, albeit with a very general framework for response surfaces.
(C) PSO with heuristic exploration. The previous version ignores uncertainty in the stochastic function model for the most part, and thus does not have any incentive for exploration. For this reason, we utilize the stochastic nature of the GP model that allows us find uncertainty estimates for the stochastic function model. We further modify the exploration behavior of the swarm and, in each iteration, relocate one particle with the worst function value to a location with high uncertainty in the stochastic function model.
As a result, this search strategy entails an explicit and formalized exploration process, 9 which differs from the swarm movements in traditional variants of PSO. Mathematically, it also allows for a guaranteed convergence of the same complexity class as Bayesian optimization.
In the following, we provide a definition of Gaussian processes as our function approximation and, based on this, we subsequently specify the actual optimization routines.

Gaussian processes
A Gaussian process (GP) over a domain D is given by a mean function m : D → R and a covariance function K : D 2 → R. The covariance function must define a valid covariance form, i. e. it is a positive semi-definite kernel over the space D 2 . Definiteness corresponds to non-degenerate GPs. Now let GP(m, K) denote the distribution given by the GP with mean m and covariance K, which is given by the following definition. We refer to Rasmussen & Williams (2008) for definitions, posteriors and common choices of covariance functions.

Definition 3.1 (Gaussian Process) For any finite set of points
where Remark 3.2 (Gaussian Process Posterior) Let Y = {y 1 , . . . , y m } be another set of points, and let K(X, Y ) be the matrix with K(X, Y ) i,j = K(x i , y j ). The definitions of K(Y, X) and K(Y, Y ) are analogous. Given an observation of f (X), the posterior distribution over f is then given by for any set Y ⊂ D.
As one can readily see, this means that the posterior distribution of a GP is again a GP with modified m and K. Thus any posterior evaluation f (Y ) follows a multivariate normal distribution. We recall that the marginals of a multivariate normal distribution are also normal distributions. This allows us to efficiently calculate the distribution of function values of the GP model at arbitrary locations.
The choice we must make when specifying a GP as a stochastic surrogate model for our function are the covariance K, as well as the mean m. In practice, one sets m = 0 and concentrates on adapting K to the problem setting (Rasmussen & Williams, 2008, Chapter 4). This allows the user to encapsulate, a priori, known properties of the function and we later give guidance concerning the choice.

Stochastic surrogate model through Gaussian processes
We now define our stochastic surrogate model based on which we approximate the objective function. Let θ denote the additional parameters in a covariance K θ . Then, the interpolation requires one to identify suitable parameters θ. Their values can be inferred from the observed data based on a maximum likelihood estimation using which gives the maximum a posteriori estimates of θ. The above optimization can be solved with a numerical optimization procedure of choice; e. g., quasi-Newton methods. 1 Even though this optimization problem is generally tractable, the matrix inversion in Equation (10)  This is why we use a greedy heuristic to only retain certain evaluations when fitting our stochastic surrogate model. That is, we use only a subset M ⊂ X of our observations, our memory, and fit the model via arg max θ p(f (M) | θ). We describe the selection process in more detail in the following section.

Efficient learning of stochastic surrogate model
We only retain a subset of M ⊂ X of our observations. The corresponding selection is made by a function χ which extracts the point that are considered most informative. We define it in the following.
Recall w Section 2). We write W (i) for the set of particles at step i, M (i) = {y 1 , . . . , y k } for the set of observations we keep in memory, and use the notation As outlined in the previous section, we use a greedy heuristic to only retain a subset of the locations and evaluations at each iteration in M (i) . At each time step, we use the current stochastic surrogate model GP(m, K θ ) and set for a fixed value of ρ. We set ρ consistently to correspond to the 75 % confidence interval, i. e. ρ ≈ 1.15. This heuristic corresponds to only keeping informative observations, that is, observations which change our assumptions about the function. All observations which appear likely under the current stochastic surrogate model are forgotten. We additionally keep all most recent observations in memory, to ensure that our stochastic function model is precise near our swarm.

Optimization methods
We now formalize the different optimization methods combining both PSO and the Gaussian-process-based surrogate model. To this end, we first specify the general form that is shared across all three algorithms (see Algorithm 1) and later discuss the individual modifications. The different algorithms only differ in the update rule that is given Step 6 of Algorithm 1, where each variant inserts its own rule. The general layout of Algorithm 1 first consists of initializations, where the initial particle positions are determined and the initial set of points for the stochastic function approximation are set. A subsequent loop moves the swarm around the function space. Within this loop, we fit the stochastic function model in Steps 4 and 5. Then, Step 6 moves the particle around, while Step 7 updates the memory that underlies our heuristic for accelerating the learning of the GP-based surrogate model.
We now give the update rules corresponding to the individual algorithm variants: Algorithm 1 General form of PSO with a Gaussian-process-based surrogate model.
Input: Domain D, parameterized family of kernels K θ , function f Output: Approximate global minimum x * of f 1: Initialize set of npar particles W (0) uniformly at random over D with multivariate normal velocity 2: Set Compute stochastic function approximation GP(m,Kθ) as the posterior of GP(m, Kθ) for the observa- where the update rule depends on the algorithm variant 7: Updated memory via M (i) ← M (i−1) ∪ χ(m,K θ , W (i−1) ) 8: end for 9: return Global best solution from W (end) (A) PSO with heuristic direction. This algorithm extends the traditional velocity update by an additional direction towards the optimum from the stochastic surrogate model. More precisely, we take the mean of the Gaussian process as a response surface, and use the minimum of this response surface as a heuristic direction, i. e. h (i) = arg min x∈D m(x). We then set the heuristic influence (HI) of particle j to h 13 we choose the worst particle j * ∈ arg min j=1,...,npar f (x (i) j ) and determine its new location from the GP mean, i. e. x * arg min x∈D m(x). Its new location and velocity is initialized Here we experiment with two choices of R. In version (C1), we let the worst particle move to the optimum of the lower bound of the 90 % confidence interval, that is (x)) for ρ ≈ 1.6. In the other version (C2), we move it to the position of maximal uncertainty, that is use R(K, m, x) = σ(x). Both versions perform added exploration, either by explicitly querying high-uncertainty points, or by being very optimistic as to the location value.
The greedy heuristic in the stochastic function approximation makes a direct transfer of convergence guarantees difficult. However, if we include all past values in our Gaussian process, we can guarantee the same convergence behavior as the associated Bayesian optimization scheme, albeit with a higher number of concurrent function evaluations.
Each of the algorithms (A)-(C) relies internally on finding a new location from the stochastic function approximation. We note here that the identification is computationally efficient, as the location is either analytically known or one can use numerical optimization (e. g. quasi-Newton methods). Multiple restarts are replaced by initializing the search with the location of the current optimum.

Choice of covariance function
We consistently chose the covariance matrix where the parameters α 1 , α 2 , α 3 and ρ are adapted by the estimation process. This choice of covariance corresponds to the assumption that our function has the form f ( where m is a global mean, f is a smooth function with mean 0, and ε is white noise i. i. d.
at every point. This residual white noise is introduced to account for non-smoothness in the underlying function.

Illustrative example
We give a short example of the proposed algorithm. For this, we look at several snapshots of the algorithm running on the two-dimensional Ackley function given by − exp (0.5(cos 2πx + cos 2πy)) + e + 20, with e as Euler's number. The domain is set to [−5, 5] 2 , for which the function has a global minimum of 0 at x = y = 0.
As an illustrative example, we now execute PSO with heuristic exploration, more precisely, variant (C1) of our algorithm. We use 10 walkers and otherwise the same parameters as in Section 4. Figure 1 depicts different iterations of the algorithm, namely, after initialization, after 6 steps (i. e. 70 function evaluations) and after 18 steps (i. e. 190 function evaluations).
After initialization, the stochastic surrogate model has learned little about the true nature of the wavy objective function and, hence, its mean suggests a rather smooth curve with a high variance associated to it. The acquisition function shows the lower bound of the 90 % confidence interval. Accordingly, the heuristic exploration guides the algorithm to further explore areas that promise both high chances of improvements, while taking the point-specific uncertainty into consideration.
As the algorithm progresses, the particles quickly converge towards the optimum in the center of plot, while the mean surface yields a better resolution and serves as an increasingly accurate function approximation. Hence, the proposed location from the heuristic exploration quickly coincides with the true optimum. We can also see the convergence of the particles 15 towards the global optimum of the Ackley function at x = y = 0. After 18 steps, we can barely recognize the position of individual particles as these are so closely scattered around the optimum. Here we also note the low variance of the approximation around the optimum.
Similarly, the acquisition function also attains its minimum around x = y = 0 and thus guides the swarm to explore this area.
Objective function Approximation (mean) Approximation (variance) Acquisition function Figure 1: Illustrative execution of the PSO with heuristic exploration, i. e. variant (C1). Shown is the state of the algorithm after initialization, 6 steps and 18 steps. Highlighted are the global optimum (yellow cross), the current heuristic exploration (white circle), the current best solution (orange square) and the particles (brown dots). The algorithm quickly learns a crisp approximation of the true objective and, based on it, guides the search direction of the particle swarm. Darker colors correspond to lower values. After 18 steps, almost all particles have converged to the true optimum and are so closely scattered around x = y = 0 that one can barely recognize them any longer.

Setup
We performed an extensive series of computational experiments. For reasons of comparability, we adhere to the CEC2013 benchmark functions and adopt the same experimental setup as in Liang et al. (2013). The benchmark functions are reviewed in Online Appendix A.
These are challenging in different ways (e. g., their landscape suffers from multi-modality and poor scaling).
In accordance with literature (Bonyadi & Michalewicz, 2017), we experiment with  Table 1). All PSO variants are compared with 50 particles (Yu et al., 2018). In practice, PSO is often used in settings where function evaluations are computationally or economically costly (Rios & Sahinidis, 2013). This is because, for instance, complex industrial simulations are involved (e. g. Hong et al., 2018). This is the setting to which our algorithms are tailored, and, hence, our evaluations compare the efficiency of the algorithms after 100 D function evaluations following Liu et al. (2013); Varma et al. (2013). For reasons of space, we report the results for ten dimensional (D = 10) objective functions. In addition, we utilize Bayesian optimization as another comparison, as our approach borrows concepts from the underlying GP-based approximation of functions, though the stochastic process in PSO yields a completely different search strategy. We evaluated Bayesian optimization in the standard parallelized implementation of the Python package GPyOpt, which uses local penalization (González et al., 2015). For each experiment, we measure the performance regarding the mean, minimum, maximum, median, and the standard deviation across 51 independent runs as in (Liang et al., 2013;Zambrano-Bigiarini et al., 2013).

Algorithm Parameters
Version (

Numerical performance
The results of the computational experiments are provided in Tables 2 and 3 We further compare the performance through statistical tests (see Online Appendix B).
That is, we compare the performance of the favored variant (A3) against the performance of benchmark algorithms to see whether the improvement is statistically significant. For a significance threshold of α = 5 %, variant (A3) is superior over SPSO2011 for 27 of the 28 objective functions at a statistically significant level. This confirms the effectiveness of the proposed algorithms.    cheap. However, we remind the reader that many applications of PSO in practice involve functions f that are computationally costly and, therefore, the relative overhead due to the GP is likely to be lower in practice.

Convergence plots
As above, we provide results for unimodal, multimodal, and composition functions. In   = 3, rotated)). Experiments are performed on an Intel Core i5 with 3.1 GHz and 16 GB memory. Bayesian optimization is omitted for reasons of space as it is already shown to be inferior in the above numerical experiments.

Comparison against surrogate-assisted evolutionary algorithms
We further compare our PSO variants against state-of-the-art surrogate-assisted evolutionary algorithms. Specifically, we compare against five popular variants: (1) a Gaussianprocess based genetic search (GPEME; Liu et al., 2013); (2) an ensemble-based algorithm with reweighting according to surrogate errors (WTA1; Goel et al., 2007); (3) a surrogate-assisted memetic algorithm that with surrogate-based local searches (GS-SOMA; Lim et al., 2009); (4) a surrogate-assisted by local Gaussian random field metamodels (MAES-ExI;Emmerich et al., 2006) and (5) a PSO extended by committee-based active learning (CAL-SAPSO; Wang et al., 2017). The evaluation setting is identical to prior literature Wang et al. (2017). This is done to allow for a fair comparison against existing surrogate-assisted evolution algorithms.
Specifically, the evaluation is based on four popular multi-modal benchmark functions functions as in Wang et al. (2017): Ackley function, Griewank function, Rastrigin function, and Rosenbrock function. The performance of the algorithm is evaluated after 11 · D function evaluations analogous to Wang et al. (2017). For reasons of space, we present the results for ten dimensional (D = 10) objective functions. We report the best candidate solution values (mean) across 20 independent runs.
The results are in Table 4. We observe that all considered surrogate-assisted evolutionary algorithms are outperformed by variant (A3) for two out of four functions. For instance, for the Ackley function, variant (A3) registers a better result than the state-of-the-art benchmarks (i. e., MEAS-ExI; Emmerich et al., 2006)  forms GPEME on the Griewank benchmark. The comparison against GPEME is particularly interesting, as GPEME is based upon a combination of Gaussian process and genetic algorithm. Here we find that variant (A3) is superior in three out of four cases. This can directly attributed to the difference between both: GPEME uses the surrogate only for prescreening candidate solutions, whereas we actually integrate the surrogate to guide the swarm search.
Overall, this establishes the strong performance of the proposed algorithms as they widely preferred.
Here the corresponding objective function was minimized. Results were obtained across 20 runs and, across all runs, report is the mean over the best candidate solution were averaged. Reported is then the mean (±SD).

Discussion
As shown above, the proposed algorithms improves the efficiency over standard PSO, that is, with the same number of iterations, a better solution is obtained. This is especially crucial in practical settings where function evaluations are economically or computationally costly (cf. Rios & Sahinidis, 2013, for a discussion). Fr potential users of our algorithms, there are direct implications. In a cost analysis, we showed that the benefits of our algorithm relate to the cost (or runtime) of making function evaluations. If function evaluations are (computationally) cheap, a standard PSO appears beneficial, while, otherwise, our Gaussian-process-based PSO has benefits. Similarly, the benefits of our algorithms over standard PSO diminish if many function evaluations can be made, yet, when this is limited as in the above experiments, our algorithms are favorable.
Our work developed variations to PSO by using the information gained about the underlying function to model this function itself as a stochastic process. In other words, this could also be seen as an extension of concepts from Bayesian optimization to PSO. By drawing upon Gaussian processes, we allow for a wide family of possible function structures and benefit from recent advances in Bayesian optimization that currently serves as a popular algorithm in high-expense black-box function optimization. Along with variant (A3), variant (C) with a combination of PSO and GP-based exploitation showed promising results in the numerical experiments. Besides that, this variant has interest theoretical properties, as it yields the same performance guarantees as sequential Bayesian optimization (subject to a fixed factor).
As opposed to pure Bayesian optimization, the PSO search remains agnostic to the underlying function. It might therefore serve as an initial exploratory scheme until the function can be adequately modeled. Here we point to Buche et al. (2005), who use an evolutionary algorithm followed by Bayesian optimization in similar fashion.

Conclusion
This work proposed an innovative direction for improving stochastic optimization methods by guiding the search process with a stochastic function approximation. Specifically, we combined particle swarm optimization with Gaussian-process-based function approximation between the walkers' positions in order to better improve particle movements with regard to exploration and exploitation. This allows us to profit from the stochastic convergence of the swarm, while also accelerating convergence and reducing risks of finding local instead of global optima. To this end, three different approaches were developed: (A) using the stochastic surrogate model to locate an approximate global optimum, and include this in the original PSO search direction; (B) using the approximate global optimum to leapfrog the worst particle for fast convergence; (C) a hybridization of PSO and Bayesian optimization for increased exploitation. As a result, these modifications can improve the stochastic search process over the original PSO by obtaining a more effective swarm behavior with respect exploration and exploitation. Finally, our computational experiments demonstrated that enhancing PSO by leveraging a Gaussian-process-bases surrogate model as a further source 26 of information improves its performance.
The above method opens avenues for future research. Specifically, one could investigate the performance when replacing the GP approximation with alternative models, such as Bayesian neural networks, random forests, or other models that allow for a stochastic interpretation of functions. Here one could also decompose the approximation into a stochastic one for uncertainty estimates and a non-stochastic counterpart for the response surface, as this could empower more accurate approximations at lower computational costs. Our approach of a GPguided stochastic optimization might also be helpful for other population-based algorithms that involve a combination of exploratory and exploitative behavior (e. g., evolutionary algorithms, cross-entropy methods and bees algorithms). Beyond that, we leave it to future research to adapt the proposed algorithm to parallel multiple swarm optimization and multiobjective optimization.

Online Appendix B. Statistical tests for performance comparisons
We report p-values for the significance of the results presented in Tables 2 and 3. To this end, we calculate p-values of a one-sided t-test on the means of favored variant (A3) versus the remaining variants and the benchmark algorithms.  Table B.1: Computed p-values of one-sided t-tests checking if the means of variant (A3) are significantly lower than the means of the remaining variants and benchmark algorithms for the CEC2013 functions. We highlight p-values below a significance level of 5% in bold.