Consensus-Based Global Optimization with Personal Best

In this paper we propose a variant of a consensus-based global optimization (CBO) method that uses personal best information in order to compute the global minimum of a non-convex, locally Lipschitz continuous function. The proposed approach is motivated by the original particle swarming algorithms, in which particles adjust their position with respect to the personal best, the current global best, and some additive noise. The personal best information along an individual trajectory is included with the help of a weighted mean. This weighted mean enters the dynamics via an additional drift term. We illustrate the performance with a toy example, analyze the respective memory-dependent stochastic system and compare the performance with the original CBO with component-wise noise for several benchmark problems.


Introduction
Interacting particle systems play an important role in many applications in science -on the one hand as a modeling framework for social and biological systems, on the other as a tool for computational algorithms used in data science. In the latter case the collective behavior of interacting particle systems is used to solve high-dimensional problems, often resulting from non-convex optimization tasks in data science. Well known algorithms include particle swarm optimization (PSO) [1], ant colony optimization [2] or evolutionary [3] and genetic algorithms [4].
PSO was first introduced in [1] and has been successfully used in engineering applications [5]. Each particle in a PSO algorithm adjusts its position due to information of the global best, personal best and a noise term that allows for exploration of its neighborhood. Consensus-based optimization (CBO) [6,7] combines the idea of swarm intelligence with consensus formation techniques [8,9,10] to obtain a global optimization algorithm for non-convex high-dimensional problems. On the one hand particles explore the state space via an amplitude modulated random walk. On the other a drift term convects them towards the weighted global best. The method was first introduced in [6] and analyzed at the mean-field level in [7]. Recent developments of CBO include component-wise diffusion and utilize the random mini-batch ideas to reduce the computational cost of calculating the weighted average [11]. Other contributions investigate a CBO dynamic that is restricted to the sphere [12,13]. Also, convergence and error estimates for time-discrete consensus-based optimization algorithms have been discussed [14].
The model proposed in he work is based on the component-wise diffusion variant introduced in [11] and combines it with personal best information. This adjustment is motivated by the original work on PSO by Eberhart and Kennedy [1], where the particles move towards a (stochastic) linear combination of their personal best or the common global best position.
The new information leads to an additional drift term in the dynamics. We investigate two types of memory effects -either using a weighted personal best over time or the personal best value in the past. The latter corresponds to record processes, see [15] for an overview. The former is used in the presented analysis and approximates the personal best of each particle. We expect by arguments similar to the Lagrange principle that the weighted mean converges towards the personal best.
The proposed stochastic dynamics with weighted personal best fall into the class of stochastic functional differential equations. These equations are in general non-Markovian and their mean-field limit has been investigated in special cases only. For example, Gadat and Panloup [16] investigated a non-Markovian process with memory, which corresponds to the weighted average of the drift all along the particle's trajectory. This memory term is of a special form allowing them to rewrite the system as a 2d-dimensional non-homogeneous Markovian dynamical system. Moreover, they exploit this special structure to analyze the existence and long time behavior of solutions as well as the mean-field limit. The proposed generalization of CBO with weighted personal best does not fall into this category, hence the derivation and analysis of the respective mean-field dynamics, which often give useful insights into the dynamics, is to the best of the author's knowledge open. This applies as well for personal best, where the update of the best function value corresponds to a record process. Hence we focus on the well-posedness of the stochastic system as well as a detailed computational investigation of the dynamics. This paper is organized as follows: we introduce the particle dynamics with (weighted) personal best in Section 2 and illustrate its dynamics with first toy examples. Section 3 discusses well-posedness and existence of solutions to the SDE model with weighted personal best. Section 4 presents extensive computational experiments of various benchmark optimization problems.

Consensus based optimization with personal best
In this section we discuss how this personal best information can be included in consensus based optimization algorithms as proposed by Carrillo and co-workers in [6,7,11]. We start by introducing the notation before continuing with the modeling.
2.1. Notation. We refer the euclidean norm by |x| = (x 2 1 + · · · + x 2 d ) 1/2 for x ∈ R d and |Y | = ( dN i,j=1 Y 2 ij ) 1/2 for matrices Y ∈ R dN ×dN . The set of natural numbers without 0 is denoted by N * = 1, 2, 3, . . . and the half-line [0, ∞) by R + . A vector valued function or vector x ∈ R dN is assumed to be of the form x = (x 1 , . . . , x N ) with x i ∈ R d . When discussing the stochastic systems we follow the notation of [17]: (Ω, F, P, {F t } t≥0 ) corresponds to the stochastic basis with sample space Ω, filtration F and probability function P. Moreover, S p d [0, T ] is the space of (equivalence classes of) P -measurable continuous stochastic processes Two processes X, Y are called equivalent if (X t = Y t ∀t ∈ [0, T ]) P-almost surely (P-a.s.). Furthermore, S p d is the space of (equivalence classes of) P-measurable continuous stochastic processes X : Ω × R + → R d such that for all T > 0 the restriction X |[0,T ] of X to [0, T ] belongs to S p d [0, T ]. Analogously, we define Λ p d (0, T ) as the space of (equivalent classes) of P-measurable processes X : We refer to Λ p d as the space of (equivalence classes of) P-measurable continuous stochastic processes X : Ω × (0, +∞) → R d for which for all T > 0 the restriction X |[0,T ] of X to [0, T ] belongs to Λ p d (0, T ). Moreover, for any φ ∈ C(R + , R dN ) we define 2.2. The model. We wish to approximate the global minimum of a given non-negative, continuous objective function f : R d → R. In doing so we consider N ∈ N particles and denote the position of the i-th particle at time t by X i t := X i (t) ∈ R d , i = 1, . . . N . Note that we use X t = X(t) = (X 1 (t), . . . X N (t)) ∈ R dN , when referring to the positions of all particles at time t. In CBO particles compare their current function value with a weighted mean value based on the current information of the whole system. A particle moves towards the position of the weighted mean, if the function value of the weighted mean is better. Following the ideas of [6,7], we use the weighted average with α > 0, to approximate the global best. Note that even though the weighted average uses only information of the current time step, it is assumed to be global, as a particle that is close to the weighed average experiences only small drift and diffusion. Moreover, the Laplace principle assures that v f converges to the global best, as α → ∞.
In the original version of PSO, see [18], particles compare their current position with the global best as well as their personal best value up to that time. We propose two different approaches how to include the personal best p i of the i-th particle. First, we consider the true personal best by setting Moreover, the personal best can be approximated similar as the global best, v f , defined in (2.2). Hereby, we use the entire trajectory in the past and refer to this trajectory by X = (X 1 , . . . , X N ) with X i ∈ C(R + , R d ) for all i = 1, . . . , N . Let X i 0 denote the initial position of the i-th particle at time t = 0, the weighted mean over time of the i-th particle is defined by with β > 0. Note that the well-posedness result presented in Section 3 holds for the weighted personal best (2.4) only. Again, by the Laplace principle, we expect that p i f (t) → P i f as β → ∞. We recall that particles either move towards the global or personal best state. The respective CBO dynamics for the i th particle, i = 1, . . . N are then given by the following SDE: . The function H corresponds to the Heaviside function and σ > 0 denotes the diffusivity. System (2.5) is supplemented with the initial condition X i 0 = ξ i , i = 1, . . . , N. The drift and diffusion are motivated by the following considerations.
(1) If the global best v f is better than the current position X i t and the personal best p i f , the particle moves towards the current global best v f .
(2) If the personal best p i f is better than the current position X i t and the global best v f , the particle moves towards the personal best p i f . (3) If none of the above holds, the particle moves due to Brownian motion unless its located at the global best v f . Note that the drift coefficients depend on the past of each particle, hence system (2.5) is non-Markovian. The form of the memory does not allow us to use existing results, such as [17] to rewrite the system. Hence the existence and form of the respective mean-field model is, up to the author's knowledge, not known. Remark 1. The CBO version proposed in [11] can be recoverd by setting Throughout this manuscript we will refer to the dynamics defined by (2.5) with (2.6) as CBO, and to (2.5) with (2.3) or (2.4) as personal best (PB) or weighted personal best (wPB), respectively.
2.3. Toy example: CBO vs. PB dynamics. In the following we will illustrate the differences between CBO and (w)PB using a 1D toy objective function f and 3 particles. We consider a double well-type f of the form: For this function, shown in Figure 1 the global and local minimum, located at x = 1.00125 and x = 0.998748 respectively, are very close. In the following we perform 1000 Monte Carlo (MC) simulations with deterministic initial conditions ξ. We count a run as run successful, if the final position of the particles, that is X * = X(T ) satisfies |v f (T ) − X * | < 0.4. The final time is set to T = 100, the time step size dt = 10 −3 and β in (2.4) to β = 30. We study the dynamics for the following two initial conditions: (IC1) Initialize 2 particles near the local and 1 particle near the global minimizer.
(IC2) Initialize 1 particle near the local and 2 particles near the global minimizer.  The initial position (IC1) and (IC2) of the particles correspond to the gray dots in Figure 1 and in Figure 3, respectively. We discuss (IC1) first. In this situation the weighted average, v f , is located near x = 0.9, thus, the Heaviside functions are zero and the system would be in a stationary state for σ = 0. For σ > 0, the diffusion term drives the dynamic and the particles are exploring their neighborhood. Due to the multiplicative factor, the particle on the left is exposed to more diffusion than the particles on the right. In case of the CBO scheme, the particle on the left has a high probability of jumping out of the basin of the global minimum. Then, all particles concentrate near the local minimum. For one run, this behavior is illustrated by the positions of v f shown in Figure 1 (left). This alone does not reflect the concentration which becomes apparent in Figure 2. In fact, the orange lines show fluctuations for small times but stabilize quickly indicating that no diffusion is present and thus that all particles are concentrated. This behavior changes when personal best information is included. Here, particles still explore their neighborhood, however at some point is current positions are worse than their personal best, and hence the drift starts pulling them back towards their personal best. This behavior is also illustrated by the success rates stated in Table 1. We see that (w)PB outperform PB for large and small values of α. The 'pull-back' effect slows down the convergence of (w)PB -we observe that the respective energies decrease slower than for CBO in Figure 2. Nevertheless, they find the global minimum.
Next we consider initial condition (IC2). Again, in the deterministic case σ = 0 the initial configuration is stationary. For σ > 0 the particles on the left are less diffusive than the particle on the right. Therefore, it is more likely that the particle on the left jumps into the basin of the global minimum. This is illustrated in Figure 3 and confirmed by the success rates in Table 2. Again CBO converges faster than (w)PB, see Figure 4. Nevertheless, the function values at the point of concentration is smaller for (w)PB which means that the slower algorithms find a better approximations. Note that the scale of the time step-axis is much smaller than in Figure 2.

Well-posedness results
In the following we discuss well-posedness of the wPB model. We begin by considering CBO with component-wise diffusion, which was proposed in [11].   The plot on the right shows the mean energy 3 i=1 |X i (t)−x * | 2 . The mean involves 1000 Monte Carlo runs. As expected the particles following the CBO scheme are concentrating very fast. The methods with personal best information need more time for stabilization. The one with weighted personal best is slightly faster than with one with true personal best values.
A detailed proof can be found in the Appendix. Let us just emphasize that the estimates in the proof of Theorem 3.1 are independent of the dimension, d, as was already highlighted in [11] for the mean-field setting. This is in contrast to [6,7], where the estimates depend on d.

3.2.
Well-posedness in case of weighted personal best. Next, we present an existence and uniqueness result for the proposed SDE model with weighted personal best and smoothed Heaviside functions. Different proofs for SDEs with local Lipschitz conditions as well as path-dependent SDEs can be found in the literature [17]. Up to the author's knowledge none of them covers the case of path-dependent SDEs with local Lipschitz conditions. In the following we present a proof which combines the two techniques to obtain a well-posedness result. We assume that the regularized Heaviside function H satisfies the following conditions: (A2) There exists a constant C > 0 such that This corresponds to the following regularized problem Moreover, we assume that the objective function f satisfies the following properties: (A3) Positivity: it holds 0 ≤ f (x) for all x ∈ R d , (A4) Quasi-local Lipschitz condition: for any n < ∞ and |x|, |y| ≤ n it holds with a constant L f > 0 depending on n only.
Remark 2. We will use the well known regularization of the Heaviside function which satisfies assumptions (A1) and (A2). Note that in the context of optimization problems, the positivity assumption on f is not too restrictive. Since f corresponds to a minimization functional it is naturally bounded from below and can be shifted to satisfy the positivity constraint.
The following proof is based on a combination of arguments of Theorem 3.17 and Theorem 3.27 in [17] -this yields well-posedness of (3.2). We begin with two lemmata providing necessary estimates. To clarify the dependencies, we write Moreover, the averages satisfy the local Lipschitz conditions: for all t ∈ [0, ∞) with |ϕ| t , |φ| t ≤ n with constants . . , ϕ (i−1)d+d ) satisfy the following conditions for all x, y ∈ R dN , ϕ, ψ ∈ C(R + , R dN ) and all R > 0 : Proof. To show (i) we calculate where t with constants C 1 and C 2 given by (3.6) with n = R. This yields (i) since The Lipschitz bound (iii) follows from similar arguments using the diagonal structure of σ: The last two inequalities hold due to Equipped with this lemma, we have everything at hand to prove the main theorem.
The proof combines arguments of Theorem 3.17 (path-dependent SDE) and Theorem 3.27 (SDE with local Lipschitz coefficients) in [17].
Proof. We start by proving uniqueness. Let X,X ∈ S 0 dN be two solutions to (3.2) corresponding to initial data ξ,ξ ∈ L 0 (Ω, F 0 , P, R dN ), respectively. Then it holds E ξ = Eξ. Define the stopping time τ n (ω) = inf{t ≥ 0 : |X t (ω)| + |X t (ω)| ≥ n}. Then the two solutions satisfy and τ n → ∞ for n → ∞. Note that we have global Lipschitz constants for b and σ for all times t ∈ [0, τ n ] with n arbitrary but fixed. This allows us to use Theorem 3.8 in [17], see proof of Theorem 3.27 in [17] for more details, to obtain for some V R depending on the local Lipschitz constants L R , R , a R , b R and some γ > 1 and p ≥ 2 arbitrary. This implies the uniqueness of the solution on [0, τ n ]. As τ n → ∞ for n → ∞, this allows us to conclude the global uniqueness of X ∈ S 0 dN . Next, we show the existence of solutions. Let M ∈ N * and 0 = T 0 < T 1 < · · · < T M = τ n with T i = iτn M . It holds We employ a fixed point argument for mapping Γ : where we need no stopping times due to t < τ n on [0, T 1 ]. Indeed, the mapping Γ is welldefined since for all ϕ ∈ C(R + , is satisfied. Because of the Lipschitz continuity, both stochastic processes b(·, U ) and σ(·, U s ) are progressively measurable for all U ∈ S p dN [0, τ n ] and b(·, U ) ∈ L p (Ω, L 1 (0, τ n )) and σ(·, U ) ∈ Λ p dN ×dN (0, τ n ). Therefore, We will show that the operator Γ is a strict contraction on the complete metric space By the Burkholder-Davis-Gundy inequality we have . Then Γ is a strict contraction in S p dN [0, T 1 ] and thus (3.2) has a unique solution X ∈ S p dN [0, T 1 ]. We extend the solution to the interval [0, T 2 ] by defining the mapping Γ :

Numerical results
The numerical simulations are based on the direct simulation of system (2.5) using the Euler-Maruyama scheme, in which we do not approximate the Heaviside function H. The final time is set to T = 15, discretised into 3 × 10 4 time steps. All presented results are averaged over M = 5000 realizations. The variance of the Brownian motion is set to σ = 0.5, while the number of agents depends on the dimension of the function space. In particular, we set the number of agents to 3, 5 or 10 times the space dimension. The initial positions of particles are drawn from a uniform distribution within a specific domain for each function. The parameters α and β to compute the global and personal best are set to α = 10 and β = 10.
A realization is successful if the average mean is close to the function minimum f min , in particular We compare the performance of the CBO scheme with µ = 0, PB and wPB for the following benchmark problems: (1) Alpine [19]: This non-convex differentiable function has a global minimum at x min = (0, . . . , 0) (2) Ackley [20]: This function is continuous, non-differentiable and non-convex and has its global minimum at x min = (0, . . . , 0).
sin(x 2 i )). (4.4) The choice of these functions is based on the different characteristics they have, see Figure 5 for plots in 2D. Table 3 shows the results for the Alpine function (4.1) and the Ackley function (4.2). We observe that the success rate increases with the number of particles, and decreases for higher space dimension. Weighted personal best and personal best give comparable results, which is not surprising since wPB approximates PB for large values   We conclude by investigating a 2D version of the toy problem considered in Section 2.3: . This function has a global minimum at x min = (−1.00125, 0) and a local minimum at (0.998748, 0). We wish to explore the dynamics of this 2D version for different number of particles. In doing so we consider 4, 8 or 16 particles and start the particle schemes with 2, 4 and 8 placed in each of the two wells with a random perturbation. Furthermore we set α = 10 and β = 20. Table 5 shows the results as the number of particles increases. We observe that CBO and (w)PB perform equally well for large numbers of particles, and that (w)PB outperform CBO for few particles. This could be explained by the fact that the probability of all particles deviating from the global minimum decreases as their number increases.  Table 5. Success rates of consensus based optimization (CBO), personal best (PB) and weighted personal best (wPB) scheme for the 2D tow problem in space dimension 2 for different # of particles.

Conclusion
In this paper we introduced a consensus based global optimization scheme, which includes the personal best information of each particle. The proposed generalization is motivated by the original works on particle swarm algorithms, in which particles adjust their position as a linear combination of moving towards the current global best and their personal best value. We discussed how information about the personal best can be included in consensus based optimization schemes, leading to a system of functional stochastic differential equations. A well-posedness result for the respective regularized non-Markovian SDEs is presented. New features of the algorithm with personal best were illustrated and compared in computational experiments. The numerical results indicate that information about the personal best leads to higher success rates in the case of few particles and that the corresponding weighted means are better approximations of the global function minima.
Proof. [Theorem 3.1] In the following we denote in abuse of notation by v f [X] the vector (v f , . . . , v f ) ∈ R dN . We rewrite (2.5) ) ∈ R dN ×dB being the diagonal matrix with d k = diag(X i (t)−v f ) ∈ R d×d for k = 1, . . . , N and dB t a dN -dimensional Brownian motion. The argument follows the lines of the well-posedness in [7]. In fact, let M [X(t)] = diag(X(t)−v f [X(t)]) and n ∈ N arbitrary. We have to check that there exists a constant C n such that for every |X(t)| ≤ n. Note that f (X(t)) is bounded for |X(t)| ≤ n due to its local Lipschitz continuity. Hence, the estimate for the first term on the right-hand side is identical to the one in [7]. Indeed, we have Similarly, we have N i=1 e −αf (φ i (t)) =: J 1 + J 2 + J 3 , which satisfy Thus, we get Taking squares leads to the estimate