Progress analysis of a multi-recombinative evolution strategy on the highly multimodal Rastrigin function

A first and second order progress rate analysis was conducted for the intermediate multi-recombinative Evolution Strategy (μ/μI, λ)-ES with isotropic scale-invariant mutations on the highly multimodal Rastrigin test function. Closed-form analytic solutions for the progress rates are obtained in the limit of large dimensionality and large populations. The first order results are able to model the one-generation progress including local attraction phenomena. Furthermore, a second order progress rate is derived yielding additional correction terms and further improving the progress model. The obtained results are compared to simulations and show good agreement, even for moderately large populations and dimensionality. The progress rates are applied within a dynamical systems approach, which models the evolution using difference equations. The obtained dynamics are compared to real averaged optimization runs and yield good agreement. The results improve further when dimensionality and population size are increased. Local and global convergence is investigated within given model showing that large mutations are needed to maximize the probability of global convergence, which comes at the expense of efficiency. An outlook regarding future research goals is provided.


Introduction
The theoretical analysis of the performance of Evolution Strategies (ES) [8] optimizing functions f (y) in real-valued Ndimensional search spaces y ∈ R N is a challenge.This is due to the probabilistic nature of these algorithms allowing up to now the dynamic progress analysis only on simple test functions such as the sphere model [2,5], the ridge function class [3,14], and the ellipsoid model [7].These test functions are simple w.r.t.their optimization landscape (also referred to as fitness landscape) in that they have at most one optimizer (i.e., the location y of the optimum).Analyzing the dynamical behavior of ES on more complex and multimodal test functions appears to be even more demanding.However, ES and other evolutionary algorithms are especially designated to optimize such problems.There is empirical evidence that ES are able to globally optimize highly multimodal optimization problems [11] with in N exponential number of local optima.The question arises how and when these ES are able to locate the global optimizer.It is the long term goal to find conditions the ES must fulfill to not get trapped in the vast amount of local optimizers.Ideally, a theoretical analysis should provide the answers regarding the success probability P S (of locating the global optimum) depending on the ES parameters such as the population size λ and the test function to be optimized.Furthermore, one is interested in the computational complexity of the optimization process.
One approach successfully applied to the analysis of the ES-performance on simple unimodal test functions mentioned above is the dynamical systems approach [5] which is based on progress rate analysis.The progress rate is a measure of expected positional change in search space between two generations depending on location, strategy and test function parameters.The idea of investigating global search behavior from expected local progress was successfully applied, among others, in [3,7].It will be shown in this paper that this approach can be extended to the highly multimodal Rastrigin test function (1) where y ∈ R N , with oscillation amplitude A and frequency parameter α.The i-th fitness component in Eq. ( 1) is defined as f i (y i ) := y 2 i + A(1 − cos(α y i )). ( Depending on A and α a finite number of local minima M can be observed for each component i.Therefore, the overall number of local minima is scaling as M N posing a highly multimodal minimization problem with the global optimizer located at ŷ = 0.An exemplary optimization landscape of the Rastrigin function is shown in Fig. 1. The remarkable observation is that ES -unlike classical nonlinear optimization algorithms (e.g.BFGS) -do not follow the local gradient or Hessian ending in one of the M N − 1 local optimizers.That is, ES perform a rather global search.
A deeper understanding of this behavior is still missing.Recently, attempts have been made to analyze the problem from the viewpoint of relaxation using kernel smoothing [15].However, the sampling process needed to transform the original problem into a convex optimization problem is still lacking a link to the ES.
In this paper a simplified and scale-invariant (μ/μ I , λ)-ES, see Algorithm 1, is analyzed with step-size control defined in Eq. (4).Starting from the so-called parental centroid vector y (g) a population of λ offspring are generated by adding isotropic Gaussian mutations x ∼ σ N (0, 1) with mutation strength σ in Lines 6 and 7. Thereafter, the fitness is evaluated in Line 8. Selection of the μ best individuals is done in Line 10.It is performed for a given selection (truncation) ratio defined as ϑ := μ λ , (3) with ϑ ∈ (0, 1).It will be an essential quantity for the progress rate results in the limit of large population sizes.Using intermediate recombination with equal weights the best m = 1, ..., μ individuals are recombined in Line 11 and the new parental centroid y (g+1) is obtained.In the following, the subscript "m; λ" can be read as the m-th best solution out of λ candidate solutions.In Line 12 the simplified step-size adaptation is performed.To this end, a constant normalized mutation σ * using the spherical normalization with y (g) = R (g) is defined as This property ensures scale invariance and therefore global convergence of the algorithm, as the mutation strength σ (g)  decreases if and only if the residual distance R (g) decreases.The quantity σ * is unknown during black-box optimizations, but it is very useful for theoretical investigations to obtain scale-invariant mutations strengths.

7:
ỹl ← y (g) + xl The remainder of this paper is organized as follows.In the next section the local performance measures will be introduced being the basis for both the progress rate analysis and the dynamical systems approach.Section 3 is devoted to the determination and evaluation of the first order progress rate.Section 4 describes the derivation of the second order progress rate, which will rely on first order progress rate results.Section 5 uses the local performance measures to establish the evolution equations that govern the dynamical behavior of the ES.Experiments will be presented to show the usefulness of the approach.In the final Section 6 conclusions will be drawn and being based on open problems the further research direction will be outlined.

Local performance measures and quality gain distribution
The performance of an ES between two generations can be evaluated in both fitness and search space.The quality gain Q y (x) of fitness f at a position y (g) due to an isotropic mutation x ∼ σ N (0, 1) is defined as Q y (x) := f y (g) + x − f y (g) , (5) and yields in the case of fitness improvement (minimization considered) a negative value Q y < 0. The definition (5) measures the fitness change before selection and will be needed for the evaluation of the two progress rates (7) and (8).The quality gain components are decomposed using f i from Eq. (2) as That is, the quality gain corresponds to the difference between fitness values before and after the mutation application.A probabilistic model for the distribution of quality values will be presented below.It will be important for the subsequent progress rate derivations, as selection is based on fitness values.
Analyzing the progress towards the optimizer in search space, the first order progress rate on the Rastrigin function has already been investigated in [17] as a first approach.In this paper, a new approach is presented which significantly improves the prediction quality.
The first order progress rate between two generations for the parental component y i is defined as given parental position y (g) and mutation strength σ (g) at generation g.It is a measure of expected positional difference in search space.Positive expected progress ϕ i > 0 is defined in the case y (g)   i In this case the distance to the optimizer ŷi = 0 is reduced in expectation.This assumption is only valid as long as the sign of E[ y (g+1) i ] does not change, i.e., for small mutations compared to the residual distance.Therefore ϕ i has limited applicability when studying the convergence behavior in the vicinity of the optimizer.As has been shown in [7] regarding the performance analysis on the ellipsoid model, a second order progress rate is needed.It is defined as (g) .(8) Squaring the positions yields ϕ II i > 0 independent of the sign, if the distance to ŷi = 0 decreases in expectation.Additionally, the derivation will yield expressions containing a progress gain and loss part, which is necessary for a more accurate model of convergence.Both progress rates will be expressed using integral equations for the expected values and approximations will be necessary to find closed-form solutions.In a second step the progress rates can be applied within difference equations to model the expected dynamics over many generations in order to investigate the global convergence behavior.The selection of individuals is based on the attained fitness values.The quality gain measures the fitness change before selection according to (5).When the progress rate of an ES is modeled, the cumulative distribution function (CDF) P Q (q) of the quality gain and its probability density function (PDF) p Q (q) are needed as a function of y and σ .Obtaining an exact of the Central Limit Theorem seems appropriate to show that the distribution is asymptotically normal. 1 However, proving its validity rigorously seems hard or even impossible for arbitrary y.Therefore, we resort to normality as an approximation for the quality gain distribution.This is backed up by experimental results in Fig. 2, where sampled Q y (x)-values are compared to the normal approximation.A standard Anderson-Darling test was performed to check whether the sampled data was drawn from a normal distribution with known mean and variance according to (9).The hypothesis test fails to reject the normality assumption at p-values p = 0.48 (left) and p = 0.53 (right), where rejection is usually defined for p < 0.05.Even at relatively small N = 10 the results agree well.Good experimental agreement is also observed for the variation of the location y and mutation strength σ (not shown).Therefore, the normality assumption does not pose a strong restriction on the overall prediction quality of the progress rates in the subsequent sections, such that we approximate Furthermore, the following abbreviations are introduced At this point an additional assumption for the coordinates y = (y 1 , ..., y N ) has to be made to justify subsequent variance approximations ( 13) and ( 14).Given the search vector y = (y 1 , ..., y N ) and residual distance R 2 = y 2 it is assumed that the components contribute approximately equally (in expectation) to the residual distance, i.e., there is no dominating component, such that , for all i = 1, ..., N. ( Property (12) will also be referred to as component equipartition.The concept was introduced in [6] and proven for the noisy ellipsoid in [12].Its applicability to the Rastrigin function was shown in [19].The equipartition assumption is necessary in 1 For independently distributed quality gain components Q i (x i ) with finite mean and variance the Central Limit Theorem holds [10], if for some δ > 0 the Lyapunov condition lim holds.The validation of the condition could be approached using Eqs.( 24), (25), and (26) for the respective quantities by evaluating higher order moments.
order to justify certain approximation steps and to provide a closed-form solution for the progress rate.Furthermore, it will be a reasonable assumption to obtain a model of the algorithm's progress and dynamics in expectation.This assumption also justifies a linear scaling of the variance with dimensionality N provided that the components are contributing equally to the overall variance, such that Additionally, for large N an important approximation will be used for the variance to significantly simplify the obtained lengthy results.If no single i-th component is dominating the sum, i.e., Var [Q i ] / j =i Var Q j → 0 (for any i in the limit N → ∞), the contribution of a single term is negligible for N → ∞.Therefore, the two sums over N and N − 1 terms, respectively, are asymptotically equal with Note that quantity D 2 i is formally introduced in (20).Returning to Eq. ( 9), the expression is rewritten using a standardized random variate Z as Approximation 1 (Quality gain distribution).The local quality gain at position y due to random mutation vector x ∼ N (0, σ 2 1) is approximately normally distributed.Therefore, P Q (q) and p Q (q) can be approximated as Within the normal approximation (16) the inverse P −1 Q (p) given some probability p can be easily obtained by using the quantile function −1 (p) of the normal distribution.This relation will be used later to obtain a quality gain for some given probability p using For the derivation of the i-th component progress rate the conditional distribution function P Q (q|x i ) of the quality gain is needed for a given component x i .In this case expected value and variance are given by where the sum j = i is taken for fixed i over the remaining N − 1 components.Therefore, a normal approximation for the conditional CDF is introduced using ( 19) and (20).
Approximation 2 (Quality gain distribution given x i ).The quality gain distribution at position y given fixed mutation component x i and random mutation vector (x) j =i ∼ (N (0, σ 2 1)) j =i is approximately normally distributed.Therefore, P Q (q|x i ) and p Q (q|x i ) can be approximated as Having derived approximations of the quality gain distribution functions, the quantities E [Q i ] and Var [Q i ] remain to be determined.As the components are independent, it is sufficient to consider a single component and then perform the summation.Starting from definition (6), one can evaluate the quality gain of a single component Q i (x i ).After applying trigonometric identity cos (α( need to be evaluated.The results will be expressed as expected values containing trigonometric functions.As a remark, terms containing moments of x i ∼ N (0, σ 2 ), i.e., E x k i with k ≥ 1, are silently evaluated as they are assumed to be widely known.
where odd powers of E x k i = 0, which also yields E [sin Expectations of the form E x k i cos αx i and E x k i sin αx i for k ≥ 0 can be obtained by using the definition of the characteristic function χ of a random variate x ∼ N (μ, σ 2 ) and its known result [1] with the imaginary unit denoted by ı = √ −1 in ( 27) and (28).Now the k-th derivatives with respect to α can be applied to both sides [cos (αμ) + ı sin (αμ)] , (28) such that corresponding real and imaginary parts can be identified by comparing both sides (denoted by !=) of Eq. (28).
Given μ = 0 for k = {0, 1, 2} the required expectations of trigonometric terms can be derived.Additionally, trigonometric identities cos 2 (x) = 1/2 + cos(2x)/2 and sin 2 (x) = 1/2 − cos(2x)/2 are used.The results are Inserting relations (29) into (25) and (26), summing over all N components and collecting the resulting terms one obtains the expected value Analogously, the variance of the Rastrigin quality gain yields ασ ) 2 ασ 2 cos(α y i ) + 2 y i sin(α y i ) . (31) The quantities E Q |x i from (19) and D 2 i from (20) are given analogously by summing over N − 1 components.Expressions E Q and D Q could be inserted into (16), and E Q |x i with Q i (x i ) and D i into (21).However, it is omitted at this point for better readability.
As an important remark, expression (23) can be linearized w.r.t.mutation x i to obtain analytically solvable progress rate integrals, see also discussion after Eq. (51).Taylor-expanding f i around y i for small x i gives f i (y i and evaluating the derivative one has with following definitions applied to (32) Component k i is the derivative of the quadratic term y 2 i , cf.Eq. ( 2), which follows the global quadratic structure of the function.Conversely, derivative d i follows the local oscillation, such that it will be very important for the model of local attraction during the progress rate derivations in Secs. 3 and 4.

First order progress rate
While the first order progress rate (7) does not suffice to completely describe the convergence behavior of the ES on Rastrigin, see Sec. 5, it is a necessary step in the calculation of the second order progress rate in Sec. 4. Given definition (7) and the parental location y (g) , one has to find the expected value over the i-component location E y (g+1) i .The positional update y (g) → y (g+1) performed by the ES is realized by consecutively applying mutation, selection, and recombination (see Algorithm 1), such that one can write where x m;λ denotes the mutation vector of the m-th best offspring after selection.Considering the i-th component of Eq. (34), abbreviating the mutation component as x m;λ := (x m;λ ) i , and taking the expected value thereof yields E y The progress rate can therefore be evaluated by inserting (35) into (7) giving (36) Before starting the derivation of (36), the important large population theorem is stated which will be used during the derivation of both first and second order progress rate.Its application also yields the so-called asymptotic generalized progress coefficients presented in Eq. (45).Theorem 1.Let λ > μ + 1 and μ > a with a ≥ 1 and ϑ = μ/λ with 0 < ϑ < 1, such that t λ−μ−1 (1 − t) μ−a exhibits its maximum on (0, 1) and vanishes at t ∈ {0, 1}.Let f x (t) be a function defined for constant x ∈ R, such that f x : [0, 1] → [0, 1] with bounded derivatives on [0, 1] and let B denote the beta function.Furthermore, let p x denote the PDF of a normally distributed variate and let p n (x) denote a polynomial of degree n in x.For infinitely large μ, λ → ∞ and constant ϑ = μ/λ the following limit holds Proof.The dominated convergence theorem is applied.First, the following sequence is defined for μ = 1, 2, ..., with λ(μ) = μ/ϑ and constant ϑ Note that g μ is measured over the density of the normal distribution.In [18] it was shown that g μ (x) converges for any x according to lim μ,λ→∞ ϑ=const.
An upper bound of g μ can be estimated using 0 ≤ f x ≤ 1 and the definition of the beta function A lower bound for the denominator of (40) can be given as Inequality ( 41) can be shown easily by setting μ = a + k with integers a ≥ 1 and k ≥ 1 (ensuring μ > a).This yields which is fulfilled for any a ≥ 1 and k ≥ 1.Using (41) in (40) one gets As there is a constant upper bound of g μ (x) , it remains to show that which is finite due to normal density p x (x).Hence, the limit in Eq. ( 37) can be exchanged with the integral over x.Using the limit of (39) the desired result is obtained.
The limit (39) is readily used in [16] to define the so-called asymptotic generalized progress coefficients for integers a ≥ 1, b ≥ 0, and truncation ratio 0 < ϑ < 1 as e a,b These are characteristic coefficients describing the progress in the limit μ, λ → ∞ with constant ϑ = μ/λ, and are related to the generalized progress coefficients [5,Eq. (5.112)].They will reappear during the derivation of both ϕ i and ϕ II i .The derivation of ϕ i is presented now.
Proposition 1.Let μ, λ ∈ N with μ ≥ 1 and μ < λ and let p x denote the PDF of the random mutation x ∼ N (0, σ 2 ).Let x m;λ denote the m-th best value (out of λ) of the i-th mutation component (x m;λ ) i .Furthermore, let P Q and P −1

Q denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function. Then, the first order component-wise progress rate is given by
Proof.From now on the conditional dependency on y (g) and σ (g) will be implicitly assumed as given for better readability of the equations.The expected value of the i-th mutation component x m;λ after selection can be expressed as an integral over the order statistic density p m;λ (x i ) of the m-th best individual, such that (36) is rewritten as The subsequent task will be to derive the density p m;λ as a function of mutation and quality gain distributions.Mutations are distributed normally with zero mean and variance σ 2 according to the normal density Given mutation x i (and implicitly position y), a random quality gain value Q is distributed according to a conditional probability density p Q (q|x i ).Given that the m-th best individual attains a quality gain within [q, q + dq], there must be m − 1 better individuals having a smaller quality value with probability λ−m .To account for all relevant combinations one has where 1/(m − 1)! and 1/(λ −m)! exclude the irrelevant combinations among the two groups of better and worse individuals, respectively.The conditional density for the m-th individual as a function of the quality gain q yields By integrating (49) over all attainable quality gain values q ∈ [q l , q u ], one arrives at the density Inserting the order statistic density from (50) into the progress rate (47), one obtains the intermediate result A few important remarks can be made regarding Eq. (51).A closed-form analytic solution cannot be obtained without applying further approximations.It can be approached in an analogous way to the ϕ i -derivation of the Ellipsoid in [13] to obtain a solution in terms of the well-known progress coefficient c μ/μ,λ [5, p. 216].However, a closed-form solution with this approach requires a linear relation of Q i w.r.t.x i , see relation (32).The effect of a linearized quality gain on the progress rate of the Rastrigin function was already studied in [17] and showed that the progress due to local attraction is not modeled correctly, as the oscillation terms have to be either dropped or linearized for small x i .
Therefore a different approach is followed here assuming the infinite population limit, an approach which was applied within the analysis of functions with noise-induced multi-modality [9].The approach will yield correction terms including the effects of the trigonometric terms from (24), in contrast to only taking linearized terms from (32).Starting from Eq. ( 51) and moving the sum including the m-dependent prefactors into the innermost integral yields Now a transformation can be applied for the sum m (•) yielding an expression as a function of the regularized incomplete beta function [5, p. 147].One has Furthermore, one can rewrite the resulting population-dependent factor as follows where we have used the property of the gamma function (n) = (n − 1)! (for any integer n > 0) and the known relation between gamma and beta functions (x+y) = B(x, y).These replacements will be useful later.After replacing the sum and refactoring we arrive at the following progress rate integral Now the integration order of t and q is exchanged.In Eq. ( 55) one has the bounds Defining the inverse transformation q = P −1 Q (1 − t) and integrating over t first, one obtains the new ranges The progress rate yields Now the innermost integral can be solved using p Q (q|x i ) = dP Q (q|x i )/dq where the probability P Q (q l |x i ) = Pr(Q ≤ q l |x i ) = 0 for any lower bound value q l .Inserting (59) into (58), we arrive at the progress rate integral (46).
Unfortunately a closed-form solution of (46) after inserting Approximation 1 and Approximation 2 for the quality gain CDF is not possible due to the underlying structure of the integrand.Hence, asymptotic approximations will be introduced assuming large populations and large dimensionality to successively simplify the integral in a way that closed-form solutions can be provided.First, the large population theorem will be applied and then the quality gain CDF is inserted.Thereafter, the normal CDF is Taylor-expanded with the first two terms yielding analytically solvable results and higher order terms vanishing as O (1/N).The results are further simplified in the end assuming component equipartition (12), which finally gives the progress rate result in (96).Theorem 2. Let p x denote the PDF of the random mutation x ∼ N (0, σ 2 ).Let P Q denote the quality gain CDF with its quantile function given by P −1 Q .For a truncation ratio ϑ = μ/λ with 0 < ϑ < 1 the component-wise progress rate for large populations yields lim μ,λ→∞ ϑ=const.
Proof.Starting from Eq. ( 46) and applying the infinite population size limit, the result of Theorem 1 can be applied with which yields the result (60).
The next step requires the use of Approximation 1 and Approximation 2 for the quality gain distributions in Eq. (60).
To this end, one uses the conditional normal distribution function , see (21), and the inverse transformation (18).One obtains Given the normal approximation (62), an expression for E Q |x i is needed.Using definition (19) with Q i -result (24) the (conditional) expected value is written as In (63) the following definitions are introduced as abbreviations Given Eq. ( 63), quantity δ(x i ) includes all non-linear terms in x i .This will be important when the normal CDF is expanded and analytically solved.Inserting relation (63) into (62) and the result into (60) yields A closed-form solution of (65) cannot be obtained with (δ i (x i )) containing non-linear terms in x i .However, a solution in terms of a Taylor expansion can be provided by introducing the decomposition (g(x i ) + h(x i )) with g(x i ) being a linear function, and h(x i ) being a small non-linear perturbation according to In (66), the abbreviation (10), is used to denote the expected value of the i-th summand of the quality gain (6).Using functions g(x i ) and h(x i ) Eq. (65) becomes Approximation 3 (Truncated cumulative distribution function series).Under the assumption of a normally distributed quality gain, see Approximation 1 and Approximation 2, and a quality gain variance scaling with N according to Eq. ( 13), the CDF of the normal distribution is expanded at g(x i ) in the limit of N → ∞ as Relation (69) is derived now.Starting from (68), the Taylor-expansion of (•) up to first order with the remainder denoted by r yields Note that all derivatives of the normal distribution exist as dx n = (−1) n He n (x) φ(x) with He n (x) denoting the n-th order probabilist's Hermite polynomials.In the following the scaling properties of the remainder as a function of N are investigated.It will be shown that r = O (1/N).To this end, (70) is rewritten as For the further analysis of r(N) the equipartition of components is assumed as introduced in Eqs. ( 12), (13), and (14).Hence, the variance D i can be written as a function of N as where the prefactor s = s(N) depends on A, α, y, and σ .With these assumptions the functions g and h are written as , dropping the subscript i for brevity and using As h → 0 for N → ∞, the remainder (71) vanishes accordingly.Therefore, in order to show r(N To evaluate (74) the derivative of r from (71) w.r.t.N is evaluated as (75) The term (e −gh− 1 2 h2 − 1) of ( 75) is expanded up to first order discarding higher orders O ((gh The derivatives of g and h from Eq. (73) are Inserting ( 73) and (77) into (76) yields after refactoring Taking the limit (74) of (78) therefore yields such that the remainder r(N) can be given as which concludes the derivation of (69).
Both integrals of (69) are analytically solvable. 2The zeroth order term yields a closed form solution due to g(x i ) being linear w.r.t.x i and gives progress contributions due to the sphere function, i.e., the linear part of the quality gain (63).The first order term can be solved by applying quadratic completion to the Gaussian product p x (x i )φ(g(x i )) yielding an expected value over a normal density.The expected value over h(x i ) can be regarded as a perturbation of the sphere containing A and α dependencies.The determination of ϕ i via (69) was done in [18] by evaluating both integrals.As the derivation and the final result for ϕ i are very lengthy and therefore not practical for further analytic treatment, the obtained expression for ϕ i was simplified as a last step assuming large dimensionality N.However, the same result as in [18] can be obtained in a quicker way by simplifying the integrands of (69) under the same assumptions before the integration, instead of simplifying the result afterwards.This will enable a more concise derivation of the final progress rate result.
First the functions g and h from (66) and (67), respectively, are simplified.For large N, the quality gain variance D i D Q using (14).As E Q i is just the quality gain expectation of a single component, it can be neglected compared to D Q scaling as √ N using (13).Hence, one has g(x i ) − Another approximation is introduced regarding the density p x (x i )φ(g(x i )) for the second term of (69).By completing the square one can derive a resulting normal density with mean m and variance ς 2 by demanding Simple calculations yield m = −1 (ϑ) quantities m and ς 2 from (84) yield the asymptotic results such that the density of the first order term yields Using the results from Eqs. (81), (82), and (86), the progress rate integral (69) is further simplified.The prefactors of the resulting integral yield the asymptotic progress coefficient (45) Approximation 4 (Progress rate integral for large dimensionality).Based on the result of Approximation 3 only the first two terms are considered.Furthermore, the integrands of (69) are approximated and simplified assuming large dimensionality using Eqs.( 81), ( 82), (86), and (87).Hence, one obtains Calculating I 0 i from (89) by inserting mutation density p x (x i ) from (48) and applying the substitution z = x i /σ , one gets The following integral identity [5, Eq. (A.12)] can be applied Evaluating (92) with a = −k i σ /D Q and b = −1 (ϑ) yields for the right-hand side of (92 Again assuming (k i σ ) 2 D 2 Q , expression (93) simplifies and the result for (89) is obtained with (87) as Now , is integrated over density p x with zero mean.Therefore, all odd functions of x i yield no contribution and only the term x i sin (αx i ) needs to be evaluated.One gets In the second line of (95) the expected value definition is used.From second to third line the expected value of x i sin (αx i ) is evaluated using (29).In the last line the derivative d i = α A sin (α y i ) from (33) is recovered.Using the results from (94) and (95) the first order progress rate approximation for large N and μ can finally be given.

First order progress rate
The first order component-wise progress rate on the Rastrigin function in the asymptotic limits of infinitely large population size μ (constant ϑ = μ/λ) and infinitely large dimensionality N yields The expressions for c ϑ = e 1,0 ϑ from (45) and D Q from (31) were not inserted to improve readability.Result (96) shows very interesting properties compared to [17, Eq. ( 26)], where a linearized quality gain approximation resulted in First note that the progress coefficient was replaced by its asymptotic form c μ/μ,λ c ϑ .The difference for the variance terms in the denominators of (96) and (97) is negligible for large N with D 2 2 , see also (14).However, the most notable difference lies between the derivative term f i = k i + d i , see definition (33), and the newly obtained term It contains an unchanged sphere-dependent term k i and an exponentially decaying Rastrigin-specific term d i .This characteristic form will be discussed in the subsequent part.The result (96) will be essential for the determination of the second order ϕ II i .At this point one-generation experiments can be performed and compared to the progress rate (96) to investigate its accuracy.To this end, a random position vector y is initialized isotropically with y = R given some residual distance R.
Then, repeated simulations are performed and quantity ( 7) is averaged over 10 6 trials.The issue with the choice of R is that the "interesting" region with high density of local minima scales with N, such that a relation R(N) is needed.The following argumentation can be given.Assuming w.l.o.g.y > 0 and that all components of the parental position are at some given local minimum denoted by ŷ( j) .Index j identifies the local attractor along the half-axis, e.g.j ∈ {1, 2, 3} in Fig. 1 on the right side.For N = 1 one has y = [ ŷ( j) ] and therefore R 2 = ( ŷ( j) ) 2 .Having N components at the same j-th local minimum  √ N with y i = 0.116 (right).As in Fig. 3, black dots depict the simulation, while the red dash-dotted line shows result (96).The error bars are very small and therefore not visible.
is therefore needed to stay within a certain region of local attractors when N is increased.
The progress rates of two exemplary components for a single experiment are shown in Fig. 3.For both plots σ ∈ [0, 1] was chosen in order to investigate the effects of the oscillation as α = 2π .On the left, one observes enhanced progress for moderate σ -values due to local attraction, as both local and global attractor are aligned along the same direction.On the right, there is negative progress for moderate σ , as the local attractor is driving the ES away from the global attractor.For larger σ , the overall spherical shape is dominating and both exhibit positive progress.A decomposition of the progress rate in terms of is displayed in Fig. 3.It shows the large-scale behavior of the k i -term, dashed cyan, and limited range of the d i -term, dotted green.As k i = ∂(y 2 i )/∂ y i , its progress term models the global quadratic structure of Rastrigin, see derivative definitions (33).The second term e − 1 2 (ασ ) 2 d i models the Rastrigin-specific local oscillation having limited range depending on the mutation strength σ (or α).By defining scale-invariant mutations using (4) with σ = σ * R/N, the oscillations vanish via e − 1 2 (ασ * R/N) 2 for large residual distance R, where the sphere function is recovered.This model significantly improves the progress rate formula (97) from [17].
As a note, changing one of the fitness parameters A or α directly affects Fig. 3.The change of amplitude A rescales both the (local) peak and dip heights accordingly, increasing the effects of local attraction for larger A. Increasing frequency α has mostly short-range effects as the overall range is reduced due to suppression via e − 1 2 (ασ ) 2 of (96).In the subsequent parts, the progress rate is investigated for A = 1 and α = 2π as an example.In Figs. 4 and 5 the progress rate is evaluated over scale-invariant σ * for two different N-values and population sizes.One can see that the approximation quality improves for larger N and μ, as expected from the applied approximations.
The overall agreement between simulation and approximation is good for larger and smaller residual distances R, see left and right plots, respectively.The σ * -range was chosen large enough, such that the progress rate of the corresponding sphere function [5, Eq. (6.54)] reaches negative values due to mutations being too large.This boundary directly translates to Rastrigin, as the global structure is the same.However, due to ϕ i being first order, no negative progress occurs even for large σ * .Therefore the second order progress rate ϕ II i needs to be derived in Sec. 4, where loss terms will provide additional correction terms.

Second order progress rate
The second order progress rate (8) requires the evaluation of E (y (g+1) i ) 2 .Starting with intermediate result (34) and referring to the i-th component, the expression yields after squaring y Squaring the last term can be evaluated by separating the sum into equal and unequal indices x k;λ x l;λ . (99) Inserting (99) into (98) and taking the expected value (conditional variables y (g) and σ (g) are implicitly assumed to be given) yields E y Noting that ϕ i = − 1 μ μ m=1 E x m;λ , see Eq. (36), and using (100) in ϕ II i -definition (8) yields the second order i-th component progress rate for which the two following expected values need to be determined 1 In the subsequent parts the solutions to Eqs. ( 102) and (103) will be derived.Starting with (102), the solution requires order statistic density (50) for the m-th individual, large population identity (37), and the expansion of the normal CDF (69) up to first order.The resulting two integrals can then be solved analytically for large N and the results will simplify significantly.Proposition 2. Let μ, λ ∈ N with μ ≥ 1 and μ < λ and let p x denote the PDF of the random mutation x ∼ N (0, σ 2 ).Let x m;λ denote the m-th best value (out of λ) of the i-th mutation component (x m;λ ) i .Furthermore, let P Q and P −1 Q denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function.Then, the second order expected value reads Proof.Starting from (102) and rewriting the expected value as an integral over order statistic density p m;λ Both ( 47) and ( 105) have the same structure after inserting p m;λ (x i ) from (50) and the integration over the squared mutation component is performed as the last step.The same steps as presented in the proof of Proposition 1 can therefore be applied with squared quantity x 2 i , which directly gives the result (104).
Analogously to the derivation of the first order progress rate in Sec. 3, a closed-form solution for (104) can only be provided by first applying the limit of large populations and then introducing approximations assuming large dimensionality N.
Theorem 3. Let p x denote the PDF of the random mutation x ∼ N (0, σ 2 ) and let x m;λ denote the m-th best value (out of λ) of the i-th mutation component (x m;λ ) i .Let P Q denote the quality gain CDF with its quantile function given by P −1 Q .For a truncation ratio ϑ the limit of the second order expected value reads lim μ,λ→∞ ϑ=const.
Proof.Starting from Eq. ( 104) and applying the infinite population size limit, the result of Theorem 1 can be applied with , which yields the result (106).
Given result (106), approximations are again applied to provide closed-form solutions.Inserting quality gain Approximation 1 and Approximation 2 via Eq.( 62) into ( 106) leads (again) to an analytically not solvable integral due to non-linear terms in x i within (•).Therefore, the CDF is expanded using Approximation 3 neglecting higher order terms O (1/N).Finally, the integrands are simplified assuming large dimensionality using Approximation 4. The result is therefore given after inserting g(x i ) and h(x i ) from ( 81) and ( 82) as 1 The two integrals abbreviated as I 0 i and I 1 i are evaluated now.For I 0 i , the substitution z = x i /σ is introduced The following integral identity [16] is applied for real parameters a and b 108) yields for the right-hand side of (111)

b
(1 Q for large N further simplifies (112) and one obtains the result This leads to following result for the first integral Second integral I 1 i from ( 109) is expressed using expected values over the normal density p x of the terms given by x 2 i δ(x i ).With δ(x i ) given in Eq. (64) one gets One has E x 4 i = 3σ 4 and E x 2 i = σ 2 .Using results from (29) the remaining expected values read Therefore, one gets Collecting the results (115) and (118) with k i = 2 y i and inserting them back into (107) the expected value finally reads 1 The solution of the second expected value 1 μ 2 E (1,1) from ( 103) is presented now.First an exact integral is derived.Then, approximations are applied to give closed-form solutions.Proposition 3. Let μ, λ ∈ N with μ ≥ 1 and μ < λ and let p x denote the PDF of the random mutation x ∼ N (0, σ 2 ).Let x k;λ denote the k-th best value (out of λ) of the i-th mutation component (x k;λ ) i .Furthermore, let P Q and P −1 Q denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function.Then, the second order expected value reads Proof.First, a joint order statistic density has to be derived for the expected value.Then, the double sum is converted into a single integral using a known identity.The resulting five-fold integration is restructured by exchanging bounds and then successively solved.
Starting with (103), the double sum includes mixed contributions from the k-th and l-th best elements of the i-th mutation component.To avoid confusion with the summation indices k and l, the integration variables associated with k-th element will be denoted as x 1 (mutation) and q 1 (quality), while the l-th element is integrated over x 2 and q 2 .The ordering 1 ≤ k < l ≤ λ is assumed with k yielding a smaller (better) quality value q 1 < q 2 .Additionally, the joint probability density p k,l;λ (x 1 , x 2 ) is needed, such that the expected value can be formulated as 1 The mutation densities are independent and denoted by p x (x 1 ) and p x (x 2 ), respectively.Given mutation components x 1 and x 2 , the conditional density obtaining the quality values q 1 and q 2 is p Q (q 1 |x 1 ) and p Q (q 2 |x 2 ), respectively.Given q 1 and q 2 , one has k − 1 values smaller than q 1 , l − k − 1 values between q 1 and q 2 and λ − l values larger than q 2 with probabilities Pr{Q ≤ q and P Q (q) denoting the quality gain CDF.The joint probability density can therefore be written as with integration ranges q min ≤ q 1 < ∞ and q 1 < q 2 < ∞ as k < l.Lower bound q min denotes the smallest possible quality value, which is resolved later.The factorials exclude the irrelevant combinations among the three groups given in (122).Plugging (123) into (121) and moving the sum into the innermost integral gives 1 The double sum of (124) over the P Q -values will be expressed by an integral.This can be done using an identity from [4, p. 113].Setting ν = 2 and identifying the indices as i 1 = l and i 2 = k, the identity yields for real values Q 1 and Q 2 , with integers ν ≤ μ < λ.Now the substitution ) can be performed and the double sum of (124) can be recognized by comparing with (125).Applying the identity therefore yields Hence, Eq. ( 124) is expressed as The prefactor of Eq. ( 127) can be evaluated as Now the integration order will be exchanged twice in (127).First the order between t and q 2 is exchanged.Then the order between t and q 1 is exchanged, such that both q-integrations are performed before the t-integration enabling the application of the large population identity (37).Starting with integration bounds (129) and using the inverse function P −1 the exchanged bounds between t and q 2 are 0 ≤ t ≤ 1 − P Q (q 1 ), Using factor (128) and exchanged bounds (130), the expression (127) is reformulated as Now the integration order between t and q 1 is exchanged starting from Therefore, one arrives at the following integral to be solved (beta function has been moved inside as it will be evaluated during the t-integration) Now the integrals in (134) will be successively solved.Starting with integral {•} over q 2 one has The q 1 -integration within [•] using (135) yields Theorem 4. Let p x denote the density of the i-th component mutation x ∼ N (0, σ 2 ) and let x k;λ denote the k-th best value (out of λ) of the i-th mutation component (x k;λ ) i .Let P Q denote the quality gain CDF with its quantile function given by P −1 Q .For a truncation ratio ϑ the limit of the second order expected value reads lim μ,λ→∞ ϑ=const.
1 μ(μ − 1) Proof.Starting from Eq. (120) the μ-dependent prefactor was rearranged in a way that the factor (μ − 1)/μ in (120) is retained in the final result.Formally one could include (μ − 1)/μ in the sequence (38) and take the limit.However, it is desirable to keep the factor in the progress rate as a correction for finite μ-values.As a next step, one ).As 0 ≤ f x (t) ≤ 1 the same bound estimation as in (43) holds.
Furthermore, both mutation integrals over density p x are finite, see also (44).Therefore, the limit is evaluated with with x i re-introduced in the last line to denote the i-th mutation component, which gives Eq. (144).
In [•] of result (144), one can identify the first order progress rate −ϕ i within the large population limit derived in Eq. (60).Refactoring (144) to obtain 1 μ 2 E (1,1) , one can insert the ϕ i -approximation from (96).Noting that c 2 ϑ = e 2,0 ϑ via (45), one gets 1 Finally, inserting the results from (119) and ( 146) into (101), one obtains the second order progress rate which serves as an approximation in the asymptotic limit of infinitely large dimensionality and population size.However, experimental investigations will also show good agreement for finite N, μ, and λ.
For future investigations of the convergence and step-size adaptation properties of the (μ/μ I , λ)-ES, a simpler expression than (147) is needed.To this end, the N-dependency of the terms within {•} of (147) is investigated.It will be shown that for N → ∞ and μ = o (N) only the term −σ 2 /μ yields relevant contributions.The relevant terms in {•} of Eq. ( 147) are abbreviated according to their respective factors as e 1,1 ϑ , c ϑ /D Q and e 2,0 ϑ .In order to maximize the absolute value of the individual terms a lower bound for D 2 Q is needed.Given the form of D 2 Q from Eq. (31), no useful lower bound for the variance could be established satisfying D 2 Q > 0 for any y i due to the trigonometric terms.Therefore, we will restrict the analysis to the sphere limit case A → 0. This assumption might seem crude.However, the most important characteristics are already contained in the first ϕ i -dependent term of (147) referred to as the gain term in sphere model theory [5].On the other hand, the loss terms in {•} are mostly dominated by the first term −σ 2 /μ.Experiments will affirm this assumption.
As the ϕ II i -approximation shall be valid for a constant σ * given any R-value, the mutation strength is re-normalized using ( 4) Setting A = 0, σ = σ * R/N, and i y 2 i = R 2 in (31), one obtains the sphere variance for constant normalized mutation strength as In the limit N → ∞ the second term of ( 149) is negligible for constant σ * giving Having obtained the sphere variance asymptotic in (150), the terms within {•} of (147) are evaluated.The term with prefactor e 1,1 ϑ yields with σ = σ * R/N and using (150 It was used in (151) that a single component y 2 i contributes in expectation 1/N to the residual distance R 2 = N j=1 y 2 j , see also (12).The second term with prefactor The last term with prefactor e 2,0 ϑ yields with A = 0 and using (150) In (153) the notation μ(N) was introduced to emphasize that the population size is usually chosen depending on the dimensionality of the search space.Finally, inserting the results of the loss term investigation for the three terms (151), (152), and (153) back into progress rate (147), one gets for the loss term in {•} of (147) Provided that the population size μ = o (N), i.e., increasing sub-linearly with N, all terms except "1" in {•} can be neglected for N → ∞.Theoretical results concerning population sizing, i.e., choosing the necessary μ(N) to achieve high global convergence probability (success probability), are not available at this point.It is one of the main future goals of the current research project.Note that treating μ as a constant is also not satisfactory, since for large N an increase of μ is necessary to maintain a high success rate on a highly multimodal problem.However, experimental investigations on the Rastrigin function including step-size adaptation suggest a sub-linear relation, which validates the approximation.Finally, the lengthy result (147) is simplified using the loss term asymptotic of (154) and the second order progress rate approximation is obtained.

Second order progress rate
The second order component-wise progress rate on the Rastrigin function in the asymptotic limits of infinitely large population size μ (constant ϑ = μ/λ) and infinitely large dimensionality N with μ = o (N) yields  147) and the dash-dotted red curves Eq. ( 156).
The expressions for c ϑ = e 1,0 ϑ from (45) and D Q from (31) were not inserted to improve readability.The first line (155) emphasizes the dependence of ϕ II i (ϕ i ) and can be thought of as a more general formula provided that ϕ i is known and the loss term behaves similarly to the sphere function loss term −σ 2 /μ.The second line (156) shows the explicit results for the Rastrigin function.The results ( 155) and ( 156) can be mapped to the Evolutionary Progress Principle [5] as the expressions contain a progress gain and loss term, respectively.Here, the gain part scales with c ϑ and it is a y i -dependent expression.
Hence, depending on the sign of y i sin (α y i ) it may also yield negative contributions due to local attraction moving the ES away from the global optimizer, cf.Fig. 3.The loss term −σ 2 /μ is characteristic for intermediate recombination.It introduces significant loss for large σ , but can be decreased using a larger μ due to recombination effects.
Results of one-generation experiments are presented in Figs. 6 and 7 by evaluating (8) over 10 6 trials (black dots with vanishing error bars) and comparing with the obtained approximations.The red dash-dotted line is showing simplified result (156), while the blue dashed line is showing (147).The positions y were initialized randomly (given R) and kept constant over all repetitions.Fig. 6 shows a smaller dimensionality N = 20 and truncation ratio ϑ = 1/4, while Fig. 7 shows larger values N = 100 with ϑ = 1/2.This was done to exemplarily investigate the results at different parameter sets.
First thing to note is that the loss term allows negative progress for large σ * , which was not the case for ϕ i .The approximation quality is good for different R-values (see left and right plots, respectively) and improves for larger N and μ in Fig. 7, which was expected.Simplified expression ϕ II i from (156) [red, dash-dotted] yields good results compared to (147) [blue, dashed], with (147) giving slightly better results for smaller σ * and (156) better results at larger σ * .This indicates that additional terms of the Taylor expansion (70) would be needed to further improve the results of (147).However, this would make the expression more involved, which is not desired.Furthermore, the results of Fig. 6 are relatively good considering that a rather small population (10/10, 40)-ES was used at low dimensionality N = 20.One can conclude that (156) yields very good results considering its "simplicity".It will therefore be used in Sec. 5 to investigate the dynamical behavior of the ES.It should be noted that at this point there is no aggregated progress measure over all N components, such as the R-dependent sphere progress rate.Given some y (g) one can evaluate all i = 1, ..., N values for ϕ II i and obtain a progress vector, but the overall effect on R (g) → R (g+1) is not known.This will be part of future research.However, the cumulative effect of all N progress rates can be evaluated within a dynamical systems model to be shown in the next chapter.

Evolution equations
In the previous sections one-generation experiments were conducted and compared against progress rate results (96), (147), and (156).In order to have an aggregated measure over all components and many generations, ϕ i and ϕ II i will be used within the evolution equations and compared to real optimization runs of Algorithm 1.Using this method the (mean) global convergence behavior can be investigated.
The two terms (1) and (2) can be interpreted as fluctuations w.r.t. the expected values (provided ϕ i and ϕ II i ).Thus, it holds E (1) = 0 = E (2) .However, the exact transition densities for g → g + 1 are not known at this point.In principle, they could be approximated using a finite number of higher order moments (or cumulants) to model the fluctuations [5,Ch. 7].However, for a first study of the progress rate results on the dynamics, the fluctuations are neglected by setting (1) = 0 = (2) .Therefore, one arrives at the (deterministic) equations describing the mean-value dynamics of the parental position coordinates with constant normalized mutation strength σ * from Eq. ( 4) giving Two important issues need to be discussed.Firstly, the positional iterations are defined for a single component i.For large N however, it is not feasible to display each component individually.While the components will be iterated separately, the dynamics will be presented as a function of the residual distance R = y (g) .Secondly, for the evaluation of ϕ II i being a function of y (g) , the square root of the components (y (g) i ) 2 has to be taken after iteration giving two solutions ±y i , both solutions are equivalent.In the following, the deterministic iterations (159) and (160) using mutation strength rescaling (161) are compared to real optimization runs.For the initialization, y (0) is chosen randomly such that y (0) = R (0) for a given R (0) .The starting position is kept constant for consecutive runs of the same experiment.For the magnitude of R (0) it is ensured that the strategy starts far enough away from the local minima landscape.Given Fig. 1 with A = 1, the farthermost local minimizer is at Considering the choice of σ * one observes in experiments that larger mutation strengths (compared to a sphere-optimal σ * ) increase the success probability P S of individual trials to converge to the global optimizer.This is due to the fact that large steps tend to overcome local attraction more easily.However, this comes at the expense of efficiency, since large steps are often overshooting the global optimizer.Therefore in Fig. 8, σ * is chosen larger than the sphere-optimal value σ * sph , which can be obtained numerically from [5, Eq. (6.54)], but small enough to prevent negative progress.The aim was to obtain P S ≈ 1.
In order to aggregate the R (g) -data of multiple dynamic experiments, the median has shown to be a suitable measure of central tendency.The main issue is that due to fluctuations the R (g) -values of distinct ES-runs may differ by orders of magnitude, such that the mean yields biased results due to a skewed distribution.The median is more suitable in this case and a more stable measure.
In Fig. 8 one can observe three phases within the dynamics.First, linear convergence is observed for large R (g) -values, where the sphere function dominates.Then, a slow down is observed due to increasing effects of local attraction.For small R (g) -values, the ES descends into the global attractor basin and linear convergence can be observed again.One can see that the ϕ i -iteration (blue) shows by far too much progress compared to ϕ II i -iteration.This is due to the first order model, which does not include loss terms and overestimates the progress significantly, see also discussion of result (96).Iteration via ϕ II i (red) shows good results compared to the median curve, especially for larger μ and N (right plot).Better agreement for large populations is also due to reduced fluctuation effects, which were neglected at the beginning of Sec. 5.In Fig. 9 the effect of reduced σ * is investigated, which increases the probability of local convergence.The left plot shows σ * = 5 with no globally converging runs, as the mutation strength is too low.Technically, for constant σ * there is no local convergence as the algorithm never stops if R is not decreasing.Still, the experiments are stopped after some g-threshold is reached.The stagnating behavior of the ES around some R (g) can be illustrated using Fig. 3.For σ = 0.2 one has σ * = σ N/R ≈ 0.9, which is small compared to σ * sph ≈ 5.7.Both left and right progress components of Fig. 3 are significantly influenced by the local attraction region at σ = 0.2.While some components may be improved (positive value left), others are worsened (negative value right) resulting in a cumulative effect of R (g) -stagnation.One way out can be increasing σ (or equivalently σ * ).However, the local minima landscape changes with changing R and arbitrary σ * -increase is not possible.Stagnation may appear at different σ * and R (g) -values depending on fitness and strategy parameters.For an active step-size adaptation, changing σ appropriately -without converging locally -poses a major challenge.
In the central plot of Fig. 9 roughly half of the runs are globally converging at increased σ * = σ * sph .In this case the deterministic iteration follows a single converging path, as no fluctuations are modeled.The residual distance of the locally converging runs is reduced compared to ES-runs with σ * = 5.Note that the convergence speed is faster (steeper negative slope) for the globally converging runs compared to σ * = 30 of Fig. 8 due to sphere-optimal σ * sph .However, this comes with the disadvantage of a lower P S , as more trials are converging locally.The right plot with σ * = 25 is similar to σ * = 30 of Fig. 8, but with several non-converging runs.Again, the ES convergence speed is faster, if σ * is chosen closer to σ * sph , but shows a slightly reduced P S -value.The overall prediction quality of the iterative mapping (160) is good and the results affirm the expectation, that relatively large mutations are favorable to maximize P S on the Rastrigin function.
To confirm the expectation that the approximation quality increases further for larger μ and N, experiments are shown in Fig. 10.First thing to notice is that positional fluctuations of the ES trials decrease further, such that nearly all runs show a similar R-dynamics.This is related to the intermediate recombination, see Eq. (34), as position y (g+1) is obtained by averaging over a large number of individuals.One can see good agreement, but for the left plot there is still some room for improvement.This is related to truncation ratio ϑ = 1/4, such that the Taylor expansion point in Eq. (70) via function g(x i ) is shifted by −1 (ϑ).For ϑ = 1/2 and even larger N and μ (right plot), very good agreement is observed.

Conclusion and outlook
In this paper the full first and second order progress rate analysis of the (μ/μ I , λ)-ES has been presented.In order to obtain closed-form expressions for ϕ i and ϕ II i it was necessary to consider the large dimensionality and large population assumption.While the latter does not present a serious issue because large populations are needed to ensure global convergence, it was the key prerequisite to solve and simplify the expected value integrals.As the experiments have shown, the approximation quality of the progress rate expressions is rather good even for N as small as 20 and comparably small populations of μ = 10.For larger N and μ the approximation quality improves further, as expected.The first order progress rate result is able to model the local attraction effects on the Rastrigin function.This is a very important step, as all subsequent investigations in this paper are based on ϕ i -results.The second order progress rate derivation was needed to obtain additional loss terms completing the progress model, which was especially needed for larger mutation strengths and close to the global optimizer.
Using the progress rate expressions, the dynamics of the evolution process have been investigated.There is a good agreement between the iterations and real ES-runs using median aggregation of the residual distance R to the global optimizer.As has been shown, depending on the choice of the normalized mutation strength, one can model global as well as local convergence behavior.Additionally, one observes a trade-off between efficiency and success rate, as relatively large mutations have to be chosen to maximize the success probability.
The conducted experiments assume scale-invariance, i.e., the mutation strength is controlled by the residual distance R.
This is in contrast to the full self-adaptive ES where σ evolves during the ES run either by mutative self-adaptation (SA), cumulative step-size adaptation (CSA), or Meta-ES.The incorporation of the self-adaptation process will be the next step completing the analysis of the (μ/μ I , λ)-ES on Rastrigin.To this end, the self-adaptation response (SAR) function must be derived.Combining N progress rates with the SAR function yields N + 1 evolution equations.In order to get manageable expressions that allow for analytic population sizing and expected runtime investigations, additional aggregation is needed.One possible approach would be the aggregation of individual parental y i components into the parental distance R modeling the expected progress as a function of the residual distance.This would reduce the number of evolution equations to two and making further analytic treatment more accessible.A first step in this direction has been done in [19].
Finally, the presented approach to model the ES-dynamics is based on mean value considerations.That is, are not considered so far.Whether the approach presented can be extended to allow for the calculation of the global attractor convergence probability as a function of strategy and fitness parameters remains an open question.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work in this paper.

Fig. 1 .
Fig. 1.The heat map shows the optimization landscape for A = 1, α = 2π , and N = 2.The global minimizer located at the origin (dark blue) is surrounded by multiple local minima.On the right side the same parameter set is shown for N = 1.For increasing y the oscillation contribution is decreasing.(For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.) , . . ., ỹμ;λ ← sort ỹ w.r.t.ascending f ) ← σ * y (g+1) /N 13:g ← g + 1 14: until termination criterion

the asymptotic generalized progress coefficient definition e 1 , 1 ϑ 1 e 1 , 1
from (45) can be applied with parameters a = 1 and b =

corresponding terms of ϕ II i and D 2 Q
(y)  are even in y (g)