University of Birmingham Escaping local optima using crossover with emergent diversity

—Population diversity is essential for avoiding premature convergence in genetic algorithms (GAs) and for the effective use of crossover. Yet the dynamics of how diversity emerges in populations are not well understood. We use rigorous runtime analysis to gain insight into population dynamics and GA performance for the ( µ + 1) GA and the Jump test function. We show that the interplay of crossover followed by mutation may serve as a catalyst leading to a sudden burst of diversity. This leads to signiﬁcant improvements of the expected optimization time compared to mutation-only algorithms like the (1 + 1) evolutionary algorithm. Moreover, increasing the mutation rate by an arbitrarily small constant factor can facilitate the generation of diversity, leading to even larger speedups. Experiments were conducted to complement our theoretical ﬁndings and further highlight the beneﬁts of crossover on the function class.


Escaping Local Optima Using Crossover
With Emergent Diversity Duc-Cuong Dang, Tobias Friedrich, Timo Kötzing, Martin S. Krejca, Per Kristian Lehre, Pietro S. Oliveto, Dirk Sudholt, and Andrew M. Sutton Abstract-Population diversity is essential for avoiding premature convergence in genetic algorithms (GAs) and for the effective use of crossover.Yet the dynamics of how diversity emerges in populations are not well understood.We use rigorous runtime analysis to gain insight into population dynamics and GA performance for the (µ + 1) GA and the Jump test function.We show that the interplay of crossover followed by mutation may serve as a catalyst leading to a sudden burst of diversity.This leads to significant improvements of the expected optimization time compared to mutation-only algorithms like the (1 + 1) evolutionary algorithm.Moreover, increasing the mutation rate by an arbitrarily small constant factor can facilitate the generation of diversity, leading to even larger speedups.Experiments were conducted to complement our theoretical findings and further highlight the benefits of crossover on the function class.

I. INTRODUCTION
G ENETIC algorithms (GAs) are powerful general-purpose optimizers that perform surprisingly well in many applications, including those where the problem is not well understood to apply a tailored algorithm.Their wide-spread success is based on a number of factors: using populations to diversify search, using mutation to generate novel solutions, and using crossover to combine features of good solutions.
Prügel-Bennett [29] given several reasons for the success of populations and crossover.Crossover can combine building blocks of good solutions and help to focus the search on bits where parents disagree [29].For both tasks, the population needs to be diverse enough; without sufficient diversity in the population, crossover is not effective.A common problem in the application of GAs is the loss of diversity when the population converges to copies of the same search point, often called premature convergence.Understanding how populations gain and lose diversity during the course of the optimization is vital for understanding the working principles of GAs and for tuning the design of GAs to get the best possible performance.
However, understanding population diversity and crossover has proved elusive.The first example function where crossover was proven to be beneficial is called Jump k .In this problem, GAs have to overcome a fitness valley such that all local optima have Hamming distance k to the global optimum.Jansen and Wegener [18] showed that, while mutation-only algorithms such as the (1 + 1) EA require expected time (n k ), a simple (μ + 1) GA with crossover only needs time O(μn 2 k 3 + 4 k /p c ).This time is O(4 k /p c ) for large k, and hence significantly faster than mutation-only GAs.However, their analysis requires an unrealistically small crossover probability p c ≤ 1/(ckn) for a large constant c > 0.
Kötzing et al. [20] later refined these results toward a crossover probability p c ≤ k/n, which is still unrealistically small.Both approaches focus on creating diversity through a sequence of lucky mutations, relying on crossover to create the optimum, once sufficient diversity has been created.Their arguments break down if crossover is applied frequently.Hence, these analyses do not reflect the typical behavior in GA populations with constant crossover probabilities p c = (1) as used in practice [22].
Lehre and Yao [21] analyzed the runtime of the (μ + 1) GA with deterministic crowding for arbitrary crossover rates p c > 0, showing exponential runtime gaps between the case p c = 0 and p c > 0. The gain in performance in that analysis stems from the ability of a diverse population to optimize multiple, separated paths in parallel using a diversity-preservation mechanism.Similar results have been also shown for instances of the vertex cover problem by generating diversity, either   [25] or through island models [23].Recently in [7], we have shown that a small change to the tie-breaking rule of the (μ + 1) GA to introduce many common principles of preserving diversity can lead to a sizeable advantage on the expected optimization time of Jump k function.The results hold for realistic crossover probabilities p c = 1 − (1).In this paper, we will consider a very different effect.
We provide a novel approach loosely inspired from population genetics: we show that diversity can also be created by crossover, followed by mutation.Note that the perspective of crossover creating diversity is common in population genetics [19], [33].A frequent assumption is that crossover mixes all alleles in a population, leading to a situation called linkage equilibrium, where the state of a population is described by the frequency of alleles [3].
For the maximum crossover probability p c = 1, we show that on Jump k diversity emerges naturally in a population: the interplay of crossover, followed by mutation, can serve as a catalyst for creating a diverse range of search points out of few different individuals.This naturally emerging diversity allows proving a speedup of order n/log n for k ≥ 3 and standard mutation rate p m = 1/n compared to mutation-only algorithms such as the (1 + 1) EA.Increasing the mutation rate to p m = (1 + δ)/n for an arbitrarily small constant δ > 0, leads to a speedup of order n.The detail can be seen in Table I.
Both operators are proven to be vital: mutation requires (n k ) expected iterations to hit the optimum from a local optimum.Also using crossover on its own does not help much.As shown in [20,Th. 8], using only crossover with p c = (1) but no mutation following crossover, diversity reduces quickly, leading to inefficient running times for small population sizes (μ = O(log n)).
All our analyses are based on observing the dynamic behavior of the size of the largest species, referring to a collection of identical genotypes as species.A population contains no diversity when only one species is present.However, mutation can create further species, and then the combination of crossover and mutation is able to rapidly create further species in a highly stochastic process.This diversity can then be exploited to find the global optimum on Jump k efficiently.A higher mutation rate facilitates the generation of new species and leads to better performance, with respect to rigorous upper runtime bounds and empirical performance.
Using Jump k as a case study, our analyses shed light on how diversity emerges in populations and how to facilitate the emergence of diversity by tuning the mutation rate.The general proof strategy we take is as follows.We characterize the size of the largest species as a stochastic process and calculate the transition probabilities of this process taking into account both mutation and crossover.We prove that the size of the largest species is described either by an almost-fair random walk (for standard mutation rates), or by an unfair random walk that is biased toward increased diversity (for higher mutation rates).This ultimately allows us to bound the expected time until sufficient diversity is present in the population to perform a crossover that successfully generates the global optimum.Our main results are stated in Theorems 2 and 3, which yield our runtime bounds under the assumed conditions.Critical lemmas are Lemma 1, which estimates the time until the entire population has reached the plateau using the method of fitness-based partitions, and Lemma 3, which bounds the transition probabilities for the random walk dynamics of the size of the largest species.The proof of Lemma 3 is carried out by a careful analysis of the different events that can occur while the entire population resides on the plateau.
This paper is based upon our preliminary study published in [6].Here we extend the analysis to higher mutation rates, leading to the surprising conclusion that increasing the mutation rates leads to smaller runtime bounds, compared to the standard mutation rate 1/n.Furthermore, the analysis of standard mutation rates in [6] was restricted to very short jumps, k = O(1).Here we generalize the results to a much larger class of Jump k functions, only requiring k = o(n).Experiments were conducted to complement the theoretical results and further highlight the benefits of combining crossover with mutation.In fact, the experimental results showed that the setting of high mutation rate can be as competitive as using specific diversity mechanisms from [7].

II. PRELIMINARIES
The Jump k : {0, 1} n → N class of pseudo-Boolean fitness functions was originally introduced by Jansen and Wegener [18].The function value increases with the number of 1 bits in the bit string until a plateau of local optima is reached, consisting of all points with n − k 1 bits.However, its only global optimum is the all-ones string 1 n .Between the plateau and the global optimum, there is a valley of bad fitness, which we call the gap of length k, and the algorithm has to jump over this gap to optimize the function.The function is formally defined as where |x| 1 = n i=1 x i is the number of 1 bits in x.Fig. 1 illustrates the function, with the number of 1 bits on the horizontal axis, and the function value on the vertical axis.
We will analyze the performance of a standard steady-state (μ + 1) GA [18] using uniform crossover (i.e., each bit of the offspring is chosen uniformly at random from one of the parents) and standard bit mutation (i.e., each bit is flipped with probability p m ).The algorithm uses a population of μ individuals.In each generation, a new individual is created.With probability p c , it is created by selecting two parents from the population uniformly at random, crossing them over, and then applying mutation to the resulting offspring.With probability 1 − p c instead, one single individual is selected and only mutation is applied.The generation is concluded by removing the worst individual from the population and breaking ties uniformly at random.Algorithm 1 shows the pseudocode for the (μ + 1) GA.Note that P is a multiset.
The most interesting behavior of the population presented in this paper occurs after the entire population is stuck at local optima, the so-called plateau.That is because under the right condition the population diversity will emerge during this stage.Then after sufficient progress is made in diversity, crossover and mutation can work together on the plateau to create an optimal solution in o n k time.This is captured by Lemma 10, which will be presented later in this paper.
For the sake of completeness, in the next section, we provide the time bounds for the population to reach the plateau for the general setting of p c = (1).This covers the case of p c = 1 which we will actually focus on in the main results.

III. TIME TO PLATEAU
In the setting of p c = (1), we direct the attention to the steps that crossover occurs.We make use of the following general result, which provides an upper bound on the expected time for the (μ+1) GA to reach some region A m of the search space.Here we consider a fitness-based partition (see [17] for a formal definition) into levels Algorithm 1: (μ + 1) GA  (x, y)) ∈ A ≥j+1 ≥ ε and 2) min x,y∈A j Pr mutate(crossover(x, y)) ∈ A ≥j+1 ≥ s j , then the expected number of iterations until the entire population of the (μ + 1) GA with The proof follows [5], but we avoid a detailed drift analysis because the algorithm is elitist, i.e., the maximum fitness in the population does not decrease.Let the current level be the smallest i ∈ [m] such that the population contains less than μ/2 individuals in A ≥i+1 .By definition, there are at least μ/2 individuals in A ≥j , where j is the current level.
Since the algorithm is elitist, the number of individuals in A ≥j is nondecreasing for any j ∈ [m].For an upper bound, we ignore any improvements where mutation only is used (i.e., lines 8 and 9 in Algorithm 1).
Assume that there are i individuals in A ≥j+1 , hence 0 ≤ i < μ/2.If i = 0, then an individual in A ≥j+1 can be created by selecting two individuals from A j , crossing them over, and mutating them such that the offspring is in A ≥j+1 and an individual not in A ≥j+1 is removed.The probability of this event is at least p c s j /4, where the 1/4 is the probability of selecting two individuals from A j , which contains at least μ/2 individuals.
If 0 < i < μ/2, then the number of individuals in A ≥j+1 can be increased by selecting an individual in A ≥j and an individual in A ≥j+1 , crossing them over, and mutating them such that the offspring is in A ≥j+1 and one of the μ − i > μ/2 individuals not in A ≥j+1 is removed.This event occurs with probability at least (p c /2)(i/μ)ε.
The expected time to increase the number of individuals in A ≥j+1 from 0 to μ/2, i.e., to increase the current level by at least one, is 4/(p c s j ) + 2μ/(p c ε) μ/2 i=1 1/i.Hence, the expected time until at least half of the population is in A m is O((μm/ε) log(μ) + m−1 j=1 1/s j ).We now consider the time to remove individuals from the lowest fitness level in the population, assuming that at least half of the population has reached the last level A m .Assume that there are 0 < i < μ/2 individuals in the lowest level j < m.The number of individuals in level j can be reduced by crossing over an individual in level j and one of the at least μ/2 individuals in level m, and mutating the offspring so that it belongs to A ≥j+1 .By Condition 1, this event occurs with probability at least p c (ε/2)(i /μ).Hence, the expected time to remove all individuals from the lowest level j is no more than The expected time until all individuals in fitness levels lower than m have been removed is therefore O(μ(m/ε) log μ).
We apply Theorem 1 to bound the time until the entire population reaches the plateau.
Lemma 1: Consider the (μ + 1) GA optimizing Jump k with p c = (1) and p m = (1/n).Then the expected time until either the optimum has been found or the entire population is on the plateau is O(n √ k(μ log μ + log n)).Proof: We divide the search space into m := n fitness levels with the partition We call any search point x ∈ {0, 1} n with n − k < |x| 1 < n a gap-point.Gap-points have worse fitness than any other search point, hence once there are no gap-points left in the population, the algorithm will not accept any further gap-points.We can therefore divide the run into two phases, with phase 1 lasting as long as the population contains at least one gap-individual, followed by phase 2, which lasts until the optimum or a plateau individual has been found.
We bound the duration of the two phases by applying Theorem 1 twice, once for each of the two phases.
We start by estimating the expected duration of phase 2 using Theorem 1 with respect to levels A k to level A n .We claim that the probability of producing a gap-point by crossing over two individuals x ∈ A ≥j and y ∈ A ≥j+1 with k ≤ j < n+1 satisfies To see why this claim holds, we first argue that the probability of producing a gap-point is highest when both parents, x and y, have n − k 1 bits.More formally, obtain x by flipping an arbitrary 0-bit in The probability of obtaining a search point with exactly k 0 bits when crossing over two bit strings with k 0 bits each is minimized when all positions of the 0 bits in the two bit strings differ.Hence, for bit strings x and y , we have by Stirling's approximation the lower bound Uniform crossover of the bit strings x and y creates two bit strings u and v , and returns either u or v with equal probability.We then have The event The claimed inequality (1) now follows, because: We now show that Condition 1 of Theorem 1 holds for the parameter ε Assume that x ∈ A ≥k+j and y ∈ A ≥k+j+1 for j ≥ 0. By the same arguments as above where we assume without loss of generality that |u| 1 ≥ |v| 1 .A crossover between x and y therefore produces two offspring u and v where |u| 1 ≥ j + 1, hence Combining ( 1) and (3) now yields Pr crossover(x, y) Finally, with probability (1 − p m ) n , none of the bits are flipped during mutation, which implies Pr mutate(crossover(x, y)) ∈ A ≥k+j+1 ≥ ε.
We now show that Condition 2 of Theorem 1 holds.Assume that x, y ∈ A j+k for j ≥ 0.Then, following the same arguments as above: The probability that the mutation operator flips at least one of the n − j 0 bits, and no other bits, is at least Hence, we can use the parameter We have shown that both Conditions 1 and 2 hold during phase 2, which by Theorem 1 implies that the expected duration of phase 2 is O(n √ k(μ log μ + log n)).To estimate the expected duration of phase 1, we again apply Theorem 1, but this time with respect to level A 1 to level A k .We can reuse the bounds from phase 2, except that we count the number of 0 bits rather than the number of 1 bits, and we do not need to account for the probability of producing gap individuals.Hence, we obtain the same upper bound for the expectation duration of phases 1 and 2, and the theorem follows.
In the following sections, we first show that once the plateau of Jump k has been reached by the (μ+1) GA with p c = 1, the population diversity can emerge naturally from the interaction between crossover and mutation.Based on such a result on the population dynamics, bounds on the expected optimization time of the function class are then deduced for two different settings of the algorithm: standard and high mutation rates.

IV. POPULATION DYNAMICS
Assume that the algorithm has reached a population where all individuals are identical and on the plateau, i.e., the less diverse setting.We refer to identical individuals as a species, hence, in this case, there is only one species.Eventually, a mutation will create a different search point on the plateau, leading to the creation of a new species.Both species may shrink or grow in size, and there is a chance that the new species will disappear and that we return to one species only.
However, the existence of two species also serves as a catalyst for creating further species in the following sense.Say two parents 0001111111 and 0010111111 are recombined, then crossover has a good chance of creating an individual with n − k + 1 1s, e.g., 0011111111.Then mutation has a constant probability of flipping any of the n − k − 1 unrelated 1 bits to 0, leading to a new species, e. g., 0011111011.This may lead to a sudden burst of diversity in the population.
To further investigate these dynamics, we set up a preliminary experiment for n = 500 and k = 3, with population size μ = 100 and mutation parameter χ from [0.1, 0.2, . . .2.0].Since we are only interested in the dynamics on the plateau, the optimum is always rejected and the population is initialized with copies of a single plateau solution.Hundred independent runs are repeated for each setting, and as an indicator of diversity, the size of the largest species is recorded for the first 10 5 iterations of each run.Fig. 2 illustrates the obtained result.Clearly, we see that new species can emerge from time to time and more importantly if the mutation rate χ/n is sufficiently large then a diverse population can be maintained (size of the largest species remains close to 1) after some time.
The above simulation indicates that the mutation rate and the size of the largest species are important factors for describing the population diversity.With a large enough mutation rate, the size of the largest species can perform a random walk biased toward a reduction of its value.Once its size has decreased significantly from its maximum μ, there is a good chance for recombining two parents from different species.This helps in finding the global optimum, as crossover can increase the number of 1s in the offspring, compared to its parents, such that fewer bits need to be flipped by mutation to reach the optimum.This is formalized in the following lemma.
Lemma 2: The probability that the global optimum is constructed by a uniform crossover of two parents on the plateau having Hamming distance 2d, followed by mutation, is: Proof: For a pair of search points on the plateau with Hamming distance 2d, both parents have d 1s among the 2d bits that differ between parents, and n − k − d 1s outside this area.Assume that crossover sets i out of these 2d bits to 1, which happens with probability 2d i • 2 −2d .Then mutation needs to flip the remaining k + d − i 0s to 1.The probability that such a pair creates the optimum is hence The second bound is obtained by ignoring summands i < 2d for the inner sum.
Note that even a Hamming distance of 2, i.e., d = 1, leads to a probability of (n −k+1 ), provided that such parents are selected for reproduction.The probability is by a factor of n larger than the probability (n −k ) of mutation without crossover reaching the optimum from the plateau.
We will show that this effect leads to a speedup of nearly n for the (μ + 1) GA, compared to the expected time of (n k ) for the (1+1) EA [10] and other EAs only using mutation.
The idea behind the analysis is to investigate the random walk underlying the size of the largest species.We bound the expected time for this size to decrease to μ/2 and then argue that the (μ + 1) GA is likely to spend a good amount of time with a population of good diversity, where the probability of creating the optimum in every generation is (n −k+1 ) due to the chance of recombining parents of Hamming distance at least 2.
In the following, we refer to Y(t) as the size of the largest species in the population at time t.Define that is p + (y) is the probability that the size of the largest species increases from y to y + 1, and p − (y) is the probability that it decreases from y to y − 1.
The following lemma gives bounds on these transition probabilities, unless two parents of Hamming distance larger than 2 are selected for recombination (this case will be treated later in Lemma 4).We formulate the lemma for arbitrary mutation rates χ/n = (1/n) and restrict our attention to sizes Y(t) ≥ μ/2 as we are only interested in the expected time for the size to decrease to μ/2.
Lemma 3: For every population on the plateau of Jump k for k = o(n), the following holds.Either the (μ + 1) GA with mutation rate χ/n = (1/n) performs a crossover of two parents whose Hamming distance is larger than 2, or the size Y(t) of the largest species changes according to transition probabilities p − (μ) = (k/n) and, for μ/2 ≤ y < μ Proof: We call an individual belonging to the current largest species a y individual and all the others non-y individuals.In each generation, there is either no change, or one individual is added to the population and one individual chosen uniformly at random is removed from the population.In order to increase the number of y individuals, it is necessary that a y individual is added to the population and a non-y individual is removed from the population.Analogously, in order to decrease the number of y individuals, it is necessary that a non-y individual is added to the population and a y individual is removed from the population.
Given that Y(t) = y, let p(y) be the probability that a y individual is created at time t+1, and q(y) the probability that a non-y individual is created.Since all considered individuals are on the plateau, the individual for deletion is selected uniformly at random.Multiplying by the survival probabilities we have p − (y) = q(y) y μ + 1 and ( 6) We now estimate an upper bound on p(y).We may assume that the Hamming distance between parents is at most 2 as otherwise there is nothing to prove.A y individual can be created in the following three ways.
3) Two non-y individuals are selected.These two individuals are either identical or have Hamming distance 2 (i.e., by assumption).In the first case, they both have one of the k 0-bit positions of a y individual set to 1.In the second case, they either both have one of the k 0-bit positions of a y individual set to 1, or they both have one of the n − k 1-bit positions set to 0. In both cases, crossover cannot change the value of such a bit.Thus, at least one specific bit-position must be flipped, which occurs with probability O(1/n).Taking into account the probabilities of the three selection events above, the probability of producing a y individual is We then estimate a lower bound on q(y).In the case where y = μ, a non-y individual can be added to the population if: 1) two y individuals are selected and the mutation operator flips one of the k 0 bits and one of the n − k 1 bits.This event occurs with probability where we used that k = o(n) in the last equality.
In the other case, where y < μ, a non-y individual can be added to the population in the following two ways.
1) A y individual and a non-y individual are selected.Crossover produces a copy of the non-y individual with probability 1/4, which is unchanged by mutation with probability (1 − χ/n) n .Second, with probability 1/4, crossover produces an individual with k − 1 0 bits.Mutation then creates a non-y individual by flipping a single of the n − k 1-bit positions that do not lead to recreating y.Third, again with probability 1/4, crossover produces an individual with k + 1 0 bits and mutation then creates a non-y individual by flipping a single of k 1 bits that do not lead back to y.The above three events, conditional on selecting a y individual and a non-y individual, lead to a total probability of 2) Two non-y individuals are selected.In the worst case, the selected individuals are different, hence, crossover produces an individual on the plateau with probability at least 1/2, which mutation does not destroy with probability (1 − χ/n) n .Assuming that μ/2 ≤ y < μ and n is sufficiently large, the probability of adding a non-y individual is Plugging p(y) and q(y) into ( 6) and ( 7), we get And we also have Steps where crossover recombines two parents with larger Hamming distance were excluded from Lemma 3 as they require different arguments.The following lemma shows that conditional transition probabilities in this case are favorable in that the size of the largest species is more likely to decrease than to increase.
Lemma 4: Assume that y ≥ μ/2 and that the (μ + 1) GA on Jump k with k = o(n) and mutation rate χ/n = selects two individuals on the plateau with Hamming distance larger than 2, then for conditional transition probabilities p * − (y) and p * + (y) for decreasing or increasing the size of the largest species, p * − (y) ≥ 2p * + (y).Proof: Assume that the population contains two individuals x and z with Hamming distance 2 ≤ 2k, where ≥ 2. Without loss of generality, let us assume that they differ in the first 2 bit positions.
First assume that the individual y representing the largest species has 0 bits in the first 2 positions.Then a y individual may be produced by creating the 0 bits and 1 bits in the exact positions by crossover and no followed mutation.Alternatively, at least 1 exact bit has to be flipped by mutation.
Then the probability of producing a y individual from x and z and replacing a non-y individual with y is less than On the other hand, the probability of producing an individual on the plateau different from y and replacing a y individual is at least for sufficiently large n.
In the other case, assume that the individual y does not have 0 bits in the first 2 bit-positions.Then the mutation operator must flip at least one specific bit among the last n−2 positions to produce y, which occurs with probability O(1/n).The probability to produce a non-y individual on the plateau is lower bounded by the probability of the event that recombining x and z produces a bitstring with exactly k 0 bits in the first 2 bit-positions, none of the bits are mutated, and a majority individual is replaced, that is where the inequality follows by Stirling's inequality.Taking into account the assumption k = o(n), it holds for sufficiently large n that p * − (y) ≥ 2p * + (y).

V. STANDARD MUTATION RATE
We first analyze the (μ + 1) GA with the standard mutation rate of 1/n, i.e., χ = 1.We show that the diversity emerging in the (μ + 1) GA leads to a speedup of nearly n for the (μ + 1) GA, compared to the expected time of (n k ) for the (1+1) EA [10] and other EAs only using mutation.
Theorem 2: The expected optimization time of the (μ + 1) GA with p c = 1 and μ ≤ κn, for some For k ≥ 3, the best speedup is of order (n/ log n) for μ = κn.For k = 2, the best speedup is of order ( √ n/ log n) for μ = ( √ n/ log n).Note that for mutation rate 1/n, the dominant terms in Lemma 3 are equal, hence the size of the largest species performs a fair random walk up to a bias resulting from small-order terms.This confirms our intuition from observing simulations.The following lemma formalizes this fact: in steps where the size Y(t) of the largest species changes, an almost fair random walk is performed.
Lemma 5: For the random walk induced by the size of the largest species, conditional on the current size y changing, for μ/2 < y < μ, the probability of increasing y is at most 1/2 + O(1/n), and the probability of decreasing it is at least 1/2 − O(1/n).
Proof: We only have to estimate the conditional probability of increasing y as the two probabilities sum up to 1.The sought probability is given by p + (y)/(p + (y) + p − (y)), which is strictly increasing in p + (y).Lemma 4 states that whenever the (μ + 1) GA recombines two parents of Hamming distance larger than 2, the claim on conditional probabilities clearly follows.Hence we assume in the following that this does not happen.
Using the lower bound for p + (y) and the upper bound for p − (y) from Lemma 3, with implicit constant c + in the asymptotic term for p + , we get where in the last step we multiplied the last fraction by μ/(μ − y).Now the numerator is O(1/n).Since μ/2 < y < μ, we have [y(μ + y)/μ(μ + 1)] = (1).Along with We use these transition probabilities to bound the expected time for the random walk to hit μ/2.Lemma 6: Consider the random walk of Y(t), starting in state X 0 ≥ μ/2.Let T be the first hitting time of state μ/2.
For μ/2 < y < μ, the probability of leaving state y is always (regardless of Hamming distances between species) bounded from below by the probability of selecting two y individuals as parents, not flipping any bits during mutation, and choosing a non-y individual for replacement as y ≥ μ/2, μ+1 ≤ 3μ/2 (since μ ≥ 2), and for n ≥ 2. Hence the expected time for leaving state i toward either state i + 1 or state i − 1 is at most 24μ/(μ − i).Using conditional transition probabilities 1/2 ± δ for δ = O(1/n) according to Lemma 5, E i is bounded as This is equivalent to Introducing and equivalently hence an induction yields 1) = O(1).Bounding both α j−i and α μ−i in this way, we get as the sum is equal to

Hence, we get
).Now we show that when the largest species has decreased its size to μ/2 there is a good chance that the optimum will be found within the following (μ 2 ) generations.
Lemma 7: Consider the (μ + 1) GA with p c = 1 on Jump k .If the largest species has size at most μ/2 and μ ≤ κn for a small enough constant κ > 0, the probability that during the next cμ 2 generations, for some constant c > 0, the global optimum is found is Proof: We show that during the cμ 2 generations the size of the largest species never rises above (3/4)μ with at least constant probability.Then we calculate the probability of jumping to the optimum during the phase given that this happens.
Let X i , 1 ≤ i ≤ cμ 2 be random variables indicating the change in the number of individuals of the largest species at generation i.We pessimistically ignore self-loops and assume that the size of the species either increases or decreases in each generation, thus X i ∈ {−1, +1}.Using the conditional probabilities from Lemma 5, we get that the expected increase in each step is Then the expected increase in size of the largest species at the end of the phase is where we use that μ ≤ κn and κ is chosen small enough.
While the size does not exceed (3/4)μ, in every step there is a probability of at least 1/4•3/4 = (1) of selecting parents from two different species.As these have Hamming distance 2d for some d ≥ 1, by Lemma 2, the probability of creating the optimum is at least 2 Finally, the probability that at least one successful generation occurs in a phase of cμ 2 is, using 1 Lemma 10], the probability that the optimum is found in one of these steps is Finally, we assemble all lemmas to prove our main theorem of this section.
Proof of Theorem 2: The expected time for the whole population to reach the plateau is O(μn Once the population is on the plateau, we wait till the largest species has decreased its size to at most μ/2.According to Lemma 6, the time for the largest species to reach size μ/2 is O(μn + μ 2 log μ).By Lemma 7, the probability that in the next cμ 2 steps the optimum is found is If not, we repeat the argument.The expected number of such trials is O(1 + n k−1 /μ 2 ), and the expected length of one trial is O(μn + μ 2 log μ)+cμ 2 = O(μn + μ 2 log μ).The expected time for reaching the optimum from the plateau is hence at most O(μn + μ 2 log(μ) Adding up all times and subsuming terms μ 2 log(μ

VI. HIGH MUTATION RATES
We now consider the runtime of (μ + 1) GA with mutation rate χ/n = (1 + δ)/n for an arbitrary constant δ > 0. The following theorem states that in this setting the algorithm has at least a linear speedup compared to the (μ+1) EA without crossover [34].By assuming a slightly higher mutation rate, we not only obtain a bound which is by a log-factor better than Theorem 2, but the analysis is also significantly simpler.
Theorem 3: The (μ + 1) GA with mutation rate (1 + δ)/n, for a constant δ > 0, and population size μ ≥ ck ln(n) for a sufficiently large constant c > 0, has for k = o(n) expected optimization time O(n √ kμ log(μ) + μ 2 + n k−1 ) on Jump k .We again study the random walk corresponding to the size of the largest species on the plateau.For mutation rate 1/n, this is almost an unbiased random walk.For slightly higher mutation rates, we will see that the random walk changes to an unfair random walk where the size of the largest species decreases by (1/μ) in expectation.Formally, our analysis assumes the following condition.
Condition 1: For a constant δ > 0 and all y, μ/2 ≤ y ≤ μ The following lemma states that it is sufficient to increase the mutation rate slightly above 1/n to satisfy the diversity condition.
Proof: The first two inequalities follow directly from Lemmas 3 and 4. For any constant ε > 0, Lemma 3 implies that Thus, given that μ/2 < y < μ and χ ≥ 1 for some constant δ > 0 when ε is sufficiently small.Given Condition 1, the additive drift theorem [16] implies that the largest species quickly decreases to half the population size.
Lemma 9: If Condition 1 holds, then the expected time until the largest species has size at most μ/2 is O(μ 2 + n).
Proof: Let Y(t) denote the size of the largest species at time t.We consider the drift with respect to the distance function which has the two terms f (y) := y, and g(y) := (n/μ)e −κ(μ−y) , with κ := ln(1 + δ) over the interval y ∈ [μ/2, μ].Due to linearity of expectation, we can consider the drift of the two terms f (y) and g(y) separately.The second term g(y) is introduced to handle the case y = μ and is defined exponentially decreasing in μ−y to avoid negative drift in the case y = μ−1.The total distance is h(μ) − h(μ/2) = O(μ + n/μ), hence, we need to prove that the drift of the process h(Y(t)) is (1/μ).
We first consider the drift with respect to the first term f (y) = y. is reported as the runtime.The population size is set to μ = 4e ln n so that a realistic population of at least 40 individuals is always assumed (e.g., even for n = 50 in Fig. 3).The impact of the jump length k on the runtime is illustrated in Fig. 4(a).The experiment was set with n in [100, . . ., 5000] (with a step size of 100) and k is in {3, 4, 5}.We notice that the increase of k does not imply a large change in the average runtime.The average runtime seems to still scale linearly with n in this setting even for k = 4.By fixing k = 3, we also experimented with different mutation rates, i.e., p m in {0.9/n, 1.0/n, 1.1/n, 2.0/n}.The results are displayed in Fig. 4(b).We notice that the mutation rates above 1/n reduce the average runtime while a slightly lower mutation rate increases it considerably.With mutation rate 2/n, the average runtime and the stability of the runs are distinctively improved.

A. Impact of Crossover and Mutation Rates
On the other hand, an excessive increase of the mutation rate may deteriorate the average runtime because of the likelihood of multiple bit flips which imply harmful mutations.This can be observed in the experiment depicted in Fig. 5 (in log-scale) for n = 500.In this experiment, k is in {2, 3, 4}, and the range of χ = p m • n is set to [0.6, . . ., 8] (with a step size of 0.1).We note that the more k is increased, the stronger the negative effect of high mutation rates can be noticed.Moreover, too low mutation rates are also bad for the runtime.This can be related to our theoretical analysis, in which a low mutation rate could have made the random walk associated with the size of the largest species biased toward the wrong direction.This may lead to the reduction of the population diversity and the loss of benefit from crossover.

B. Comparison With the Use of Diversity Mechanisms
In a previous study [7], we have shown that many common mechanisms to preserve population diversity can speed up significantly the expected optimization time of (μ + 1) GA (with standard mutation rate) on Jump k when crossover is enabled.The aim of this section is to compare by experiments the setting of high mutation rate with the results taken directly from [7] for six 1 mechanisms: duplicate minimization and Fig. 6.Performance of the diversity mechanisms for jump length 4; the mutation rate p m is set to 1/n unless specified.elimination, deterministic crowding, convex hull maximization, fitness sharing and island model.
Again full crossover is enabled (p c = 1.0), but the problem size n is varied in [100, 1000] (with a step size of 25).The result for k = 4 is shown in Fig. 6 which also includes the setting of (μ + 1) GA with standard mutation rate and without any diversity mechanism as a reference.Here the high mutation rate is set with p m = 2.6/n (the best choice for n = 500 and k = 4, previously suggested by Fig. 5).An interesting observation from the experimental results is that it appears the setting of high mutation rate can be as efficient as the implementation of specific diversity mechanisms.Specifically, in Fig. 6 the setting of high mutation rate is only worse than convex hull maximization and fitness sharing.

VIII. CONCLUSION
A rigorous analysis of the (μ + 1) GA has been presented showing how combining the use of crossover with that of mutation considerably speeds up the runtime for Jump k compared to algorithms using mutation only.
It is traditionally believed that crossover is useful only in the presence of sufficient diversity, and the emergence of this diversity is typically attributed to the mutation operator [11], [15], [35].In general, the dynamics of mutation and crossover are vastly complex, and the question of how the two operators interact to balance exploration and exploitation has been open for decades [30].Nevertheless, previous theoretical results on the benefit of crossover have relied solely on mutation for establishing the diversity necessary for recombination.For example, on the Jump k function (with the exception of our own work in [7]), proofs have required an unrealistically small crossover probability in order to force long phases during which mutation alone builds up enough diversity before a useful crossover operation can be applied.
Diversity can also be enforced using artificial mechanisms, and such techniques lead to more efficient evolutionary algorithms both empirically [4], [32] and theoretically [13], [28].Artificially enforced diversity can also be used in proofs that crossover is beneficial without having to rely on mutation alone to create sufficient variation [7].
The question to what degree the interplay between both crossover and mutation promotes the natural emergence of diversity in the population has been so far open.Our analysis shows that this interplay on the plateau of local optima of the Jump k function quickly leads to a burst of diversity that is then exploited by both operators to reach the global optimum.
The balance between the amount of mutation and crossover impacts the runtime considerably.While mutation rates lower than the standard 1/n rate considerably increase the expected runtime, rates that are slightly higher than 1/n lead to improved performance.These rates also depend on the presence of crossover.For instance, for k = 4, the best rate for a mutation-only algorithm is 4/n while the best rate for the (μ + 1) GA with p c = 1 is considerably lower than 4/n and higher than 1/n.
It is an open problem for future work whether crossover can lead to more than linear speedups on Jump k for realistic crossover probabilities.Our analysis could be improved by taking into account crossover between plateau individuals with Hamming distance larger than 2. For large k, this could lead to super-linear speedups.In fact, our experiments reveal that the average runtime of the (μ+1) GA does not increase considerably when k is increased from 2 to 4. However, completely new techniques may be required to improve our analysis.Finally, future work should address the interplay between mutation and crossover on fitness landscapes with different characteristics than the Jump k function, such as those featuring neutral networks.

Fig. 1 .
Fig. 1.Illustration of the Jump k fitness function for the case n = 10 and k = 3, including the levels A 1 , . . ., A 10 defined in the proof of Lemma 1.

Fig. 3 (
Fig. 3(a) and (b) depicts the performance of the GA (p c = 1.0) compared to the algorithm using only mutation (p c = 0.0) under the same setting (p m = 1/n).The range of n in this experiment is set to n = [50, . . ., 300] with a step size of 10, and k is in {2, 3}.Even with these small values of k and n, a strong reduction of the average runtime can be observed, up to a multiplicative factor of 10 4 .The impact of the jump length k on the runtime is illustrated in Fig.4(a).The experiment was set with n in [100, . . ., 5000] (with a step size of 100) and k is in {3, 4, 5}.We notice that the increase of k does not imply a large change in the average runtime.The average runtime seems to still scale linearly with n in this setting even for k = 4.By fixing k = 3, we also experimented with different mutation rates, i.e., p m in {0.9/n, 1.0/n, 1.1/n, 2.0/n}.The results are displayed in Fig.4(b).We notice that the mutation rates above 1/n reduce the average runtime while a slightly lower mutation rate increases it considerably.With mutation rate 2/n, the average runtime and the stability of the runs are distinctively improved.On the other hand, an excessive increase of the mutation rate may deteriorate the average runtime because of the likelihood of multiple bit flips which imply harmful mutations.This can be observed in the experiment depicted in Fig.5(in log-scale) for n = 500.In this experiment, k is in {2, 3, 4}, and the range of χ = p m • n is set to [0.6, . . ., 8] (with a step size of 0.1).We note that the more k is increased, the stronger the negative effect of high mutation rates can be noticed.Moreover, too low mutation rates are also bad for the runtime.This can be related to our theoretical analysis, in which a low mutation rate could have made the random walk associated with the size of the largest species biased toward the wrong direction.

TABLE I SOME
EXAMPLES OF RUNTIME BOUNDS WE OBTAIN FOR THE (μ + 1) GA ON Jump k Let (A i ) i∈[m] be a fitness-based partition of the search space into m ∈ N levels.If there exist parameters ε, s 1 , . . ., s m−1 ∈ (0, 1] such that for all j ∈ [m − 1]: 1) min x∈A ≥j ,y∈A ≥j+1 Pr mutate(crossover (as the individuals have Hamming distance 2 by assumption), and mutation does not flip any bits with probability (1 − χ/n) n .If the crossover operator does not produce a y individual, then, to produce a y individual, at least one specific bit-position must be mutated, which occurs with probability O(1/n).The overall probability is hence (1 1) Two y individuals are selected.Crossing over two y individuals produces another y individual, which survives mutation if no bits are flipped, i.e., with probability (1 − χ/n) n .2) One y individual and one non-y individual are selected.The crossover operator produces a y individual with probability 1/4