Self-adjusting Population Sizes for the $(1, \lambda)$-EA on Monotone Functions

We study the $(1,\lambda)$-EA with mutation rate $c/n$ for $c\le 1$, where the population size is adaptively controlled with the $(1:s+1)$-success rule. Recently, Hevia Fajardo and Sudholt have shown that this setup with $c=1$ is efficient on \onemax for $s<1$, but inefficient if $s \ge 18$. Surprisingly, the hardest part is not close to the optimum, but rather at linear distance. We show that this behavior is not specific to \onemax. If $s$ is small, then the algorithm is efficient on all monotone functions, and if $s$ is large, then it needs superpolynomial time on all monotone functions. In the former case, for $c<1$ we show a $O(n)$ upper bound for the number of generations and $O(n\log n)$ for the number of function evaluations, and for $c=1$ we show $O(n\log n)$ generations and $O(n^2\log\log n)$ evaluations. We also show formally that optimization is always fast, regardless of $s$, if the algorithm starts in proximity of the optimum. All results also hold in a dynamic environment where the fitness function changes in each generation.


Introduction
Randomized Optimization Heuristics (ROHs) like evolutionary algorithms (EAs) are simple general-purpose optimizers. One of their strengths is that they can often be applied with little adaptation to the problem at hand. However, ROHs usually come with parameters, and their efficiency often depends on the parameter settings. Therefore, parameter control is a classical topic in the design and analysis of ROHs [2]. It aims at providing methods to automatically tune parameters over the course of optimization. The goal is not to remove parameters altogether; the parameter control mechanisms themselves introduce new meta-parameters. Nevertheless, there are two objectives that can sometimes be achieved with parameter control mechanisms.
Firstly, some ROHs are rather sensitive to small changes in the parameters, and inadequate setting can slow down or even prevent success. Two examples that are relevant for this paper are the (1, λ)-EA, which fails to optimize even the easy OneMax benchmark if λ is too small [3,4,5], and the (1 + 1)-EA, which fails on monotone functions if the mutation rate is too large [6,7,8]. In both cases, changing the parameters just by a constant factor makes all the difference between finding the optimum in time O(n log n), and not even finding an ε-approximation of the optimal solution in polynomial time. So these algorithms are extremely sensitive to small changes of parameters. In such cases, one hopes that performance is more robust with respect to the meta-parameters, i.e., that the parameter control mechanism manages to find a decent parameter setting regardless of its meta-parameters.
Secondly, often there is no single parameter setting that is optimal throughout the course of optimization. Instead, different phases of the optimization process profit from different parameter settings, and the overall performance with dynamically adapted parameters is better than for any static parameters [9,10,11,12,13]. This topic, which has always been studied in continuous optimization, has taken longer to gain traction in discrete domains [14,15,16,11] but has attracted increasing interest over the last years [17,18,19,20,21,22,23,24,25,26,27,28,29,30]. Instead of a detailed discussion we refer the reader to the book chapter [31] for an overview over theoretical results, and to [32] for a discussion of some recent developments.
One of the most traditional and influential methods for parameter control is the (1 : s + 1)-success rule [33], independently developed several times [34,35,36] and traditionally used with s = 4 as one-fifth rule in continuous domains, e.g. [37]. This rule has been used for controlling the offspring population size in discrete domains [11,16], in particular for the (1, λ)-EA [32,38], where it yields the so-called self-adjusting (1, λ)-EA or SA-(1, λ)-EA, also called (1, {λ/F, F 1/s λ})-EA. As in the basic (1, λ)-EA, in each generation the algorithm produces λ offspring, and selects the fittest of them as the unique parent for the next generation. The difference to the basic (1, λ)-EA is that the parameter λ is replaced by λ/F if the fittest offspring is fitter than the parent, and by λ · F 1/s otherwise. We will give a more thorough discussion of this algorithm in Section 2 below. Thus, the (1 : s + 1)-success rule replaces the parameter λ by two parameters s and F . As outlined above, there are two hopes associated with this scheme: (i) that the performance is more robust with respect to F and s than with respect to λ; (ii) that the scheme can adaptively find the locally optimal value of λ throughout the course of optimization.
Recently, Hevia Fajardo and Sudholt have investigated both hypotheses on the OneMax benchmark [38,32]. They found a negative result for (i), and a (partial) positive result for (ii). The negative result says that performance is at least as fragile with respect to the parameters as before: if s < 1, then the SA-(1, λ)-EA finds the optimum of OneMax in O(n) generations, but if s ≥ 18 and F ≤ 1.5 the runtime becomes exponential with overwhelming probability. Experimentally, they find that the range of bad parameter values even seems to include the standard choice s = 4, which corresponds to the 1/5-rule. On the other hand, they show that for s < 1, the algorithm successfully achieves (ii): they show that the expected number of function evaluations is O(n log n), which is optimal among all unary unbiased blackbox algorithms [12,39]. Moreover, they show that the algorithm makes steady progress over the course of optimization, needing O(b−a) generations to increase the fitness from a to b whenever b − a ≥ C log n for a suitable constant C. The crucial point is that this is independent of a and b, so independent of the current state of the algorithm. It implies that the algorithm chooses λ = O(1) in early stages when progress is easy, and (almost) linear values λ = Ω(n) in the end when progress is hard. Thus, it achieves (ii) conditional on having appropriate parameter settings. Interestingly, it is shown in [32] that for s ≥ 18, the SA-(1, λ)-EA fails in a region far away from the optimum, more precisely in the region with 85% one-bits. Consequently, it also fails for every other function that is identical with OneMax in the range of [0.84n, 0.85n] one-bits, which includes other classical benchmarks like Jump, Cliff, and Ridge. It is implicit that the algorithm would be efficient in regions that are closer to the optimum. This is remarkable, since usually optimization is harder close to the optimum. Such a reversed failure profile has previously only been observed in very few situations. One is the (µ + 1)-EA with mutation rate c/n for an arbitrary constant c > 0 on certain monotone functions. This algorithm is efficient close to the optimum, but fails to cross some region in linear distance of the optimum if µ > µ 0 for some µ 0 that depends on c [40]. A similar phenomenon has been shown for µ = 2 and a specific value of c in the dynamic environment Dynamic BinVal [41,42]. These are the only examples for this phenomenon that the authors are aware of.
A limitation of [32] is that it studies only a single benchmark, the OneMax function. Although the negative result also holds for functions that are identical to OneMax in some range, the agreement with OneMax in this range must be perfect, and the positive result does not extend to other functions in such a way. This leaves the question on what happens for larger classes of benchmarks: (a) Is there a safe choice for s that makes the algorithm efficient for a whole class of functions?
(b) Does the positive result (ii) extend to other benchmarks than OneMax?
In this paper, we will answer both questions with Yes for the set of all (strictly) monotone pseudo-Boolean functions, i.e., functions where flipping a zero-bit into a one-bit always increases the fitness. 1 This is a very large class; for example, it contains all linear functions. In fact, all our results hold in an even more general dynamic setting: the fitness function may be different in each generation, as long as it is a monotone function every time and therefore shares the same global optimum (1, . . . , 1). We show an upper bound of O(n) generations (Theorem 18) and O(n log n) function 1 Shortly after our work, Hevia Fajardo and Sudholt also provided such a class in [43]. This is the class of everywhere hard functions, for which the chance of creating a strict improvement does not exceed n −ε anywhere in the search space. This includes the popular LeadingOnes benchmark, but not OneMax. In fact, it does not include any monotone function, since from the all-zero string (and from all other strings with Ω(n) zero-bits) the probability of an improvement is Ω(1) on monotone functions. Hence, their class is disjoint from ours. In [43] it was shown that for any constant s the SA-(1, λ)-EA imitates the elitist SA-(1 + λ)-EA on everywhere hard functions, which by design can never lose fitness. This arguably makes the comma variant a bit pointless for everywhere hard functions, since its potential benefit of escaping from local optima [44] is suppressed in this case.
evaluations (Theorem 28) if the mutation rate is c/n for some c < 1, which is a very natural assumption for monotone functions as many algorithms become inefficient for large values of c [7,8,45,46]. Those results are as strong as the positive results in [32], except that we replace the constant "1" in the condition s < 1 by a different constant that may depend on c. For c = 1 we still show that a bound of O(n log n) generations (Theorem 19) and O(n 2 log log n) evaluations (Theorem 29). It is in line with general frameworks for elitist algorithms that the number of function evaluations stops being quasi-linear [47,7], although the bounds in other contexts are better than quadratic [48].
Both parts of the answer are encouraging news for the SA-(1, λ)-EA. It means that, at least for this class of benchmarks, there is a universal parameter setting that works in all situations. This resembles the role of the mutation rate c/n for the (1 + 1)-EA on monotone functions: If c < 1 + ε, then the (1 + 1)-EA is efficient on all monotone functions [7,8,45], and for c < 1 this is known for many other algorithms as well [46]. On the other hand, the (µ + 1)-EA is an example where such a safe parameter choice for c does not exist: for any c > 0 there is µ such that the (µ + 1)-EA with mutation rate c/n needs super-polynomial time to find the optimum of some monotone functions.
We do not just strengthen the positive result, but we show that the negative result generalizes in a surprisingly strong sense, too: for any arbitrary mutation rate c/n where c < 1, if s is sufficiently large, then the SA-(1, λ)-EA needs exponential time on every monotone function, Theorem 40. Thus, the failure mode for large s is not specific to OneMax. On the other hand, we also generalize the result (implicit in [32]) that the only hard region is in linear distance from the optimum: for any value of s, if the algorithm starts close enough to the optimum (but still in linear distance), then with high probability it optimizes every monotone function efficiently, Theorem 35. Finally, we complement the theoretical analysis with simulations in Section 7. These simulations show another interesting aspect: in a 'middle region' of s, it seems to depend on F whether the algorithm is efficient or not. Thus, we conjecture that there does not exist an efficiency threshold s 0 such that all parameters s < s 0 and F > 1 are efficient, while all s > s 0 and F > 1 are inefficient. Note that our results show that this dependency on F can only appear in a 'middle region', since our results for small s and large s are independent of F (which improves the negative result in [32]). A similar effect was observed for the self-adjusting (1 + (λ, λ))-GA in [49], but for different reasons. There, the effect was caused by a universal upper bound on the success probability of ≈ 0.31 < 1, independent of λ. This can cause problems if the target success rate is larger than 0.31, and thus unachievable, see [49,Section 6.4] for a full discussion. In our setting, the success probability approaches one as λ grows, so the problem does not exist, and the reason for the impact of F seems different.
Our proofs build on ideas from [32]. In particular, we use a potential function of the form g(x t , λ t ) = Zm(x t ) + h(λ t ), where Zm(x t ) is the number of zero-bits in x t and h(λ t ) is a penalty term for small values of λ t . Similar decompositions have been used before [50]. The exact form of h depends on the situation; sometimes it is very similar to the choices in [32] (positive result for c < 1, negative result), but some cases are completely different (positive results for c = 1 and close to the optimum). With these potential functions, we obtain a positive or negative drift and upper or lower bounds on the number of generations, depending on the situation. When translating the number of generations into the number of function evaluations, while some themes from [32] reappear (e.g., to consider the best-so-far Zm value), the overall argument is different. In particular, we do not use the ratchet argument from [32], see Remark 31 for a discussion of the reasons.

Discussion of the SA-(1, λ)-EA
Let us give a short explanation of the concept of the (1 : s+1)-success rule (or (1 : s + 1)-rule for short). For given λ and given position x in the search space, the algorithm has some success probability p, where success means that f (y) > f (x) for the fittest of λ offspring y of x. For simplicity we will ignore the rounding effect coming from λ ∈ N and will assume that p(λ) ≤ 1/(s + 1) for λ = 1, and that 0 < p < 1. The success probability p = p(λ) is obviously an increasing function in λ, since additional offspring can only increase the chances of finding an improvement. Moreover, it is strictly increasing due to 0 < p < 1. Hence there is a value λ * such that p(λ) < 1/(s + 1) for λ < λ * and p(λ) > 1/(s + 1) for λ > λ * . Now consider the potential log F λ. This potential decreases by 1 with probability p and increases by 1/s with probability 1 − p. So in expectation it changes by −p + (1 − p)/s = (1 − (s + 1)p)/s. Hence, the expected change is positive if λ < λ * and negative if λ > λ * . Therefore, λ has a drift towards λ * from both sides (in a logarithmic scaling). So the rule implicitly has a target population size λ * , and this population size λ * corresponds to the target success rate p = 1/(s + 1).
Note that a drift towards λ * does not necessarily imply that λ always stays close to λ * . Firstly, p depends on the current state x of the algorithm, and might vary rapidly as the algorithm progresses (though this does not seem a very typical situation). In this case, the target value λ * also varies.
Secondly, even if λ * remains constant, there may be random fluctuations around this value, see [51,52] for treatments on when drift towards a target guarantees concentration. However, we note that the (1 : s + 1)-rule for controlling λ gives stronger guarantees than the same rule for controlling other parameters like step size or mutation rate. The difference is that other parameters do not necessarily influence p in a monotone way, and therefore we cannot generally guarantee that there is a drift towards success probability 1/(s + 1) when the (1 : s + 1)-rule is used to control them. Only when controlling λ we are guaranteed a drift in the right direction.

Preliminaries and Definitions
Throughout the paper we will assume that c > 0, s > 0 and F > 1 are constants independent of n while n → ∞. Note that s need not be an integer. Our search space is always {0, 1} n , and we denote by supp{x} := {i ∈ [n] | x i = 1} the support of a bit string x ∈ {0, 1} n . We say that an event E = E(n) holds with high probability or whp if Pr[E] → 1 for n → ∞. We denote the negation of an event E by E, and by 1 E the indicator of E, i.e., 1 E = 1 if E holds and 1 E = 0 otherwise.

The Algorithm: SA-(1, λ)-EA
We will consider the self-adjusting (1, λ)-EA with (1 : s + 1)-success rate, with mutation rate c/n, success ratio s and update strength F , and we denote this algorithm by SA-(1, λ)-EA. It is given by the following pseudocode. Note that the parameter λ may take non-integral values during the execution of the algorithm, however the number of children generated at each step is chosen to be the closest integer λ to λ. Algorithm 1 SA-(1, λ)-EA with success rate s, update strength F and mutation rate c/n for maximizing a fitness function f : {0, 1} n → R. Initialization: Choose x 0 ∈ {0, 1} n uniformly at random and λ 0 := 1 Optimization: for t = 0, 1, . . . do Mutation: for j ∈ {1, . . . , λ t } do y t,j ← mutate(x t ) by flipping each bit independently with prob. c/n Selection: Choose y t = arg max i f (y t,i ), breaking ties randomly; We will often omit the index t if it is clear from the context.

The Benchmark: Dynamic Monotone Functions
Whenever we speak of "monotone" functions in this paper, we mean strictly monotone pseudo-Boolean functions, defined as follows.
In this paper we will consider the following set of benchmarks. For each t ∈ N, let f t : {0, 1} n → R be a monotone function that may change at each step depending on x t . Then the selection step in the t-th generation of Algorithm 1 is performed with respect to f t . By slight abuse of notation we will still speak of a dynamic monotone function f .
All our results (positive and negative) hold in this dynamic setup. This set of benchmarks is quite general. Of course, it contains the static setup in which we only have a single monotone function to optimize, which includes linear functions and OneMax as special cases. It also contains the setup of Dynamic Linear Functions (originally introduced as Noisy Linear Functions in [53]) and Dynamic BinVal [41,42]. On the other hand, all monotone functions share the same global optimum (1 . . . 1), have no local optima, and flipping a zero-bit into a one-bit strictly improves the fitness. In the dynamic setup, these properties still hold "locally", within each selection step. Thus, the setup falls into the general framework by Jansen [47], which was extended to the partially ordered EA (PO-EA) by Colin, Doerr, Férey [48]. This implies that the (1 + 1)-EA with mutation rate c/n finds the optimum of every such Dynamic Monotone Function in expected time O(n log n) if c < 1, and in time O(n 3/2 ) if c = 1.

Drift Analysis and Potential Functions
Drift analysis is a key instrument in the theory of EAs. To apply it, one must define a potential function and compute the expected change of this potential. A common potential for simple problems in EAs are the OneMax and ZeroMax potential of the current state x t , which assign to each search point x ∈ {0, 1} n the number of one-bits and zero-bits, respectively: Note that for two bit strings x and y, Om(|x − y|) computes their Hamming distance, where the difference and absolute value are taken component-wise.
For our purposes, this potential function will not be sufficient since there is an intricate interplay between progress and the value of λ. Following [38,32], we use a composite potential function of the form g(x t , λ t ) = Zm(x t ) + h(λ t ), where h(λ t ) varies from application to application (Definitions 20, 24, 36, 41). We will write Z t := Zm(x t ), H t := h(λ t ) and G t := g(x t , λ t ) throughout the paper.
Once the drift is established, the positive and negative statements about generations then follow from standard drift analysis [52]. In particular, we will use the Additive, Multiplicative, and Negative Drift Theorem, given below. 2 Theorem 2 (Additive Drift Theorem [52]). Let (X t ) t≥0 be a sequence of non-negative random variables over a bounded state space S ⊂ R + 0 containing the origin and let T := inf{t ≥ 0 | X t = 0} denote the hitting time of 0. Assume there exists δ > 0 such that for all t < T , Theorem 3 (Multiplicative Drift Theorem [54]). Let (X t ) t≥0 be a sequence of non-negative random variables over a bounded state space S ⊂ R + 0 containing the origin and such that x min := min{x ∈ S : x > 0} is well defined. Let T := inf{t ≥ 0 | X t ≤ 0} denote the hitting time of 0. Suppose that there exists a constant δ > 0 such that for all t < T , Then, Theorem 4 (Negative Drift Theorem With Scaling [55]). Let (X t ) t≥0 be a sequence of random variables over a state space S ⊂ R. Suppose there exists an interval [a, b] ⊆ R and, possibly depending on := b − a, a drift bound ε := ε( ) > 0 as well as a scaling factor r := r( ) such that for all t ≥ 0 the following three conditions hold: 3. 1 ≤ r 2 ≤ ε /(132 log(r/ε)).

Concentration of Hitting Times
In our analysis, we will prove concentration of the number of steps needed to improve the number of 1-bits. We will use the following results of Kötzing [51], which we slightly reformulate for convenience.
Definition 5 (Sub-Gaussian [51]). Let X t t≥0 be a sequence of random variables and F = F t t≥0 an adapted filtration. We say that X t for all t and all z ∈ [0, δ].
Theorem 6 (Tail Bounds Imply Sub-Gaussian [51]). For every 0 < α, 0 < β < 1, there exists γ, δ > 0 such that the following holds. Let X t t≥0 be a random sequence and F an adapted filtration. Assume that E[X t+1 | F t ] ≤ 0 and for all times t and that for all x ≥ 0 we have Then X t t≥0 is (γ, δ)-sub-Gaussian. Theorem 7 (Concentration of Hitting Times [51]). For every γ, δ, ε > 0 there exists a D > 0 such that the following holds. Let X t t≥0 be a random sequence and F an adapted filtration satisfying the following properties Let T denote the first point in time when T t=1 X t ≥ N , then for all τ ≥ 2N/ε, To prove concentration of the number of steps spent improving the fitness under multiplicative drift, we will use the following theorem.
Theorem 8 (Multiplicative Drift, Tail Bound [56]). Let (X t ) t≥0 be nonnegative random variables over a state space S ⊂ R + 0 . Assume that X 0 ≤ b and let T be the random variable that denotes the first point in time t ∈ N for which X t ≤ a, for some a ≤ b. Suppose that there exists δ > 0 such that for all t < T , Then,

Further Tools
We will use the FKG inequality (Fortuin-Kasteleyn-Ginibre inequality), which is a standard tool in percolation theory, but less commonly used in the theory of EAs. We only give a special case of what is known as Harris inequality.
Theorem 9 (FKG inequality [57, Section 2.2]). Let I be a finite set, and consider a product probability space Ω = i∈I Ω i , where all Ω i have binary sample space {0, 1}. A real-valued random variable X is called increasing if X(ω) ≤ X(ω ) holds for all elementary events ω, ω in Ω with ω i ≤ ω i for all i ∈ I. It is called decreasing if −X is increasing.
1. If two random variables X, Y are both increasing or are both decreasing, then 2. If X is increasing and Y is decreasing, or vice versa, then We also say that X and Y are positively correlated and negatively correlated in the first and second case respectively. Note that the FKG inequality also applies to probabilities. E.g., if A, B are increasing events (which just means that their indicators 1 A and 1 B are increasing), then Pr To switch between differences and exponentials, we will frequently make use of the following estimates, taken from Lemma 1.4.2 -Corollary 1.4.6 in [58]. 1. For all r ≥ 1 and 0 ≤ s ≤ r, Finally, we will also use standard Chernoff bounds.

Technical Definitions and Results
As mentioned previously, our techniques resemble those of Hevia Fajardo and Sudholt [38,32]. The key is analysing a suitable potential function g(x, λ) = Zm(x) + h(λ) which combines the distance Zm(x) to the optimum (as defined in Section 2) with a penalty term for small λ. When this function has strong positive drift, we can establish that the optimum is reached fast; conversely, when g has (strong) negative drift, the optimisation takes superpolynomial time. In some cases, we use a similar penalty term h(λ) (and thus potential function) as [32], in other cases very different ones. However, the potential always contains the number of zero-bits Z t = Zm(x t ) at time t as an additive term, so the drift of Z enters the drift of the potential in all cases. The goal of this section is to compute this drift in Lemma 13. Moreover, the definitions and results for showing this lemma are also used at other places in the paper.
Finally, we show that if the penalty term h(λ) is 'reasonable', then the truncated change of g at each step min{C, G t − G t+1 } has exponential tail bounds and is thus sub-Gaussian. This allows us to apply the concentration results from Section 2.4 to establish concentration from above of the optimisation time; see Remark 14 for more details.

Definition and Properties of Basic Events
Because our analysis deals with an entire class of functions, we will not be able to precisely compute the probability of finding a fitness improvement. However, since we study (dynamic) monotone functions we can relate that probability to the probability of 1) having a child that flips no 1-bit of the parent, and 2) having a child that flips at least a 0-bit of the parent. Understanding those two events, which we respectively denote by A and B and formally define below, is the backbone of our approach and this subsection is devoted to their analysis.
Recall that x t and λ t are the search point and the offspring population size at time t respectively, and that y t,j denotes the j-th offspring at time t. For all times t we define In words, A t,j is the event that the j-th offspring at time t does not flip any one-bit of the parent, and A t is the event that such a child exists at time t. We also define respectively as the event that the j-th child does flip a zero-bit of the parent, and the event that such a child exists. We drop the superscript t when the time is clear from context, and just write x, y j and λ for parent, offspring, and population size at time t, and A j , A, B j , B for the events defined above.
We also observe that all the events {A j } j ∪ {B j } j are independent and in particular A and B are independent.
In the lemma below we estimate the probability of A t and B t in terms of Z t and λ t . We also provide a bound on the probability of not finding a fitness improvement.
Lemma 12. For any mutation rate c ≤ 1, there exist constants b 1 , b 2 , b 3 > 0 depending only on c such that at all times t with Z t ≥ 1 we have Proof. Let us start with the first inequality. The event A j happens with probability (1−c/n) n−Z t ≥ e −c by Lemma 10, soĀ = j A j has probability again using Lemma 10. We conclude the first proof by observing that λ ≥ λ/1.5.
The eventB happens if none of the λ offspring flips a zero-bit of the parent. This happens with probability The upper bound is obtained as above: (1 − c/n) ≤ e −c/n by Lemma 10 and λ ≥ λ/1.5. For the lower bound, we see that (1 − c/n) ≥ e −2c/n for c/n ≤ 1/2 by Lemma 10 and λ ≤ 2λ so that

For the last inequality, for every j the event
since an offspring y j is always an improvement if it is obtained by flipping a single zero-bit and no one-bit. Since all A j and B j are independent and

The Drift of Z t
With the definitions introduced in the previous subsection, we may now state and prove the key result of this section, that is, we compute the drift of Z in terms of Pr[A], Pr[B], Z and λ.
Lemma 13. Consider the SA-(1, λ)-EA with mutation rate 0 < c ≤ 1. There exist constants a 1 , a 2 , b > 0 depending only on c such that at all times t with Z t > 0 we have This also holds if we replace Z t − Z t+1 by min{1, Z t − Z t+1 }.
Remark 14. Theorems 6 and 7 which we use to prove concentration of hitting times require that the probability of having large jumps is small. This is not true in general: when we generate many children λ there is an increased probability of flipping many bits.
In order to still be able to prove concentration, we consider the situation in which the number of 0-bits may decrease by at most 1 at each step, i.e., this is why we cap the difference Z t − Z t+1 at 1. Even under this pessimistic assumption, we prove (in Sections 4 and 5) that the drift is positive and the optimum is reached fast.
The proof of this will be obtained using the following claims.
This also holds if we replace This also holds if we replace Z t − Z t+1 by min{1, Z t − Z t+1 }.
Proof of Claim 15. First, let us define K to be the index of the fittest offspring, i.e. y K = x t+1 . A first step in proving the claim will be to show that for all i ∈ supp(x). Note that (2) would hold with equality if we replaced K by a fixed j ∈ [ λ ] and omitted conditioning on A, so the task is to show that conditioning on A and conditioning on the offspring being selected can only decrease the probability. To show this, we use a multiple exposure of the randomness: we let u 1 , . . . , u λ respectively be obtained from x by only revealing the flips (or non-flips) of the 0-bits of x in each of the λ children (where we abbreviate x = x t and λ = λ t ). The child y j , j ∈ [ λ ] may then be obtained from u j by revealing the rest of the bits, i.e. the flips of the 1-bits of x.
Consider an index i such that x i = 1, and decompose For a given j, we observe that if we additionally condition on the 1-bit flips in other children (y ) =j , then 1 K=j,A is a decreasing function of the 1-bit flips in the j-th child, while 1 y j i =0 is an increasing function of those flips. The FKG inequality, Theorem 9, thus gives where the last line simply comes from the fact that the i-th bit flip in y j is independent of what happens in the other children and in the 0-bit flips of y j . Using the law of total probability over (y ) =j gives and plugging this into (3) gives A similar decomposition gives and combining (4) and (5) gives To obtain (2) it suffices to apply the law of total probability over (u ) .
We can now compute the drift conditioned on A, B K : since B K implies that y K turns (at least) one 0-bit into a 1-bit, we obtain Note that the first step in (6) remains correct if we replace Z t − Z t+1 with min 1, Z t − Z t+1 , and this difference does not play a role in any other parts of the proof. To continue the proof, we decompose (6). To conclude the proof, it thus suffices to show and We start by proving (8): we will argue that if both A and B K hold, then x t+1 = x t . Indeed, B K implies that f (x t+1 ) ≤ f (x t ) since no bit is flipped from 0 to 1. Additionally, the equality holds if and only if x t+1 = x t , since flipping any 1-bit to 0 would decrease f by strict monotonicity. On the other hand, one easily observes that Finally, we prove (7). Once (u ) ∈[ λ ] is revealed, we can define J as the set of those indices j which maximise f (u j ). We observe that if B and ∪ j∈J A j hold, then B K also does. Indeed, if we let J ⊆ J be the set of indices for which A j holds, and if J = ∅, then the set of children which maximise f (y j ) is exactly J . Hence we have The second inequality is simply obtained by noting that, under the assumption that B holds and since J is not empty, the probability of ∪ j∈J A j is at least that of A j for a single (arbitrary) j in J. The event A j has probability (1 − c/n) n−Z t , is independent of B and positively correlated with A. The last inequality follows from Lemma 10 since Z t ≥ 1.
Proof of Claim 16. Recall thatĀ is the event that every offspring flips at least one one-bit. Let K be the index of the fittest child and let N j be the number of one-bits flipped in the j-th offspring, we want to show that , i.e. the fittest offspring does not flip more one-bits than an arbitrary offspring in expectation. Note that conditioning onĀ leads to dependent bit flips within each individual offspring, but once we know that a specific one-bit is flipped, the remaining one-bit flips are independent. Therefore, we can couple the onebits flips givenĀ with the following procedure. Assume there are m one-bits in x, we first sample the position of the first (= left-most) one-bit l to be flipped. Afterwards, we still flip each bit to the right of l independently with probability c/n. This gives the usual distribution of one-bit flips, conditioned onĀ. To make this formal, we sample l ∈ and flip the l-th one-bit. It is easy to verify that m l=1 p l (m) = 1 since p l is a geometric sequence. Then for each l ∈ [m]\[l], we flip the l -th one-bit independently with probability c/n. The probability that a specific one-bit is flipped givenĀ is (c/n)(1 − (1 − c/n) m ) −1 . By our procedure, this probability is which is exactly the desired conditional probability.
Therefore, we can get rid ofĀ as follows. Let N j a:b be the random number of bit flips when we flip the a-th to the b-th one-bit independently with probability c/n for offspring j and l j be the index l sampled for offspring j.

Now we may use the FKG inequality to show
The proof is similar to that used in the previous claim: one conditions on (u ), (y ) =j in order to have a product space on which N j i+1:m is increasing and 1 K=j decreasing. One then applies the FKG inequality. Observing that N j i+1:m is independent of (u ), (y ) =j and using the law of total probability over (u ), (y ) =j gives (9). Continuing the previous derivation, where we use the fact that which proves Claim 16.
We now combine the two claims above to obtain Lemma 13.
Proof of Lemma 13. The drift of Z t = Zm(x t ) may be decomposed as follows, where we omitted the conditioning on x, λ on the right-hand side for brevity. As observed above, A, B are independent so we get Pr Also, we observe that the second conditional expectation in (10) must be 0: ifB holds then no child is a strict improvement of the parent, but A guarantees that some children are at least as good. Hence, if A,B hold, we must have x t = x t+1 . Combining those remarks with the bounds of Claims 15 and 16 gives

Improvements are Sub-Gaussian
In Sections 4.2 and 5 we will prove that the number of time steps needed to optimise a function is tightly concentrated. We provide the following result, based on Theorems 6 and 7, which allows us to relate strong positive drift and concentration of hitting times.
Proof. We will use Theorem 6 to prove that (ε − Γ t ) is sub-Gaussian. To be able to apply this theorem, we must prove that E ε − Γ t ≤ 0 and that there exist 0 < α and 0 < β < 1 such that Pr[|ε − Γ t | > w] ≤ α/(1 + β) w for all w ≥ 0. The first immediately holds by i, so we focus on the second. Let y 1 , . . . , y λ be the children at step t and let K = arg max j f (y j ) the index of the fittest child. We also let N 1 , . . . , N λ be the number of 1-bit flips in y 1 , . . . , y λ and we define Let w > C 2 + ε and w = w − ε − C 2 . Since the value of h may only decrease by C 2 at each step by ii, to have ε − Γ t ≥ w it must be that Z t+1 − Z t ≥ w − ε − C 2 , and in particular we must have N ≥ w . We compute Above, the second inequality is obtained using the FKG inequality in the same fashion as in the proof of Claim 15. The above implies that for all w > C 2 + ε, the probability of having ε − Γ t ≥ w is bounded by α/(1 + β) w for α = 2 ε+C 2 and β = 1. Up to possibly increasing α to be a large constant, the same relation holds for w ∈ [0, C 2 + ε]. Since Γ t is upper bounded by the constant 1 + C 1 , the quantity of interest ε − Γ t is lower-bounded by ε − 1 − C 1 so we can also trivially achieve by possibly increasing α again.

Monotone Functions Are Efficiently Optimized for Small Success Rates
In this section we analyse the SA-(1, λ)-EA when the success rate s is small, and the mutation rate is c/n for a constant 0 < c ≤ 1. We show that if s is sufficiently small then for any strictly monotone fitness function, the optimum is found efficiently both in the number of generations and evaluations. We distinguish between the cases c < 1 and c = 1.

Bound on the Number of Generations
In this subsection, we study the number of generations required to reach the optimum and show that for c ≤ 1, the SA-(1, λ)-EA finds the optimum efficiently. We start with the case c < 1.
Theorem 18. Let 0 < c < 1 < F be constants. Then there exist C, s 0 > 0 such that for all 0 < s ≤ s 0 and for every dynamic monotone function the expected number of generations of the SA-(1, λ)-EA with success rate s, update strength F and mutation probability c/n is at most Cn.
For c = 1, we additionally need to assume that the update strength F is bounded from above by a suitable constant F 0 > 1. As we will show experimentally, the update strength can have a notable impact on performance, but it remains open whether this effect vanishes for sufficiently small s. Theorem 19. There exist constants F 0 > 1, s 0 > 0 and C > 0 such that for all 1 < F < F 0 , all 0 < s ≤ s 0 , and for all dynamic monotone functions the expected number of generations of the SA-(1, λ)-EA with success rate s, update strength F and mutation probability 1/n is at most Cn log n.
Our approach will be essentially the same for both theorems and will follow the ideas of Hevia Fajardo and Sudholt [32,38]. We prove them in the following two subsections.

Expected Number of Generations When c < 1
We will prove Theorem 18 in this section; for the remainder of this section, we assume that 0 < c < 1 < F and the dynamic monotone function f are all given and we will show the existence of a desired s 0 independent of f . Recall that x t is the search point at time t, its children are y t,1 , . . . , y t, λt , and λ t is the value of λ at time t. In particular, the latter does not need to be an integer, and the actual number of offspring at time t is the closest integer λ t . Whenever the time t is clear from the context, we will remove it from the superscript.
We show that for an appropriate function g, the drift E[g(x t , λ t ) − g(x t+1 , λ t+1 )] is positive. Our choice of g will guarantee that g(x, ·) = 0 implies x = (1, · · · , 1), and the Additive Drift Theorem 2 will allow us to bound the time until this happens. For this section, we use the following g = g 1 .
Our first lemma states that g 1 (x, λ) does not deviate much from Zm(x) for all x, λ, and that it suffices to show that g 1 reaches 0, since then the optimum is found.
Proof. The lemma follows trivially from the fact that We will now compute the drift of G t 1 . The drift of Z t was already computed in the previous section, so it suffices to to compute that of H t 1 .
Claim 22. At all times t ≥ 0 we have Proof of Claim 22. We first give a general bound that we will use for the case that the fitness increases. We have λ t+1 ≥ λ t /F and thus log F (λ max /λ t+1 ) ≤ 1 + log F (λ max /λ t ) and H t+1 In particular, This gives Claims 15, 16 and 22 may now be combined to obtain the following drift of G t 1 . We again drop the index t from x t and λ t .

Corollary 23.
There exists a constant s 0 > 0 such that for all 0 < s ≤ s 0 the following holds. There is a constant δ and a choice of K 1 such that for all t with Z t > 0, This also holds if G t − G t+1 is replaced by min{1 + K 1 /s, G t − G t+1 }.
Proof. Combining Lemma 13 and Claim 22, one obtains that the drift of G 1 is at least for some constants α 1 , α 2 , β > 0. We choose K 1 = α 1 /2 so that the drift for any t with Z t > 0 is at least • If λ < n: then as we want s small enough, we may assume that s < 1. In this setting the drift is lower bounded by Note that there is λ 0 = λ 0 (α 2 , β, K 1 ) such that for λ ≥ λ 0 the last term can be bounded as α 2 e −βλ ≤ K 1 /2, in which case the drift is at least K 1 /2. For the remaining case, recall Lemma 12 guarantees that Pr[B] ≥ e −b 2 λ for some constant b 2 > 0 depending only on c. Hence, for a choice of s small enough we can achieve Pr[B]K 1 (1 − s)/s ≥ α 2 e −βλ , so the drift stays above The first term is at least Pr (1); this implies that the drift is at least (1−e −c ) 2 K 1 for sufficiently large n. To see why the statement also holds for min{1 + K 1 /s , G t − G t+1 }, we recall that G t = Z t + H t 1 , that Lemma 13 holds for min{1, Z t − Z t+1 } and that H 1 may increase by at most K 1 /s in each step. This implies that the first formula (11) also holds if we replace G t − G t+1 by min{1 + K 1 /s , G t − G t+1 } and all following arguments are unchanged.
We are now ready to prove the main theorem of this section.
Proof of Theorem 18. Corollary 23 guarantees that for s sufficiently small there is δ > 0 such that the drift of G 1 is at least E[G t 1 − G t+1 1 | x, λ] ≥ δ whenever Z t > 0. Let T be the first point in time when either G T 1 = 0 or Z T = 0. Then the drift bound for G 1 applies to all t < T , and by Theorem 2 we have E[T ] ≤ G 0 1 /δ ≤ (n + K 1 log(λ max ))/δ = O(n). By Lemma 21, G T 1 = 0 implies Z T = 0, so in particular at time T we have x T = (1, . . . , 1) and Theorem 18 is proved.

Expected Number of Generations When c = 1
We will now prove Theorem 19; that is, we will show that the selfadjusting EA is also efficient when the mutation rate is 1/n. The reason we need to treat this case differently from the previous one is because of the expected number of bits gained when increasing the fitness. If we set c = 1, the drift obtained in Claim 15 is no longer constant but proportional to Z t /n. In particular, in the last stages of the exploration, the drift is a lot smaller and this results in a looser bound for the number of generations. Still, the proof is similar to the one for c < 1, but we need to choose a different potential function.
Definition 24 (Potential function for c = 1). Let where K 2 is a constant to be chosen later, and λ max = F 1/s n. Then for x ∈ {0, 1} n and λ ∈ [1, ∞) we define and we set H t 2 := h 2 (x t , λ t ) and G t 2 := g 2 (x t , λ t ), and as before Z t := Zm(x t ).
As before, we have a lemma stating that the deviation between Zm(x) and g 2 (x, λ) is small.
Proof. Similar to the proof of Lemma 21, the proof follows from the fact that The drift of Z is known from Lemma 13, so to compute the drift of G it suffices to compute that of H 2 . As before, we abbreviate x = x t and λ = λ t where the index is clear from the context.
Proof of Claim 26. Similar to the proof of Claim 22, we analyse the drift of H t 2 . We first give a general bound that we will use for the case that the fitness increases. We have λ t+1 ≥ λ t /F and thus This gives where the conditioning on x, λ is implicit. Recall that B is the event that at least one of the offspring flips a 0-bit of X t , which is a necessary condition for

by its upper bound Pr[B] and replacing Pr[f (x t+1 ) ≤ f (x t )] by its lower bound Pr[B], we conclude the proof.
We can bound the drift of G t 2 from below as follows.
Corollary 27. There exists constants 0 < s 0 and 1 < F 0 such that the following holds. For all 0 < s ≤ s 0 and all 1 < F ≤ F 0 there exists a choice of K 2 and a constant δ > 0 such that for all times t with Z t > 0, Proof. We will first show for some δ > 0. Combining Lemma 13 together with Claim 26 we obtain that the drift of G 2 is at least for some constants α 1 , α 2 , β > 0. We will argue that if F > 1 and s > 0 are both small enough and if K 2 is chosen appropriately, then the drift of G t 2 is of order Z t /n. We will choose F, s later but we may already choose K 2 = α 1 /(2(F − 1)) so that the drift is at least Our proof is based on a case distinction. Let b 3 be the constant of Lemma 12, γ := 1 − e −b 3 and letλ > 0 be such that γα 1 /(4λ) − α 2 e −βλ ≥ 0 holds for all λ ≥λ.
• If λ ≤ max{λ, n/Z t }: then by ignoring the first positive contribution in (13), the drift is at least Splitting the positive contribution into three equal parts gives Recall that Lemma → +∞, so a choice of F, s small enough (but constant) guarantees that the second and third line in (14) are both non-negative. This means that in the range of λ considered, the drift is at least some multiple of 1/λ, which is at least δ Z t /n for a small enough constant δ .
• If λ > max{λ, n/Z t }: then by ignoring the last positive contribution in (13) we see that the drift is at least Lemma 12 states that Pr[B] ≥ 1 − e −b 3 λZ t /n ≥ 1 − e −b 3 = γ since λ > Z t /n and by definition of γ. Since λ >λ, the last two contributions sum up to a non-negative constant so that the drift is at least γα 1 Z t 4n ≥ δZ t /n. At any time when Z t ≥ 1 and λ t ≥ 1 we have This proves (12). To relate Z t to G t 2 , recall that by Lemma 25, we have To obtain a multiplicative drift, we distinguish two cases. If Hence, for all times t < T we have which concludes the proof. Now we are ready to prove the main result of this subsection.

Bound on the Number of Evaluations
We have proved that the number of generations is respectively O(n) or O(n log n) if c < 1 or c = 1. We will now turn our attention to the total number of function evaluations. For c = 1, we again need to assume that F is sufficiently close to one. More precisely, we will show the following theorems.
Theorem 28. Let 0 < c < 1 < F be constants. Then there exist constants C, s 0 > 0 such that for all s ≤ s 0 and every dynamic monotone function, the expected number of function evaluations of the SA-(1, λ)-EA with success rate s, update strength F and mutation probability c/n is at most Cn log n.

Theorem 29.
There exist constants C, s 0 > 0 and F 0 > 1 such that for all s ≤ s 0 , all 1 < F < F 0 and every dynamic monotone function, the expected number of function evaluations of the SA-(1, λ)-EA with success rate s, update strength F and mutation probability 1/n is at most Cn 2 log log n.
Remark 30. Theorem 28 is tight since any unary unbiased algorithm needs at least Ω(n log n) function evaluations to optimize OneMax [39]. On the other hand, Theorem 29 is not tight. Calculating a bit more precisely would allow to replace the log log n factor by an even smaller factor. However, we suspect that even the main order n 2 is not tight, since the (1+1)-EA with c = 1 is known to need time O(n 3/2 ) even in the pessimistic PO-EA model [48], which includes every dynamic monotone function. The order n 3/2 is tight for the PO-EA, but a stronger bound of O(n log 2 n) is known for all static monotone functions [45], and an O(n log n) bound is known for all dynamic linear functions [53].
We conjecture that the number of function evaluations required to optimise static monotone functions is linear up to some logarithmic factors (in fact, we conjecture O(n log n)) even for c = 1. However, the methods used in [45] are rather different from the ones in this paper, so it remains unclear whether they can be transferred.
We also conjecture that dynamic monotone functions are harder to optimise, i.e., that O(n log n) generations and O(n 3/2 ) evaluations are tight. More precisely, we conjecture that the 'adversarial' Dynamic BinVal described in our conclusion is the hardest dynamic monotone function for the SA-(1, λ)-EA, and requires Ω(n log n) generations and Ω(n 3/2 ) evaluations.
Remark 31. Our approach uses the best-so-far ZeroMax value Z t * defined below as in [32,38]. However, apart from that our proof is rather different. In fact, we believe that the proof in these papers is not fully correct. In the proof of Theorem 3.5 in [32], the authors bound the number of evaluations per generation by identically distributed random variables, and use Wald's equation to bound the total number of evaluations. However, Wald's equation is only true for the sum of independent random variables (or for similar conditions, e.g. [59]), a condition that is not satisfied in this situation. (The random variables are identically distributed, but not independently identically distributed.) Thus we need to use a different approach.
To avoid the issue mentioned above, we will decompose the interval [n] into smaller 'sub-intervals' and we will show that with very high probability, the time needed for Zm(x) to 'traverse' such an interval is of the expected order. We will compute the expected number of children at each of those steps, and will conclude using linearity of expectation.
To prove concentration of the time needed to traverse 'sub-intervals' we will use Theorem 7 in the case c < 1 and Theorem 8 in the case c = 1.
The key ideas are summarised in the following lemmas. The first one is an adaptation of one proved by Hevia Fajardo and Sudholt. Below, we let Z t * := min t ≤t (Z t ) be the smallest value of Z t observed until time t. Naturally, the process is unaware of this value but it will turn out useful for the analysis. We will apply the following lemmas to intermediate stages of a run, so we will consider an arbitrary starting population size λ init in them.
Lemma 32 (Fajardo,Sudholt [32,38]). Consider the SA-(1, λ)-EA as in Theorems 28 or 29, with an arbitrary initial search point and an initial value of λ = λ init . There exists a constant C > 0 such that at all times t ≥ 0 and for all z > 0 we have Consider the self-adjusting (1, λ)-EA as in Theorems 28 or 29, with an arbitrary initial search point and an initial value of λ = λ init . Let T denote the first time t at which λ t ≤ 8en log n/Z t . There exists an absolute constant C > 0 such that Let (a, b) be an interval of length b − a = log n. Consider the self-adjusting (1, λ)-EA with c < 1 as in Theorem 28, with an initial search point x = x init such that Zm(x init ) ≤ b, and an arbitrary initial value of λ.
Let T be first time t at which Z t ≤ a. Then there exists an absolute constant D > 0 such that T ≤ D log n with probability at least 1 − n −4 .
Proof of Lemma 32. We will compute the expectation using the following formula Let be an integer; if ≤ max{λ init /F t , n/z}, then Pr[λ t ≥ ] = 1. Otherwise, observe that for λ t ≥ there has to exist a time before t when λ increases, i.e. when the fitness does not improve. We may then write Note that if (t − k) is the last time when λ increases, then the number of children at this time must be λ t · F k−1/s ≥ −1 + · F k−1/s . Naturally, if Z t * ≥ z, we must also have Z t−k * ≥ z. In particular, we find that if the following event holds then so does If Z t−k * ≥ z, the probability of a single child improving the fitness is at least where the last step holds by Lemma 10 since cz/n ≤ 1. In particular, this implies that the probability of the event in (17) is at most (16) and using F k = e k log F ≥ 1 + k log F gives

Replacing in
, for a sufficiently large constant C since ≥ n/z. Now using Equation (15) and taking the trivial upper bound of 1 for the probability of the first max{n/z, λ init /F t } terms, we obtain for a large constant C > 0.
Proof of Lemma 33. Consider a time t < T such that λ t ≥ 8en log n/Z t . The probability that a child improves the fitness is at least where the last step holds by Lemma 10 since cZ t /n ≤ 1. Hence the probability that all the λ t children fail to improve the fitness is at most This recursively implies that E λ t · 1 t<T ≤ λ init · F −t/2 . Using λ t ≤ λ t−1 F 1/s , we can now conclude, for some constant C.
Proof of Lemma 34. Recall the definition In other words, T ≤ T and we will show the desired tail bound for T .
To obtain the tail bound for T , we observe that h 1 is decreasing and that h(λ)−h(λF ) ≤ K 1 for all λ. By Corollary 23, the drift of Γ t is at least a constant ε > 0 and Lemma 17 applied with C 1 = K 1 /s, C 2 = K 1 guarantee that ε − Γ t is sub-Gaussian. Thus Theorem 7 is applicable. Let D be the constant from Theorem 7. Then we choose τ = max{4/D, 2/ε} · log n and Theorem 7 immediately implies that Pr[T > τ ] ≤ Pr[T > τ ] ≤ n −4 .
We are now ready to prove Theorem 28. We look at a slight alteration of the SA-(1, λ)-EA, working in exactly the same way as the 'normal' process except that we introduce some idle steps in which the algorithm does not do anything. Moreover, we divide a run of the algorithm into blocks and phases as follows. For simplicity, we will assume in the following that n/ log n is an integer. A block starts with an initialisation phase which lasts until the condition λ t ≤ F 1/s 8en log n/Z t is met. Once this phase is over, the block runs for n/ log n phases of length D log n, with D the constant of Lemma 34. During the i-th such phase the process attempts to improve Z t from n−i log n to n−(i+1) log n. If such an improvement is made before the D log n steps are over, then the process remains idle during the remaining steps of that phase. We call the non-idle steps active.
If a phase fails to make the correct improvement in D log n generations, or if λ t ≥ F 1/s 8en log n/Z t * at any point after the initialisation phase is over, then the whole block is considered a failure, and the next block starts. Obviously, the entire process stops (and succeeds) if the optimum is found. With this partitioning of a run, we will prove Theorem 28.
Proof of Theorem 28. The proof relies on the following two facts: (i) every block finds the optimum whp; (ii) consider a block starting with λ = λ init , then the expected number of function evaluations during this block is at most K(λ init + n log n) for a constant K = K(c, F, s).
It is rather easy to see how those two items imply the theorem. The algorithm starts with an initial value of λ = 1, so the expected number of function evaluations in the first block is at most O(n log n). Recall that a block terminates as soon as λ goes above F 1/s 8en log n/Z t * after the initialisation phase. This means that any block-run after the first one will start with λ init ≤ F 2/s 8en log n/Z t * and by ii its expected total number of evaluations is also O(n log n). The success of each block is at least 1 − o(1) for all possible x init , λ init it starts with, so by i we have an expected (1+o(1)) blocks, each requiring an expected O(n log n) evaluations, hence the result.
To conclude, we will now prove the two items above. Let us start with i which is simpler: the initialisation never fails since it runs until it succeeds, i.e., until λ gets small enough or the optimum is found. By Lemma 34, 'crossing' any interval of size log n fails with probability at most n −4 . By union bound over all n/ log n phases of the block, the probability that one of them fails for this reason is at most n −3 log −1 n = o(n −3 ). Finally, the block might also fail because we have λ t ≥ F 1/s 8en log n/Z t * at some point, which means that the (t − 1)-th step was not successful despite λ t−1 ≥ 8en log n/Z t−1 , since Z t * ≤ Z t−1 . This happens with probability at most n −2 by Lemma 12, so by union bound over the Dn generations in the different phases, the probability that this happens at some time during a given block is at most o(1).
We now prove ii. Consider any 1 ≤ i ≤ n/ log n; we will show that the number of function evaluations in the i-th phase of the block is at most of order n n−i log n . In a slight abuse of notation, for t ∈ [D log n] we will let λ t i denote the value of λ in the t-th step of the i-th phase, and we set λ t i to be 0 if step t is idle, e.g. if the improvement to Z t ≤ n − (i + 1) log n has already been found, or if the i-th phase does not happen because the block has failed before that.
Since the previous phase is successful, the value of λ at the start of this phase must be λ 0 i ≤ C n log n n−i log n . By definition, t is only active if Z t has never been at or below n − (i + 1) log n, so by Lemma 32 we have Note that in each round, the number of function evaluations is λ t i ≤ 1+λ t i . Summing over the D log n steps of the i-th phase, we see that the total number of function evaluations during this phase satisfies for a constant C > 0. Hence, excluding the initialisation phase, the total number of function evaluations is in expectation at most C n/ log n i=1 n log n n − i log n = C n/ log n j=1 n log n j log n ≤ C n log n for a new constant C > 0. Lemma 33 immediately gives that the expected number of function evaluations of the initialisation phase is at most C λ init , and ii is proved.
The proof of Theorem 29 is extremely close to that above, so we will only give a sketch of it. The main difference is the way the phases of a block are defined: during the i-th phase, the algorithm attempts to improve Z t from n/ log i−1 n to n/ log i n in Dn log log n steps for a large constant D. One checks that n/ log i n = n/e i log log n is less than 1 for i > log n/ log log n, so we have 1 + log n/ log log n phases in a block. Since G t 2 has multiplicative drift, Theorem 8 immediately gives that the probability that a phase fails to improve Z t is at most log −2 n, so the probability that any phase in a fixed block fails is O(log n/ log log n · log −2 n) = o(1). In the i-th phase, the expected value of λ at each step is at most O(log i n). The total number of function evaluations per block after the initialization phase is then O(1) log n/ log log n i=0 n log log n · log i n = O(n 2 log log n), which implies Theorem 29.

Close to the Optimum Success Rates Become Asymptotically Irrelevant
In this section, we will show that one can still have efficient search even when s is large, provided one starts close enough to the optimum. More precisely, we will prove the following theorem. For simplicity, we only treat the case c < 1.
Theorem 35. Let 0 < c < 1 < F be constants. For every s > 0, there exists an ε > 0 such that for any initial search point x 0 satisfying Zm(x 0 )/n ≤ ε and for any initial population size λ init ≥ 1 the following holds. For every dynamic monotone function, with high probability the number of generations of the SA-(1, λ)-EA with success rate s, update strength F , and mutation probability c/n and initial state (x 0 , λ init ) is O(n). Additionally, the number of function evaluations is ω(n log n) with probability o(1).
The approach will be essentially the same as in Section 4.2. The main difference lies in the potential function: we will need to introduce a second penalty term into the part h(λ) that depends on λ. Moreover, when s is large, no potential function can have strong positive drift towards the optimum for all values of Z t , as it would otherwise contradict the negative results from Section 6. Hence, we will only show that the potential has positive drift when Z t /n ≤ 2ε. Then by the Negative Drift Theorem, starting from Z t /n ≤ ε it is unlikely that the exploration reaches a search point for which Z t /n > 2ε in polynomial time. Hence, the algorithm stays in a range where the drift is positive, and by the Additive Drift Theorem the optimum is found efficiently.
Definition 36 (Potential function for positive result near optimum). For all λ ≥ 1, we set where λ max = F 1/s n and the constants K 1 , K 2 , K 3 > 0 will be fixed later.
As in Section 4, A = A t denotes the event that some child of x t flips no one-bit and B = B t the event that some child of x t flips (at least) one zero-bit. Also recall that we abbreviate x = x t and λ = λ t when t is clear from the context.
Proof. The proof is extremely similar to that of Claim 22. In particular, the contribution of the first term K 1 max {0, log F λ max /λ} is exactly the same, so we will only compute the contribution of the second term. Since that term is non-negative at time t and is at most K 2 e −K 3 λ/F at time t + 1 due to In the case f (x t+1 ) ≤ f (x t ) of non-success we have λ t+1 = λF 1/s , so we may use the exact contribution to H t 3 − H t+1 3 , which is and using the law of total expectation gives the result.
• If λ ≤ F K 3 log 2α 2 K 2 K 1 : then, ignoring the (positive) contribution of Pr[B](α 1 − K 1 ) and Pr[B]K 1 /s, we see that the drift of g is at least where in the second step we bounded e −K 3 λ/F ≤ 1 and used β ≥ K 3 and , and in the third step we used λ ≤ F/K 3 ·log(2α 2 K 2 /K 1 . Then the drift must be at least for a δ > 0 chosen small enough. • If F K 3 log 2α 2 K 2 K 1 ≤ λ < n: then the first contribution in (18) is at least Pr[B]K 1 /2. Hence, the drift is at least where the second line holds since K 3 = β/2 and K 2 = 2α 3 /(α 4 K 3 ). Recall that Pr[B] = (1−c/n) λ Z t ≥ e −4cλε whenever Z t /n ≤ 2ε and n is sufficiently large. Choosing ε small guarantees that this is larger than e −K 3 λ for all λ, meaning that the drift of G t 3 is at least K 1 / max{2, s} ≥ δ for a suitable δ > 0.
For the last part of the statement, we use the same argument as for Corollary 23, using that (18) also holds for min{1, G t −G t+1 } since Claims 15 and 16 do, and h may only increase by K 1 /s + K 2 at each step.
We may now prove Theorem 35.
Proof of Theorem 35. Let ε, δ, K 1 , K 2 , K 3 > 0 be the constants from Corollary 39 and assume the initial search point x 0 is such that Zm(x 0 )/n ≤ ε. Analogously to the proof of Theorem 28, we define Γ t = min{1 + K 1 /s + K 2 , G t 3 −G t+1 3 }, let T be the first time t when Z t = 0 or G t 3 = 0, and observe that this actually implies Z T = 0. Moreover, we define T as the first time when Z t /n > 2ε.
By Corollary 39 the drift at any time t < min{T, T } is at least As in the proof of Theorem 28, we use Lemma 17 to see that δ − Γ t is sub-Gaussian since h 3 is decreasing may not increase too much at each step.
In particular, Theorem 7 gives that, for a suitable constant D > 0, the event E := {T > Dn and Dn τ =0 Γ τ < εn + K 1 log λ max + K 2 } has probability Pr[E] = e −Ω(n) . If the second event does not happen, Dn τ =0 Γ τ ≥ εn + K 1 log λ max + K 2 , then by the Sandwich Lemma 37 this implies Z t ≤ 0 for t = Dn and thus T ≤ Dn. Hence, Pr[T ≤ Dn] ≥ Pr[Ē] = 1−e −Ω(n) , and the statement about the number of generations is proven. For the number of function evaluations, in the proof in Theorem 28 we use the potential function as a black box (except for the Sandwich Lemma), so the proof carries over.

Small Success Rates Yield Exponential Runtimes
The aim of this section is to show that for large s, that is, for a small enough success rate, the SA-(1, λ)-EA needs super-polynomial time to find the optimum of any dynamic monotone function. The reason is that the algorithm has negative drift in a region that is still far away from the optimum, in linear distance. In fact, as we have shown in Section 5, the drift is positive close to the optimum. Thus the hardest region for the SA-(1, λ)-EA is not around the optimum. This surprising phenomenon was discovered for OneMax in [38]. We show that it is not caused by any specific property of OneMax, but that it occurs for every dynamic monotone function. Even in the OneMax case, our result is slightly stronger than [32], since they show their result only for 1 < F < 1.5, while ours holds for all F > 1. On the other hand, they give an explicit constant s 1 = 18 for OneMax.
Theorem 40. Let 0 < c ≤ 1 < F . For every ε > 0, there exists s 1 > 0 such that for all s ≥ s 1 the following holds. For every dynamic monotone function and every initial search point x init satisfying Zm(x init ) ≥ εn the number of generations of the SA-(1, λ)-EA with success rate s, update strength F , and mutation probability c/n is e Ω(n/ log 2 n) with high probability.
Definition 41 (Potential function for negative result). Given F , we define with K 4 a positive constant to be chosen later. As before, we define the potential function to be the sum of Zm(x) and h 4 (λ): As usual, we set G t 4 := g 4 (x t , λ t ), H t 4 := h 4 (λ t ) and Z t := Zm(x t ). Contrary to the previous sections, we now are now aiming to show that the difference G t+1 4 − G t 4 is positive in expectation. (Note the switched order of t + 1 and t.) This will require approaches slightly different from the ones we used so far.
The theorem will be proved using the following lemmas. Recall from Section 3.1 the event B that at least one child flips a zero-bit.
Lemma 42. There exists a constant α 1 > 0 depending only on c such that at all times t we have Lemma 43. There exist constants ε, α 2 > 0 depending only on c, F such that if Z t ≤ εn and λ ≤ F , then Lemma 44. Assume that s ≥ 1 ≥ c. At all times t with Z t > 0 we have Proof of Lemma 42. The eventB implies supp(x t+1 ) ⊆ supp(x t ) and thus Z t+1 − Z t ≥ 0. Hence, E[Z t+1 − Z t |B] ≥ 0. By the law of total probability, we may thus bound To bound the conditional expectation, let N j be the number of zero-bits flipped by the j-th individual, and let N : The events B j are positively correlated with the event N ≥ z, for every z ≥ 1. Therefore, As in the proof of Claim 16, we can couple the one-bit flips in y j given B j by first sampling the position l of the left-most one-bit flip, and then flipping all bits to the right of l independently with probability c/n. Since there are less than n positions to the right of l, this shows that N j is dominated by 1 + N , where N follows a Bin(n, c/n) distribution. In particular, by the Chernoff bound, Theorem 11, for a constant α 0 that only depends on c. Hence, for a suitable constant α 1 > 0. Combining this with (20), we obtain as desired.
Proof of Lemma 43. For all j ∈ [ λ ], let us denote by M j the number of one-bits flipped by the j-th offspring and M = min j M j . We also define N j as the number of zero-bits flipped by the j-th child and let N = max j N j .
Observe that M is the minimum of λ ≤ F i.i.d. random variables following a binomial distribution Bin(n − Z t , c/n). In particular, Observe now that N ≤ j N j . Since each N j follows a binomial distribution Bin(Z t , c/n) the expected value of N is at most Choosing ε small enough and If we now condition on f (x t+1 ) > f (x t ) and assume λ ≥ F , we have We observe that h 4 is decreasing with λ, so when λ < F we may simply lowerbound the drift by 0.
The law of total probability then gives For a choice of s large enough this is at least δ, for some constant δ > 0.
We are now ready to prove the main theorem of this section. Essentially, it follows from Corollary 45 and the Negative Drift Theorem 4. However, compared to the other sections, there is a slight complication since the difference |G t 4 − Z t | = K 4 log 2 F (λF ) is not bounded. However, we will prove that with overwhelming probability the difference does not grow larger than K 4 √ n.
Proof of Theorem 40. Let Λ := n 2 /F and let T be the first point in time when Z t ≤ εn/2. We first show that with overwhelming probability, we have λ t ≤ Λ for all 1 ≤ t ≤ min{T, e n }. Indeed, to obtain some λ > Λ, it would be necessary to have a step with λ > ΛF −1/s that does not improve the fitness. If this were to happen before time T , it must happen in a step with Z t ≥ εn/2. By Lemma 12, the probability to have a non-improving step is e −Ω(λ) . By a union bound, the probability that such a step happens before time e n is at most e n−Ω(λ) = o(1). Hence, w.h.p. λ t ≤ Λ for all 1 ≤ t ≤ min{T, e n }. Note that in this case we have |G t 4 − Z t | ≤ 4K 4 log 2 F n, so in particular, G t 4 > 4K 4 log 2 F n implies that Z t > 0 for λ ≤ Λ. In the following, we will apply the Negative Drift Theorem 4 to G t 4 . The drift condition is satisfied by Corollary 45 whenever Z t ∈ [εn/2, εn], which is implied whenever G t 4 ∈ [εn/2 + 4K 4 log 2 F n, εn] and λ ≤ Λ. For the step size condition, let L j denote the total number of bits flipped in y j , and L := max{L j } j . Since L j follows a Bin(n, c/n) distribution, by the Chernoff bound, Theorem 11, there is a constant β > 0 such that Pr[L j ≥ z] ≤ e −βz for all z ≥ 0. Let r := 4K 4 log F n/β, and note that we can achieve |H t+1 4 − H t 4 | ≤ r/2 when λ ≤ Λ, by making β > 0 smaller if necessary. Then for all j ≥ 1, where the last inequality holds for n sufficiently large. Thus the step size condition of Theorem 4 is satisfied, and we obtain that w.h.p. G t 4 ≥ εn/2 + 4K 4 log 2 F n for e Ω(n/ log 2 n) steps if λ t ≤ Λ during this time. Since the latter also holds w.h.p., this implies T = e Ω(n/ log 2 n) w.h.p., which concludes the proof.

Simulations
In this section, we provide simulations that complement our theoretical analysis. The functions optimized in our simulations include OneMax, Binary, HotTopic [46], BinaryValue, and Dynamic BinVal [41], where Binary is defined as f (x) = n/2 i=1 x i n+ n i= n/2 +1 x i , and BinaryValue is defined as f (x) = n i=1 2 i−1 x i . The definition of HotTopic can be found in [46], and we set the parameters to L = 100, α = 0.25, β = 0.05, and ε = 0.05. Dynamic BinVal is the dynamic environment which applies the BinaryValue function to a random permutation of the n bit positions, see [41] for its formal definition. In all experiments, we start the SA-(1, λ)-EA with a randomly sampled search point and an initial offspring size of λ init = 1. The algorithm terminates when the optimum is found or after 500n generations. The code for the simulations can be found at https://github.com/zuxu/OneLambdaEA.

Threshold of s
In Figure 1, we follow the same setup as in [32], but for a larger set of functions. We observe exactly the same threshold s = 3.4 for OneMax. For the other monotone functions of our choice, the threshold effect happens before s = 3.4, which suggests that some hard monotone functions might have a lower allowance for the value of s than OneMax, other than conjectured by Hevia Fajardo and Sudholt in [32].

Effect of F
We have shown that the SA-(1, λ)-EA with c < 1 optimizes every dynamic monotone function efficiently when s is sufficiently small and is inefficient when s is too large. Both results hold for arbitrary F . It is natural to assume that there is a threshold s 0 between the efficient and inefficient regime. However, Figure 2 below shows that the situation might be more complicated. For this plot, we have first empirically determined an efficiency threshold for s on Dynamic BinVal (see Figure 1), then fixed s slightly below this threshold and systematically varied the value of F . For this intermediate value of s, we see that there is a phase transition in terms of F . Hence, we conjecture that there is no threshold s 0 such that the SA-(1, λ)-EA is efficient for all s < s 0 and all F > 1, and inefficient for all s > s 0 and all F > 1. Rather, we conjecture that there is 'middle range' of values of s for which it depends on the value of F whether the SA-(1, λ)-EA is efficient. Note that we know from this paper that this phenomenon can only occur for a 'middle range': both for sufficiently small s (Theorems 18, 28), and for sufficiently large s (Theorem 40), the value of F does not play a role.
In general, smaller values of F seem to be beneficial. However, the correlation is not perfect, see for example the dip for c = 0.98 and F = 5.5 in the left subplot of Figure 2. These dips also happen for some other combinations of s, F and c (not shown), and they seem to be consistent, i.e., they do not disappear with a larger number of runs or larger values of n up to n = 5000. To test whether this is due to the rounding scheme, we checked whether the effect disappears if we round λ in each generation stochastically to the next integer; e.g., λ t = 2.6 means that in generation t we create two offspring with probability 40% and three offspring with probability 60%.
The effect remains, and the runtime still seems to depend on F in a nonmonotone fashion, see the right subplot of Figure 2.
The impact of F is visible for all ranges c < 1, c = 1 and c > 1. For c = 1 we have only proven efficiency for sufficiently small F . However, we conjecture that there is no real phase transition at c = 1, and the 'only' difference is that our proof methods break down at this point. For the fixed s, with increasing c the range of F becomes narrower and restricts to smaller values while larger values of c admit a larger range of values for F .

Conclusion
In this paper, we have studied the SA-(1, λ)-EA on dynamic monotone functions. Hevia Fajardo and Sudholt had shown an extremely strong dependency of the performance on the success rate s for the OneMax benchmark. We have shown that there is nothing specific to OneMax about the situation. The same effect happens for any (static or dynamic) monotone fitness function: for small values of s, the SA-(1, λ)-EA is efficient on all dynamic monotone functions, while for large values of s, the SA-(1, λ)-EA is inefficient on every dynamic monotone function. In the latter case, the bottleneck is not around the optimum, but rather in some area of linear distance from the optimum. Thus the SA-(1, λ)-EA is one of the surprising examples showing that some algorithms may fail in easy fitness landscapes, but succeed in hard fitness landscapes.
Hevia Fajardo and Sudholt have conjectured that the problem becomes worse the easier the fitness landscape is. Concretely, they conjectured that any parameter choice that works for OneMax should also give good result for any other landscape [38]. In a companion paper [60], we disprove this conjecture, but for an unexpected reason: there are different ways to measure 'easiness' of a fitness landscape. While it is theoretically proven that OneMax is the easiest fitness function with respect to decreasing the distance from the optimum [54], this is not the aspect that matters for the SA-(1, λ)-EA. Here, the important aspect is how easy it is to find a fitness improvement, since this may induce too small target population sizes in the SA-(1, λ)-EA. For finding fitness improvements, there are easier functions than OneMax, for example the dynamic BinVal function [41] or HotTopic functions [46], see [60] for details. It remains open to determine the easiest dynamic monotone function f easiest with respect to fitness improvements. A candidate for f easiest might be the 'adversarial' Dynamic BinVal, which we define as Dynamic BinVal (see Section 7) with the exception that the permutation is not random but chosen so that any 0-bit is heavier than any 1-bit. With this fitness function, any 0-bit flip gives a fitter child, regardless of the number of 1-bit flips, so it is intuitively convincing that it should be the easiest function with respect to fitness improvement.
Moreover, the conjecture of Hevia Fajardo and Sudholt might still hold if we replace OneMax by f easiest . I.e., is it true that any parameter choice that works for f easiest also works for any other dynamic monotone function, and perhaps even in yet more general settings?
Apart from that, the most puzzling part of the picture is the experimental finding that in a 'middle regime' of success rates, the update strength F seems to play a role in a non-monotone way (for fixed success rate s). It is open to prove theoretically that there is indeed such a 'middle regime' where F plays a role at all. For why this effect is non-monotone in F , we do not even have a good hypothesis. As outlined in Section 7, it does not seem to be a rounding effect. This shows that we are still missing important parts of the overall picture.