Theoretical Analyses of Multi-Objective Evolutionary Algorithms on Multi-Modal Objectives

,


Introduction
Real-world problems often contain multiple conflicting objectives.For such problems, a single best solution cannot be determined without additional information.One solution concept for such problems is to compute a set of solutions each of which cannot be improved without worsening in at least one objective (Pareto optima) and then let a decision maker choose one of these solutions.With their population-based nature, multiobjective evolutionary algorithms (MOEAs) are a natural choice for this approach and have indeed been very successfully applied here [ZQL + 11].
Similar to the situation for single-objective evolutionary algorithms, the rigorous theoretical understanding of MOEAs falls far behind the success of these algorithms in practical applications.The classic works in this area have defined multiobjective, especially bi-objective, counterparts of well-analyzed single-objective benchmark functions used in evolutionary computation theory and have analyzed the performance of mostly very simple MOEAs on these benchmarks.
For example, in the problems COCZ [LTZ + 02] and OneMinMax [GL10], the two objectives are both (conflicting) variants of the classic OneMax benchmark.The classic benchmark LeadingOnes was used to construct the LOTZ [LTZ04b] and WLPTNO [QYZ13] problems.These multiobjective benchmark problems are among the most intensively studied [Gie03, BFN08, DKV13, DGN16, BQT18, HZCH19, HZ20, COGNS20, ZLD22, BQ22, ZD22].We note that these problems are all very easy.They are composed of unimodal objectives and they have the further property that from each set of solutions P a set P ′ witnessing the Pareto front can be computed by repeatedly selecting a solution from P , flipping a single bit in it, adding it to P , and removing dominated solutions from P .They are thus relatively easy to solve, as witnessed by typical runtimes such as O(n 2 log n) on the OneMax-type problems or O(n 3 ) on the LeadingOnes-inspired benchmarks.These runtimes naturally are higher than for the single-objective counterparts due to the fact the Pareto front of the multiobjective versions has size Θ(n), hence Θ(n) times more solutions have to be computed compared to the single-objective setting.
We defer a detailed discussion of the state of the art of rigorous analyses of MOEAs to Section 3 and state here only that, to the best of our knowledge, there is not a single runtime analysis work discussing in detail how MOEAs cope with problems composed of multimodal objectives.
Our contributions.This paper aims at a deeper understanding of how MOEAs cope with multiobjective problems consisting of natural, well-analyzed, multimodal objectives.In the theory of single-objective evolutionary computation, the class of Jump functions proposed by Droste et al. [DJW02] is a natural and intensively used multimodal function class that has inspired many ground-breaking results.Hence, in this paper, we design a bi-objective counterpart of the Jump function with problem size n and jump size k, called OneJumpZeroJump n,k .It consists of one Jump function w.r.t. the number of ones and one Jump function w.r.t. the number of zeros, hence both objectives are isomorphic to classic jump functions.We compute its Pareto front in Theorem 7 and, as intended, observe that different from the easy multimodal benchmarks the Pareto front is not connected, that is, one cannot transform any solution on the Pareto front into any other solution on the Pareto front via one-bit flips.From this observation, we easily show that for all n ∈ N and all k ∈ [2.. n 2 ], the simple evolutionary multiobjective optimizer (SEMO) cannot find the full Pareto front of the OneJumpZeroJump benchmark (Theorem 9).
The heart of this work are mathematical runtime analyses for the global SEMO (GSEMO).We first show that the classic version of this algorithm for all n and k ∈ [2..⌊ n 2 ⌋] finds the Pareto front of OneJumpZeroJump n,k in O((n − 2k + 3)n k ) iterations (and fitness evaluations) in expectation (Theorem 12).We show a matching lower bound of Ω((n − 2k)n k ) for k ∈ [4.. n 2 − 1] (Theorem 13).Here and in the remainder, the asymptotic notation only hides constants independent of n and k.We note that our actual bounds are more precise.In particular, for 4 ≤ k = o(n), the expected runtime is 3 2 en k+1 apart from lower-order terms.This result might be the first runtime analysis of an MOEA that is tight apart from lower order terms.
We then try to reuse two ideas that led to performance gains for multimodal problems in single-objective optimization.Doerr et al. [DLMN17] proposed the fast mutation operator in which the number of flipped bits follows a power-law distribution with (negative) exponent β > 1.They showed that the (1 + 1) EA with this operator optimizes jump functions with jump size k by a factor of k Ω(k)  faster than the standard (1 + 1) EA.We show that the GSEMO with fast mutation also witnesses such an improvement over the standard GSEMO.More precisely, we prove an upper bound of O((n − 2k)(en) k /k k+0.5−β ) for this algorithm (Theorem 20).We also provide a result with explicit leading constant, for the first time for an algorithm using fast mutation.We are optimistic that our non-asymptotic estimates for the number of flipped bits by this operator will be useful also in other works.
The stagnation-detection strategy of Rajabi and Witt [RW22] is a second way to cope with multimodality (in single-objective optimization).So far only used in conjunction with the (1 + , λ) EA (and there mostly with the (1 + 1) EA), it consists of counting for how long no improvement has been found and using this information to set the mutation rate (higher mutation rates when no improvement was made for longer time).With suitable choices of the parameters of this strategy, the (1 + 1) EA can optimize jump functions with jump size k faster than the standard (1 + 1) EA by a factor of Ω(k Ω(k) ).This is the same rough estimate as for the (1 + 1) EA with fast mutation.A more detailed calculation, however, shows that the stagnation-detection (1 + 1) EA is faster than the fast (1 + 1) EA by a factor of Θ(k β−0.5 ).We note that this difference usually is not extremely large since β is usually chosen small (β = 1.5 was recommended by [DLMN17]) and k has to be small to admit reasonable runtimes (recall that the runtime dependence on n is Ω(n k )).Nevertheless, in experiments conducted by [RW22] the stagnationdetection (1 + 1) EA was faster than the fast (1 + 1) EA by around a factor of three.Different from fast mutation, it is not immediately clear how to incorporate stagnation detection into the GSEMO.Being an algorithm with non-trivial parent population, one question is whether one should count unsuccessful iterations globally or individually for each parent individual.Also, clearly the parameters of the algorithm need to be adjusted to the longer waiting times for an improvement.We succeeded in defining a stagnation-detection version of the GSEMO that effectively solves the OneJumpZeroJump problem, more precisely, that computes the full Pareto front in an expected runtime of O((n − 2k)(en) k /k k ), again a k Ω(k) factor improvement over the classic GSEMO and reducing the runtime guarantee for the heavy-tailed GSEMO by a small factor of Ω(k β−0.5 ), see Theorem 25.
Via a small set of experiments, we demonstrate that these are not only asymptotic differences.Compared to the standard GSEMO, we observe roughly a factor-5 speed-up with heavy-tailed mutation and a factor-10 speed-up with our stagnationdetection GSEMO, and this already for jump size k = 4 and problem sizes n between 10 and 50.
We note that this work is an extended version of the 7-page (excluding acknowledgments and references) preliminary paper [DZ21] and differs in the following ways.We estimate the leading constant of the asymptotic upper bound of the runtime of the GSEMO with heavy-tailed mutation for not too large jump size, which has not been shown in our preliminary version, and also not been shown before for the theoretical works with the heavy-tailed mutation.Besides, this version adds a formal discussion on the multimodality in multiobjective problems in Section 3, and contains all mathematical proofs that had to be omitted in [DZ21] for reasons of space.
The remainder of this paper is organized as follows.Section 2 introduces the basic definitions that will be used throughout this work.In Section 3, we discuss the relevant previous works.The OneJumpZeroJump n,k function class is defined in Section 4. Our mathematical runtime analyses are presented in Sections 5 to 8. Section 9 shows our experimental results.A conclusion is given in Section 10.

Basic Definitions
A multiobjective optimization problem consists of optimizing multiple objectives simultaneously.In this paper, we restrict ourselves to the maximization of biobjective pseudo-Boolean problems f = (f 1 , f 2 ) : {0, 1} n → R 2 .We note that only sporadic theoretical works exist that discuss the optimization of more than two objectives [LTZ04b,BQT18,HZLL21].
We briefly review the standard notation in multiobjective optimization.For any two search points x, y ∈ {0, 1} n , we say that • x weakly dominates y, denoted by x y, if and only if f 1 (x) ≥ f 1 (y) and f 2 (y) and at least one of the inequalities is strict.
We call x ∈ {0, 1} n Pareto optimal if and only if there is no y ∈ {0, 1} n such that y ≻ x.All Pareto optimal solutions form the Pareto set S * of the problem.The set For most multiobjective problems, the objectives are at least partially conflicting and thus there is usually not a single Pareto optimum.Since a priori it is not clear which of several incomparable Pareto optima to prefer, the most common target is to compute the Pareto front, that is, compute a set P of solutions such that f (P ) := {(f 1 (x), f 2 (x)) | x ∈ P } equals the Pareto front F * .This is our objective in this work as well.We note that if the Pareto front is excessively large, then one has to resort to approximating it in a suitable manner, but this will not be our problem here.
We will use |x| 1 and |x| 0 to denote the number of ones and the number of zeros of the search point x ∈ {0, 1} n .We use [a..b] to denote the set {a, a + 1, . . ., b} for a, b ∈ Z and a ≤ b.

Previous Works and A Discussion of Multimodality in Multiobjective Problems
Since this work conducts a mathematical runtime analysis of an evolutionary algorithm for discrete search spaces, we primarily discuss the state of the art inside this subarea of evolutionary computation.The mathematical runtime analysis as a rigorous means to understand evolutionary algorithms was started in the 1990s with works like [Müh92,Bäc93,Rud97].The first analysis of MOEAs were conducted by Laumanns et al. [LTZ + 02, LTZ04b], Giel [Gie03], and Thierens [Thi03].
The theory of MOEAs here often followed the successful examples given by singleobjective works.For example, the multiobjective benchmarks COCZ [LTZ04b], and OneMinMax [GL10] are composed of two contradicting copies of the classic OneMax problem, the problems LOTZ [LTZ04b] and WLPTNO [QYZ13] follow the main idea of the LeadingOnes problem [Rud97].Due to the higher complexity of analyzing MOEAs, many topics with interesting results on single-objective EAs received almost no attention for MOEAs.For example, the first runtime analysis of a single-objective algorithm using crossover was conducted by Jansen and Wegener [JW01] and was followed up by a long sequence of important works.In contrast, only sporadic works discuss crossover in MOEAs [NT10, QYZ13, HZCH19, BQ22].Dynamic parameter settings for single-objective EAs were first discussed by Jansen and Wegener [JW05] and then received a huge amount of attention (see, e.g., the bookchapter [DD20]), whereas the only such result for MOEAs appears to be one result by Doerr et al. [DHP22].How single-objective evolutionary algorithms cope with noise was discussed already by Droste [Dro04], but despite numerous follow-up works in single-objective optimization, there is only a single mathematical runtime analysis of an MOEA in the presence of noise [Gut12].
In this work, we regard another topic that is fairly well-understood in the evolutionary theory community on the single-objective side, but where essentially nothing is known on the multiobjective side.It is well-known in general that local optima are challenging for evolutionary algorithms.Already one of the first runtime analysis works, the ground-breaking paper by Droste et al. [DJW02], proposes a multimodal (that is, having local optima different from the global optimum) benchmark of scalable difficulty.The jump function with difficulty parameter k, later called gap size, resembles the easy OneMax function except that it has a valley of very low fitness around the optimum.This valley consists of the k − 1 first Hamming levels around the optimum, hence the search points in distance k form a local optimum that is easy to reach, but hard to leave.Mutation-based algorithms need to flip the right k bits to go from any point on the local optimum to a search point of better fitness (which here is the global optimum).While other multimodal benchmarks have been proposed and have received some attention, most notably trap functions [JS07] and the DeceptiveLeadingBlocks problem [LN19], it is safe to say that the Jump benchmark is the by far most studied one and that these studies have led to fundamental insights on how evolutionary algorithms cope with local optima.
For reasons of space, we only discuss some of the most significant works on jump functions.The first work [DJW02] determined the runtime of the (1 + 1) EA on Jump functions with gap size k ≥ 2 to be Θ(n k ), showing that local optima can lead to substantial performance losses.This bound was tightened and extended to arbitrary mutation rates by Doerr et al. [DLMN17].This analysis led to the discovery of a heavy-tailed mutation operator and more generally, established the use of power-law distributed parameters, an idea that led to many interesting results subsequently [?, e.g.,]]AntipovBD21gecco,CorusOY21foga,AntipovBD22,DangELQ22.Again an analysis of how single-trajectory heuristics optimize jump functions led to the definition of a stagnation-detection mechanism, which in a very natural manner leads to a reduction of the time an EA is stuck on a local optimum [RW22].Jump functions are among the first and most convincing examples that show that crossover can lead to significant speed-ups over only using mutation as variation operator [JW02, DFK + 18, ADK22].That estimation-of-distribution algorithms and ant-colony optimizers may find it easier to cope with local optima was also shown via jump functions [HS18,Doe21,BBD21,Wit23].
In contrast to this intensive discussion of multimodal functions in singleobjective evolutionary optimization, almost no such results exist in multiobjective optimization.Before discussing this aspect in detail, let us agree (for this work) on the following terminology.We say that a function f : {0, 1} n → R is unimodal if it has a unique maximum and if all other search points have a Hamming neighbor with strictly larger fitness.We say that f is multimodal if it has local optima different from the global optimum.Here we call local optimum a set S of search points such that f (x) = f (y) for all x, y ∈ S and such that all neighbors of S, that is, all z / ∈ S having a Hamming neighbor in S, have a smaller f -value.Note that a global optimum is also a local optimum according to this definition.With these definitions, unimodal functions are those where a hillclimber can reach the optimum via one-bit flips regardless of the initial solution.In contrast, multimodal functions definitely need a single-trajectory heuristic to flip more than one bit (when started in a suitable solution) or need the acceptance of truly inferior solutions.With our definition, plateaus of constant fitness render a function not unimodal, but they do not imply multimodality.This is convenient for our purposes as we shall not deal with such plateaus -for the simple reason that they behave again differently from unimodal functions and true local optima.We note that the role of plateaus in multiobjective evolutionary optimization has been extensively studied, among others, in [BFH + 07, FHN10, FHN11, QTZ16, LZZZ16].
We extend the definitions of unimodality and multimodality to multiobjective problems in the natural way: A multiobjective problem is unimodal if all its objectives are unimodal functions.Many such problems are easy and even simple MOEAs employing one-bit flips as only variation operator, e.g., the SEMO, can find the full Pareto front.Examples of such problems include the well-studied benchmarks inspired by the unimodal single-objective benchmarks OneMax and LeadingOnes.However, somewhat surprisingly, it is not true that the SEMO can cover the full Pareto front of all unimodal multiobjective problems as we now show.
Lemma 1.There exists a unimodal multiobjective problem whose full Pareto front cannot be covered by the SEMO in an arbitrarily long time with a positive probability.
We now show that once the optimization process has reached the state described in the last line of Table 1, the Pareto front point (3, 3) cannot be reached anymore.Note that 110 is the only Pareto optimum for (3, 3), and its Hamming neighbors are 010, 100, and 111.Since f (010) = (0, 2), f (100) = (4, 0), and f (111) = (0, 4), we know that 010 and 111 are strictly dominated by 011, and that 100 is strictly dominated by 101.Hence, even if any of them is generated, it cannot survive to the next generation when the current population is {101, 011}.Therefore, 110 cannot be generated (and (3, 3) cannot be reached) via the SEMO with the one-bit mutation.
We say that a multiobjective problem is multimodal if at least one objective is multimodal.We note that this does not automatically imply that the SEMO cannot find the full Pareto front, in fact, as the following proof of Lemma 2 shows, there are multi-objective problems consisting only of multimodal objectives such that the SEMO regardless of the initial solution computes the full Pareto front very efficiently.Clearly, our interest in this work are multimodal multiobjective problems that are harder to optimize.To avoid misunderstandings, we note that a second notion of multimodality exists.In multimodal optimization, the target is to find all or a diverse set of solutions having some optimality property [Pre15,LYQ16,TI20].This notion is unrelated to our notion of multimodality.
Lemma 2. There is a bi-objective problem f = (f 1 , f 2 ) : {0, 1} n → R 2 such that both objectives f 1 and f 2 have several local optima, but the SEMO algorithm, regardless of its initial solution, finds the full Pareto front in time O(n 2 log n).
for all x ∈ {0, 1} n .Hence f 1 agrees with the classic OneMax benchmark when |x| 1 is a multiple of ℓ.Solutions x with |x| 1 ≡ 1 (mod ℓ) are local optima different from the global optimum, which is We define f 2 analogously, but with the roles of zeroes and ones interchanged, that is, for all x ∈ {0, 1} n .Clearly, f 2 has the same characteristics as f 1 , in particular, it is highly multimodal with its k local optima.
Figure 1 plots these two functions with (n, ℓ) = (50, 10).We clearly to see the n/ℓ + 1 = 6 local optima for both objectives, namely the points with |x| 1 ≡ 1 (mod ℓ) and x = 1 n for f 1 and the points with |x| 0 ≡ 1 (mod ℓ) and x = 0 n for f 2 .We note that any solution is a Pareto optimum of f = (f 1 , f 2 ).Indeed, we have Since all solutions are Pareto optima and since f has the same classes of search points of equal fitness, the optimization process of the SEMO on f is identical to the one on OneMax.With [GL10, Theorem 3] and its proof, our claim is proven.
Having made precise what we understand as multimodal multiobjective problem, we review the known results on such problems.
To the best of our knowledge, only very few runtime analysis works on multimodal multiobjective problems exist and they all are not focussed on the possible difficulties arising from the multimodality.Roughly speaking, these can be divided into works that construct artificial problems to demonstrate, often in an extreme way, a particular strength or weakness of an algorithm, and works on classic combinatorial optimization problems that happen to be multimodal.The first set of works gave rise to the multimodal problems ZPLG, SPG, and Dec-obj-MOP.The first two of these were proposed by Qian et al. [QTZ16] and are defined as follows.

Definition 3 (ZPLG [QTZ16]). The function ZPLG
Definition 4 (SPG [QTZ16]).The function SPG : {0, 1} n → R×{0, 1} is defined by In order to investigate the effect of mixing low-level heuristics, Qian et al. [QTZ16] showed that mixing fair selection w.r.t. the decision space and the objective space is beneficial for ZPLG, and that mixing 1-bit and 2-bit mutation is efficient for SPG.
In the first theoretical study of decomposition-based MOEAs, Li et al. [LZZZ16] analyzed how the MOEA/D solves several multiobjective problems, among them the following multimodal one.

Definition 5 (Dec-obj-MOP [LZZZ16]). The function Dec-obj-MOP
We note that the three problems ZPLG, SPG, and Dec-obj-MOP all appear slightly artificial.Also, they have only one multimodal objective.The first objective of ZPLG has local optima on x = 1 the first objective of SPG has local optima on x = 0 n and x = 1 i 0 n−i , i mod 3 = 0, i ∈ [1..n], and the second objective of Dec-obj-MOP has two local optima on x ∈ {0 n , 1 n }.
The second type of works dealing with multimodal multiobjective problems are those which regard combinatorial optimization problems, for example, [LTZ04a, KB06, Neu07, NR08, Gre09, Hor09, QYZ13, QYZ15, QZTY18, FQT19, QBF20, RBN20, RNNF22].Combinatorial optimization problems almost always are multimodal, simply because already a simple cardinality constraint suffices to render a problem multimodal.We note for some such problems the multimodality is not very strong, e.g., for minimum spanning tree problems flipping two bits suffices to leave a local optimum.Overall, all these works are relatively problem-specific and we could not distill from these any general insights on how MOEAs cope with local optima.

The OneJumpZeroJump Problem
To study via mathematical means how MOEAs cope with multimodality, we now define and analyze a class of bi-objective functions of scalable difficulty.As mentioned above, this is strongly influenced by the single-objective Jump function class proposed in [DJW02], which is intensively used in the theory of singleobjective evolutionary computation and which gave rise to many interesting results including that larger mutation rates help in the optimization of multimodal functions, like [DLMN17], that crossover can help to cope with multimodality, like [JW02, DFK + 18], and that estimation-of-distribution algorithms and the (1 + (λ, λ)) GA can significantly outperform classic evolutionary algorithms on multimodal problems, like [HS18, Doe21, AD20, ADK22].
We recall that for all n ∈ N and k ∈ [1.
.n], the jump function Jump n,k : {0, 1} n → R with problem size n and gap size k is defined by this function has a valley of low fitness around its optimum x * = 1 n , which can be crossed only by flipping k specific bits (or accepting solutions with very low fitness).We define the OneJumpZeroJump n,k function as a bi-objective counterpart of the function Jump n,k .
Hence the first objective of OneJumpZeroJump n,k is just the classic Jump n,k function.The second objective has a fitness landscape isomorphic to this function, but the roles of zeros and ones are interchanged, that it, f 2 (x) = Jump n,k (x) , where x is the bitwise complement of x.
Proof.We firstly prove that for any given x ∈ S * , y ⊁ x holds for all y ∈ {0, 1} n .If y ≻ x, then f 1 (y) ≥ f 1 (x) and f 2 (y) ≥ f 2 (x), and at least one of two inequalities is strict.W.l.o.g., let That is, we obtain f 2 (y) < f 2 (x) which is contradictory to y ≻ x.Hence, S * is the Pareto set.We secondly prove that for any given y ∈ {0, Then we could take x with |x| 1 = n − k, and have Hence, F * is the Pareto front and this theorem is proven.
Theorem 7 implies that for k > n/2, the Pareto front consists only of the two extremal solutions 0 n and 1 n .This appears to be a relatively special situation that is not very representative for multimodal multiobjective problems.For this reason, in the remainder, we shall only regard the case that k ≤ n/2.In this case, again from Theorem 7, we easily obtain in the following corollary a general upper bound on the size of any set of solutions without pair-wise weak domination, and thus also on the size of the population in the algorithms discussed in this work.

SEMO Cannot Optimize OneJumpZeroJump Functions
The simple evolutionary multiobjective optimizer (SEMO), proposed by [LTZ + 02], is a well-studied basic benchmark algorithm in multiobjective evolutionary theory [QYZ13,LZZZ16].It is a multiobjective analogue of the randomized local search (RLS) algorithm, which starts with a random individual and tries to improve it by repeatedly flipping a single random bit and accepting the better of parent and this offspring.As a multiobjective algorithm trying to compute the full Pareto front, the SEMO naturally has to work with a non-trivial population.This is initialized with a single random individual.In each iteration, a random parent is chosen from the population.It generates an offspring by flipping a random bit.The offspring enters the population if it is not weakly dominated by some individual already in the population.In this case, any individual dominated by it is removed from the population.The details of SEMO are shown in Algorithm 1.We note that more recent works such as [FHN10] define the (G)SEMO minimally different, namely so that in case of equal objective values the offspring enters the population and replaces the existing individual with this objective value.Preferring offspring in case of equal objective values allows EAs to traverse plateaus of equal fitness and this is generally preferred over keeping the first solution with a certain objective value.For our purposes, this difference is not important since all points with equal fitness behave identically.
In the following Theorem 9, we show that the SEMO cannot cope with the multimodality of the OneJumpZeroJump n,k function.Even with infinite time, it does not find the full Pareto front of any OneJumpZeroJump function.We note that this result is not very surprising and we prove it rather for reasons of Algorithm 1 SEMO 1: Generate x ∈ {0, 1} n uniformly at random and P ← {x} 2: loop 3: Uniformly at random select one individual x from P 4: Generate x ′ via flipping one bit chosen uniformly at random 5: if there is no y ∈ P such that x ′ y then 6: end if 8: end loop completeness.It is well-known in single-objective optimization that the randomized local search heuristic with positive (and often very high) probability fails on multimodal problems (the only published reference we found is [DJK08, Theorem 9], but surely this was known before).It is for the same reason that the SEMO algorithm cannot optimize the OneJumpZeroJump function.
, we know that any newlygenerated search point with number of ones in [k..n − k] will lead to the removal from the current population of all individuals with number of zeros or ones in [1..k − 1] and will also prevent any such individual from entering the population in the future.
W.l.o.g., suppose that the initial individual has at most n 2 ones.We show that 1 n can never enter the population.Assume first that never in this run of the SEMO a search point with n − k ones is generated.Since each offspring has a Hamming distance of one from its parent, the statement "Never an individual with n − k or more ones is generated" holds in each generation.In particular, the search point 1 n is never generated.
Hence assume now that at some time t for the first time an individual x with n − k ones is generated.As in the previous paragraph, up to this point no search point with more than n − k ones is generated.In particular, the point 1 n is not generated so far.Since x lies on the Pareto front, it enters the population.From this point on, by our initial considerations, no individual with n − k + 1 to n − 1 ones can enter the population.In particular, at no time a parent with n − 1 ones can be selected, which is a necessary precondition for generating the point 1 n .Consequently, the point 1 n will never be generated.

Runtime Analysis for the GSEMO
As we have shown in the previous section, to deal with the OneJumpZeroJump benchmark a mutation-based algorithm needs to be able to flip more than one bit.The global SEMO (GSEMO), proposed by [Gie03], is a well-analyzed MOEA that has this ability.Generalized from the (1 + 1) EA algorithm, it uses the standard bit-wise mutation, that is, each bit is flipped independently with the same probability of, usually, 1/n.The details are shown in Algorithm 2. if there is no y ∈ P such that x ′ y then 6:

Algorithm 2 GSEMO
end if 8: end loop Theorem 12 will show that GSEMO can find the Pareto front.To make the main proof clearer, some propositions are extracted as the following lemmas.
where we use that the population size is at most n − 2k + 3, see Corollary 8. Hence, by a simple Markov chain argument (similar to Wegener's fitness level method [Weg01]), the expected time to find all (a, 2k Similarly, the probability to find (a − 1, 2k + n − a + 1) in the Pareto front once the population contains a solution with function value (a, 2k + n − a) is at least and the expected time to find all (a, 2k By comparing the individual summands, the sum of (1) and ( 2) is at most From a population covering the whole inner part of the Pareto front, the two missing extremal points of the front can be found in roughly O(n k+1 ) iterations, as the following Lemma 11 shows.Proof.We pessimistically calculate the time to generate two elements (a, 2k + n − a), a ∈ {k, n + k}, in the Pareto front ignoring the fact that the current population may contain the corresponding solutions 0 n and 1 n already.Similar to the proof of Lemma 10, we note that the probability to find (k, k + n) in the Pareto front from a solution with function value (2k, n) is at least The same estimate holds for the probability to find (k + n, k) in the Pareto front from one solution with function value (n, 2k).Hence, the expected time to find one of (k, n + k) and (k + n, k) is at most e 2 n k (n − 2k + 3), and the expected time to find the remaining element of the Pareto front is at most en k (n−2k +3) additional iterations.Now we establish our main result for the GSEMO.
The expected runtime of the GSEMO optimizing the OneJumpZeroJump n,k function is at most ). Proof.We first consider the time until the population P contains an x such that (f and thus dominates all search points with m ∈ [1..k − 1] zeros.Hence, we pessimistically add up the expected waiting times for increasing the number of zeros one by one until one individual with k zeros is found, which need an expected number of iterations of at most ≤ en(n − 2k + 3). (3) Then from Lemma 10, we know that all (a, 2k+n−a), a ∈ [2k.
.n], in the Pareto front will be found in an expected number of at most 2en(n − 2k + 3)(ln⌈ n 2 ⌉ + 1) iterations.After that, from Lemma 11, the points (k, n + k) and (n + k, k) in the Pareto front will be found in an expected number of at most 3 2 en k (n − 2k + 3) additional iterations.Hence, the expected iterations to find the Pareto front is at most en(n ).We finally show that this bound is very tight.For convenience, we only consider the case k ≥ 4. In this case, it is sufficiently hard to generate an extremal point of the Pareto front from the inner part that we do not need a careful analysis of the population dynamics in the short initial phase in which the inner part of the Pareto front is detected, which is not the main theme of this work.That said, we are very optimistic that it would pose no greater difficulties to show that with high probability at the first time when a solution with less than 1 4 n or more than 3 4 n ones enters the population, the population already contains Θ(n) individuals.
Hence from this point on, only every Θ(n) iterations an outer-most point will be chosen as parent, and this should suffice to prove an Ω(n k+1 ) lower bound also for k = 2 and k = 3.
The expected runtime of the GSEMO optimizing the OneJumpZeroJump n,k function is at least ) and matches the upper bound of Theorem 12 apart from lower-order terms.
Proof.We first prove that with high probability the inner region of the Pareto front will be fully covered before 0 n or 1 n is generated and then use this to bound the time until the two extremal points of the Pareto front are found.
For the initial individual x, we have With the additive Chernoff bound, see, e.g., [Doe20, Theorem 1.10.7],we know that with probability at least 4 n, w.l.o.g., let |x| 1 < k, then we will show that with probability at least 1 − n −n/4+2 − 2n −n+2k − en −1 , an individual x ′ with k ≤ |x ′ | 1 ≤ n − k is generated earlier than 1 n or 0 n .Note that before such an x ′ appears, the population consists of one individual with 1 4 n < |x| 1 < k except if 1 n , 0 n , or a y with |y| 1 > n − k is generated.We note that the probability to generate one of the exceptions from one individual with 1 where the last inequality uses k ≥ 1.Then with probability at least the population consists of only one individual with 1 4 n < |x| 1 < k in n 2 iterations.Then from the similar calculation in (3) except changing n − 2k + 3 to 1, we know that it takes an expected time of at most en iterations to generate such an x ′ .Via Markov's inequality, we know that with probability at least 1 − en n 2 = 1 − e n , such an x ′ will be generated in n 2 iterations.Hence, with probability at least From this point on, we will first show that with probability at least 1− 8e(ln n+1) n k−3 , all (a, 2k + n − a), a ∈ [2k..n], in the Pareto front are found before 0 n or 1 n is generated.It is not difficult to see that any search point y in either of the two gaps, that it, with has both objective values less than any search point z with |z| 1 ∈ [k..n − k], and thus cannot enter into the population.Therefore, the probability to generate 0 n or 1 n from some parent x with From Lemma 10, we know that all (a, 2k + n − a), a ∈ [2k.
.n], in the Pareto front will be found in an expected number of at most 2en(n − 2k + 3)(ln⌈ n 2 ⌉ + 1) iterations.Via Markov's inequality, we know that with probability at least 1 − .n], in the Pareto front will be found in 2en 3 (ln n + 1) iterations.Hence, with (4), we know that the event that during all 2en 3 (ln n + 1) iterations, 0 n or 1 n is not generated, has probability at least Now all (a, 2k +n−a), a ∈ [2k.
.n], in the Pareto front are found and the current population has size n − 2k + 1.From this time on, we compute the probability to generate the remaining Pareto optimal solution 0 n and 1 n as Then the expected time to find 0 n or 1 n is at least 1 n−1 .W.l.o.g., 0 n is found.Now the current population size is n − 2k + 2, and the probability to find the remaining 1 n is n additional iterations in expectation to cover the whole Pareto front.
In summary, together with the above discussed probability for the initial individual and the probability that all (a, 2k + n − a), a ∈ [2k..n], in the Pareto front are found before 0 n or 1 n is generated, we obtain that the lower bound to cover the Pareto front is

GSEMO with Heavy-Tailed Mutation
In the previous section, we have shown that the GSEMO can optimize our multimodal optimization problem, but similar to the single-objective world (say, the optimization of Jump functions via simple evolutionary algorithms [DJW02]), the runtime increases significantly with the distance k that a solution on the Pareto front can have from all other solutions on the front.As has been discussed in [DLMN17], increasing the mutation rate can improve the time it takes to jump over such gaps.However, this work also showed that a deviation from the optimal mutation rate can be costly: A small constant-factor deviation from the optimal rate k/n leads to a performance loss exponential in k.For this reason, a heavytailed mutation operator was proposed.Compared to using the optimal (usually unknown) rate, it only loses a small polynomial factor (in k) in performance.
We now equip the GSEMO with the heavy-tailed mutation operator from [DLMN17] and observe similar advantages.We start by introducing the heavy-tailed mutation operator and giving a first non-asymptotic estimate for the probabilities to flip a certain number of bits.

Heavy-Tailed Mutation Operator
The following capped power-law distribution will be used in the definition of the heavy-tailed mutation operator.
The heavy-tailed mutation operator proposed by [DLMN17], in the remainder denoted by mut β (•), in each application independently samples a number α from the power-law distribution D β n/2 and then uses standard bit-wise mutation with mutation rate α/n, that is, flips each bit independently with probability α/n.[DLMN17] shows the two following properties of mut β .

Lemma 15 ([DLMN17]
).Let x ∈ {0, 1} n and y ∼ mut β (x).Let H(x, y) denote the Hamming distance between x and y.Then we have In order to obtain an understanding of the leading constants when generating a solution with a not too large Hamming distance, we prove the following estimations, which have not been shown before, and can also be applied to other previous works about the theoretical analysis of the heavy-tailed mutation.We suspect tighter estimates of the leading constants exist but leave it as interesting future work.Lemma 16.Let x ∈ {0, 1} n and y ∼ mut β (x).Let H(x, y) denote the Hamming distance between x and y.Then we have The proof of Lemma 16 will use an elementary estimate for n k when k ≤ √ n.
Since we have not found a good reference for it and since we believe that it might be useful for other runtime analyses, we formulate it as a separate lemma and give a formal proof.
Proof.Our claim holds trivially for i = 0 and 1.
Then we have Now we prove Lemma 16.
Proof of Lemma 16.We calculate (5) where the second inequality uses Stirling's approximation.Using that j → j i (1 − j n ) n−i is unimodal and has its maximum at j = i, we compute where the first inequality uses that n √ n, and the third inequality uses n ≥ 2.
Then we know where the last inequality uses (5).Hence, this lemma is proven.
Equipping the standard GSEMO with this mutation operator mut β gives Algorithm 3, which we call GSEMO-HTM.

Algorithm 3
The GSEMO-HTM algorithm with power-law exponent β > 1 1: Generate x ∈ {0, 1} n uniformly at random, and P ← {x} if there is no y ∈ P such that x ′ y then 6: end if 8: end loop

GSEMO-HTM on OneJumpZeroJump n,k
We now analyze the runtime of the GSEMO-HTM on OneJumpZeroJump n,k .We start with an estimate for the time it takes to find the inner part of the Pareto front.

Lemma 18. Consider the GSEMO-HTM optimizing the OneJumpZeroJump
If in a certain iteration, the point (a 0 , 2k + n − a 0 ) is covered by the current population, then all (a, 2k +n−a), a ∈ [2k..n], in the Pareto front will be found in an expected number of at most 2 Proof.The proof is very similar to the one of Lemma 10.As there, we ignore any positive effect of mutation flipping more than one bit.Noting that the corresponding solution for (a, 2k + n − a) in the Pareto front has a − k ones, the probability to find (a + 1, 2k + n − a − 1) in the Pareto front conditional on the population has one solution with function value (a, 2k + n − a) is at least Hence, the expected time to find all (a, 2k Similarly, the probability to find (a − 1, 2k + n − a + 1) in the Pareto front conditional on the population containing a solution with function value (a, 2k + n − a) is at least and the expected time to find all (a, 2k + n − a) for a ∈ [2k..a 0 ] is at most By comparing the individual summands, the sum of ( 6) and ( 7) is at most We now estimate the time to find the two extreme Pareto front points after the coverage of the inner part of the Pareto front.Proof.We note that the probability to find (k, k + n) in the Pareto front from one solution with function value (2k, n), or to find (k + n, k) in the Pareto front from one solution with function value (n, 2k) is at least Hence, the expected time to find (k, , (k, n + k) is found first.Then the probability to find (k + n, k) in the front from one solution with function value (n, 2k) is at least Hence, the expected additional time to find the last missing element in the Pareto front is at most n k (n − 2k + 3)/P β k .Then this lemma is proven.Now we are ready to show the runtime for the GSEMO-HTM on OneJumpZeroJump n,k .
Theorem 20.The expected runtime of the GSEMO-HTM optimizing the OneJumpZeroJump n,k function is at most Proof.We first consider the time to find one (a, 2k + n − a) for some a ∈ [2k..n] in the Pareto front.If the initial search point has a number of ones in [k, n − k], then (a, 2k + n − a) for some a ∈ [2k..n] in the Pareto front is already found.Otherwise, we have the initial search point with at most k − 1 ones or zeros.W.l.o.g., we consider the initial point with m ≤ k − 1 zeros.Similar to the discussion in the proof of Theorem 12, we pessimistically add the waiting time to increase the number of zeros one by one until one individual with k zeros is found.This gives the expected number of iterations of at most Then from Lemma 18, we know that all (a, 2k + n − a) for a ∈ [2k.
.n] in the Pareto front will be found in at most 2 After that, from Lemma 19, the points (k, n + k) and (n + k, k) enter the Pareto front in at most 2 Hence, the expected time to cover the Pareto front is at most where the first equality uses Lemma 15.
To give a non-asymptotic runtime bound, by Lemma 16, for k ∈ [2..⌊ √ n⌋], we know that the first line of (8) can be calculated as Via the Stirling's approximation √ 2πn n+0.5 e −n ≤ n! ≤ en n+0.5 e −n and due to the fact that n/(n − k) ≤ 2, we know that Hence, we easily obtain the following runtime Then the expected runtime of the GSEMO-HTM optimizing the OneJumpZeroJump n,k function is at most Comparing Theorem 12 and Theorem 20 ( Theorem 21), we see that the asymptotic expected runtime of the GSEMO-HTM on OneJumpZeroJump n,k is smaller than that of the GSEMO by a factor of around k k+0.5−β /e k .

GSEMO with Stagnation Detection
In this section, we discuss the non-trivial question of how to adapt the stagnationdetection strategy proposed by [RW22] to multiobjective optimization, which increases the mutation rate with the time that no improvement has been found.We define a stagnation-detection variant of the GSEMO that has a slightly better asymptotic performance on OneJumpZeroJump than the GSEMO with heavytailed mutation.In contrast to this positive result on OneJumpZeroJump, we speculate that this algorithm may have difficulties with plateaus of constant fitness.[RW22] proposed the following strategy to adjust the mutation rate during the run of the (1 + 1) EA.We recall that the (1 + 1) EA is a very elementary EA working with a parent and offspring population size of one, that is, it generates in each iteration one offspring from the unique parent via mutation and accepts it if it is at least as good as the parent.The classic mutation operator for this algorithm is standard bit-wise mutation with mutation rate 1/n, that is, the offspring is generated by flipping each bit of the parent independently with probability 1/n.

The Stagnation-Detection Strategy of Rajabi and Witt
The main idea of the stagnation-detection approach is as follows.Assume that the (1 + 1) EA for a longer time, say at least 10n ln n iterations, has not accepted any new solution.Then, with high probability, it has generated (and rejected) all Hamming neighbors of the parent.Consequently, there is no use to generate these solutions again and the algorithm should better concentrate on solutions further away from the parent.This can be achieved by increasing the mutation rate.For example, with a mutation rate of 2 n the rate of Hamming neighbors produced is already significantly reduced; however, Hamming neighbors can still be generated, which is important in case we were unlucky so far and missed one of them.
More generally and more precisely, to implement this stagnation-detection approach the (1 + 1) EA maintains a counter ("failure counter") that keeps track of how long the parent individual has not given rise to a better offspring.This counter determines the current mutation rate.This dependency is governed by a safety parameter R which is recommended to be at least n.Then for r = 1, 2, . . . in this order the mutation rate r/n is used for iterations; the rate ⌊n/2⌋ is used until an improvement is found, that is, the mutation rate is never increased above 1/2.When a strictly improving solution is found, the counter is reset to zero, and consequently, the mutation rate starts again at 1 n .[RW22] showed that the (1 + 1) EA with this strategy optimizes Jump n,k with ) is obtained.This is faster than the runtime of Θ(k β−0.5 ( en k ) k ) proven by [DLMN17] for the (1 + 1) EA with heavy-tailed mutation with power-law exponent β > 1 by a factor of k β−0.5 .For the recommended choice β = 1.5, this factor is Θ(k).

Adaptation of the Stagnation-Detection Strategy to Multiobjective Optimization
The (1 + 1) EA being an algorithm without a real population, it is clear that certain adaptations are necessary to use the stagnation-detection strategy in multiobjective optimization.

Global or Individual Failure Counters
The first question is how to count the number of unsuccessful iterations.The following two obvious alternatives exist.Individual counters: From the basic idea of the stagnation-detection strategy, the most natural solution is to equip each individual with its own counter.Whenever an individual is chosen as parent in the GSEMO, its counter is increased by one.New solutions (but see the following subsection for an important technicality of what "new" shall mean) entering the population (as well as the random initial individual) start with a counter value of zero.
A global counter: Algorithmically simpler is the approach to use only one global counter.This counter is increased in each iteration.When a new solution enters the population, the global counter is reset to zero.
We suspect that for many problems, both ways of counting give similar results.The global counter appears to be wasteful in the sense that when a new individual enters the population, also parents that are contained in the population for a long time re-start generating offspring with mutation rate 1 n despite the fact that they have, with very high probability, already generated as offspring all solutions close by.On the other hand, often these "old individuals" do not generate solutions that enter the population anyway, so that an optimized choice of the mutation rate is less important.
For the OneJumpZeroJump problem, it is quite clear that this second effect is dominant.A typical run starts with some individual in the inner region of the Pareto front.In a relatively short time, the whole middle region is covered, and for this it suffices that relatively recent solutions generate a suitable Hamming neighbor as offspring.The runtime is dominated by the time to find the two extremal solutions and this will almost always happen from the closest parent in the middle region of the front (regardless of whether individual counters or a global counter is used).For this reason, we analyze in the following the simpler approach using a global counter.

Dealing with Indifferent Solutions
One question that becomes critical when using stagnation detection is how to deal with indifferent solutions, that is, which solution to put or keep in the population in the case that an offspring y has the same (multiobjective) fitness as an individual x already in the population.Since f (x) = f (y), we have x y and y x, that is, both solutions do an equally good job in dominating others and thus in approximating the Pareto front.In early works, e.g.[LTZ + 02] proposing the SEMO algorithm, such later generated indifferent solutions do not enter the population.This is partially justified by the fact that in many of the problems regarded in these works, search points with equal fitness are fully equivalent for the future run of the algorithm.We note that our OneJumpZeroJump problem also has this property, hence all results presented so far are valid regardless of how indifferent solutions are treated.
When non-equivalent search points with equal fitness exist, it is less obvious how to deal with indifferent solutions.In particular, it is clear that larger plateaus of constant fitness can be traversed much easier when a new indifferent solution is accepted as this allows to imitate a random walk behavior on the plateau [BFH + 07].For that reason, and in analogy to single-objective optimization [JW01], it seems generally more appropriate to let a new indifferent solution enter the population, and this is also what most of the later works on the SEMO and GSEMO algorithm do [FHN10, FHN11, QTZ16, LZZZ16, BQT18, COGNS20].
Unfortunately, as mentioned in Section 1, it is not so clear how to handle indifferent solutions together with stagnation detection.In principle, when a new solution enters the population, the failure counter has to be reset to zero to reset the mutation rate to 1/n.Otherwise, the continued use of a high mutation rate would prohibit finding good solutions in the direct neighborhood of the new solution.However, the acceptance of indifferent solutions can also lead to unwanted resets.For the OneJumpZeroJump problem, for example, it is easy to see by mathematical means that in a typical run, it will happen very frequently that an indifferent solution is generated.If this enters the population with a reset of a global failure counter (or an individual counter), then the regular resets will prevent the counters to reach interesting values.In a quick experiment for n = 50, k = 4, and a global counter, the largest counter value ever reached in this run of over 500,000,000 iterations was 5. Consequently, this SD-GSEMO was far from ever increasing the mutation rate and just imitated the classic GSEMO.
For this reason, in this work we regard the GSEMO with stagnation detection only in the variant that does not accept indifferent solutions, and we take note of the fact that thus our positive results on the stagnation-detection mechanism will not take over to problems with non-trivial plateaus of constant fitness.

Adjusting the Self-Adjustment
In the (1 + 1) EA with stagnation detection, [RW22] increased the mutation rate from r n to r+1 n once the rate r n has been used for T r iterations with T r as defined in (9).This choice ensured that any particular target solution in Hamming distance r is found in this phase with probability at least 1 − (nR) −2 , see the proof of Lemma 2 in [RW22].Since in a run of the GSEMO with current population size |P | each member of the population is chosen as parent only an expected number of T r /|P | times in a time interval of length T r , we need to adjust the parameter T r .Not surprisingly, by taking that is, roughly |P | T r , the probability to generate any particular solution in Hamming distance r in phase r is at least which is sufficient for this purpose.Note that the population size |P | changes only if a new solution enters the population and in this case the mutation rate is reset to 1 n .Hence the definition of Tr , with the convention that we suppress |P | in the notation to ease reading, is unambiguous.

The GSEMO with Stagnation Detection: SD-GSEMO
Putting the design choices discussed so far together, we obtain the following variant of the GSEMO, called SD-GSEMO.Its pseudocode is shown in Algorithm 4.
Algorithm 4 SD-GSEMO with safety parameter R 1: Generate x ∈ {0, 1} n uniformly at random, and P ← {x} 2: r ← 1 and u ← 0 3: loop 4: Uniformly and randomly select one individual x from P 5: Generate x ′ via independently flipping each bit value of x with probability r/n 6: if there is no y ∈ P such that x ′ y then 8: We now analyze the runtime of the SD-GSEMO on the OneJumpZeroJump function class.This will show that its expected runtime is by at least a factor of Ω(k) smaller than the one of the heavy-tailed GSEMO (which was a factor of k Ω(k) smaller than the one of the standard GSEMO).
We extract one frequently used calculation into Lemma 23 to make our main proof clearer.The proof of Lemma 23 uses the following lemma.
Proof.We compute ≤ 2 e a + 1 where the first inequality uses Lemma 22.
As for all algorithms discussed in this work, the most costly part of the optimization process is finding the two extremal points of the Pareto front.Borrowing many ideas from the proof of [RW20, Theorem 3.2], we show the following estimate for the time to find these extremal points when the inner part of the Pareto front is already covered.
Lemma 24.Consider the SD-GSEMO optimizing the OneJumpZeroJump n,k function.Consider the first time T 0 that all (a, 2k + n − a), a ∈ [2k..n], are covered by the current population.Then the two possibly missing elements (a, 2k + n − a), a ∈ {k, n + k}, of the Pareto front will be found in an expected number of at most nk ) ln(nR)) + 2k additional generations.Proof.Let T ′ denote the first time (additional number of iterations) that one of 0 n and 1 n is found.Let phase r be the period when the SD-GSEMO uses the bit-wise mutation rate r n , and let A r denote the event that either 0 n or 1 n is found in phase r.With the pessimistic assumption that neither 0 n nor 1 n is found during phase r for all r < k, we have Conditional on the event A r , the time T ′ to find the first extremal point includes all time spent in phases 1 to r − 1, that is, at most T1 + • • • + Tr−1 with Ti defined as in (10) for |P | = n − 2k + 3.In phase r, each iteration has a probability of at to find an extremal point (via choosing one of the boundary points of the inner region as parent and then flipping exactly the right k bits in it).Consequently, the time spent in phase r follows a geometric law with success probability p conditional on being at most Tr .Since for any random variable X and any number T such that Pr , the expected number of iterations spend in phase r conditional on A r is at most 1 pr .From this, and recalling from Corollary 8 that the population size is at most n − 2k + 3, we compute where the second inequality uses Lemma 22.We observe that for r ≥ k, the probability that during phase r neither 0 n nor 1 n is found, is at most Hence we have Pr i=1 Ti holds trivially for r < n 2 , and that it also holds for r = n 2 due to 1 pr ≤ ⌈2(n − 2k + 3)(2e) n/2 ln(nR)⌉ = Tn 2 , we compute where we use Lemma 23 for the second inequality.Noting Pr[A k ] ≤ 1, together with (11) and (13), we have We omit a precise statement of the very similar calculation for the time to generate the second extremal point.The only differences are that the expression nk , this lemma is proven.Now we are ready to show the runtime for the SD-GSEMO on the OneJumpZeroJump n,k function.

Theorem 25. The expected runtime of the SD-GSEMO optimizing the
OneJumpZeroJump n,k function is at most Proof.We first consider the expected time until we have all (a, 2k + n − a), a ∈ [2k.
.n], in the current Pareto front.Our main argument is the following adaptation of the analysis of this period for the standard GSEMO in the proof of Theorem 12.The main argument there was that the time spent in this period can be estimated by a sum of at most n−2 waiting times for the generation of a particular Hamming neighbor of a particular member of the population.The same argument, naturally, is valid for the SD-GSEMO, and fortunately, we can also reuse the computation of these waiting times.Consider the time such a subperiod now takes with the SD-GSEMO.If we condition on the subperiod ending before the rate is increased above the initial value of 1/n, then the expected time is at most the expected time of this subperiod for the standard GSEMO (this is the same argument as used in the proof of Lemma 24).Consequently, the total time of this period for the SD-GSEMO is at most the corresponding time for the standard GSEMO (en(n − 2k + 3)(ln n + 1) + 2en(n − 2k + 3)(ln(n − k) + 1) as determined in the proof of Theorem 12) plus the additional time caused by subperiods using rates r/n with r ≥ 2. We prove an upper bound for this time valid uniformly for all subperiods.Let phase r be the time interval of this subperiod when the SD-GSEMO uses the bit-wise mutation rate r n .We recall that at all times the population contains an individual x such that there is a search point y that can dominate at least one individual in the current population and that y is a Hamming neighbor of x, that is, can be generated from x by flipping a single bit.Hence for all r ≥ 1, the probability that this y is not found in phase r is at most Thus with the probability at least 1 − 1 (nR) 2 , the desired search point y will be found in phase r.Hence, if y is not found in phase r = 1, then analogous to (13) in the proof of Lemma 24, also noting that the estimate E[T ′ | A r ] ≤ r i=1 Ti holds trivially for r < n 2 and it also holds for r = n 2 since |P |2 n ≤ ⌈2|P |(2e) n/2 ln(nR)⌉ = Tn 2 , the expected time spent in phases from 2 to n/2 is at most Together with the time to find the remaining two elements in the Pareto front in Lemma 24, this theorem is proven.
Assume that, as suggested in [RW20], the control parameter R is set to n.Then the dominating element of the upper bound in Theorem 25 becomes (n

Experiments
To understand the performance of the algorithms discussed in this work for concrete problem sizes (for which an asymptotic mathematical analysis cannot give definite answers), we now conduct a simple experimental analysis.Since the SEMO cannot find the Pareto front, we did not include it in this investigation.We did include the variant of the SD-GSEMO, denoted by SD-GSEMO-Ind, in which each individual has its own failure counter (see the discussion in Section 8).Our experimental settings are the same for all algorithms.
• Maximal iterations for the termination: n k+3 .This number of iterations was reached in none of the experiments, i.e., the results reported are the true runtimes until the full Pareto front was found.
• 20 independent runs for each setting.
Figure 4 shows the median number of function evaluations of GSEMO, GSEMO-HTM, SD-GSEMO, and SD-GSEMO-Ind on the OneJumpZeroJump n,k function.To see how the experimental results compare with our bounds, we also plot (i) the curve 1.5e(n − 2k)n k corresponding to the bounds for the GSEMO in Theorems 12 and 13, (ii) the curve (n − 2k)(en) k /k k−1 for the GSEMO-HTM with β = 1.5; since the leading constant in Theorem 20 is implicit, we chose a constant such that the curve matches the experimental data.We observe that for n ≥ 18, Theorem 21 have given an upper bound of 4e 8 √ 2+14 β (β−1) √ π ≥ 6 × 10 11 for the leading constant, which is far larger than 1 and indicates that further improvements of the leading constant estimate are possible, and (iii) the curve 1.5(n − 2k)(en) k /k k corresponding to the upper bound of SD-GSEMO with R = n in Theorem 25.
We clearly see that these curves, in terms of shape and, where known, in terms of leading constants, match well the estimates of our theoretical runtime results.We also see, as predicted by informal considerations, the similarity of the performance of the SD-GSEMO and the SD-GSEMO-Ind.Finally, our experiments show that the different runtime behaviors are already visible for moderate (and thus realistic) problem sizes and not only in the asymptotic sense in which they were proven.In particular, we observe a performance improvement by a factor of (roughly) 5 through the use heavy-tailed mutation and by a factor of (roughly) 10 with the stagnation-detection strategy.

Conclusion and Outlook
Noting that, different from single-objective evolutionary computation, no broadly accepted multimodal benchmarks exist in the theory of MOEAs, we proposed the OneJumpZeroJump benchmark, which is a multiobjective analogue of the single-objective jump problem.We use this benchmark to show that several insights from single-objective EA theory extend to the multiobjective world.Naturally, algorithms using 1-bit mutation such as the SEMO cannot optimize our benchmark, that is, cannot compute a population that covers the full Pareto front.In contrast, for all problem sizes n and jump sizes k ∈ [4.. n 2 − 1], the GSEMO covers the Pareto front in Θ((n − 2k)n k ) iterations in expectation.Heavy-tailed mutation rates and the stagnation-detection mechanism of [RW22] give a runtime improvement over the standard GSEMO by a factor of k Ω(k) , with the stagnation-detection GSEMO slightly ahead by a small polynomial factor in k.These are the same gains as observed for the singleobjective Jump benchmark with gap size k.Our experiments confirm this performance ranking already for moderate problem sizes, with the stagnation-detection GSEMO slightly more ahead than what the asymptotically small advantage suggests.On the downside, adapting the stagnation-detection mechanism to MOEAs needs taking several design choices, among which the question how to treat indifferent solutions could be difficult for problems having larger plateaus of constant fitness.
Overall, this work suggests that the recently developed ideas to cope with multimodality in single-objective evolutionary optimization can be effective in multiobjective optimization as well.In this first work in this direction, we only concentrated on mutation-based algorithms.The theory of evolutionary computation has also observed that crossover and estimation-of-distribution algorithms can be helpful in multimodal optimization.Investigating to what degree these results extend into multiobjective optimization is clearly an interesting direction for future research.
Also, we only covered very simple MOEAs in this work.Analyzing more complex MOEAs amenable to mathematical runtime anlyses, such as the decomposition-based MOEA/D [ZL07,LZZZ16] or the NSGA-II [DPAM02, ZLD22] would be highly interesting.For the MOEA/D, this would most likely require an adaptation of our benchmark problem.Since the difficult-to-find extremal points of the front are just the solutions of the single-objective sub-problems, and thus the two problems that naturally are part of the set of subproblems regarded by the MOEA/D, this algorithm might have an unfair advantage on the One-JumpZeroJump problem.
Work conducted after this research: After the publication of [DZ21], the following works have used the OneJumpZeroJump benchmark.In [DQ22], the performance of the NSGA-II, both with standard-bit mutation and with heavytailed mutation, on the OneJumpZeroJump benchmark was analyzed.When the population size N is at least four times the size n − 2k + 3 of the Pareto front and 2 < k ≤ n/4, then the standard NSGA-II finds the Pareto front in time O(Nn k ).Hence the NSGA-II and the GSEMO satisfy the same asymptotic runtime guarantees when N is chosen asymptotically optimal (that is, N = Θ(n − 2k)).With heavy-tailed mutation, again a k Ω(k) improvement is obtained.In this work, an NSGA-II was regarded that does not use crossover, but it is clear that the same upper bounds could have been shown when crossover with rate less than one was used.Doerr and Qu [DQ23b] show that the upper bound of O(Nn k ) for the NS-GA-II with standard-bit mutation is asymptotically tight in many cases: When N = o(n 2 /k 2 ), the runtime is Ω(Nn k ).Consequently, in this regime the NSGA-II does not profit from larger population sizes (not even when regarding the parallel runtime, that is, the number of iterations).Doerr and Qu [DQ23a] prove that uniform crossover can give super-constant speed-ups for the NSGA-II optimizing OneJumpZeroJump functions.This result is not totally surprising given that Dang et al. [DFK + 18] have shown that crossover speeds up the (µ + 1) EA on single-objective jump functions, but the reason for the speed-ups shown in [DQ23a] is different (and, in fact, can be translated to different, and sometimes stronger, speed-ups for the (µ + 1) EA on jump functions).These recent results show that a jump-like benchmark, as proposed in this work, is useful in multiobjective evolutionary computation.
the function values of the Pareto set is called the Pareto front.

Figure 1 :
Figure 1: The values of f 1 and f 2 with respect to |x| 1 , the number of ones in the search point x.

Figure 2 :
Figure 2: The values of two objectives in OneJumpZeroJump n,k with respect to |x| 1 , the number of ones in the search point x.In the following theorem we determine the Pareto set and front of the OneJumpZeroJump n,k function.As Figure 2 suggests, for k ≥ 2 the Pareto set consists of an inner region of all search points x with |x| 1 ∈ [k..n − k] and the two extremal points 1 n and 0 n , as visualized in Figure 3.We shall call the Pareto front values {(a, 2k + n − a) | a ∈ [2k..n]} the inner part/region of the Pareto front in this paper.Theorem 7. The Pareto set of the OneJumpZeroJump n,k function is S * = {x | |x| 1 ∈ [k..n − k] ∪ {0, n}}, and the Pareto front is F * = {(a, 2k + n − a) | a ∈ [2k..n] ∪ {k, n + k}}.

Corollary 8 .
Let k ≤ n/2.Consider any set of solutions P such that x y w.r.t.OneJumpZeroJump n,k for all x, y ∈ P with x = y.Then |P | ≤ n − 2k + 3. Proof.We first note that any solution that lies on the Pareto front dominates any solution not on the Pareto front.Hence either P is a subset of the Pareto front or it contains no point on the Pareto front.In the first case, Theorem 7 immediately shows our claim.Hence assume that P contains no point of the Pareto front.Let x, y ∈ P with |x| 1 , |y| 1 ∈ [1..k − 1].Assume without loss of generality that |x| 1 ≤ |y| 1 .Then x y.Hence P contains at most one solution x with |x| 1 ∈ [1..k − 1].A symmetric argument shows that P contains at most one solution x with |x| 1 ∈ [n − k + 1..n − 1].Hence |P | ≤ 2 ≤ n − 2k + 3.

1:
Generate x ∈ {0, 1} n uniformly at random and P ← {x} 2: loop 3:Uniformly at random select one individual x from P 4: Generate x ′ via independently flipping each bit value of x with probability 1/n 5: Lemma 10 shows that once a solution in the inner part of the Pareto front is part of the population, it takes O(n 2 log n) iterations until the whole inner part of the Pareto front is covered.Lemma 10.Consider optimizing the function f := OneJumpZeroJump n,k via the GSEMO.Let a 0 ∈ [2k..n].Assume that at some time the population contains an individual x with f (x) = (a 0 , 2k + n − a 0 ).Then the expected number of interations until all (a, 2k + n − a), a ∈ [2k..n], are covered by the population is at most 2en(n − 2k + 3)(ln⌈ n 2 ⌉ + 1) iterations.Proof.Note that any solution corresponding to (a, 2k + n − a) in the Pareto front contains exactly a − k ones.If the population contains such an individual, then the probability to find (a + 1, 2k + n − a − 1) in the Pareto front is at least

Lemma 11 .
Consider the GSEMO optimizing the OneJumpZeroJump n,k function.Assume that at some time, all (a, 2k + n − a), a ∈ [2k..n], are covered by the current population.Then the two missing points (a, 2k + n − a), a ∈ {k, n + k}, of the Pareto front will be found in an expected number of at most 3 2 en k (n − 2k + 3) additional generations.
for some a ∈ [2k..n].If the initial search point has a number of ones in [k..n − k], then such an x is already found.Otherwise, the initial search point has at most k − 1 ones or at most k − 1 zeros.W.l.o.g., we consider an initial point with m ≤ k − 1 zeros.If m = 0, that is, the initial point is 1 n , then its function value is (n + k, k).Since any point x with m ∈ [1..k − 1] zeros has function value ( m, k + m), such a point is not dominated by 1 n , and thus will be kept in the population.After that, since ( m, k + m) increases with respect to m, one individual with larger m ∈ [1..k − 1] will replace the one with smaller m ∈ [1..k − 1].Finally, any individual with m select one individual x from P 4: Sample α from D β n/2 and generate x ′ via independently flipping each bit value of x with probability α/n 5:

Lemma 19 .
Consider the GSEMO-HTM optimizing the OneJumpZeroJump n,k function.Assume that at some time, all (a, 2k + n − a), a ∈ [2k..n], are covered by the current population.Then the two missing (a, 2k + n − a), a ∈ {k, n + k}, in the Pareto front will be found in an expected number of at most 3 2P β k n k (n − 2k + 3) additional generations.

2
n−2k+3 in the definition of p has to be replaced by 1 n−2k+3 and that the expression 2 |P | in (12) has to be replaced by 1 |P | , both due to the fact that now only one parent individual can generate the desired solution via a k-bit flip.With this, we obtain that the expected additional number of iterations to generate the second extremal point is at most ) ≤ 4en|P | ln(nR) 3 2n = 6e|P | ln(nR), we use Lemma 23 for the first inequality.With this the total additional runtime caused by the stagnation-detection strategy compared with the GSEMO to find all (a, 2k + n − a), a ∈ [2k..n] in the Pareto front is at most (n − 2)6e|P | ln(nR).Therefore together with the runtime for GSEMO to find all (a, 2k + n − a), a ∈ [2k..n] in the Pareto front in the proof of Theorem 12, we know the expected time when all (a, 2k + n − a), a ∈ [2k..n] are contained in the current Pareto front is at most en(n − 2k + 3)(ln n + 1) + 2en(n − 2k + 3)(ln(n − k) + 1) +(n − 2)6e(n − 2k + 3) ln(nR)) = 3e(n − 2k + 3)(n ln n + 2(n − 2) ln(nR)).