1 Introduction

Estimation of Distribution Algorithms (EDAs) are a class of randomised search heuristics with many practical applications [15, 20, 24, 47, 48]. Unlike traditional Evolutionary Algorithms (EAs) which search for optimal solutions using genetic operators such as mutation or crossover, EDAs build and maintain a probability distribution of the current population over the search space, from which the next generation of individuals is sampled. Several EDAs have been developed over the last decades. The algorithms differ in how they capture interactions among decision variables, as well as in how they build and update their probabilistic models. EDAs are often classified as either univariate or multivariate; the former treats each variable independently, while the latter also considers variable dependencies [40]. Well-known univariate EDAs include the compact Genetic Algorithm (cGA [21]), the Population-Based Incremental Learning Algorithm (PBIL [4]), and the Univariate Marginal Distribution Algorithm (UMDA [37]). Given a problem instance of size n, univariate EDAs represent probabilistic models as an n-vector, where each vector component is called a marginal. Some Ant Colony Optimisation (ACO) algorithms and even certain single-individual EAs can be cast in the same framework as univariate EDAs (or n-\({{\mathrm{Bernoulli}}}\)-\(\lambda \)-EDA, see, e.g., [18, 22, 25, 42]). Multivariate EDAs, such as the Bayesian Optimisation Algorithm, which builds a Bayesian network with nodes and edges representing variables and conditional dependencies respectively, attempt to learn relationships between decision variables [22]. The surveys [1, 22, 39] describe further variants and applications of EDAs.

Recently EDAs have drawn a growing attention from the theory community of evolutionary computation [10, 13, 18, 26,27,28, 32, 44,45,46]. The aim of the theoretical analyses of EDAs in general is to gain insights into the behaviour of the algorithms when optimising an objective function, especially in terms of the optimisation time, that is the number of function evaluations, required by the algorithm until an optimal solution has been found for the first time. Droste [14] provided the first rigorous runtime analysis of an EDA, specifically the cGA. Introduced in [21], the cGA samples two individuals in each generation and updates the probabilistic model according to the fittest of these individuals. A quantity of \(\pm 1/K\) is added to the marginals for each bit position where the two individuals differ. The reciprocal K of this quantity is often referred to as the abstract population size of a genetic algorithm that the cGA is supposed to model. Droste showed a lower bound \(\varOmega (K\sqrt{n})\) on the expected optimisation time of the cGA for any pseudo-Boolean function [14]. He also proved the upper bound \(\mathcal {O}(nK)\) for any linear function, where \(K=n^{1/2+\varepsilon }\) for any small constant \(\varepsilon >0\). Note that each marginal of the cGA considered in [14] is allowed to reach the extreme values zero and one. Such an algorithm is referred to as an EDA without margins, since in contrast it is possible to reinforce some margins (also called borders) on the range of values for each marginal to keep it away from the extreme probabilities, often within the interval \([1/n,1-1/n]\). An EDA without margins can prematurely converge to suboptimal solutions; thus, the runtime bounds of [14] were in fact conditioned on the event that early convergence never happens. Very recently, Witt [45] studied an effect called domino convergence on EDAs, where bits with heavy weights tend to be optimised before bits with light weights. By deriving a lower bound of \(\varOmega (n^2)\) on the expected optimisation time of the cGA on BinVal for any value of \(K>0\), Witt confirmed the claim made earlier by Droste [14] that BinVal is a harder problem for the cGA than the OneMax problem is. Moreover, Lengler et al. [32] considered \(K=\mathcal {O}\left( \sqrt{n}/\log ^2 n\right) \), which was not covered by Droste in [14], and obtained a lower bound \(\varOmega (K^{1/3}n+n\log n)\) on the expected optimisation time of the cGA on OneMax. Note that if \(K=\varTheta (\sqrt{n}/\log ^2 n)\), the above lower bound will be \(\varOmega (n^{7/6}/\log ^2 n)\), which further tightens the bounds on the expected optimisation time of the cGA.

An algorithm closely related to the cGA with (reinforced) margins is the 2-Max Min Ant System with iteration best (\(2\)-MMAS\(_{\text {ib}}\)). The two algorithms differ only in the update procedure of the model, and \(2\)-MMAS\(_{\text {ib}}\) is parameterised by an evaporation factor \(\rho \in (0,1)\). Sudholt and Witt [42] proved the lower bounds \(\varOmega (K\sqrt{n}+n\log {n})\) and \(\varOmega (\sqrt{n}/\rho +n\log {n})\) for the two algorithms on OneMax under any setting, and upper bounds \(\mathcal {O}(K\sqrt{n})\) and \(\mathcal {O}(\sqrt{n}/\rho )\) when K and \(\rho \) are in \(\varOmega (\sqrt{n}\log {n})\). Thus, the optimal expected optimisation time \(\varTheta (n\log {n})\) of the cGA and the \(2\)-MMAS\(_{\text {ib}}\) on OneMax is achieved by setting these parameters to \(\varTheta (\sqrt{n}\log {n})\). The analyses revealed that choosing lower parameter values results in strong fluctuations that may cause many marginals (or pheromones in the context of ACO) to fix early at the lower margin, which then need to be repaired later. On the other hand, choosing higher parameter values resolves the issue but may slow down the learning process.

Friedrich et al. [18] pointed out two behavioural properties of univariate EDAs at each bit position: a balanced EDA would be sensitive to signals in the fitness, while a stable one would remain uncommitted under a biasless fitness function. During the optimisation of LeadingOnes, when some bit positions are temporarily neutral, while the others are not, both properties appear useful to avoid commitment to wrong decisions. Unfortunately, many univariate EDAs without margins, including the cGA, the UMDA, the PBIL and some related algorithms are balanced but not stable [18]. A more stable version of the cGA—the so-called stable cGA (or scGA)—was then introduced in [18]. Under appropriate settings, it yields an expected optimisation time \(\mathcal {O}(n\log n)\) on LeadingOnes with high probability. Furthermore, a recent study by Friedrich et al. [17] showed that cGA can cope with higher levels of noise more efficiently than mutation-only heuristics do.

Introduced by Baluja [4], the PBIL is another univariate EDA. Unlike the cGA that samples two solutions in each generation, the PBIL samples a population of \(\lambda \) individuals, from which the \(\mu \) fittest individuals are selected to update the probabilistic model using a convex combination with a smoothing parameter\(\rho \in (0,1]\) of the current model and the frequencies of ones among all selected individuals at that bit position. The PBIL can be seen as a special case of the cross-entropy method [38] on the binary hypercube \(\{0,1\}^n\). Wu et al. [46] analysed the runtime of the PBIL on OneMax and LeadingOnes. The authors argued that due to the use of a sufficiently large population size, it is possible to prevent the marginals from reaching the lower border early even when a large smoothing parameter \(\rho \) is used. Runtime results were proved for the PBIL without margins on OneMax and the PBIL with margins on LeadingOnes, and were then compared to the runtime of some Ant System approaches. However, the required population size is large, i.e. \(\lambda =\omega (n)\). Very recently, Lehre and Nguyen [28] obtained an upper bound \(\mathcal {O}(n\lambda \log \lambda +n^2)\) on the expected optimisation time for the PBIL with margins on BinVal and LeadingOnes, which improves the previously known upper bound \(\mathcal {O}(n^{2+\epsilon })\) in [46] by a factor of \(n^{\varepsilon }\), where \(\varepsilon \) is some positive constant, for smaller population sizes \(\lambda =\varOmega (\log n)\).

The UMDA is a special case of the PBIL with the largest smoothing parameter \(\rho = 1\), that is, the probabilistic model for the next generation depends solely on the selected individuals in the current population. The algorithm has a wide range of applications, not only in computer science, but also in other areas like population genetics and bioinformatics [20, 48]. Moreover, the UMDA relates to the notion of linkage equilibrium [36, 41], which is a popular model assumption in population genetics. Thus, studies of the UMDA can contribute to the understanding of population dynamics in population genetics.

Despite an increasing momentum in the runtime analysis of EDAs over the last few years, our understanding of the UMDA in terms of runtime is still limited. The algorithm was early analysed in a series of papers [5,6,7,8], where time-complexities of the UMDA on simple uni-modal functions were derived. These results showed that the UMDA with margins often outperforms the UMDA without margins, especially on functions like BVLeadingOnes, which is a uni-modal problem. The possible reason behind the failure of the UMDA without margins is due to fixation, causing no further progression for the corresponding decision variables. The UMDA with margins is able to avoid this by ensuring that each search point always has a positive chance to be sampled. Shapiro investigated the UMDA with a different selection mechanism than truncation selection [40]. In particular, this variant of the UMDA selects individuals whose fitnesses are no less than the mean fitness of all individuals in the current population when updating the probabilistic model. By representing the UMDA as a Markov chain, the paper showed that the population size has to be at least \(\sqrt{n}\) for the UMDA to prevent the probabilistic model from quickly converging to the corners of the hypercube on the search space. This phenomenon is well-known as genetic drift [2]. A decade later, the first upper bound on the expected optimisation time of the UMDA on OneMax was revealed [10]. Working on the standard UMDA using truncation selection, Dang and Lehre [10] proved an upper bound \(\mathcal {O}(n\lambda \log \lambda )\) on the expected optimisation time of the UMDA on OneMax, assuming a population size \(\lambda =\varOmega (\log n)\). If \(\lambda = \varTheta (\log n)\), then the upper bound is \(\mathcal {O}(n\log n\log \log n)\). Inspired by the previous work of [42] on cGA/\(2\)-MMAS\(_{\text {ib}}\), Krejca and Witt [26] obtained a lower bound \(\varOmega (\mu \sqrt{n}+n\log n)\) for the UMDA on OneMax via drift analysis, where \(\lambda = (1+\varTheta (1))\mu \). Compared to [42], the analysis is much more involved since, unlike in cGA/\(2\)-MMAS\(_{\text {ib}}\) where each change of marginals between consecutive generations is small and limited by to the smoothing parameter, large changes are always possible in the UMDA. From these results, we observe that the latest upper and lower bounds for the UMDA on OneMax still differ by \(\varTheta (\log \log n)\). This raises the question of whether this gap could be closed.

This paper derives upper bounds on the expected optimisation time of the UMDA on the following problems: OneMax, BinVal, and LeadingOnes. The preliminary versions of this work appeared in [10] and [27]. Here we use the improved version of the level-based analysis technique [9]. The analyses for LeadingOnes and BinVal are straightforward and similar to each other, i.e. yielding the same runtime \(\mathcal {O}(n\lambda \text {log} \lambda +n^2)\); hence, they will serve the purpose of introducing the technique in the context of EDAs. Particularly, we only require population sizes \(\lambda = \varOmega (\log {n})\) for LeadingOnes which is much smaller than previously thought [6,7,8]. For OneMax, we give a more detailed analysis so that an expected optimisation time \(\mathcal {O}(n\log n)\) is derived if the population size is chosen appropriately. This significantly improves the results in [9, 10] and matches the recent lower bound in [26]. More specifically, we assume \(\lambda \ge b \mu \) for a sufficiently large constant \(b>0\), and separate two regimes of small and large selected populations: the upper bound \(\mathcal {O}(\lambda n)\) is derived for \(\mu = \varOmega (\log n) \cap \mathcal {O}(\sqrt{n})\), and the upper bound \(\mathcal {O}(\lambda \sqrt{n})\) is shown for \(\mu = \varOmega (\sqrt{n}\log n)\). These results exhibit the applicability of the level-based technique in the runtime analysis of (univariate) EDAs. Table 1 summarises the latest results about the runtime analyses of univariate EDAs on simple benchmark problems; see [25] for a recent survey on the theory of EDAs.

Related independent work Witt [44] independently obtained the upper bounds \(\mathcal {O}(\lambda n)\) and \(\mathcal {O}(\lambda \sqrt{n})\) on the expected optimisation time of the UMDA on OneMax for \(\mu = \varOmega (\log n)\cap o(n)\) and \(\mu = \varOmega (\sqrt{n}\log n)\), respectively, and \(\lambda =\varTheta (\mu )\) using an involved drift analysis. While our results do not hold for \(\mu =\varOmega (\sqrt{n})\cap \mathcal {O}\left( \sqrt{n}\log n\right) \), our methods yield significantly easier proofs. Furthermore, our analysis also holds when the parent population size \(\mu \) is not proportional to the offspring population size \(\lambda \), which is not covered in [44].

This paper is structured as follows. Section 2 introduces the notation used throughout the paper and the UMDA with margins. We also introduce the techniques used, including the level-based theorem, which is central in the paper, and an important sharp bound on the sum of Bernoulli random variables. Given all necessary tools, Sect. 3 presents upper bounds on the expected optimisation time of the UMDA on both LeadingOnes and BinVal, followed by the derivation of the upper bounds on the expected optimisation time of the UMDA on OneMax. The latter consists of two smaller subsections according to two different ranges of values of the parent population size. Section 5 presents a brief empirical analysis of the UMDA on LeadingOnes, BinVal and OneMax to support the theoretical findings in Sects. 3 and 4. Finally, our concluding remarks are given in Sect. 6.

Table 1 Expected optimisation time (number of fitness evaluations) of univariate EDAs on the three problems OneMax, LeadingOnes and BinVal

2 Preliminaries

This section describes the three standard benchmark problems, the algorithm under investigation and the level-based theorem, which is a general method to derive upper bounds on the expected optimisation time of non-elitist population-based algorithms. Furthermore, a sharp upper bound on the sum of independent Bernoulli trials, which is essential in the runtime analysis of the UMDA on OneMax for a small population size, is presented, followed by Feige’s inequality.

We use the following notation throughout the paper. The natural logarithm is denoted as \(\ln (\cdot )\), and \(\log (\cdot )\) denotes the logarithm with base 2. Let [n] be the set \(\{1,2,\ldots ,n\}\). The floor and ceiling functions are \(\lfloor x\rfloor \) and \(\lceil x\rceil \), respectively, for \(x \in \mathbb {R}\). For two random variables XY, we use \(X \preceq Y\) to indicate that Y stochastically dominates X, that is \(\hbox {Pr}\left( X \ge k\right) \le \hbox {Pr}\left( Y \ge k\right) \) for all \(k \in \mathbb {R}\).

We consider a partition of the finite search space \(\mathcal {X}=\{0,1\}^n\) into m ordered subsets \(A_1,\ldots ,A_m\) called levels, i.e. \(A_i \cap A_j = \emptyset \) for any \(i \ne j\) and \(\cup _{i=1}^{m}A_i = \mathcal {X}\). The union of all levels above j inclusive is denoted \(A_{\ge j}:=\cup _{i=j}^m A_i\). An optimisation problem on \(\mathcal {X}\) is assumed, without loss of generality, to be the maximisation of some function \(f:\mathcal {X} \rightarrow \mathbb {R}\). A partition is called fitness-based (or f-based) if for any \(j \in [m-1]\) and all \(x \in A_{j}\), \(y \in A_{j+1} :f(y)>f(x)\). An f-based partitioning is called canonical when \(x,y \in A_j\) if and only if \(f(x) =f(y)\).

Given the search space \(\mathcal {X}\), each \(x\in \mathcal {X}\) is called a search point (or individual), and a population is a vector of search points, i.e. \(P \in \mathcal {X}^{\lambda }\). For a finite population \(P= \left( x^{(1)},\ldots ,x^{(\lambda )}\right) \), we define \(|P \cap A_j| := |\{i \in [\lambda ] \mid x^{(i)} \in A_j\}|\), i.e. the number of individuals in population P which are in level \(A_j\). Truncation selection, denoted as \((\mu ,\lambda )\)-selection for some \(\mu <\lambda \), applied to population P transforms it into a vector \(P'\) (called selected population) with \(|P'|=\mu \) by discarding the \(\lambda - \mu \) worst search points of P with respect to some fitness function f, where ties are broken uniformly at random.

2.1 Three Problems

We consider the three pseudo-Boolean functions: OneMax, LeadingOnes and BinVal, which are defined over the finite binary search space \(\mathcal {X}=\{0,1\}^n\) and widely used as theoretical benchmark problems in runtime analyses of EDAs [10, 14, 26, 28, 44, 46]. Note in particular that these problems are only required to describe and compare the behaviour of the EDAs on problems with well-understood structures. The first problem, as its name may suggest, simply counts the number of ones in the bitstring and is widely used to test the performance of EDAs as a hill climber [25]. While the bits in OneMax have the same contributions to the overall fitness, BinVal, which aims at maximising the binary value of the bitstring, has exponentially scaled weights relative to bit positions. In contrast, LeadingOnes counts the number of leading ones in the bitstring. Since bits in this particular problem are highly correlated, it is often used to study the ability of EDAs to cope with dependencies among decision variables [25].

The global optimum for all functions is the all-ones bitstring, i.e. \(1^n\). For any bitstring \(x=(x_1,\ldots ,x_n) \in \mathcal {X}\), these functions is defined as follows:

Definition 1

\(\textsc {OneMax} (x) := \sum \limits _{i=1}^{n}x_i\).

Definition 2

\(\textsc {LeadingOnes} (x) := \sum \limits _{i=1}^{n}\prod \limits _{j=1}^{i}x_j\).

Definition 3

\(\textsc {BinVal} (x) := \sum \limits _{i=1}^{n}2^{n-i}x_i\).

2.2 Univariate Marginal Distribution Algorithm

Introduced by Mühlenbein and Paaß [37], the Univariate Marginal Distribution Algorithm (UMDA; see Algorithm 1) is one of the simplest EDAs, which assume independence between decision variables. To optimise a pseudo-Boolean function \(f:\{0,1\}^n\rightarrow \mathbb {R}\), the algorithm follows an iterative process: sample independently and identically a population of \(\lambda \) offspring from the current probabilistic model and update the model using the \(\mu \) fittest individuals. Each sample-and-update cycle is called a generation (or iteration). The probabilistic model in generation \(t\in \mathbb {N}\) is represented as a vector \(p_t=\left( p_t(1),\ldots ,p_t(n)\right) \in [0,1]^n\), where each component (or marginal) \(p_t(i)\in [0,1]\) for \(i\in [n]\) and \(t\in \mathbb {N}\) is the probability of sampling a one at the i-th bit position of an offspring in generation t. Each individual \(x=(x_1,\ldots ,x_n)\in \{0,1\}^n\) is therefore sampled from the joint probability distribution

$$\begin{aligned} \hbox {Pr}\left( x \mid p_t\right) =\prod _{i=1}^{n}p_t(i)^{x_i}\left( 1-p_t(i)\right) ^{(1-x_i)}. \end{aligned}$$
(1)

Note that the probabilistic model is initialised as \(p_0(i):=1/2\) for each \(i\in [n]\). Let \(x_t^{(1)},\ldots ,x_t^{(\lambda )}\) be \(\lambda \) individuals that are sampled from the joint probability distribution (1), then \(\mu \) of which with the fittest fitness are selected to obtain the next model \(p_{t+1}\). Let \(x_{t,i}^{(k)}\) denote the value of the i-th bit position of the k-th individual in the current sorted population \(P_t\). For each \(i \in [n]\), the corresponding marginal of the next model is

$$\begin{aligned} p_{t+1}(i):=\frac{1}{\mu }\sum _{k=1}^{\mu }x_{t,i}^{(k)}, \end{aligned}$$

which can be interpreted as the frequency of ones among the \(\mu \) fittest individuals at bit-position i.

The extreme probabilities—zero and one—must be avoided for each marginal \(p_t(i)\); otherwise, the bit in position i would remain fixed forever at either zero or one, obstructing some regions of the search space. To avoid this, all marginals \(p_{t+1}(i)\) are usually restricted within the closed interval \([{1}/{n},1-{1}/{n}]\), and such values 1 / n and \(1-{1}/{n}\) are called lower and upper borders, respectively. The algorithm in this case is known as the UMDA with margins.

figure a

2.3 Level-Based Theorem

We are interested in the optimisation time of the UMDA, which is a non-elitist algorithm; thus, tools for analysing runtime for this class of algorithms are of importance. Currently in the literature, drift theorems have often been used to derive upper and lower bounds on the expected optimisation time of the UMDA, see, e.g., [26, 44] because they allow us to examine the dynamics of each marginal in the vector-based probabilistic model. In this paper, we take another perspective where we consider the population of individuals. To do this, we make use of the so-called level-based theorem.

Introduced by Corus et al. [9], the level-based theorem is a general tool that provides upper bounds on the expected optimisation time of many non-elitist population-based algorithms on a wide range of optimisation problems [9]. It has been applied to analyse the expected optimisation time of Genetic Algorithms with or without crossover on various pseudo-Boolean functions and combinatorial optimisation problems [9], self-adaptive EAs [11], the UMDA with margins on OneMax and LeadingOnes [10], and very recently the PBIL with margins on LeadingOnes and BinVal [28].

The theorem assumes that the algorithm to be analysed can be described in the form of Algorithm 2. The population \(P_t\) in generation \(t\in \mathbb {N}\) of \(\lambda \) individuals is represented as a vector \((P_t(1),\ldots ,P_t(\lambda ))\in \mathcal {X}^\lambda \). The theorem is general because it does not assume specific fitness functions, selection mechanisms, or generic operators like mutation and crossover. Rather, the theorem assumes that there exists, possibly implicitly, a mapping \(\mathcal {D}\) from the set of populations \(\mathcal {X}^\lambda \) to the space of probability distributions over the search space \(\mathcal {X}\). The distribution \(\mathcal {D}(P_t)\) depends on the current population \(P_t\), and all individuals in population \(P_{t+1}\) are sampled identically and independently from this distribution [9]. The assumption of independent sampling of the individuals holds for the UMDA, and many other algorithms.

figure b

Furthermore, the theorem assumes a partition \(A_1,\ldots ,A_m\) of the finite search space \(\mathcal {X}\) into m subsets, which we call levels. We assume that the last level \(A_m\) consists of all optimal solutions. Given a partition of the search space \(\mathcal {X}\), we can state the level-based theorem as follows:

Theorem 4

[9] Given a partition \((A_1,\ldots ,A_m)\) of \(\mathcal {X}\), define \( T:=\min \{t\lambda \mid |P_t\cap A_m|>0\},\) where for all \(t\in \mathbb {N}\), \(P_t\in \mathcal {X}^\lambda \) is the population of Algorithm 2 in generation t. If there exist \(z_1,\ldots ,z_{m-1}, \delta \in (0,1]\), and \(\gamma _0\in (0,1)\) such that for any population \(P_t \in \mathcal {X}^\lambda \),

  • (G1) for each level \(j\in [m-1]\), if \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda \) then

    $$\begin{aligned} {\hbox {Pr}}_{y \sim \mathcal {D}(P_t)}\left( y \in A_{\ge j+1}\right) \ge z_j. \end{aligned}$$
  • (G2) for each level \(j\in [m-2]\) and all \(\gamma \in (0,\gamma _0]\), if \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda \) and \(|P_t\cap A_{\ge j+1}|\ge \gamma \lambda \) then

    $$\begin{aligned} {\hbox {Pr}}_{y \sim \mathcal {D}(P_t)}\left( y \in A_{\ge j+1}\right) \ge \left( 1+\delta \right) \gamma . \end{aligned}$$
  • (G3) and the population size \(\lambda \in \mathbb {N}\) satisfies

    $$\begin{aligned} \lambda \ge \left( \frac{4}{\gamma _0\delta ^2}\right) \ln \left( \frac{128m}{z_*\delta ^2}\right) , \end{aligned}$$

    where \(z_* :=\min _{j\in [m-1]}\{z_j\}\), then

    $$\begin{aligned} \mathbb {E}\left[ T\right] \le \left( \frac{8}{\delta ^2}\right) \sum _{j=1}^{m- 1}\left[ \lambda \ln \left( \frac{6\delta \lambda }{4+z_j\delta \lambda }\right) +\frac{1}{z_j}\right] . \end{aligned}$$

Informally, the first condition (G1) requires that the probability of sampling an individual in levels \(A_{\ge j+1}\) is at least \(z_j\) given that at least \(\gamma _0\lambda \) individuals in the current population are in levels \(A_{\ge j}\). Condition (G2) further requires that at least \(\gamma \lambda \) of them are in levels \(A_{\ge j+1}\), the probability of sampling an offspring in levels \(A_{\ge j+1}\) is at least \((1+\delta )\gamma \). The last condition (G3) sets a lower limit on the population size \(\lambda \). As long as the three conditions are satisfied, an upper bound on the expected time to reach the last level \(A_m\) of a population-based algorithm is guaranteed.

To apply the level-based theorem, it is recommended to follow the five-step procedure in [9]: (1) identifying a partition of the search space (2) finding appropriate parameter settings such that condition (G2) is met (3) estimating a lower bound \(z_j\) to satisfy condition (G1) (4) ensuring the the population size is large enough and (5) derive the upper bound on the expected time to reach level \(A_m\).

Note in particular that Algorithm 2 assumes a mapping \(\mathcal {D}\) from the space of populations \(\mathcal {X}^{\lambda }\) to the space of probability distributions over the search space. The mapping \(\mathcal {D}\) is often said to depend on the current population only [9]; however, this is not strictly necessary. Very recently, Lehre and Nguyen [28] applied Theorem 4 to analyse the expected optimisation time of the PBIL with a sufficiently large offspring population size \(\lambda =\varOmega (\log n)\) on LeadingOnes and BinVal, when the population for the next generation is sampled using a mapping that depends on the previous probabilistic model \(p_t\) in addition to the current population \(P_t\). The rationale behind this is that, in each generation, the PBIL draws \(\lambda \) samples from the probability distribution (1), that correspond to \(\lambda \) individuals in the current population. If the number of samples \(\lambda \) is sufficiently large, it is highly likely that the empirical distributions for all positions among the entire population cannot deviate too far from the true distributions, i.e. marginals \(p_t(i)\) [28], due to the Dvoretzky–Kiefer–Wolfowitz inequality [34].

2.4 Feige’s Inequality

In order to verify conditions (G1) and (G2) of Theorem 4 for the UMDA on OneMax using a canonical f-based partition \(A_1,\ldots ,A_m\), we later need a lower bound on the probability of sampling an offspring in given levels, that is \(\hbox {Pr}_{y\sim p_t}(y \in A_{\ge j})\), where y is the offspring sampled from the joint probability distribution (1). Let Y denote the number of ones in the offspring y. It is well-known that the random variable Y follows a Poisson–Binomial distribution with expectation \(\mathbb {E}\left[ Y\right] =\sum _{i=1}^{n}p_t(i)\) and variance \(\sigma _n^2=\sum _{i=1}^{n}p_t(i)\left( 1-p_t(i)\right) \). A general result due to Feige [16] provides such a lower bound when \(Y<\mathbb {E}\left[ Y\right] \); however, for our purposes, it will be more convenient to use the following variant [10].

Theorem 5

(Corollary 3 in [10]) Let \(Y_1,\ldots ,Y_n\) be n independent random variables with support in [0, 1], define \(Y = \sum _{i=1}^{n} Y_i\) and \(\mu = \mathbb {E}\left[ Y\right] \). It holds for every \(\varDelta >0\) that

$$\begin{aligned} \hbox {Pr}\left( Y > \mu - \varDelta \right) \ge \min \left\{ \frac{1}{13},\frac{\varDelta }{1+\varDelta }\right\} . \end{aligned}$$

2.5 Anti-concentration Bound

In addition to Feige’s inequality, it is also necessary to compute an upper bound on the probability of sampling an offspring in a given level, that is \(\hbox {Pr}_{y\sim p_t}\left( y \in A_j \right) \) for any \(j \in [m]\), where \(y\sim \hbox {Pr}(\cdot \mid p_t)\) as defined in (1). Let Y be the random variable that follows a Poisson–Binomial distribution as introduced in the previous subsection. Baillon et al. [3] derived the following sharp upper bound on the probability \(\hbox {Pr}_{y\sim p_t}\left( y \in A_j \right) \).

Theorem 6

(Adapted from Theorem 2.1 in [3]) Let Y be an integer-valued random variable that follows a Poisson–Binomial distribution with parameters n and \(p_t\), and let \(\sigma _n^2=\sum _{i=1}^{n}p_t(i)(1-p_t(i))\) be the variance of Y. For all n, y and \(p_t\), it then holds that

$$\begin{aligned} \sigma _n\cdot \hbox {Pr}\left( Y=y\right) \le \eta , \end{aligned}$$

where \(\eta \) is an absolute constant being

$$\begin{aligned} \eta =\max _{x\ge 0}\sqrt{2x} e^{-2x}\sum _{k=0}^{\infty }\left( \frac{x^k}{k!}\right) ^2 \approx 0.4688. \end{aligned}$$

3 Runtime of the UMDA on LeadingOnes and BinVal

As a warm-up example, and to illustrate the method of level-based analysis, we consider the two functions—LeadingOnes and BinVal—as defined in Definitions 2 and 3. It is well-known that the expected optimisation time of the (1+1) EA on LeadingOnes is \(\varTheta (n^2)\), and that this is optimal for the class of unary unbiased black-box algorithms [29]. Early analysis of the UMDA on LeadingOnes [8] required an excessively large population, i.e. \(\lambda = \omega (n^2\log n)\). Our analysis below shows that a population size \(\lambda = \varOmega (\log n)\) suffices to achieve the expected optimisation time \(\mathcal {O}(n^2)\).

Theorem 7

The UMDA (with margins) with parent population size \(\mu \ge c \log {n}\) for a sufficiently large constant \(c>0\), and offspring population size \(\lambda \ge (1+\delta )e\mu \) for any constant \(\delta >0\), has expected optimisation time \(\mathcal {O}(n\lambda \log {\lambda }+n^2)\) on LeadingOnes and BinVal.

Proof

We apply Theorem 4 by following the guidelines from [9].

Step 1 For both functions, we define the levels

$$\begin{aligned} A_j := \{ x\in \{0,1\}^n \mid \textsc {LeadingOnes} (x)=j-1\}. \end{aligned}$$

Thus, there are \(m = n+1\) levels ranging from \(A_1\) to \(A_{n+1}\). Note that a constant \(\gamma _0\) appearing later in this proof is set to \(\gamma _0 := \mu /\lambda \), that coincides with the selective pressure of the UMDA.

For LeadingOnes, the partition is clearly f-based as it is canonical to the function. For BinVal, however, note that since all the \(j-1\) leading bits of any \(x \in A_j\) are ones, then the contribution of these bits to \(\textsc {BinVal} (x)\) is \(\sum _{i=1}^{j-1}2^{n-i}\). On the other hand, the contribution of bit position j is 0, and that of the last \(n-j\) bits is between 0 and \(\sum _{i=j+1}^n 2^{n-i} = \sum _{i=0}^{n-j-1} 2^i = 2^{n-j}-1\), so in overall

$$\begin{aligned} \sum _{i=1}^{j}2^{n-i} - 1 \ge \textsc {BinVal} (x) \ge \sum _{i=1}^{j-1}2^{n-i}. \end{aligned}$$

Therefore, for any \(j \in [n]\) and all \(x \in A_j\), and all \(y \in A_{j+1}\) we have that

$$\begin{aligned} \textsc {BinVal} (y) \ge \sum _{i=1}^{j}2^{n-i} > \sum _{i=1}^{j}2^{n-i}-1 \ge \textsc {BinVal} (x); \end{aligned}$$

thus, the partition is also f-based for BinVal. This observation allows us to carry over the proof arguments of LeadingOnes to BinVal.

Step 2 In (G2), for any level \(j \in [n-1]\) satisfying \(|P_{t} \cap A_{\ge j}| \ge \gamma _0\lambda = \mu \) and \(|P_{t} \cap A_{\ge j+1}| \ge \lambda \gamma \) for some \(\gamma \in (0,\gamma _0]\), we seek a lower bound \((1+\delta )\gamma \) for \(\hbox {Pr}\left( y \in A_{\ge j+1}\right) \) where \(y \sim \mathcal {D}(P_t)\). The given conditions on j imply that the \(\mu \) fittest individuals of \(P_t\) have at least \(j-1\) leading 1-bits and among them at least \(\lceil \gamma \lambda \rceil \) have at least j leading 1-bits. Hence, \(p_{t+1}(i)=1-1/n\) for \(i \in [j-1]\) and \(p_{t+1}(j) \ge \max (\min (1-1/n, \gamma \lambda /\mu ),1/n)\ge \min (1-1/n,\gamma /\gamma _0)\), so

$$\begin{aligned} \hbox {Pr}\left( y \in A_{\ge j+1}\right)&\ge \prod _{i=1}^{j} p_{t+1}(i) \ge \min \left\{ \left( 1 - \frac{1}{n}\right) ^{j}, \left( 1 - \frac{1}{n}\right) ^{j-1} \cdot \frac{\gamma \lambda }{\mu }\right\} \\&\ge \min \bigg \{\frac{1}{e}, \frac{\gamma }{e\gamma _0}\bigg \} = \frac{\gamma }{e\gamma _0} = \frac{\lambda \gamma }{e\mu } \ge (1 +\delta )\gamma , \end{aligned}$$

due to \(\gamma \le \gamma _0\) and \(\lambda \ge (1+\delta )e\mu \) for any constant \(\delta >0\). Therefore, condition (G2) is now satisfied.

Step 3 In (G1), for any level \(j \in [n]\) satisfying \(|P_{t} \cap A_{\ge j} | \ge \gamma _0\lambda = \mu \) we need a lower bound \(\hbox {Pr}\left( y \in A_{\ge j+1}\right) \ge z_j\). Again the condition on level j gives that the \(\mu \) fittest individuals of \(P_t\) have at least \(j-1\) leading 1-bits, or \(p_{t+1}(i)=1-\frac{1}{n}\) for \(i \in [j-1]\). Due to the imposed lower margin, we can assume pessimistically that \(p_{t+1}(j) = \frac{1}{n}\). Hence,

$$\begin{aligned} \hbox {Pr}\left( y \in A_{\ge j+1}\right)&\ge \prod _{i=1}^{j} p_{t+1}(i) \ge \left( 1 - \frac{1}{n}\right) ^{j-1} \cdot \frac{1}{n} = \frac{1}{en} =: z_j. \end{aligned}$$

So, (G1) is satisfied for \(z_j:=\frac{1}{en}\).

Step 4 Considering (G3), because \(\delta \) is a constant, and both \(1/z_*\) and m are \(\mathcal {O}(n)\), there must exist a constant \(c>0\) such that \(\mu \ge c \log n \ge (4/\delta ^2)\ln (128 m / (z_* \delta ^2))\). Note that \(\lambda = \mu /\gamma _0\), so (G3) is satisfied.

Step 5 All conditions of Theorem 4 are satisfied, so the expected optimisation time of the UMDA on LeadingOnes is

$$\begin{aligned} \mathbb {E}\left[ T\right]&= \mathcal {O}\left( \sum _{j=1}^{n} \left( \lambda \ln \left( \frac{\lambda }{1 + \lambda /n}\right) + n \right) \right) = \mathcal {O}\left( n \lambda \log \lambda + n^2\right) . \end{aligned}$$

We now consider BinVal. In both problems, all that matters to determine the level of a bitstring is the position of the leftmost zero-bit. Now consider two bitstrings in the same level for BinVal, their rankings after the population is sorted are also determined by some other less significant bits; however, the proof thus far never takes these bits into account. Hence, the expected optimisation time of the UMDA on LeadingOnes can be carried over to BinVal for the UMDA with margins using truncation selection.

\(\square \)

4 Runtime of the UMDA on OneMax

We consider the problem in Definition 1, i.e., maximisation of the number of ones in a bitstring. It is well-known that OneMax can be optimised in expected time \(\varTheta (n\log n)\) using the simple \((1+1)\) EA. The level-based theorem yielded the first upper bound \(\mathcal {O}(n\lambda \log \lambda )\) on the expected optimisation time of the UMDA on OneMax, assuming that \(\lambda =\varOmega (\log n)\) [10]. This leaves open whether an improved bound \(\mathcal {O}(n\lambda )\) can be obtained for the UMDA (with margins) on problem OneMax.

We now introduce additional notation used throughout the section. The following random variables related to the sampling of a Poisson Binomial distribution with the parameter vector \(p_t = \left( p_t(1),\ldots ,p_t(n)\right) \) are often used in the proofs.

  • Let \(Y:=(Y_1,Y_2,\ldots ,Y_n)\) denote an offspring sampled from the probability distribution (1) in generation t, where \(\hbox {Pr}(Y_i=1)=p_t(i)\) for each \(i\in [n]\).

  • Let \(Y_{i,j}:=\sum _{k=i}^j Y_k\) denote the number of ones sampled from the sub-vector \(\left( p_t(i),p_t(i+1),\ldots ,p_t(j)\right) \) of the model \(p_t\) where \(1\le i\le j\le n\).

4.1 Small Parent Population Size

Our approach refines the analysis in [10] by considering anti-concentration properties of the random variables involved. As already discussed in Sect. 2.3, we need to verify the three conditions (G1), (G2) and (G3) of Theorem 4 to derive an upper bound on the expected optimisation time. The range of values of the marginals are (assuming that \(\mu < {n}\))

$$\begin{aligned} p_t(i) \in \left\{ \frac{k}{\mu }\mid k\in [\mu -1]\right\} \cup \left\{ 1-\frac{1}{n}, \frac{1}{n} \right\} . \end{aligned}$$

When \(p_t(i)=1-1/n\) or 1 / n, we say that the marginal is at the upper or lower border (or margin), respectively. Therefore, we can categorise values for \(p_t(i)\) into three groups: those at the upper margin \(1-1/n\), those at the lower margin 1 / n, and those within the closed interval \([1/\mu , 1-1/\mu ]\). For OneMax, all bits have the same weight and the fitness is just the sum of these bit values, so the re-arrangement of bit positions will have no impact on the sampling distribution. Given the current sorted population, recall that \(X_i:=\sum _{k=1}^{\mu } x_{t,i}^{(k)}\), and without loss of generality, we can re-arrange the bit-positions so that for two integers \(k,\ell \ge 0\), it holds

  • for all \(i\in [1,k],\)\(1\le X_i \le \mu -1\) and \(p_t(i)=X_i/\mu \),

  • for all \(i\in (k,k+\ell ]\), \(X_i = \mu \) and \(p_t(i)=1-1/n\), and

  • for all \(i\in (k+\ell ,n]\), \(X_i =0\) and \(p_t(i)=1/n\).

We define the levels using the canonical f-based partition

$$\begin{aligned} A_j&:=\left\{ x\in \{0,1\}^n \mid \textsc {OneMax} (x)=j-1\right\} . \end{aligned}$$
(2)

Note that the probability appearing in conditions (G1) and (G2) of Theorem 4 is the probability of sampling an offspring in levels \(A_{\ge j+1}\), that is \(\hbox {Pr}\left( Y_{1,n}\ge j\right) \).

We aim at obtaining an upper bound \(\mathcal {O}(n\lambda )\) on the expected optimisation time of the UMDA on OneMax using the level-based theorem. The logarithmic factor \(\mathcal {O}(\log \lambda )\) in the previous upper bound \(\mathcal {O}(n\lambda \log \lambda )\) in [10] stems from the lower bound \(\varOmega (1/\mu )\) on the parameter \(z_j\) in the condition (G1) of Theorem 4. We aim for the stronger bound \(z_j=\varOmega (\frac{n-j+1}{n})\). Note that in the following proofs, we choose the parameter \(\gamma _0:=\mu /\lambda \).

Assume that the current level is \(A_j\), that is \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda =\mu \), which, together with the two variables k and \(\ell \), implies that there are at least \(j-\ell -1\) ones from the first k bit positions. To verify conditions (G1) and (G2) of Theorem 4, we need to calculate the probability of sampling an offspring in levels \(A_{\ge j+1}\). It is thus more likely for the algorithm to maintain the \(\ell \) ones for all bit positions \(i\in (k,k+\ell ]\) (actually this happens with probability at least 1 / e), and also sample at least \(j-\ell \) ones from the remaining \(n-\ell \) remaining bit positions. This lead us to consider three distinct cases according to different configurations of the current population with respect to the two parameters k and j in Step 3 of Theorem 8 below.

  1. 1.

    \(k\ge \mu \). In this situation, the variance of \(Y_{1,k}\) is not too small. By the result of Theorem 6, the distribution of \(Y_{1,k}\) cannot be too concentrated on its mean \(\mathbb {E}\left[ Y_{1,k}\right] =j-\ell -1\), and with probability at least \(\varOmega (1)\), the algorithm can sample at least \(j-\ell \) ones from the first k bit positions to obtain an offspring with at least \((j-\ell )+\ell =j\) ones. Thus, the probability of sampling at least j ones is bounded from below by

    $$\begin{aligned} \hbox {Pr}(Y_{1,n}\ge j) \ge \hbox {Pr}(Y_{1,k}\ge j-\ell ) \hbox {Pr}(Y_{k+1, k+\ell }=\ell ) =\varOmega (1). \end{aligned}$$
  2. 2.

    \(k<\mu \) and \(j\ge n+1-\frac{n}{\mu }\). In this case, the current level is very close to the last level \(A_{n+1}\), and the bitstring has few zeros. As already obtained from [10], the probability of sampling an offspring in \(A_{\ge j+1}\) in this case is \(\varOmega (\frac{1}{\mu })\). Since the condition can be rewritten as \(\frac{1}{\mu }\ge \frac{n-j+1}{n}\), it ensures that \(z_j=\varOmega (\frac{1}{\mu })=\varOmega (\frac{n-j+1}{n})\).

  3. 3.

    The remaining cases. Later will we prove that if \(\mu \le \sqrt{n(1-c)}\) for some constant \(c\in (0,1)\), and excluding the two cases above, imply \(0\le k<(1-c)(n-j+1)\). In this case, k is relatively small, and \(\ell \) is not too large since the current level is not very close to the last level \(A_{n+1}\). This implies that most zeros must be located among bit positions \(i\in (k+\ell ,n]\), and it suffices to sample an extra one from this region to get at least \((j-\ell -1)+\ell +1=j\) ones. The probability of sampling an offspring in levels \(A_{\ge j+1}\) is then \(z_j=\varOmega (\frac{n-j+1}{n})\).

We now present our detailed runtime analysis for the UMDA on OneMax, when the population size is small, that is, \(\mu = \varOmega (\log n) \cap \mathcal {O}(\sqrt{n})\).

Theorem 8

For some constant \(a>0\) and any constant \(c\in (0,1)\), the UMDA (with margins) with parent population size \(a\ln (n)\le \mu \le \sqrt{n(1-c)}\), and offspring population size \(\lambda \ge (13e/(1-c))\mu \), has expected optimisation time \(\mathcal {O}\left( n\lambda \right) \) on OneMax.

Proof

We re-arrange the bit positions as explained above and follow the recommended 5-step procedure for applying Theorem 4 [9].

Step 1 The levels are defined as in Eq. (2). There are exactly \(m=n+1\) levels from \(A_1\) to \(A_{n+1}\), where level \(A_{n+1}\) consists of the optimal solution.

Step 2 We verify condition (G2) of Theorem 4. In particular, for some \(\delta \in (0,1)\), for any level \(j\in [m-2]\) and any \(\gamma \in (0,\gamma _0]\), assuming that the population is configured such that \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda =\mu \) and \(|P_t\cap A_{\ge j+1}|\ge \gamma \lambda >0\), we must show that the probability of sampling an offspring in levels \(A_{\ge j+1}\) must be no less than \((1+\delta )\gamma \). By the re-arrangement of the bit-positions mentioned earlier, it holds that

$$\begin{aligned} \sum _{i=k+1}^{k+\ell }X_i=\mu \ell \quad \text{ and } \quad \sum _{i=k+\ell +1}^{n}X_i=0, \end{aligned}$$
(3)

where \(X_i\) for all \( i\in [n]\) are given in Algorithm 1. By assumption, the current population \(P_t\) consists of \(\gamma \lambda \) individuals with at least j ones and \(\mu -\gamma \lambda \) individuals with exactly \(j-1\) ones. Therefore,

$$\begin{aligned} \sum _{i=1}^{n}X_i\ge \gamma \lambda j + \left( \mu -\gamma \lambda \right) (j-1) = \gamma \lambda +\mu \left( j-1\right) . \end{aligned}$$
(4)

Combining (3), (4) and noting that \(\lambda =\mu /\gamma _0\) yield

$$\begin{aligned} \sum _{i=1}^{k}X_i&=\sum _{i=1}^{n}X_i-\sum _{i=k+1}^{k+\ell }X_i-\sum _{i=k+\ell +1}^{n}X_i \\&\ge \gamma \lambda +\mu \left( j-1\right) -\mu \ell =\mu \left( \frac{\gamma }{\gamma _0}+j-1-\ell \right) . \end{aligned}$$

Let \(Z=Y_{1,k}+Y_{k+\ell +1,n}\) be the integer-valued random variable, which describes the number of ones sampled in the first k and the last \(n-k-\ell \) bit positions. Since \(k+\ell \le n\), the expected value of Z is

$$\begin{aligned} \mathbb {E}\left[ Z\right] = \sum _{i=1}^{k}p_t(i) +\sum _{i=k+\ell +1}^{n}p_t(i) = \frac{1}{\mu }\sum _{i=1}^{k}X_i +\frac{n-k-\ell }{n} \ge j-\ell -1+\frac{\gamma }{\gamma _0} . \end{aligned}$$
(5)

In order to obtain an offspring with at least j ones, it is sufficient to sample \(\ell \) ones in positions \(k+1\) to \(k+\ell \) and at least \(j-\ell \) ones from the other positions. The probability of this event is bounded from below by

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n} \ge j\right) \ge \hbox {Pr}\left( Z\ge j-\ell \right) \cdot \hbox {Pr}\left( Y_{k+1,k+\ell }=\ell \right) . \end{aligned}$$
(6)

The probability to obtain \(\ell \ge n-1\) ones in the middle interval from position \(k+1\) to \(k+\ell \) is

$$\begin{aligned} \hbox {Pr}\left( Y_{k+1,k+\ell }=\ell \right) =\left( 1-\frac{1}{n}\right) ^\ell \ge \left( 1-\frac{1}{n}\right) ^{n-1}\ge \frac{1}{e} \end{aligned}$$
(7)

by the result of Lemma 10 for \(t=-1\). We now estimate the probability \(\hbox {Pr}\left( Z\ge j-\ell \right) \) using Feige’s inequality. Since Z takes integer values only, it follows by (5) that

$$\begin{aligned} \hbox {Pr}\left( Z\ge j-\ell \right) =\hbox {Pr}\left( Z> j-\ell -1\right) \ge \hbox {Pr}\left( Z>\mathbb {E}\left[ Z\right] -\frac{\gamma }{\gamma _0}\right) . \end{aligned}$$

Applying Theorem 5 for \(\varDelta =\gamma /\gamma _0\le 1\) and noting that we chose \(\mu \) and \(\lambda \) such that \(1/\gamma _0=\lambda /\mu \ge 13e/(1-c)= 13e(1+\delta )\) yield

$$\begin{aligned} \hbox {Pr}\left( Z\ge j-\ell \right) \ge \min \bigg \{\frac{1}{13},\frac{\varDelta }{\varDelta +1}\bigg \} \ge \frac{\varDelta }{13} = \frac{\gamma }{13\gamma _0} \ge e\left( 1+\delta \right) \gamma . \end{aligned}$$
(8)

Combining (6), (7), and (8) yields \( \hbox {Pr}\left( Y_{1,n}\ge j\right) \ge \left( 1+\delta \right) \gamma \), and, thus, condition (G2) of Theorem 4 holds.

Step 3 We now consider condition (G1) for any level j. Let \(P_t\) be any population where \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda =\mu \). For a lower bound on \(\hbox {Pr}\left( Y_{1,n}\ge j\right) \), we modify the population such that any individual in levels \(A_{\ge j+1}\) is moved to level \(A_j\). Thus, the \(\mu \) fittest individuals belong to level \(A_j\). By the definition of the UMDA, this will only reduce the probabilities \(p_{t+1}(i)\) on the OneMax problem. Hence, by Lemma 13, the distribution of \(Y_{1,n}\) for the modified population is stochastically dominated by \(Y_{1,n}\) for the original population. A lower bound \(z_j\) that holds for the modified population therefore also holds for the original population. All the \(\mu \) fittest individuals in the current sorted population \(P_t\) have exactly \(j-1\) ones, and, therefore,\( \sum _{i=1}^{n}X_i= \mu \left( j-1\right) \) and \( \sum _{i=1}^{k}X_i= \mu \left( j-\ell -1\right) \). There are four distinct cases that cover all situations according to different values of variables k and j. We aim to show that in all four cases, we can use the parameter \(z_j=\varOmega (\frac{n-j+1}{n})\).

Case 0\(k=0\). In this case, \(p_t(i)=1-1/n\) for \(1\le i\le j-1\), and \(p_t(i)=1/n\) for \(j\le i\le n\). To obtain j ones, it suffices to sample only ones in the first \(j-1\) positions, and exactly a one in the remaining positions, i.e.,

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right) \ge \frac{n-j+1}{n}\left( 1-\frac{1}{n}\right) ^{n-1} =\varOmega \left( \frac{n-j+1}{n}\right) . \end{aligned}$$

Case 1\(k\ge \mu \). We will apply the anti-concentration inequality in Theorem 6. To lower bound the variance of the number of ones sampled in the first k positions, we use the bounds \(1/\mu \le p_i(t)\le 1-1/\mu \) which hold for \(1\le i\le k\). In particular,

$$\begin{aligned} \mathrm {Var}\left[ Y_{1,k}\right] =\sum _{i=1}^{k}p_t(i)\left( 1-p_t(i)\right) \ge \frac{k}{\mu }\left( 1-\frac{1}{\mu }\right) \ge \frac{9k}{10\mu }\ge \frac{9}{10}, \end{aligned}$$

where the second inequality holds for sufficiently large n because \(\mu \ge a\ln (n)\) for some constant \(a>0\). Theorem 6 applied with \(\sigma _k\ge \sqrt{9/10}\) now gives

$$\begin{aligned} \hbox {Pr}\left( Y_{1,k}=j-\ell -1\right) \le \eta /\sigma _k. \end{aligned}$$

Furthermore, since \(\mathbb {E}\left[ Y_{1,k}\right] \) is an integer, Lemma 11 implies that

$$\begin{aligned} \hbox {Pr}\left( Y_{1,k}\ge \mathbb {E}\left[ Y_{1,k}\right] \right) \ge 1/2. \end{aligned}$$
(9)

By combining these two probability bounds, the probability of sampling an offspring with at least \(j-\ell \) ones from the first k positions is

$$\begin{aligned} \hbox {Pr}\left( Y_{1,k}\ge j-\ell \right)&=\hbox {Pr}\left( Y_{1,k}\ge j-\ell -1\right) -\hbox {Pr}\left( Y_{1,k}=j-\ell -1\right) \nonumber \\&=\hbox {Pr}\left( Y_{1,k}\ge \mathbb {E}\left[ Y_{1,k}\right] \right) -\hbox {Pr}\left( Y_{1,k}=j-\ell -1\right) \nonumber \\&\ge \frac{1}{2}-\frac{\eta }{\sigma _k} >\frac{1}{2}-\frac{0.4688}{\sqrt{9/10}} = \varOmega (1). \end{aligned}$$

In order to obtain an offspring in levels \(A_{\ge j+1}\), it is sufficient to sample at least \(j-\ell \) ones from the k first positions and \(\ell \) ones from position \(k+1\) to position \(k+\ell \). Therefore, using (7) and the above lower bound, this event happens with probability bounded from below by

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right)&\ge \hbox {Pr}\left( Y_{1,k} \ge j-\ell \right) \cdot \hbox {Pr}\left( Y_{k+1,k+\ell }=\ell \right) \\&> \varOmega (1)\cdot \frac{1}{e} =\varOmega \left( \frac{n-j+1}{n}\right) . \end{aligned}$$

Case 2\(1\le k<\mu \) and \(j\ge n(1-1/\mu )+1\). The second condition is equivalent to \(1/\mu \ge (n-j+1)/n\). The probability of sampling an offspring in levels \(A_{\ge j+1}\) is then bounded from below by

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right)&\ge \hbox {Pr}\left( Y_{1,1}=1\right) \hbox {Pr}\left( Y_{2,k}\ge j-\ell -1\right) \hbox {Pr}\left( Y_{k+1,k+\ell }=\ell \right) \\&\ge \frac{1}{\mu }\hbox {Pr}\left( Y_{2,k}\ge j-\ell -1\right) \frac{1}{e} \ge \frac{1}{14e\mu }, \end{aligned}$$

where we used the inequality \(\hbox {Pr}\left( Y_{2,k}\ge j-\ell -1\right) \ge 1/14\) for \(\mu \ge 14\) proven in [10]. Since \(1/\mu \ge (n-j+1)/n\), we can conclude that

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right) \ge \frac{1}{14e\mu } \ge \frac{n-j+1}{14en}=\varOmega \left( \frac{n-j+1}{n}\right) . \end{aligned}$$

Case 3\(1\le k<\mu \) and \(j<n(1-1/\mu )+1\). This case covers all the remaining situations not included by the first two cases. The latter inequality can be rewritten as \(n-j+1\ge n/\mu \). We also have \(\mu \le \sqrt{n(1-c)}\), so \(n/\mu \ge \mu /(1-c)\). It then holds that

$$\begin{aligned} (1-c)(n-j+1) \ge (1-c)(n/\mu ) \ge (1-c)\mu /(1-c)=\mu >k. \end{aligned}$$

Thus, the two conditions can be shortened to \(1\le k<(1-c)(n-j+1)\). In this case, the probability of sampling j ones is

$$\begin{aligned} \hbox {Pr}(Y_{1,n}\ge j)&\ge \hbox {Pr}\left( Y_{1,k}\ge j-\ell -1\right) \hbox {Pr}\left( Y_{k+1,k+\ell }=\ell \right) \hbox {Pr}\left( Y_{k+\ell +1,n}\ge 1\right) \\&\ge \frac{1}{2}\cdot \frac{1}{e}\cdot \frac{n-k-\ell }{n} = \frac{n-k-\ell }{2en}, \end{aligned}$$

where the 1 / 2 factor in the last inequality is due to (9). Since \(\ell \le j-1\) and \(k<(1-c)(n-j+1)\), it follows that

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right) > \frac{n-(1-c)(n-j+1)-j+1}{2en} =\varOmega \left( \frac{n-j+1}{n}\right) . \end{aligned}$$

Combining all three cases together yields the probability of sampling an offspring in levels \(A_{\ge j+1}\) as follows.

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge j\right) = \varOmega \left( \frac{n-j+1}{n}\right) , \end{aligned}$$

and by defining \(z_j=c\cdot \frac{n-j+1}{n}\) for a sufficiently small \(c>0\) and choosing \(z_*:=\min _{j\in [n]}\{z_j\}=\varOmega (1/n)\), condition (G1) of Theorem 4 is satisfied.

Step 4 We consider condition (G3) regarding the population size. We have \(1/\delta ^2=\mathcal {O}(1)\), \(1/z_*= \mathcal {O}(n)\), and \(m=\mathcal {O}(n)\). Therefore, there must exist a constant \(a>0\) such that

$$\begin{aligned} \left( \frac{a}{\gamma _0}\right) \ln (n) \ge \left( \frac{4}{\gamma _0\delta ^2}\right) \ln \left( \frac{128m}{z_*\delta ^2}\right) . \end{aligned}$$

The requirement \(\mu \ge a\ln (n)\) now implies that

$$\begin{aligned} \lambda = \frac{\mu }{\mu /\lambda } \ge \left( \frac{a}{\gamma _0}\right) \ln (n) \ge \left( \frac{4}{\gamma _0\delta ^2}\right) \ln \left( \frac{128m}{z_*\delta ^2}\right) ; \end{aligned}$$

hence, condition (G3) is satisfied.

Step 5 We have verified all three conditions (G1), (G2), and (G3). By Theorem 4 and the bound \(z_j=\varOmega ((n-j+1)/n)\), the expected optimisation time is therefore

$$\begin{aligned} \mathbb {E}\left[ T\right] = \mathcal {O}\left( \lambda \sum _{j=1}^{n}\ln \left( \frac{n}{n-j+1}\right) + \sum _{j=1}^{n}\frac{n}{n-j+1}\right) . \end{aligned}$$

We simplify the two terms separately. By Stirling’s approximation (see Lemma 12), the first term is

$$\begin{aligned} \mathcal {O}\left( \lambda \sum _{j=1}^{n}\ln \left( \frac{n}{n-j+1}\right) \right) =\mathcal {O}\left( \lambda \ln \prod _{j=1}^n\frac{n}{n-j+1}\right) \\ =\mathcal {O}\left( \lambda \ln \left( \frac{n^n}{n!}\right) \right) =\mathcal {O}\left( \lambda \ln \frac{n^n\cdot e^n}{n^{n+1/2}}\right) = \mathcal {O}\left( n\lambda \right) . \end{aligned}$$

The second term is

$$\begin{aligned} \mathcal {O}\left( \sum _{j=1}^{n}\frac{n}{n-j+1}\right) =\mathcal {O}\left( n\sum _{k=1}^{n}\frac{1}{k}\right) = \mathcal {O}\left( n\log n\right) . \end{aligned}$$

Since \(\lambda > \mu = \varOmega (\log n)\), the expected optimisation time is

$$\begin{aligned} \mathbb {E}\left[ T\right] =\mathcal {O}\left( n\lambda \right) + \mathcal {O}\left( n\log n\right) =\mathcal {O}\left( n\lambda \right) . \end{aligned}$$

\(\square \)

4.2 Large Parent Population Size

For larger parent population sizes, i.e., \(\mu = \varOmega (\sqrt{n}\log n)\), we prove the upper bound \(\mathcal {O}(\lambda \sqrt{n})\) on the expected optimisation time of the UMDA on OneMax. Witt [44] obtained a similar result, and we actually rely on one of his lemmas to derive our improved result. In overall, our proof is not only significantly simpler but also holds for different settings of \(\mu \) and \(\lambda \), that is, \(\lambda =\varOmega (\mu )\) instead of \(\lambda =\varTheta (\mu )\).

Theorem 9

For sufficiently large constants \(a>1\) and \(c>0\), the UMDA (with margins) with offspring population size \(\lambda \ge a\mu \), and parent population size \(\mu \ge c\sqrt{n}\log n\), has expected optimisation time \(\mathcal {O}\left( \lambda \sqrt{n}\right) \) on OneMax.

Here, we are mainly interested in the parent population size \(\mu \ge c\sqrt{n}\log n\) for a sufficiently large constant \(c>0\). In this case, Witt [44] found that \(\hbox {Pr}(T\le n^{cc'})=\mathcal {O}(n^{-cc'})\), where \(c'\) is another positive constant and \(T:=\min \{t\ge 0\;|\;p_t(i)\le 1/4\}\) for an arbitrary bit \(i\in [n]\). This result implies that the probability of not sampling at least an optimal solution within \(n^{cc'}\) generations is bounded by \(\mathcal {O}(n^{-cc'})\). Therefore, the UMDA needs \(\mathcal {O}(n\lambda \log \lambda )/\lambda = \mathcal {O}(n\log \lambda )\) generations [10] with probability \(\mathcal {O}(n^{-cc'})\) and \(\mathcal {O}(\lambda \sqrt{n})/\lambda =\mathcal {O}(\sqrt{n})\) with probability \(1-\mathcal {O}(n^{-cc'})\) to optimise OneMax. The expected number of generations is

$$\begin{aligned} \mathcal {O}(n^{-cc'})\cdot \mathcal {O}(n\log \lambda ) + (1-\mathcal {O}(n^{-cc'}))\cdot \mathcal {O}(\sqrt{n}) \end{aligned}$$

If we choose the constant c large enough, then \(n\log \lambda \) can subsume any polynomial number of generations, i.e. \(n\log \lambda \in \text {poly}(n)\), which leads to \(\mathcal {O}(n^{-cc'})\cdot \mathcal {O}(n\log \lambda ) = \mathcal {O}(1)\). Therefore, the overall expected number of generations is still bounded by \(\mathcal {O}(\sqrt{n})\), so the expected optimisation time is \(\mathcal {O}(\lambda \sqrt{n})\).

In addition, the analysis by Witt [44] implies that all marginals will generally move to higher values and are unlikely to drop by a large distance. We then pessimistically assume that all marginals are lower bounded by a constant \(p_{\min }=1/4\). Again, we rearrange the bit positions such that there exist two integers \(0\le k,\ell \le n\), where \(k+\ell =n\) and

  • \(p_t(i)\in \left[ p_{\min }, 1-\frac{1}{\mu }\right] \) for all \(1\le i \le k\),

  • \(p_t(i) =1-\frac{1}{n}\) for all \(k+1\le i \le n\).

Note that \(k>0\) because if \(k=0\) we would have sampled a globally optimal solution.

Proof of Theorem 9

We apply Theorem 4.

Step 1 We partition the search space into the m subsets \(A_1, \ldots ,A_m\) (i.e. levels) defined by

$$\begin{aligned} A_{i}&:= \{ x \in \{0,1\}^n \mid f_{i-1} \le \textsc {OneMax} (x) < f_{i} \} \text { for } i \in [m-1], \\ \text { and } A_{m}&:= \{ 1^n \}, \end{aligned}$$

where the sequence \((f_i)_{i \in \mathbb {N}}\) is defined with some constant \(d\in (0,1]\) as

$$\begin{aligned} f_0&:= 0 \text { and } f_{i+1} := f_{i} + \lceil d\sqrt{n-f_{i}} \rceil . \end{aligned}$$
(10)

The range of d will be specified later, but for now note that \(m = \min \{i \mid f_{i} = n\} + 1\) and due to Lemma 15,Footnote 1 we know that the sequence \((f_i)_{i \in \mathbb {N}}\) is well-behaved: it starts at 0 and increases steadily (at least 1 per level), then eventually reaches n exactly and remains there afterwards. Moreover, the number of levels satisfies \(m = \varTheta (\sqrt{n})\).

Step 2 For (G2), we assume that \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda =\mu \) and \(|P_t\cap A_{\ge j+1}|\ge \gamma \lambda \). Additionally, we make the pessimistic assumption that \(|P_t\cap A_{\ge j+2}|= 0\), i.e. the current population contains exactly \(\gamma \lambda \) individuals in \(A_{j+1}\), \(\mu -\gamma \lambda \) individuals in level \(A_{j}\), and \(\lambda -\mu \) individuals in the levels below \(A_j\). In this case,

$$\begin{aligned} \sum _{i=1}^n X_i = \gamma \lambda f_j +(\mu -\gamma \lambda )f_{j-1} = \mu \left( f_{j-1}+\frac{\gamma }{\gamma _0}\left( f_j-f_{j-1}\right) \right) , \end{aligned}$$

and

$$\begin{aligned} \sum _{i=1}^k X_i = \sum _{i=1}^n X_i - \sum _{i=k+1}^n X_i =\mu \left( f_{j-1}+\frac{\gamma }{\gamma _0}\left( f_j-f_{j-1}\right) -\ell \right) . \end{aligned}$$

The expected value of \(Y_{1,k}\) is

$$\begin{aligned} \mathbb {E}\left[ Y_{1,k}\right] =\frac{1}{\mu }\sum _{i=1}^k X_i = (f_{j-1}-\ell ) +\frac{\gamma }{\gamma _0}\left( f_j-f_{j-1}\right) . \end{aligned}$$

Due to the assumption \(p_t(i)\ge p_{\min }=1/4\), the variance of \(Y_{1,k}\) is

$$\begin{aligned} \mathrm {Var}\left[ Y_{1,k}\right]&=\sum _{i=1}^k p_t(i)(1-p_t(i))\\&\ge p_{\min }(k-\mathbb {E}\left[ Y_{1,k}\right] )\\&= \frac{1}{4}\left( n-\ell -\mathbb {E}\left[ Y_{1,k}\right] \right) \\&= \frac{1}{4}\left( n-\ell -f_{j-1}-\frac{\gamma }{\gamma _0}\left( f_j-f_{j-1}\right) +\ell \right) \\&\ge \frac{1}{4}\left( n-f_{j-1}- d\left( n-f_{j-1}\right) \right) =\frac{1}{4}\left( n-f_{j-1}\right) \left( 1-d\right) . \end{aligned}$$

The probability of sampling an offspring in \(A_{\ge j+1}\) is bounded from below by

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n} \ge f_{j}\right) \ge \hbox {Pr}(Y_{1,k}\ge f_j-\ell )\cdot \hbox {Pr}(Y_{k+1,n} = \ell ), \end{aligned}$$

where

$$\begin{aligned} \hbox {Pr}(Y_{k+1,n}=\ell )=\left( 1-\frac{1}{n}\right) ^{\ell }\ge \left( 1-\frac{1}{n}\right) ^{n-1}\ge \frac{1}{e}, \end{aligned}$$

and

$$\begin{aligned} \hbox {Pr}\left( Y_{1,k}\ge f_j-\ell \right)&\ge \hbox {Pr}\left( Y_{1,k}\ge \mathbb {E}\left[ Y_{1,k}\right] \right) -\hbox {Pr}\left( \mathbb {E}\left[ Y_{1,k}\right] \le Y_{1,k}\le f_j-\ell \right) . \end{aligned}$$
(11)

By Theorem 6, we have

$$\begin{aligned} \hbox {Pr}\left( \mathbb {E}\left[ Y_{1,k}\right] \le Y_{1,k}\le f_j-\ell \right)&\le \frac{\eta \left( f_j-\ell -\mathbb {E}[Y_{1,k}]\right) }{\sqrt{\mathrm {Var}\left[ Y_{1,k}\right] }}\\&= \eta \left( 1-\frac{\gamma }{\gamma _0}\right) \frac{f_j-f_{j-1}}{\sqrt{\mathrm {Var}\left[ Y_{1,k}\right] }}\\&= 2\eta \left( 1-\frac{\gamma }{\gamma _0}\right) \frac{d}{\sqrt{1-d}}\\&\le \left( 1-\frac{\gamma }{\gamma _0}\right) \frac{d}{\sqrt{1-d}}. \end{aligned}$$

The last inequality follows from \(\eta \approx 0.4688 <1/2\). Note that \(\hbox {Pr}\left( Y_{1,k}\ge \mathbb {E}\left[ Y_{1,k}\right] \right) \ge \psi = \varOmega (1)\) due to Lemma 16, so (11) becomes

$$\begin{aligned} \hbox {Pr}(Y_{1,k}\ge f_j-\ell ) \ge \psi - \left( 1-\frac{\gamma }{\gamma _0}\right) \frac{d}{\sqrt{1-d}} \ge \psi \frac{\gamma }{\gamma _0}. \end{aligned}$$
(12)

The last inequality is satisfied if for any \(j\in [m-1]\),

$$\begin{aligned} \frac{d}{\sqrt{1-d}}\le \psi&\iff \psi ^{-2} d^2+d-1 \le 0. \end{aligned}$$

The discriminant of this quadratic equation is \(\varDelta = 1+4\psi ^{-2}>0\). Vieta’s formula [43] yields that the product of its two solutions is negative, implying that the equation has two real solutions \(d_1<0\) and \(d_2>0\). Specifically,

$$\begin{aligned} d_{1} = -(1 + \sqrt{\varDelta })\psi ^{2}/2 <0 \quad \text { and } \quad d_{2} = (-1+ \sqrt{\varDelta })\psi ^{2}/2\in (0,1). \end{aligned}$$
(13)

Therefore, if we choose any value of d such that \(0<d \le d_2\), then inequality (12) always holds. The probability of sampling an offspring in \(A_{\ge j+1}\) is therefore bounded from below by

$$\begin{aligned} \hbox {Pr}(Y_{1,n}\ge f_j)\ge \frac{1}{e}\cdot \psi \frac{\gamma }{\gamma _0} \ge (1+\delta )\gamma . \end{aligned}$$

The last inequality holds if we choose the population size in the UMDA such that \(\mu /\lambda = \gamma _0 \le \psi /(1+\delta )e\), where \(\delta \in (0,1]\). Condition (G2) then follows.

Step 3 Assume that \(|P_t\cap A_{\ge j}|\ge \gamma _0\lambda =\mu \). This means that the \(\mu \) fittest individuals in the current sorted population \(P_t\) belong to levels \(A_{\ge j}\). In other words,

$$\begin{aligned} \sum _{i=1}^n X_i\ge \mu f_{j-1}, \end{aligned}$$

and

$$\begin{aligned} \sum _{i=1}^k X_i = \sum _{i=1}^n X_i-\sum _{i=k+1}^n X_i \ge \mu f_{j-1}-\mu \ell =\mu (f_{j-1}-\ell ). \end{aligned}$$

The expected value of \(Y_{1,n}\) is

$$\begin{aligned} \mathbb {E}\left[ Y_{1,n}\right] =\sum _{i=1}^n p_t(i) = \frac{1}{\mu }\sum _{i=1}^{k}X_i +\sum _{i=k+1}^n \left( 1-\frac{1}{n}\right) \ge f_{j-1}-\frac{\ell }{n}. \end{aligned}$$
(14)

An individual belonging to the higher levels \(A_{\ge j+1}\) must have at least \(f_j\) ones. The probability of sampling an offspring \(y \in A_{\ge j+1}\) is equivalent to \(\hbox {Pr}(Y_{1,n}\ge f_j)\). According to the level definitions and following the result of Lemma 17, we have

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge f_j\right)&= \hbox {Pr}\left( Y_{1,n}\ge f_{j-1}+\lceil d\sqrt{n-f_{j-1}}\rceil \right) \\&\ge \hbox {Pr}\left( Y_{1,n}\ge \mathbb {E}\left[ Y_{1,n}\right] +d\sqrt{n-\mathbb {E}\left[ Y_{1,n}\right] }\right) . \end{aligned}$$

In order to obtain a lower bound on \(\hbox {Pr}\left( Y_{1,n}\ge f_j\right) \), we need to bound the probability \(\hbox {Pr}\left( Y_{1,n}\ge \mathbb {E}\left[ Y_{1,n}\right] +d\sqrt{n-\mathbb {E}\left[ Y_{1,n}\right] }\right) \) from below by a constant. We obtain such a bound by applying the result of Lemma 14. This lemma with constant \(d^*\ge 1/p_{\min }=4\) and \(d \le d^*\) yields

$$\begin{aligned} \hbox {Pr}\left( Y_{1,n}\ge f_j\right)&\ge \hbox {Pr}\left( Y_{1,n}\ge \mathbb {E}\left[ Y_{1,n}\right] +d\sqrt{n-\mathbb {E}\left[ Y_{1,n}\right] }\right) \\&\ge \hbox {Pr}\left( Y_{1,n}\ge \min \bigg \{\mathbb {E}\left[ Y_{1,n}\right] +d^*\sqrt{n-\lfloor \mathbb {E}\left[ Y_{1,n}\right] \rfloor },n\bigg \}\right) \\&\ge \kappa >0, \end{aligned}$$

where \(\kappa \) is a constant. Hence, the probability of sampling an offspring in levels \(A_{\ge j+1}\) is bounded from below by a positive constant \(z_j:=\kappa \) independent of n.

Step 4 We consider condition (G3) regarding the population size. We have \(1/\delta ^2=\mathcal {O}(1)\), \(1/z_*= \mathcal {O}(1)\), and \(m=\mathcal {O}(\sqrt{n})\). Therefore, there must exist a constant \(c>0\) such that

$$\begin{aligned} \left( \frac{c}{\gamma _0}\right) \sqrt{n}\ln (n) \ge \left( \frac{4}{\gamma _0\delta ^2}\right) \ln \left( \frac{128m}{z_*\delta ^2}\right) . \end{aligned}$$

The requirement \(\mu \ge c\sqrt{n}\ln (n)\) now implies that

$$\begin{aligned} \lambda = \frac{\mu }{\mu /\lambda } \ge \left( \frac{c}{\gamma _0}\right) \sqrt{n}\ln (n) \ge \left( \frac{4}{\gamma _0\delta ^2}\right) \ln \left( \frac{128m}{z_*\delta ^2}\right) ; \end{aligned}$$

hence, condition (G3) is satisfied.

Step 5 The probability of sampling an offspring in levels \(A_{\ge j+1}\) is bounded from below by \(z_j=\kappa \). Having satisfied all three conditions, Theorem 4 then guarantees an upper bound on the expected optimisation time of the UMDA on OneMax, assuming that \(\mu =\varOmega (\sqrt{n}\log n)\),

$$\begin{aligned} \mathbb {E}\left[ T\right] =\mathcal {O}\left( \lambda \sum _{j=1}^m \frac{1}{z_j}+\sum _{j=1}^m \frac{1}{z_j}\right) =\mathcal {O}(m\lambda ) = \mathcal {O}\left( \lambda \sqrt{n}\right) \end{aligned}$$

since \(m=\varTheta (\sqrt{n})\) due to Lemma 15. \(\square \)

5 Empirical Results

We have proved upper bounds on the expected optimisation time of the UMDA on OneMax, LeadingOnes and BinVal. However, they are only asymptotic upper bounds as growth functions of the problem and population sizes. They provide no information on the multiplicative constants or the influences of lower order terms. Our goal is also to investigate the runtime behaviour for larger populations. To complement the theoretical findings, we therefore carried out some experiments by running the UMDA on the three functions.

For each function, the parameters were chosen consistently with the theoretical analyses. Specifically, we set \(\lambda =n\), and \(n \in \{100, 200,\ldots ,4500\}\). Although the theoretical results imply that significantly smaller population sizes would suffice, e.g. \(\lambda =O(\log n)\) for Theorem 8 we chose a larger population size in the experiments to more easily observe the impact of \(\lambda \) on the running time of the algorithm. The results are shown in Figs. 1, 2 and 3. For each value of n, the algorithm is run 100 times, and then the average runtime is computed. The average runtime for each value of n is estimated with \(95\%\) confidence intervals using the bootstrap percentile method [30] with 100 bootstrap samples. Each average point is plotted with two error bars to illustrate the upper and lower margins of the confidence intervals.

5.1 OneMax

In Sect. 4, we obtained two upper bounds on the expected optimisation time of the UMDA on OneMax, which are tighter than the earlier upper bound \(\mathcal {O}(n\lambda \log \lambda )\) in [10], as follows

  • \(\mathcal {O}\left( \lambda n\right) \) with parent population sizes \(\mu =\varOmega (\log n)\cap \mathcal {O}(\sqrt{n})\),

  • \(\mathcal {O}(\lambda \sqrt{n})\) with parent population sizes \(\mu = \varOmega (\sqrt{n}\log (n))\).

Fig. 1
figure 1

Average runtime of the UMDA on OneMax with 95% confidence intervals plotted with error bars in red colour. Models are also fitted via non-linear regression. a Small \(\mu \). b Large \(\mu \) (Color figure online)

We therefore experimented with two different settings for the parent population size: \(\mu =\sqrt{n}\) and \(\mu =\sqrt{n}\log (n)\). We call the first setting small population and the other large population. The empirical runtimes are shown in Fig. 1. Theorem 8 implies the upper bounds \(\mathcal {O}(n^2)\) for the setting of small population and \(\mathcal {O}(n^{3/2})\) for the setting of large population. Following [30], we identify the three positive constants \(c_1, c_2\) and \(c_3\) that best fit the models \(c_1 n\log n\), \(c_2 n^{3/2}\) and \(c_3n^{2}\) in non-linear least square regression. Note in particular that these models were chosen because they are close to the theoretical results. The correlation coefficient \(\rho \) is then calculated for each model to find the best-fit model.

Table 2 Correlation coefficient \(\rho \) for the best-fit models in the experiments with OneMax shown in Fig. 1a, b

In Table 2, we observe that for small parent populations (i.e. \(\mu = \sqrt{n}\)), model \(0.8104\;n^{3/2}\) fits the empirical data best, while the quadratic model gives the worst result. For larger parent population (i.e. \(\mu = \sqrt{n}\log n\)), the model \(1.0767~n^{3/2}\) fits best the empirical data among the three models. Since \(0.8104~n^{3/2} \in \mathcal {O}(n^2)\), these findings are consistent with the theoretical expected optimisation time and may further suggest that the quadratic bound in case of small population is not tight.

5.2 LeadingOnes

We conducted experiments with \(\mu =\sqrt{n}\), and \(\lambda =n\). According to Theorem 7, the upper bound of the expected runtime is in this case \(\mathcal {O}(n\lambda \log \lambda + n^2) = \mathcal {O}(n^2\log n)\). Figure 2 shows the empirical runtime. Similarly to the OneMax problem, we fit the empirical runtime with four different models—\(c_1n\log n\), \(c_2n^{3/2}\), \(c_3n^2\) and \(c_4n^2\log n\)—using non-linear regression. The best values of the four constants are shown in Table 3 along with the correlation coefficients of the models.

Fig. 2
figure 2

Average runtime of the UMDA on LeadingOnes with 95% confidence intervals plotted with error bars in red colour. Models are also fitted via non-linear regression (Color figure online)

Table 3 Correlation coefficient \(\rho \) for the best-fit models in the experiments with LeadingOnes shown in Fig. 2

Figure 2 and Table 3 show that both the model \(1.5223~n^2\) and the model \(0.1851~n^2\log n\), having the same correlation coefficient, fit well with the empirical data (i.e. the empirical data lie between these two curves). This finding is consistent with the theoretical runtime bound \(\mathcal {O}(n^2\log n)\). Note also that these two models differ asymptotically by \(\varTheta (\log n)\), suggesting that our analysis of the UMDA on LeadingOnes is nearly tight.

5.3 BinVal

Finally, we consider BinVal. The upper bound \(\mathcal {O}(n\lambda \log \lambda + n^2)\) from Theorem 7 for the function is identical to the bound for LeadingOnes. Since BinVal is also a linear function like OneMax, we decided to set the experiments similarly for these functions, i.e. with different parent populations \(\mu =\sqrt{n}\) and \(\mu =\sqrt{n}\log n\). The empirical results are shown in Fig. 3. Again the empirical runtime is fitted to the three models \(c_1 n\log n\), \(c_2 n^{3/2}\) and \(c_3 n^{2}\). The best values of \(c_1, c_2\) and \(c_3\) are listed in Table 4, along with the correlation coefficient for each model.

Fig. 3
figure 3

Average runtime of the UMDA on BinVal with 95% confidence intervals plotted with error bars in red colour. Models are also fitted via non-linear regression. a Small \(\mu \). b Large \(\mu \) (Color figure online)

Table 4 Correlation coefficient \(\rho \) for the best-fit models in the experiments with BinVal shown in Fig. 3a, b

Theorem 7 gives the upper bound of \(\mathcal {O}(n^2\log n)\) for the expected runtime of BinVal. However, Fig. 3 and Table 4 show clearly that the model \(1.4605 \;n^{3/2}\) fits best the empirical runtime for \(\mu =\sqrt{n}\). On the other hand, the empirical runtime lies between the two models \(11.973 \;n\log n\) and \(1.6586 \;n^{3/2}\) when \(\mu =\sqrt{n}\log n\). While these observations are consistent with the theoretical upper bound since \(\mathcal {O}(n^{3/2})\) and \(\mathcal {O}(n\log n)\) are all members of \(\mathcal {O}(n^2\log n)\), they also suggest that our analysis of the UMDA on BinVal given by Theorem 7 may be loose.

6 Conclusion

Despite the popularity of EDAs in real-world applications, little has been known about their optimisation time, even for apparently simple settings such as the UMDA on toy functions. More results for the UMDA on these simple problems with well-understood structures provide a way to describe and compare the performance of the algorithm with other search heuristics. Furthermore, results about the UMDA are not only relevant to evolutionary computation, but also to population genetics where it corresponds to the notion of linkage equilibrium [36, 41].

We have analysed the expected optimisation time of the UMDA on three benchmark problems: OneMax, LeadingOnes and BinVal. For both LeadingOnes and BinVal, we proved the upper bound \(\mathcal {O}(n\lambda \log \lambda +n^2)\), which holds for \(\lambda =\varOmega (\log n)\). For OneMax, two upper bounds of \(\mathcal {O}(\lambda n)\) and \(\mathcal {O}(\lambda \sqrt{n})\) were obtained for \(\mu = \varOmega (\log n)\cap \mathcal {O}(\sqrt{n})\) and \(\mu = \varOmega (\sqrt{n}\log n)\), respectively. Although our result assumes that \(\lambda \ge (1+\beta )\mu \) for some positive constant \(\beta >0\), it no longer requires that \(\lambda =\varTheta (\mu )\) as in [44]. Note that if \(\lambda =\varTheta (\log n)\), a tight bound \(\varTheta (n\log n)\) on the expected optimisation time of the UMDA on OneMax is obtained, matching the well-known tight bound \(\varTheta (n\log n)\) for the (\(1+1\)) EA on the class of linear functions. Although we did not obtain a runtime bound when the parent population size is \(\mu = \varOmega (\sqrt{n})\cap \mathcal {O}(\sqrt{n}\log n)\), our results finally close the existing \(\varTheta (\log \log n)\)-gap between the first upper bound \(\mathcal {O}(n\log n \log \log n)\) for \(\lambda =\varOmega (\mu )\) [10] and the relatively new lower bound \(\varOmega (\mu \sqrt{n}+n\log n)\) for \(\lambda = (1+\varTheta (1))\mu \) [26].

Our analysis further demonstrates that the level-based theorem can yield, relatively easily, asymptotically tight upper bounds for non-trivial, population-based algorithms. An important additional component of the analysis was the use of anti-concentration properties of the Poisson–Binomial distribution. Unless the variance of the sampled individuals is not too small, the distribution of the population cannot be too concentrated anywhere, even around the mean, yielding sufficient diversity to discover better solutions. We expect that similar arguments will lead to new results in runtime analysis of evolutionary algorithms.