On the convergence rate issues of general Markov search for global minimum

This paper focuses on the convergence rate problem of general Markov search for global minimum. Many of existing methods are designed for overcoming a very hard problem which is how to efficiently localize and approximate the global minimum of the multimodal function f while all information which can be used are the f-values evaluated for generated points. Because such methods use poor information on f, the following problem may occur: the closer to the optimum, the harder to generate a “better” (in sense of the cost function) state. This paper explores this issue on theoretical basis. To do so the concept of lazy convergence for a globally convergent method is introduced: a globally convergent method is called lazy if the probability of generating a better state from one step to another goes to zero with time. Such issue is the cause of very undesired convergence properties. This paper shows when an optimization method has to be lazy and the presented general results cover, in particular, the class of simulated annealing algorithms and monotone random search. Furthermore, some attention is put on accelerated random search and evolution strategies.


Introduction
Let (A, d) be a separable metric space and let f : A → [0, ∞) be a Borel measurable function having its global minimum f = min f (A).There is a great number of iterative numerical methods which are used for finding a global minimum of f in case the global minimization problem B Dawid Tarłowski dawid.tarlowski@im.uj.edu.pl;dawid.tarlowski@gmail.com 1 Faculty of Mathematics and Computer Science, Institute of Mathematics, Jagiellonian University, Łojasiewicza 6, 30 348 Kraków, Poland min x∈A f (x) cannot be solved analytically.Many of those iterative techniques are designed for solving difficult, irregular or multimodal, real world problems.This paper focuses on the class of Markov methods which, as A is assumed to be separable, admit the following general representation where Y t is an independent sequence and independent of X 0 , see [8].We will say that X t is globally convergent if it converges stochastically to A .It is often an easy task to examine the global convergence property of such methods on theoretical basis (of course there are exceptions, especially in case of self-adaptive methods) and general techniques are based, in particular, on Borel-Cantelli lemma, classical probability theory [24,31], Lyapunov functions [2,22,30,32,33], and Markov chains [1,4,12].At the same time, the theoretical convergence rate analysis is usually extremely difficult.The convergence rate analysis must take into account the optimization scheme, the initial parameters of the given procedure and the appropriate properties of the given problem function (in general, the function can be multimodal and complex which strongly determines the algorithm efficiency, [23,37]).While it is justified theoretically that gradient based local search methods are fast [21], the existing theoretical results regarding derivative-free global random search techniques usually indicate slow convergence rate or concern some special cases-for instance, in many cases of convex optimization the derivative-free methods may efficiently use gradient-estimates, see [10,11].However, many global random search methods are designed for overcoming a general and almost impossible problem which is how to efficiently localize and approximate the global minimum of a multimodal function while all information which can be used are the f -values evaluated for generated points.Furthermore, the global minimal value is usually unknown.The derivative may not exist or may be unavailable (for instance, in case of so called "black box " problems, usually all one have is the possibility of compute the value f (x) at given state x ∈ A and this computation often requires much effort ), and hence many methods belong to the class of derivative-free algorithms, [27].Because the given method uses poor information on f , its convergence may have very undesired properties based on the following issue: the closer to the optimum, the harder to generate a "better" (in sense of the cost function) state.This paper explores this issue on theoretical basis.To do so the concept of lazy convergence for a globally convergent method is introduced: a globally convergent method is called lazy if the probability of generating a better state from one step to another goes to zero with time.It is shown, in particular, that a monotonic method X t [in sense f (X t+1 ) ≤ f (X t )] is lazy iff for any k ∈ N we have P(X t+k = X t+k−1 = • • • = X t ) t→∞ −→ 1 and the expected length k of constant finite subsequences X t = X t+1 = • • • = X t+k goes to infinity with time t.The above property is extended to the case of nonmonotonic methods as the property of the corresponding best iterate sequence.This paper shows when an optimization method is lazy and the presented general results cover, in particular, the class of simulated annealing algorithms and monotone random search.To provide an application example from the class of methods which are based on parameters' self-adaptation it is shown that the finite descent Accelerated Random Search [3], converges lazily.As it is discussed further, the undesired lazy convergence property appears to be the property of an optimization method rather than the property of the problem function f .The author of this paper believes that the methods based on parameters' self-adaptation may be a good way to overcame the convergence issues presented here and the last, additional, chapter of this paper focuses on this class of methods.Finally, it is worth to mention that this paper is about the convergence behaviour of the given optimization method X t as the method approaches the global minimum.The alternative convergence problem is also the object of analysis in literature: given ε > 0, how to analyze the expected time of hitting the εneigbhourhood of global minimum by the given optimization method X t and, in particular, how this time changes as ε goes to zero, see [35].Those two convergence research aspects are based on different approaches and one of the reasons for that is that in the first case the target (the global minimum) usually has a zero Lebeasgue'a measure and in the second case the target (the ε-ball) has positive Lebesgue'a measure.
This paper is organized as follows.Section 2 introduces the general assumptions and corresponding notation, and next it introduces and discusses the concept of lazy convergence.Section 3 presents general results for the class of comparison based monotone homogeneous Markov search.Section 4 successively develops the framework of Sect. 3 up to the full generality.Both sections discuss the results and present the corresponding illustrative examples.The main result of Sect. 4 is Theorem 5 which cover, in particular, the class of Simulated Annealing algorithms and monotone random search.Additionally, the lazy convergence of finite descent ARS is provided as the conclusion of Theorem 3. The last section is an additional chapter which shortly indicates that the self-adaptive methods may be a good way to overcome the issues analysed here.

General assumptions and lazy convergence
This section presents general assumptions and notation which will hold throughout the paper.Next it introduces and discusses the concept of lazy convergence.

General assumptions
We assume that A ⊂ R d .The presented methodology can be extended to more general spaces however the full generality is not a purpose of this paper as the clarity of the presented ideas is more important.We will assume that the metric d on A either is a metric for the Euclidean topology or, in the case A = I d := (0, 1] d , is the d-dimensional torus metric d T given by: We will always assume that

Now we introduce general notation:
(1) We will always assume that the measurable problem function f : A → [0, ∞) satisfies the following natural conditions: Condition (A1) means that for any ε > 0 we have inf 123 Condition (A2) is satisfied, for example, if the set of global minimums A is finite and the function f is continuous at points from A .If A is infinite, condition (A2) still holds true if for some ε > 0 the set A (ε) is compact and f is continuous on A (ε).Under conditions (A1), (A2), for any sequence x t ∈ A we have: Let ( , , P) be a probability space and let {X t } ∞ t=0 be a measurable sequence which represents the successive states of the given optimization method.Under (A1) and (A2) different types of basic global convergence modes are equivalent, see Observation 1 and Theorem 1 in [33].For instance, it is easy to see that under (A1) and (A2) the following conditions are equivalent: In this paper we will say that an optimization method X t is globally convergent (has a global convergence property) if it satisfies conditions (B1), (B2).Condition (B2) represents, in fact, the stochastic convergence of X t to A and thus the global convergence of X t will be denoted by

Lazy convergence
The aim of this paper is to explain on the theoretical basis why many global search methods cannot be convergent quickly.The main attention is paid on methods with global convergence property.Still, the general results of next sections cover the case of methods which are not necessarily convergent towards A .Below we introduce the definition of lazy convergence which expresses the undesired convergence behaviour of many random search techniques.

Definition 1
We will say that a globally convergent sequence X t converges lazily towards A , or shortly that X t is lazy, if it satisfies lim t→∞ P( f (X t+1 ) < f (X t )) = 0.It will be denoted by X t l−s → A Proposition 1 presents some rather straightforward consequences of this definition.Theorem 1 provides proper intuition behind this concept and we believe it explains the use of term "lazy" for this convergence type.

Proposition 1 Assume that X t l−s
→ A .We have: We have P(C t ) → 1 and thus for any k ∈ N, To see condition (C1) it remains to note that To see (C2) note that for any k ∈ N, based on condition (C1), To see (C3) note that for any n ∈ N and M ∈ N \ {0}, from the definition of τ X n we have Hence, from (2.1), lim n→∞ Eτ X n ≥ M.This finishes the proof as M can be arbitrarily big.
From (C1) it follows, in particular, that for a monotonic sequence (in sense f (X t+1 ) ≤ f (X t )), we have that for any k ∈ N, and that the expected length of constant finite subsequences goes to infinity with time t (condition (C3)).If the method X t is not monotonic then we can consider the associated current best iterate sequence Xt given by It is an easy observation that if the sequence X t converges lazily towards A then the current best iterate Xt is a monotonic sequence which converges lazily towards A .In fact, we have that if The below theorem presents the properties of lazy convergence which provides the proper intuition behind this notion.

Theorem 1 If the sequence X t converges lazily towards A then the associated best iterate sequence Xt satisfies
and the expected lenght k of constant finite subsequences Xt = Xt+1 = • • • = Xt+k goes to infinity with time t.
The above rather simple result gives some insight into the properties of the stopping conditions for the class of lazy methods.Fix k, n ∈ N with n ≥ k and let h : A → [0, ∞) be the given function.Consider the stopping criterion τ (h,n,k) given by τ represents the outcome of the process which performs at least n iterations and next it stops when the value of the improvement during the last k steps does not exceed the value determined by the function h.The stopping condition will take the form , where ε > 0. Theorem 1 immediately implies thae following observation.
Observation 1 If we have X t l−s → 0, then for any k ∈ N and h : A → [0, ∞) we have

3 Monotone homogeneous Markov search
This section presents the general result for comparison based monotone homogeneous Markov search.This class of methods was an initial motivation for the research presented in this paper.The methodology of this section is extended to the general case of inhomogeneous Markov search techniques in next section.

Illustrative examples
First we will discuss some illustrative examples.For now, to clarify the presentation, we will assume the most natural case when A = {a} is a singleton.We will focus on the class of monotonic homogeneous random search methods which can be described as follows.Given the current state x t the algorithm samples a candidate for the next step q t from the probability kernel P Q (x, •) which depends on the point x = x t .The new candidate is chosen as the next state x t+1 if it is "better" than the current state so we have f From the theoretical perspective this scheme admits the following general representation: where: . random variables and independent of the initial state X 0 We will sometimes use the following more compact form where the mapping T is uniquely defined by the equation (3.1).We also denote To give some simple examples: if P Q does not depend on x then P Q (x, C) = P Q (C) is a probability measure on A and a method (3.1) represents PRS algorithm (if A is bounded then P Q is usually defined as uniform distribution on A).In case A = R d another simple example of P Q is normal distribution centered at x with some covariance matrix : As we will show later, the following natural property of the probability kernel P Q is the cause of insufficient convergence behaviour of methods (3.1): In the present case A = {a} the ( ) condition states for sup x∈B(a,ε) P Q (x, (B(a, )) 0 as ε 0, of course.To provide some intuition for the commonness of the ( ) property we will shortly discuss some examples.Consider for a moment the class of Markov monotone symmetric search methods which was analysed in papers [34][35][36].Methods from this class are natural for spaces which exclude boundary issues connected to defining symmetric densities.Those methods satisfy the general scheme (3.1) and the candidate points are sampled from some density p(x t , •) on (B, d B ) = (A, d), where x t is the current state, and the p(x t , y) is a nonincreasing function of the distance between x t and y.We thus have for some nonincreasing function h : (0, ∞) → [0, ∞) which satisfies the normalization condition A h(d(x t , y))dy = 1.Assume for now that the algorithm (3.1) satisfies the above symmetry condition (3.4).In case A = R d one can consider, for example: sampling from the normal distribution for some function ϕ : (0, ∞) → [0, 1].From the continuity property of a probability measure it follows immediately that ϕ(ε) 0 as ε 0. This implies that ( ) condition is satisfied.In fact, for any Note that we did not put any assumptions on the extremum a ∈ A in the above case so the analogous condition holds true for any point.Now we are going back to the situation (3.1) (no symmetries assumed).Assume for now that the A is a closed subset of R d with the induced euclidean metric.Property (3.8) is too strong to be satisfied for a bounded domain situation because of the issues of the boundary regions.Still, some modifications of it will hold true.For example, consider the class of methods that generate a candidate point from some distribution on R n around the current position x t and next if the candidate is created outside of the set of admissible solutions then it is taken back to the boundary of this set according to some procedure.This mechanism causes an efficient search of the boundary of the domain.To see this assume for a moment that the algorithm (3.1) satisfies: (1) The only assumption on Y t is very natural too as there is not any sens in generating a candidate equal to the current state.Note that there is a nonnegative valued function ϕ with ϕ(ε) 0 as ε 0 such that for any x from the interior of A and for any ε > 0 we have: In fact, we have Now we can repeat the previous argumentation to obtain that if the global minimum a belongs to the interior of A then it satisfies condition ( ).Methods with more sophisticated rules of taking back a candidate to the admissible domain also satisfy the ( ) condition under natural assumptions and proving that would be based more or less on the same idea regarding the algorithm behaviour: the closer to the optimum, the harder to sample an appropriate candidate.Below we start the general theoretical justification of this issue.

Theory
From now on we release the assumption that A is a singleton and we assume that the set A is compact instead.Recall that condition ( ) takes the following form: and note that under conditions (A1), (A2) this is equivalent to the following condition: sup Let B(A) denote the family of Borel subsets of A and let M 1 (A) denote the topological space of Borel probability measures on A with the weak convergence topology, see [7] or [13] for the general theory.Let us recall that As a direct consequence of Proposition 5 from the next section (i.e. from Conclusion 2 stated there) we will have that if P Q satisfies two conditions: is continuous, then P Q satisfies ( ) condition.Note that the assumption P Q (x, A ) = 0, x ∈ A, is satisfied, for example, if A has zero Lebesgue'a measure and the distributions P Q (x, •) are absolutely continuous.
Below we present the main result of this section.For any δ > 0, if P(X t ∈ A δ ) = 0 then we simply put P( f Theorem 2 Assume that X t is a method of the form (3.1) such that condition ( ) is satisfied.Then, for any 0 < C < 1 there is δ > 0 such that for any x ∈ A δ we have Furthermore, we have Proof First, note that from the construction of the algorithm we have that X t and Y t are independent and hence, from the Fubini's theorem: • Fix C ∈ (0, 1).Let From ( ) it follows that there is ε > 0 with ϕ(ε) < 1 − C and from (A1) it follows that there is δ > 0 with A δ ⊂ B(A , ε).For any x ∈ A δ , The constants δ > 0 and ε > 0 are chosen in such a way that for any x ∈ A δ we have: The above proves (3.7).Now, fix C ∈ (0, 1) and let δ > 0 be small enough to have condition (3.7) satisfied.Using Fubini's theorem we obtain that for any t ∈ N with P X t (A δ ) > 0 we have: The above proves lim δ→0 sup t∈N P( f Hence, if X t is globally convergent and thus for any δ > 0 it satisfies = 0 can be nicely derived from (3.7) for a globally convergent method).
Note that the undesired convergence properties expressed by the lazy convergence notion are in fact consequences of the algorithm general scheme and the ( ) property of the probability kernel P Q .In fact, we practically did not put any assumptions on the problem function properties.Thus the methods of the form (3.1) are in some sense condemned for the "lazy convergence"-the information on the function f which is used by method (3.1) is insufficient to keep "good" convergence behaviour as the method approaches the extremum.The next section extends Theorem 2 to cover the general inhomogeneous case.More advanced examples will be presented.

General inhomogeneous Markov search 4.1 General case
From now we assume that the sequence X t is given by the general recursive equation: where the mappings T t : A × B → A and the distributions of Y t : → B can change over time.Naturally, the sequence X 0 , Y 1 , Y 2 , . . . is assumed to be independent.
We will say that a sequence X t satisfies (♦) condition if it satisfies: Proposition 2 Assume that X t is given by (4.1) and that condition (♦) is satisfied.We have In particular, if X t is globally convergent, then it converges lazily towards A .
Proof The proof follows from the proof of Theorem 2. In fact, from (♦) we have that for any 0 < C < 1 there is δ C > 0 with The above is the inhomogeneous analogy of the inequality (3.7) from the proof of Theorem 2 and it has been shown in that proof that (3.7) implies the thesis in the homogeneous case-this part of the proof can be directly repeated in the present inhomogeneous case.

Example: simulated annealing
Consider for a moment nonmonotone generalization of scheme (3.1).Assume that we have {Y 1 t : → B 1 } t∈N (B 1 is assumed to be a separable metric space) and . .are independent.Assume that the method X t is given by the equation and such that for some A good illustration of the above inequality is a method X t which at every step t samples a candidate Q(X t , Y 1 t ) and next this candidate is accepted with some probability p t (X t , Y 1 t ).Assume that Y 2 t are uniformly distributed on [0, 1] and that we have measurable functions functions p t : A × B 1 → [0, 1], t ∈ N. We see that the following scheme satisfies equation (4.2) and inequality (4.3).This scheme describes the well known Simulated Anenaling method (although the candidate distributions are constant over time, for now).
Proposition 3 Assume that X t is a method given by (4.2) which satisfies condition (4.3) for some mapping Q and such that the probability kernel P Q (given by (3.3)) satisfies ( ) condition.Then X t satisfies condition (♦) and thus, if X t is globally convergent, then X t converges lazily towards A .
Proof In fact, fix δ > 0 and x ∈ A δ .We have This is equivalent to ( ) as f satisfies conditions (A1) and (A2).Proposition 2 finishes the proof.
The conclusion below is a direct consequence of Proposition 3.

Conclusion 1
Assume that X t is a Simulated Annealing method given by (4.4) and that the The most common acceptance probability p t for the Simulated Annealing Algorithm given by (4.4) depends on the value of the difference and on time t and is given by the Metropolis formula where β t is a sequence (so called "cooling schedule") with β t → 0. This formula causes that the good candidate Q t [in sense: ] is always accepted while the candidate with f (Q t ) > f (X t ) still have the positive acceptance probability equal to exp(− 1 β t • t ).Various acceptance probability formulas (the most frequently analysed aspect is the convergence rate of β t towards zero) have been analysed in the context of global convergence property.The various choices for p t formula can help the SA method to avoid local minima.Many papers focus on methods (4.4) and the global convergence is achieved by various techniques, see [1,12] for applications of Markov chains theory or [16] for more classical approach.Theorem 5 in [33] gives the condition on the probability kernel P Q under which the condition β t → 0 ensures the global convergence regardless of the convergence rate of β t .However, as stated in Conclusion 1, regardless of the acceptance probabilities, the convergence of this method cannot be quick if the probability kernel for sampling a candidate is constant over time.The assumptions of this chapter allow the p t to be dependent only on time t, X t , and Y t [and thus on Q(X t , Y 1 t ), off course], however the extension of the presented results to more general case p t = p t (X 0 , . . ., X t , Y t ) is obvious.
Further in this section we will present a general result, Theorem 4, which cover, in particular, the class of SA algorithms for which the probability kernel for sampling a candidate may change in time.

General case
In order to provide the final result for the general case (4.1) we need to extend the notion of ( ) condition to the case of family of probability kernels (Markov kernels).We say that a mapping K : A × B(A) −→ [0, 1] is a probability kernel on A if it satisfies the following conditions: (1) for any Borel set

probability measure on Borel sets B(A).
Let K( A) denote the set of probability kernels on A.
Definition 2 We will say that a family C ⊂ K( A) satisfies ( ) condition if any of the following equivalent (under (A1) and (A2)) conditions is satisfied: If C is some metric space and the following mapping is given: 123 then will say that the set C satisfies the ( ) condition if the family of probability kernels Set C can represent, for example, the set of parameters of a given method.It can be also the subset of M( A × B, A) × M 1 (B), where M( A × B, A) denotes measurable functions uniquely defines the corresponding probability kernel P Q .We assume that the space M( A × B, A) is equipped with the topology of uniform convergence of functions and M 1 (B) has the topology of weak convergence of probability measures.Let denote the topological space equipped with the product topology.Let be given by Note that we have The following characterization of continuity in this case is the consequence of Proposition 1 stated in [22].
Then, the mapping is continuous.
The following proposition provides exemplary sufficient conditions for ( ) which are verifiable in practical cases.Some examples will be given further.

Proposition 5 Let C be a compact metric space and P
Recall that from the Wierestrass theorem we have that an upper semi-continuous function attains its upper bound on a compact set.We will use the Wierestrass theorem several times in this proof and in order to simplify the argument presentation we will not always be explicit about that.
Fix n > 0. The continuity of (Q, x) → P Q (x, •) and the upper semicontinuity of is upper semi-continuous and thus there is It is easy to see that: [note that the sequence [the existence of Qm follows from the Wierestrass theorem which can be applied to the upper semicontinuous function which, by (4.8), will give us As n ∈ N can be arbitrarilly big, to finish the proof it will remain to show: To show (4.10) assume for a contradiction that there is a subsequence t k ∈ N with + ε, a contradiction with the definition of Qn given by (4.9).We thus proved that we have As n ∈ N was chosen arbitrarily, it remains to note that condition (4.11) is satisfied.In fact, if for some ε > 0 there is a subsequence then it is easy to see that for any N we have and thus P Q ( â, A ) ≥ ε > 0 which contradicts the basic assumption sup a∈A P Q (a, A ) = 0, Q ∈ C.

Conclusion 2 Assume that P
Proof To see that it is enough to note that for set C := {P Q } and the identity mapping P : C → C the assumptions of Proposition 5 are satisfied.
The following result is a simple consequence of Proposition 2.
Theorem 3 Let C be a metric space and let P : C −→ K( A) be given mapping.Assume that for some neighbourhood U of A a method (4.1) satisfies If the family {P Q } Q∈C satisfies condition ( ) then lim δ→0 sup t∈N P( f The above and Proposition 5 immediately yields the following conclusion. is continuous and sup a∈A P Q (a, A ) = 0 for any Q ∈ C. Let X t be a method (4.1) such that (4.12) holds true.If X t is globally convergent, then X t converges lazilly towards A .
Proof of Theorem 3 Based on Proposition 2, to prove the theorem it will be enough to show (♦) condition, i.e. we will show that: 0 as δ 0.
For any δ > 0 and t ∈ N, from (4.12) we already have: and thus sup t∈N sup As condition ( ) is satisfied, the latter goes to 0 as δ goes to zero.

The main result
Consider the following nonhomogeneous generalization of scheme (4.4): Here, the measurable mappings Q t : A × B → A and the distributions of Y 1 t : → B can change over time.The appropriate acceptance probabilities p t can represent Simulated Annealing or monotone random search.Naturally, we have where P (Q t ,P Y t ) is given by (4.7).Thus, based on Theorem 3, the following result, Theorem 4, is immediate.The extension of Theorem 4 to the case p t = p t (X 0 , . . ., X t , Y 1 t ) is straightforward.
Theorem 4 Assume that X t is given by (4.13) and that the family {P (Q t ,P Y t ) } t∈N satisfies ( ) condition.Then, The above, based on Propositions 4 and 5, immediately leads to the following result.
Theorem 5 If X t of the form (4.13) is globally convergent and any of the following conditions is satisfied: (1) for some compact set C and neighbourhood U of A there is a continuous function Assume for simplicty the natural case A = {a}.Roughly speaking, from the above theorem it follows that a globally convergent method (4.13) which does not converge lazily towards A must have a subsequence of probability kernels P (Q t k ,P Y t k ) ∈ K( A) which converges (in some sense) to a P (Q,ν) ∈ K( A) with P (Q,ν) (a, {a}) > 0. To give a simple example, assume that the candidate Q t (x, Y t ) is uniformly distributed on the ball centered at the current state x according to and N > log c 1 ρ .Theorem 3 finishes the proof as the family {P Q i } i∈{0,1,...,N } satisfies condition (

Self-adaptation
As discussed earlier, a basic reason for the slow convergence rate of many techniques is related to the following issue: sampling a better candidate point q t goes to zero as the current state x t approximates to the optimum.One approach to overcome this difficulty is to set the parameters of nonhomogeneous search in such a way that the optimization method gradually moves from the global search to the local search.The proper procedure for changing parameters' values should cause that the method does not lose global convergence property and performs the reasonable local search at the same time.Going back to the example it is easy to show that the condition σ t → 0 exclude the ( ) condition but, on the other hand, finding a proper pattern for σ t (for the given class of problems) is a very difficult open problem.

Self-adaptation
An important class of methods which partially avoid the difficulties mentioned above are methods which use self-adaptive mechanisms for parameters' changes.The numerical experiments indicate that the methods based on self-adaptation can perform very fast convergence towards global minimum.The existing theoretical results, see [4,9], show that in some special cases many of such methods converge towards minimum with very fast (exponential) convergence rate.Those results are based on very restrictive assumptions on the problem function but still they indicate that this fast convergence mode can be satisfied for more natural cases.This section is an additional chapter and presents a simple explanation how the "proper" self-adaptation overcomes the problems analysed in the previous sections.
Self-adaptive methods can be, in general, written in the following form: where: (1) the sequence X t : → A n represents the successive states of the algorithm, (2) To simplify further analysis we will consider the case n = 1, k = 1 and A = R d , and we will focus on the class methods satisfying: where Y t is uniformly distributed on [−1, 1] and σ t : → (0, ∞) > 0. Given the current values (X t , σ t ) = (x, σ ) we see that the candidate x t+1 = Q(x, σ, Y t ) for the next state is uniformly distributed on the ball B(x, σ ).We will write The σ t parameter is adjusted according to some general procedure This procedure may naturally use the values of X t , Q(X t , σ t Y t ), f (Q(X t , σ t , Y t )), f (X t ), Y t .This method is an example of evolutionary algorithm (1+1).Recall that evolution strategies (μ + λ) are methods which at every step of evolution (every time-step) transform the population of μ individuals by producing λ descendents (candidates) and next choosing μ the best fitted individuals among the population of (μ + λ) parents and descendents, see [6].
If the self-adaptive mechanism of σ t keeps the proper balance between d(x t , a) and σ t then the local search capabilities are adjusted to the current algorithm position in such a way that the issues analysed in the previous sections do not occur.This is expressed in Theorem 7. Below we present the exemplary condition for the above mentioned proper balance: (5.2) From now, we will assume that A = {0} and we will assume that d is the maximum metric d(x, y) = |x − y| so we will write For any δ > 0, 0 < M 1 < M 2 < ∞ we denote: Theorem 7 Let X t be a method (5.1).We have P Q (x, σ, K (0, ε)) ≥ ( M 1 M 2 ) d (2) If condition (5.2) is satisfied and X t is globally convergent then: Proof To prove (1) fix M 2 > M 1 > 0, ε > 0 and choose δ > 0 such that for any x from the ball K (0, δ) we have K (x, M 1 • |x|) ⊂ K (0, ε).For any (x, σ ) from the set A(δ, M 1 , M 2 ) we have To see (2) note that for fixed ε > 0 and δ > 0 as above we have We mention that proving condition (5.2) and some weak form of the "asymptotic independence" between the behaviour of sequences X t σ t and X t may be a good base for proving the geometric convergence rate for some class of self-adaptive methods and problem functions.While in general proving such a result will be a difficult task, it is already proved that in some special cases the X t σ t is a Markov chain which converges to some stationary distribution Π supported on A, see [4].This off course implies that condition (5.2) is satisfied for any 0 < M 1 < M 2 .While in general situation the sequence X t σ t is not a Markov chain still the sequence ( X t σ t , X t ) is Markov and we believe that the analysis of ( X t σ t , X t ) based on Markov chains theory may be a good direction for the development of theoretical tools for the convergence rate analysis of self-adaptive methods.
after little modifications those two examples work also for (A, d) = (I d , d T )].Condition (3.4) implies that for any ε > 0 and x ∈ A, is continuous and thus the assmuption (1) holds true for the compact set