Black-box optimization using geodesics in statistical manifolds

Information geometric optimization (IGO) is a general framework for stochastic optimization problems aiming at limiting the influence of arbitrary parametrization choices. The initial problem is transformed into the optimization of a smooth function on a Riemannian manifold, defining a parametrization-invariant first order differential equation. However, in practice, it is necessary to discretize time, and then, parametrization invariance holds only at first order in the step size. We define the Geodesic IGO update (GIGO), which uses the Riemannian manifold structure to obtain an update entirely independent from the parametrization of the manifold. We test it with classical objective functions. Thanks to Noether's theorem from classical mechanics, we find an efficient way to write a first order differential equation satisfied by the geodesics of the statistical manifold of Gaussian distributions, and thus to compute the corresponding GIGO update. We then compare GIGO, pure rank-$\mu$ CMA-ES \cite{CMATuto} and xNES \cite{xNES} (two previous algorithms that can be recovered by the IGO framework), and show that while the GIGO and xNES updates coincide when the mean is fixed, they are different in general, contrary to previous intuition. We then define a new algorithm (Blockwise GIGO) that recovers the xNES update from abstract principles.


Introduction
Consider an objective function f : X → R to be minimized. We suppose we have absolutely no knowledge about f : the only thing we can do is ask for its value at any point x ∈ X (black-box optimization), and that the evaluation of f is a costly operation. We are going to study algorithms that can be described in the IGO framework (see [OAAH11]), A possible way to optimize such a function is the following: We choose (P θ ) θ∈Θ a family of probability distributions 1 on X, and an initial probability distribution P θ 0 . Now, we replace f by F : Θ → R (for example F (θ) = E x∼P θ [f (x)]), and we optimize F with a gradient descent: However, because of the gradient, this equation depends entirely on the parametrization we chose for Θ, which is disturbing: we do not want to have two different updates because we chose different numbers to represent the objects we are working with. That is why invariance is a design principle behind IGO. More precisely, we want invariance with respect to monotone transformations of f , and invariance under reparametrization of θ.
IGO provides a differential equation on θ with the desired properties, but because of the discretization of time needed to obtain an explicit algorithm, we lose invariance under reparametrization of θ: two IGO algorithms for the same problem, but with different parametrizations, coincide only at first order in the step size. A possible solution to this problem is Geodesic IGO (GIGO), introduced here 2 : the initial direction of the update at each step of the algorithm remains the same as in IGO, but instead of moving straight for the chosen parametrization, we use the structure of Riemannian manifold of our family of probability distributions (see [AN07]) by following its geodesics.
Finding the geodesics of a Riemannian manifold is not always easy, but Noether's theorem will allow us to obtain quantities that are preserved along the geodesics. In the case of Gaussian distributions, it is possible to find enough invariants to obtain a first order differential equation satisfied by the geodesics, which makes their computation easier.
Although the geodesic IGO algorithm is not strictly speaking parametrization invariant when no closed form for the geodesics is known, it is possible to compute them at arbitrary precision without increasing the numbers of objective function calls.
The first two sections are preliminaries: in Section 1, we recall the IGO algorithm, introduced in [OAAH11], and in Section 2, after a reminder about Riemannian geometry, we state Noether's theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.
In Section 3, we consider Gaussian distributions with covariance matrix proportional to the identity matrix: this space is isometric to the hyperbolic space, and the geodesics of the latter are known.
In Section 4.1, we consider the general Gaussian case, and we use Noether's theorem to obtain two different sets of equations to compute the GIGO update. The equations were already known, see [Eri87], [CO91] and [ITW11], but the connection with Noether's theorem has not been mentioned. We then give the explicit solution for these equations, from [CO91].
In Section 5, we recall quickly the xNES and CMA updates and we introduce a slight modification of the IGO algorithm to incorporate the direction-dependent learning rates used in CMA-ES and xNES. We then compare these different algorithms, prove that xNES is not GIGO in general and we finally introduce a new family of algorithms extending GIGO and recovering xNES from abstract principles.
Finally, Section 6 presents numerical experiments, which suggest that when using GIGO with Gaussian distributions the step size must be chosen carefully.
To simplify notation, we will use R n instead of a general manifold as much as possible, but the latter is the "right" formal context.
Acknowledgement. I would like to thank Yann Ollivier for his numerous remarks about this article, and Frédéric Barbaresco for finding the reference [CO91].

Definitions: IGO, GIGO
In this section, we recall what the IGO framework is, and we define the geodesic IGO update. Consider again Equation 1: Its main problems are that • The gradient depends on the parametrization of our space of probability distributions (see 1.3 for an example).
• The equation is not invariant under monotone transformations of f . For example, the optimization for 10f moves ten times faster than the optimization for f .
In this section, we recall how IGO deals with this (see [OAAH11] for a better presentation).

Invariance under reparametrization of θ: Fisher metric
In order to achieve invariance under reparametrization of θ, it is possible to turn our family of probability distributions into a Riemannian manifold (it is the main topic of information geometry, see [AN07]), and therefore, there is a canonical gradient (called the natural gradient).
Definition 1. Let P, Q be two probability distributions on X. The Kullback-Leibler divergence of Q from P is defined by: By definition, it does not depend on the parametrization. It is not symmetrical, but if for all x, the application θ → P θ (x) is C 2 , then a second-order expansion yields: where This is enough to endow the family (P θ ) θ∈Θ with a Riemannian manifold structure 3 . The matrix I(θ) is called "Fisher information matrix", the metric it defines is called the "Fisher metric". Given a metric, it is possible to define a gradient attached to this metric: the key property of the gradient is that for any smooth function f where x, y = x T Iy is the dot product in metric I. Therefore, in order to keep the property of Equation 5, we must have ∇f = I −1 ∂f ∂x . We have therefore the following gradient (called "natural gradient", see [AN07]):∇ and since the Kullback-Leibler divergence does not depend on the parametrization, neither does the natural gradient. Later in this paper, we will study families of Gaussian distributions. The following proposition gives the Fisher metric for these families.
Proposition 2. Let (P θ ) θ∈Θ be a family of normal probability distributions: P θ = N (µ(θ), Σ(θ)). If µ and Σ are C 1 , the Fisher metric is given by: Proof. This is a non-trivial calculation. See [PF86] for more details. As we will often be working with Gaussian distributions, we introduce the following notation: Notation 3. G d is the manifold of Gaussian distributions in dimension d, equipped with the Fisher metric.
G d is the manifold of Gaussian distributions in dimension d, with covariance matrix proportional to identity in the canonical basis of R d , equipped with the Fisher metric.

IGO flow, IGO algorithm
In IGO [OAAH11], invariance with respect to monotone transformations is achieved by replacing f by the following transform: we set a non-increasing function w : [0; 1] → R is chosen (the selection scheme), and finally W f θ (x) = w(q(x)). 4 By performing a gradient descent on E x∼P θ [W θ t f (x)], we obtain the "IGO flow": For practical implementation, the integral in (9) has to be approximated. For the integral itself, the Monte-Carlo method is used: N values (x 1 , ..., x N ) are sampled from the distribution P θ t , and the integral becomes and we approximate 1 5 We now have an algorithm that can be used in practice if the Fisher information matrix is known.
Definition 4. The IGO update associated with parametrisation θ, sample size N , step size δt and selection scheme w 6 is given by the following update rule: We call IGO speed the vector

Geodesic IGO
Although the IGO flow associated with a family of probability distributions is intrinsic (it only depends on the family itself, not the parametrization we choose for it), the IGO update is not. However, two different IGO updates are "close" when δt → 0: the difference between two steps of IGO that differ only by the parametrization is a O(δt 2 ). Intuitively, the reason for this difference is that two IGO algorithms start at the same point, and follow "straight lines" with the same initial speed, but the definition of "straight lines" changes with the parametrization.
For instance, in the case of Gaussian distributions, let us consider two different IGO updates with Gaussian distributions in dimension 1, the first one with 4 This definition has to be slightly changed if the probability of a tie is not zero. See [OAAH11] for more details.
5 It can be proved (see [OAAH11]) that lim N →∞ Nŵ i = W θ t f (x i ). Here again, we are assuming there are no ties.
6 One could start directly with theŵ i rather than w, as we will do later.
6 parametrization (µ, σ), and the second one with parametrization (µ, c := σ 2 ). We suppose the IGO speed for the first algorithm is (μ,σ). The corresponding IGO speed in the second parametrization is given by the identityċ = 2σσ. Therefore, the first algorithm gives the standard deviation σ new,1 = σ old + δtσ, and the variance c new,1 = (σ new,1 ) 2 = c old + 2δtσ oldσ + δt 2σ2 = c new,2 + δt 2σ2 . The geodesics of a Riemannian manifold are the generalization of the notion of straight line: they are curves that locally minimize length 7 . The notion will be explained precisely in Section 2, but let us define the geodesic IGO algorithm, which follows the geodesics of the manifold instead of following the straight lines for an arbitrary parametrization.
Definition 5 (GIGO). The geodesic IGO update (GIGO) associated with sample size N , step size δt and selection scheme w is given by the following update rule: is the IGO speed and exp θ t is the exponential of the Riemannian manifold Θ. Namely, exp θ t (Y δt) is the endpoint of the geodesic of Θ starting at θ t , with initial speed Y , after a time δt. By definition, this update does not depend on the parametrization θ.
Notice that while the GIGO update is compatible with the IGO flow (in the sense that when δt → 0 and N → ∞, a parameter θ t updated according to the GIGO algorithm is a solution of Equation 9, the equation defining the IGO flow), it not necessarily an IGO update. More precisely, the GIGO update is an IGO update if and only the geodesics of Θ are straight lines for some parametrization 8 .
The main problem with this update is that in general, obtaining equations for the geodesics is a difficult problem. In the next section, we will state Noether's Theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.

Riemannian geometry
The goal of this section is to state Noether's theorem. We will not prove anything here, see [AVW89] for the proofs, [Bou07] or [JLJ98] for a more detailed presentation. Noether's theorem states that if a system has symmetries, then, there are invariants attached to this symmetries. Firstly, we need some definitions.
Definition 6 (Motion in a Lagrangian system). Let M be a differentiable manifold, T M the set of tangent vectors on M 9 , and L : T M → R (q, v) → L(q, v) a differentiable function (called the Lagrangian function 10 ). A "motion in the lagrangian system (M, L) from x to y" is map γ : [t 0 , t 1 ] → M such that: • γ is a local extremum of the functional: among all curves c : [t 0 , t 1 ] → M such that c(t 0 ) = x, and c(t 1 ) = y.
For example, when (M, g) is a Riemannian manifold, the length of a curve γ between γ(t 0 ) and γ(t 1 ) is: The curves that follow the shortest path between two points x, y ∈ M are therefore the minima γ of the functional (15) such that γ(t 0 ) = x and γ(t 1 ) = y, and the corresponding Lagrangian function is (q, v) → g(v, v). The solution to the problem of finding a parametrized curve with a shortest length is not unique: any curve following the shortest trajectory will have minimum length. For example, if γ 1 : [a, b] → M is a curve of shortest path, so is γ 2 : t → γ 1 (t 2 ): these two curves define the same trajectory in M , but they do not travel along this trajectory at the same speed. This leads us to the following definition: Definition 7 (Geodesics). Let I be an interval of R, (M, g) be a Riemannian manifold. A curve γ : I → M is called a geodesic if for all t 0 , t 1 ∈ I, γ| [t0,t1] is a motion in the Lagrangian system (M, L) from γ(t 0 ) to γ(t 1 ), where It can be shown (see [Bou07]) that geodesics are curves that locally minimize length, with constant velocity 11 , which solves the previous uniqueness problem. More precisely, given a starting point and a starting speed, the geodesic is unique. This motivates the definition of the exponential of a Riemannian manifold.
Definition 8. Let (M, g) be a Riemannian manifold. We call exponential of M the application: such that for any x ∈ M , if γ is the geodesic of M satisfying γ(0) = x and γ (0) = v, then exp x (v) = γ(1). 9 A tangent vector is identified by the point at which it is tangent, and a vector in the tangent space.
In order to find an extremal of a functional, the most commonly used result is called the "Euler-Lagrange equations" (see [AVW89] for example). Using them, it is possible to show that the geodesics of a Riemannian manifold follow the "geodesic equations":ẍ where the are called "Christoffel symbols" of the metric g. However, these coefficients are tedious (and sometimes difficult) to compute, and (17) is a second order differential equation. Noether's theorem will give us a first order equation to compute the geodesics.

Noether's Theorem
where dh is the differential of h.
If M is obvious, we will sometimes say that L is invariant under h.
An example will be given in the proof of Theorem 18. We can now state Noether's theorem (see for example [AVW89]).
Theorem 10 (Noether's Theorem). If the Lagrangian system (M, L) admits the one-parameter group of symmetries h s : M → M , s ∈ R, then the following quantity remains constant during the motions in the system (M, L). Namely, does not depend on t if γ is a motion in (M, L). Now, we are going to apply this theorem to our problem: computing the geodesics of Riemannian manifolds of Gaussian distributions.

GIGO inG d
We do not know of a closed form for the geodesics of G d . However, the situation is better if we force the covariance matrix to be either diagonal or proportional to the identity matrix. In the former case, the manifold we are considering is (G 1 ) d , and in the latter case, it isG d . The geodesics of (G 1 ) d are given by Proposition 11. Let M be a Riemannian manifold, let d ∈ N, let Φ be the Riemannian exponential of M d , and let φ be the Riemannian exponential of M . We have: In particular, knowing the geodesics of G 1 is enough to compute the geodesics of (G 1 ) d . This is true because a block of the product metric does not depend on variables of the other blocks.
Consequently, a GIGO update with a diagonal covariance matrix with the sample (x i ) is equivalent to d separate 1-dimensional GIGO updates using the same samples. Moreover, G 1 ∼ =G1, the geodesics of which are given below.
We will show thatG d and the "hyperbolic space", of which the geodesics are known, are isometric.
We also recall the expression of its geodesics (see for example [GHL04]): Proposition 13 (Geodesics of the Poincaré half-plane). The geodesics of the Poincaré half-plane are exactly the: with ad − bc = 1, and v > 0.
The geodesics are half-circles perpendicular to the line y = 0, and vertical lines. The generalization to higher dimension is the following: Definition 14 (Hyperbolic space). We call "hyperbolic space of dimension n" the Riemannian manifold with the metric ds 2 = dx 2 1 +...+dx 2 n−1 +dy 2 y 2 . Its geodesics stay in a plane containing the direction y and the initial speed. 12 The induced metric on this plane is the metric of the Poincaré half-plane. The geodesics are therefore given by the following proposition: Proposition 15 (Geodesics of the hyperbolic space). If γ : t → (x 1 (t), ..., x n−1 (t), y(t)) = (x(t), y(t)) is a geodesic of H n , then, there exists a, b, c, d ∈ R such that ad−bc = 1 and v > 0 such that

Computing the GIGO update inG d
If we want to implement the GIGO algorithm inG d , we need to compute the natural gradient inG d , and to be able to compute the Riemannian exponential ofG d .
Using Proposition 2, we can compute the metric ofG d in the parametrization (µ, σ) → N (µ, σ 2 I). We find: Since this matrix is diagonal, it is easy to invert, and we immediately have the natural gradient, and, consequently, the IGO speed.
Proposition 16. InG d , the IGO speed Y is given by: Proof. We recall the IGO speed is defined by The result follows.

The metric defined by Equation 24
is not exactly the metric of the hyperbolic space, but with the substitution µ ← µ Theorem 17 (Geodesics ofG d ). If γ : t → N (µ(t), σ(t) 2 I) is a geodesic ofG d , then, there exists a, b, c, d ∈ R such that ad − bc = 1 and v > 0 such that Now, in order to implement the corresponding GIGO algorithm, we only need to be able to find the coefficients a, b, c, d, v corresponding to an initial position (µ 0 , σ 0 ), and an initial speed (μ 0 ,σ 0 ). It is a tedious but easy computation, the result of which is given in Proposition 34.
The pseudocode of GIGO inG d is also given in the Appendix: it is obtained by concatenating Algorithms 1 and 7. 13 4 GIGO in G d

Obtaining a first order differential equation for the geodesics of G d
In the case where both the covariance matrix and the mean can vary freely, the equations of the geodesics have been computed in [CO91] and [Eri87]. However, these articles start with the equations of the geodesics obtained with the Christoffel symbols, then partially integrate them, obtaining equations (38) and (39) of Theorem 19. These equations are in fact a consequence of Noether's theorem, and can be found directly.
Theorem 18. Let γ : t → N (µ t , Σ t ) be a geodesic of G d . Then, the following quantities do not depend on t: Proof. It is a direct application of Noether's theorem, with suitable groups of diffeomorphisms. By Proposition 2, the Lagrangian associated with the geodesics of G d is Its derivative is Let us show that this Lagrangian is invariant under affine changes of basis (thus illustrating Definition 9).
We have and sinceȦµ = Aμ andȦΣA T = AΣA T , we find easily that or in other words: In order to use Noether's theorem, we also need one-parameter groups of transformations. We choose the following:

Translations of the mean vector. For any
remains constant for all i. The fact that J µ is an invariant immediately follows.

Linear base changes. For any
, where E ij is the matrix with a 1 at position (i, j), and zeros elsewhere.
So by Noether's theorem, we then obtain the following invariants: and the coefficients of J Σ in (29) are the (J ij /2).
This leads us to first order equations satisfied by the geodesics mentionned in [Eri87], [CO91] and [ITW11]. where Proof. It is an immediate consequence of Proposition 18.
These equations can be solved analytically (see [CO91]), but usually, that is not the case, and they have to be solved numerically, for example with the Euler method (the corresponding algorithm, which we call GIGO-Σ, is described in the Appendix). The goal of the remainder of the subsection is to show that having to use the Euler method is fine.
To avoid confusion, we will call the step size of the GIGO algorithm (δt in Proposition 5) "GIGO step size", and the step size of the Euler method (inside a step of the GIGO algorithm) "Euler step size".
Having to solve our equations numerically brings two problems: The first one is a theoretical problem: the main reason to study GIGO is its invariance under reparametrization of θ, and we lose this invariance property when we use the Euler method. However, GIGO can get arbitrarily close to invariance by decreasing the Euler step size. In other words, the difference between two different IGO alorithms is O(δt 2 ), and the difference between two different implementations of the GIGO algorithm is O(h 2 ), where h is the Euler step size, and it is easier to reduce the latter. Still, without a closed form for the geodesics of G d , the GIGO update is rather expensive to compute, but it can be argued that most of the computation time will still be the computation of the objective function f .
The second problem is purely numerical: we cannot guarantee that the covariance matrix remains positive definite along the Euler method. Here, apart from finding a closed form for the geodesics, we have two solutions.
We can enforce this a posteriori : if the covariance matrix we find is not positive definite after a GIGO step, we repeat the failed GIGO step with a reduced Euler step size (in our implementation, we divided it by 4, see Algorithm 2 in the Appendix.).
The other solution is to obtain differential equations on a square root of the covariance matrix 14 .
Theorem 20 (GIGO-A). If µ : t → µ t and A : t → A t satisfy the equationṡ where Proof. This is a simple rewriting of Theorem 19: if we write Σ := AA T , we find that J µ and J Σ are the same as in Theorem 19, and we havė By Theorem 19, Σ(J Σ − J µ µ T ) is symmetric (sinceΣ has to be symmetric). Therefore, we haveΣ = Σ(J Σ − J µ µ T ), and the result follows.
Notice that Theorem 20 gives an equivalence, whereas Theorem 19 does not. The reason is that the square root of a symmetric positive definite matrix is not unique. Still, it is canonical, see discussion in Section 5.1.1.
As for Theorem 19, we can solve Equations 40 and 41 numerically, and we obtain another algorithm (Algorithm 4 in the Appendix, we will call it GIGO-A), with a behavior similar to the previous one (with equations 38 and 39). For both of them, numerical problems can arise when the covariance matrix is almost singular.
We have not managed to find any example where one of these two algorithms converged to the minimum of the objective function whereas the other did not, and their behavior is almost the same.
More interestingly, the performances of these two algorithms are also the same as the performances of the exact GIGO algorithm, using the equations of Section 4.2.
Notice that even though GIGO-A directly maintains a square root of the covariance matrix, which makes sampling new points easier 15 , both GIGO-Σ and GIGO-A still have to invert the covariance matrix (or its square root) at each step, which is as costly as the decomposition, so one of these algorithms is roughly as expensive to compute as the other.

Explicit form of the geodesics of G d
We now give the exact geodesics of G d : the following results are a rewriting of Theorem 3.1 and its first corollary in [CO91].
where exp is the Riemannian exponential of G d , G is any matrix satisfying and G − is a pseudo-inverse of G. 16 15 To sample a point from N (µ, Σ), a square root of Σ is needed. 16 In [CO91], the existence of G (as a square root ofΣ 2 0 + 2μ 0μ T 0 ) is proved. Notice that anyway, in the expansions of (42) and (44), only even powers of G appear.
And since for all A ∈ GL d (R), for all µ 0 ∈ R d , the application preserves the geodesics, we find the general expression for the geodesics of G d . with where exp is the Riemannian exponential of G d , G is any matrix satisfying and G − is a pseudo-inverse of G.
It should be noted that the final values for mean and covariance do not depend on the choice of G. 17 The reason for this is that ch(G) is a Taylor series in G 2 , and so are sh(G)G − and G − sh(G).
For our practical implementation, we actually used these Taylor series instead of the expression of the corollary.

Definitions
In this section, we recall the xNES and pure rank-µ CMA-ES, and we describe them in the IGO framework, thus allowing a reasonable comparison with the GIGO algorithms.

xNES
We recall a restriction 18 of the xNES algorithm, introduced in [GSY + 10].
17 As a square root of Definition 23 (xNES algorithm). The xNES algorithm with sample size N , weights w i , and learning rates η µ and η Σ updates the parameters µ ∈ R d , A ∈ M d (R) with the following rule: At each step, N points x 1 , ..., x N are sampled from the distribution N (µ, AA T ). Without loss of generality, we assume f (x 1 ) < ... < f (x N ). The parameter is updated according to: The more general version decomposes the matrix A as σB, where det B = 1, and uses two different learning rates for σ and for B. We gave the version where these two learning rates are equal 19 . This restriction of the xNES algorithm can be described in the IGO framework, provided all the learning rates are equal (most of the elements of the proof can be found in [GSY + 10] 20 , or in [OAAH11]): Proposition 24 (xNES as IGO). The xNES algorithm with sample size N , weights w i , and learning rates η µ = η Σ = δt coincides with the IGO algorithm with sample size N , weights w i , step size δt, and in which, given the current position (µ t , A t ), the set of Gaussians is parametrized by with δ ∈ R m and M ∈ Sym(R m ). The parameters maintained by the algorithm are (µ, A), and the x i are sampled from N (µ, AA T ).
Proof. Let us compute the IGO update in the parametrization φ µt,At : we have δ t = 0, M t = 0, and by using Proposition 2, we can see that for this parametrization, the Fisher information matrix at (0, 0) is the identity matrix. The IGO update is therefore, Therefore, the IGO update is: or, in terms of mean and covariance matrix: This is the xNES update.
Using a square root of the covariance matrix Firstly, we recall that the IGO framework (on G d , for example) emphasizes the Riemannian manifold structure on G d . All the algorithms studied here (including GIGO, which is not strictly speaking an IGO algorithm) define a trajectory in G d (a new point for each step), and to go from a point θ to the next one (θ ), we follow some curve γ : To be compatible with this point of view, an algorithm giving an update rule for a square root of the covariance matrix A has to satisfy the following condition: For a given initial speed, the covariance matrix Σ t+δt after one step must depend only on Σ t , and not on the square root A t chosen for Σ t .
The xNES algorithm does satisfy this condition: consider two xNES algorithms, with the same learning rates, respectively at (µ, A t 1 ) and (µ, A t 2 ), with 21 Using tr(M ) = log(det(exp(M ))) A t 1 (A t 1 ) T = A t 2 (A t 2 ) T (i.e.: they define the same Σ t ), using the same samples x i to compute the natural gradient update 22 , then we will have Σ t+δt Using the definitions of Section 5.4, we have just shown that what we will call the "xNES trajectory" is well-defined.
It is also important to notice that, in order to be well defined, a natural gradient algorithm updating a square root of the covariance matrix has to specify more conditions than simply following the natural gradient.
The reason for this is that the natural gradient is a vector tangent to G d : it lives in a space of dimension d(d+3)/2 (the dimension of G d ) whereas the vector (µ, A) lives in a space of dimension d(d + 1), which is too large: we will have no uniqueness for the solutions of the equations expressed with A (this is why Theorem 20 is simply an implication, whereas Theorem 19 is an equivalence).
More precisely, let us consider A in GL d (R), and v A , v A two infinitesimal updates of A. Since Σ = AA T , the infinitesimal update of Σ corresponding to For any A ∈ M d (R), let us denote by T A the space of the matrices M such that A −1 M is antisymmetric, or in other words T A := {u ∈ M d (R), Au T + uA T = 0}. Having a subspace S A in direct sum with T A for all A is sufficient (but not necessary) to have a well-defined update rule. Namely, consider the (linear) application sending an infinitesimal update of A to the corresponding update of Σ. It is not bijective, but as we have seen before, Ker φ A = T A , and therefore, if we have, then φ A | U A is an isomorphism. Let v Σ be an infinitesimal update of Σ. We choose the following update of A corresponding to v Σ : Any and consequently, it can be defined without referring to the parametrization, which makes it a canonical choice.
Let us now show that this is the choice made by xNES and GIGO-A (which are well-defined algorithms updating a square root of the covariance matrix).

Pure rank-µ CMA-ES
We now recall the pure rank-µ CMA-ES algorithm. The general CMA-ES algorithm is described in [Han11].
Definition 26 (pure rank-µ CMA-ES algorithm). The pure rank-µ CMA-ES algorithm with sample size N , weights w i , and learning rates η µ and η Σ is defined by the following update rule: At each step, N points x 1 , ..., x N are sampled from the distribution N (µ, Σ). Without loss of generality, we assume f (x 1 ) < ... < f (x N ). The parameter is updated according to: The pure rank-µ CMA-ES can also be described in the IGO framework, see for example [ANOK10].
Proposition 27 (pure rank-µ CMA-ES as IGO). The pure rank-µ CMA-ES algorithm with sample size N , weights w i , and learning rates η µ = η Σ = δt coincides with the IGO algorithm with sample size N , weights w i , step size δt, and the parametrization (µ, Σ).

Twisting the metric
As we can see, the IGO framework does not allow to recover the learning rates for xNES and pure rank-µ CMA-ES, which is a problem, since usually, the covariance learning rate is set much smaller than the mean learning rate (see either [GSY + 10] or [Han11]).
A way to recover these learning rates is to incorporate them directly into the metric (see also Blockwise GIGO, in Section 5.5). More precisely: Definition 28 (Twisted Fisher metric). Let η µ , η Σ ∈ R, and let (P θ ) θ∈Θ be a family of normal probability distributions: P θ = N (µ(θ), Σ(θ)), with µ and Σ C 1 . We call "(η µ , η Σ )-twisted Fisher metric" the metric defined by: All the remainder of this section is simply a rewriting of the work in Section 1 with the twisted Fisher metric instead of the regular Fisher metric.
This approach seems to be somewhat arbitrary. Arguably, the mean and the covariance play a "different role" in the definition of a Gaussian (only the covariance can affect diversity, for example), but we lack a resasonable intrinsic characterization that would make this choice of twisting more natural. Moreover, this construction can be slightly generalized (see Appendix).
And the IGO flow and the IGO algorithms can be modified to take into account the twisting of the metric: the (η µ , η Σ )-twisted IGO flow reads This leads us to the twisted IGO algorithms.
Definition 30. The (η µ , η Σ )-twisted IGO algorithm associated with parametrisation θ, sample size N , step size δt and selection scheme w is given by the following update rule: Definition 31. The (η µ , η Σ )-twisted geodesic IGO algorithm associated with sample size N , step size δt and selection scheme w is given by the following update rule: By definition, the twisted geodesic IGO algorithm does not depend on the parametrization 25 .

Algorithms after twisting
We now give all the algorithms mentioned earlier, but with the twisted Fisher metric. Except for Proposition 34, which is a simple calculation, the proofs of the following statements are an easy rewriting of their non-twisted counterparts: one can return to the non-twisted metric (up to a η Σ factor) by changing µ to √ ησ √ ηµ µ.
Theorem 35. Let γ : t → N (µ t , Σ t ) be a geodesic of G d (η µ , η Σ ). Then, the following quantities are invariant: Theorem 36. If µ : t → µ t and Σ : t → Σ t satisfy the equationṡ Theorem 37. If µ : t → µ t and A : t → A t satisfy the equationṡ where . Notice that the equations found by twisting the metric are exactly the equations without twisting, except that we have "forced" the learning rates η µ , η Σ to appear by multiplying the increments of µ and Σ by η µ and η Σ .
Proposition 38 (xNES as IGO). The xNES algorithm with sample size N , weights w i , and learning rates η µ , η σ = η B = η Σ coincides with the ηµ δt , ηΣ δttwisted IGO algorithm with sample size N , weights w i , step size δt, and in which, given the current position (µ t , A t ), the set of Gaussians is parametrized by with δ ∈ R m and M ∈ Sym(R m ). The parameters maintained by the algorithm are (µ, A), and the x i are sampled from N (µ, AA T ).

Trajectories of different IGO steps
As we have seen, two different IGO algorithms (or an IGO algorithm and the GIGO algorithm) coincide at first order in δt when δt → 0. In this section, we study the differences between pure rank-µ CMA-ES, xNES, and GIGO by looking at the second order in δt, and in particular, we show that xNES and GIGO do not coincide in the general case. We view the updates done by one step of the algorithms as paths on the manifold G d , from (µ(t), Σ(t)) to (µ(t + δt), Σ(t + δt)), where δt is the time step of our algorithms, seen as IGO algorithms. More formally: Definition 41.
1. We call GIGO update trajectory the application (exp is the exponential of the Riemannian manifold G d (η µ , η Σ )) 2. We call xNES update trajectory the application with AA T = Σ. The application above does not depend on the choice of a square root A.
3. We call CMA update trajectory the application These appliations map the set of tangent vectors to G d (T G d ) to the curves in G d (η µ , η Σ ).
We will also use the following notation: For instance, T GIGO (µ, Σ, v µ , v Σ )(δt) gives the position (mean and covariance matrix) of the GIGO algorithm after a step of size δt, while µ GIGO and Σ GIGO give respectively the mean component and the covariance component of this position.
This formulation ensures that the trajectories we are comparing had the same initial position and the same initial speed, which is the case provided the sampled values 26 are the same.
Different IGO algorithms coincide at first order in δt. The following proposition gives the second order expansion of the trajectories of the algorithms.
Proposition 42 (second derivatives of the trajectories). We have: Proof. We can immediately see that the second derivatives of µ xNES , µ CMA , and Σ CMA are 0. Next, we have The expression of Σ xNES (µ, Σ, v µ , v Σ ) (0) follows. Now, for GIGO, let us consider the geodesic starting at (µ 0 , Σ 0 ) with initial speed (η µ v µ , η Σ v Σ ). By writing J µ (0) = J µ (t), we findμ(t) = Σ(t)Σ −1 0μ 0 . We then easily haveμ(0) =Σ 0 Σ −1 0μ 0 . In other words Finally, by using Theorem 19, and differentiating, we find In order to interpret these results, we will look at what happens in dimension 1: 27 • In [GSY + 10], it has been noted that xNES converges to quadratic minimums more slowly than CMA, and that it is less subject to premature convergence. That fact can be explained by observing that the mean update is exactly the same for CMA and xNES whereas xNES tends to have a higher variance (Proposition 42 shows this at order 2, and it is easy to see that in dimension 1, for any µ, • At order 2, GIGO moves the mean faster than xNES and CMA if the standard deviation is increasing, and more slowly if it is decreasing. This seems to be a reasonable behavior (if the covariance is decreasing, then the algorithm is presumably close to a minimum, and it should not leave the area too quickly). This remark holds only for isolated steps, because we do not take into account the evolution of the variance.
• The geodesics of G 1 are half-circles (see Figure 2 below -we recall G 1 is the Poincaré half-plane). Consequently, if the mean is supposed to move (which always happens), then σ → 0 when δt → ∞. For example, a step whose initial speed has no component on the standard deviation will always decrease it. See also Proposition 6.2, about the optimization of a linear function.
• For the same reason, for a given initial speed, the update of µ always stays bounded as a function of δt: it is not possible to make one step of the GIGO algorithm go further than a fixed point by increasing δt. Still, the geodesic followed by GIGO changes at each step, so the mean of the overall algorithm is not bounded. We now show that xNES follows the geodesics of G d if the mean is fixed but that xNES and GIGO do not coincide otherwise.
Proposition 43 (xNES is not GIGO in the general case).
Then the GIGO and xNES updates starting at N (µ, Σ) with initial speeds v µ and v Σ follow the same trajectory if and only if the mean remains constant. In other words: Proof. If v µ = 0, then we can compute the GIGO update by using Theorem 19: since J µ = 0,μ = 0, and µ remains constant. Now, we have J Σ = Σ −1Σ : this is enough information to compute the update. Since this quantity is also preserved by the xNES algorithm (see for example the proof of Proposition 47), the two updates coincide.

Blockwise GIGO
Although xNES is not GIGO, it is possible to define a family of algorithms extending GIGO and including xNES, by decomposing our family of probability distributions as a product, and by following the restricted geodesics simultaneously.
• We denote by Θ θ,i the Riemannian manifold with the metric induced from Θ. There is a canonical isomorphism of vector spaces T θ Θ = ⊕ n i=1 T Θ θ,i . Moreover, if the splitting is compatible, it is an isomorphism of Euclidean spaces.
• We denote by Φ θ,i the exponential at θ of the manifold Θ θ,i .
Definition 46 (Blockwise GIGO update). Let Θ 1 , ..., Θ n be a compatible splitting. 29 The blockwise GIGO algorithm in Θ with splitting Θ 1 , ..., Θ n associated with sample size N , step sizes δt 1 , ..., δt n and selection scheme w is given by the following update rule: with Y i the T Θ θ,i -component of Y . This update only depends on the splitting (and not on the parametrization inside each Θ i ).
Since blockwise GIGO only depends on the splitting (since we make no other choice 30 ), it can be thought as almost parametrization-invariant.
Notice that blockwise GIGO updates and twisted GIGO updates are two different things: firstly, blockwise GIGO can be defined on any manifold with a compatible splitting, whereas twisted GIGO (and twisted IGO) are only defined for Gaussians 31 . But even in G d (η µ , η Σ ), with the splitting (µ, Σ), these two algorithms are different: for instance, if η µ = η Σ , and δt = 1 then the twisted GIGO is the regular GIGO algorithm, whereas blockwise GIGO is not (actually, we will prove that it is the xNES algorithm). The only thing blockwise GIGO and twisted GIGO have in common is that they are compatible 32 with the (η µ , η Σ )-twisted IGO flow (56).
We now have a new description of the xNES algorithm: 28 If the Riemannian manifold is not ambiguous, we will simply write a compatible splitting. 29 If the splitting is not compatible, it is possible to define exactly the same algorithm, but it does not seem relevant.
30 Except the tunable parameters: sample size, step sizes, and selection scheme. 31 Maybe a more general form of twisted IGO could be defined. 32 As defined at the end of Section 1.3: A parameter θ t following these updates with δt → 0 and N → ∞ is a solution of Equation 56.
Proof. Firstly, notice that the splitting (µ, Σ) is compatible, by Proposition 2. Now, let us compute the Blockwise GIGO update: we have The induced metric on Θ θ t ,1 is the Euclidian metric, so we have Since we have already shown (using the notation in Definition 23) that Y µ = AG µ (in the proof of Proposition 24), we find On Θ θ t ,2 , we have the following Lagrangian for the geodesics: By applying Noether's theorem, we find that is invariant along the geodesics of Θ θ t ,2 , so they are defined by the equatioṅ Σ = ΣJ Σ = ΣΣ −1 0Σ 0 (and therefore, any update preserving the invariant J Σ will satisfy this first-order differential equation and follow the geodesics of Θ θ t ,2 ). The xNES update for the covariance matrix is given by A(t) = A 0 exp(tG M /2). Therefore Although blockwise GIGO is somewhat "less natural" than GIGO, it can be easier to compute for some splittings (as we have just seen), and in the case of the Gaussian distributions, the mean-covariance splitting seems natural.
Another advantage of blockwise GIGO is that it allows us to recover the learning rates without having to twist the metric, which felt like an ad-hoc solution. In particular, blockwise GIGO can be defined for families of probability distributions that are not Gaussian.

Numerical Experiments
We conclude this article with some numerical experiments to compare the behavior of GIGO, xNES and CMA-ES (we give the pseudocodes for these articles in the Appendix). We made two series of tests. The first one is a performance test, using classical benchmark functions, and the settings from [GSY + 10]. The goal of the second series of tests is to illustrate the computations in Section 5.4 by plotting the trajectories (standard deviation versus mean) of these three algorithms in dimension 1.

Benchmarking
For the first series of experiments, presented in Figure 3, we used the following parameters, taken from [GSY + 10]: • Varying dimension.
• Sample size: 4 + 3 log(d) • IGO step size and learning rates: • Initial position: θ 0 = N (x 0 , I), where x 0 is a random point of the circle with center 0, and radius 10.
• Euler method for GIGO: Number of steps: 100. We used the GIGO-A variant of the algorithm. 34 • We chose not to use the exact expression of the geodesics for this benchmarking to show that having to use the Euler method is fine. However, we did run the tests, and the results are basically the same as GIGO-A.
We plot the median number of runs to achieve target fitness (10 −8 ). Each algorithm has been tested in dimension 2, 4, 8, 16, 32 and 64: a missing point means that all runs converged prematurely.

Failed runs
In Figure 3, a point is plotted even if only one run was successful. Below is the list of the settings for which some runs converged prematurely.
• Only one run reached the optimum for the cigar-tablet function with CMA in dimension 8 • Seven runs (out of 24) reached the optimum for the Rosenbrock function with CMA in dimension 16 • About half the runs reached the optimum for the sphere function with CMA in dimension 4.
For the following settings, all runs converged prematurely.
• GIGO did not find the optimum of the Rosenbrock function in any dimension.
• CMA did not find the optimum of the Rosenbrock function in dimension 2, 4, 32 and 64.
• All the runs converged prematurely for the Cigar-tablet function in dimension 2 with CMA, for the Sphere function in dimension 2 for all algorithms, and for the Rosenbrock function in dimension 2 and 4 for all algorithms.
As the last item of the list above shows, all the algorithms converge prematurely in low dimension, probably because the covariance learning rate has been set too high (or because the sample size is too small). This is different from the results in [GSY + 10].
This remark aside, as noted in [GSY + 10], the xNES algorithm shows more robustness than CMA and GIGO: it is the only algorithm able to find the minimum of the Rosenbrock function in high dimensions. However, its convergence is consistently slower.
In terms of performance, when both of them work, CMA and GIGO are extremely close (GIGO is usually a bit better) 35 . An advantage of GIGO is that it is theoretically defined for any δt, η Σ , whereas the covariance matrix maintained by CMA (not only pure rank-µ CMA) can stop being positive definite if η Σ δt > 1. However, in that case, the GIGO algorithm is prone to premature convergence (remember Figure 2, and see Proposition 6.2 below), and learning rates that high are rarely used in practice.  , with x 0 uniformly distributed on the circle of center 0 and radius 10. We recall that the "CMA-ES" algorithm here is using the so-called pure rank-µ CMA update.

Plotting trajectories in G 1
We want the second series of experiments to illustrate the remarks about the trajectories of the algorithms in Section 5.4, so we decided to take a large sample size to limit randomness, and we chose a fixed starting point for the same reason. We use the weights below because of the property of quantile improvement proved in [AO13]: the 1/4-quantile will improve at each step. The parameters we used were the following: • Sample size: λ = 5000 • Dimension 1 only.
• Weights: w = 4.1 q 1/4 (w i = 4.1 i 1250 ) • IGO step size and learning rates: • Initial position: θ 0 = N (10, 1) • Dots are placed at t = 0, 1, 2 . . . (except for the graph δt = 1.5, for which there is a dot for each step).  Figures 7, 8 and 11 show that when δt 1, GIGO reduces the covariance even at the first step. More generally, when using the GIGO algorithm inG d for the optimization of a linear function, there exists a critical step size δt cr (depending on the learning rates η µ , η σ , and on the weights w i ), above which GIGO will converge, and we can compute its value when the weights are of the form 1 q q0 36 .
Proposition 48. Let d ∈ N, k, η µ , η σ ∈ R * + let w = k.1 q q0 , and let Let µ n be the first coordinate of the mean, and let σ 2 n be the variance (at step n) maintained by the (η µ , η σ )-twisted geodesic IGO algorithm inG d associated with selection scheme w, sample size ∞ 37 , and step size δt, when optimizing g.
There exists δt cr such that: • if δt > δt cr , (σ n ) converges to 0 with exponential speed, and (µ n ) converges. 36 Notice that for q 0 0.5, the discussion is not relevant because in that case, even the IGO flow converges prematurely. Also compare with the critical δt of smoothed cross entropy method, and IGO ML, in [OAAH11]. 37 It has been proved in [OAAH11] that IGO algorithms are consistent with the IGO flow, i.e. they get closer and closer to the IGO flow as sample size tends to infinity. In other words Sample size ∞ means we replace the IGO speed in the GIGO update by its limit for large N .
In particular, it is not random.
The proof and the expression of δt cr can be found in the Appendix.
In the case corresponding to k = 4, n = 1, q 0 = 1/4, η µ = 1, η σ = 1.8, we find: δt cr ≈ 0.84. x → x 2 in dimension 1 with δt = 0.1, sample size 5000, weights w i = 4.1 i 1250 , and learning rates η µ = 1, η Σ = 1.8. One dot every 10 steps. All algorithms exhibit a similar behavior, differences start to appear. It cannot be seen on the graph, but the algorithm closest to zero after 400 steps is CMA (∼ 1.10 −16 , followed by xNES (∼ 6.10 −16 ) and GIGO (∼ 2.10 −15 ). Figure 6: Trajectories of GIGO, CMA and xNES optimizing x → x 2 in dimension 1 with δt = 0.5, sample size 5000, weights w i = 4.1 i 1250 , and learning rates η µ = 1, η Σ = 1.8. One dot every 2 steps. Stronger differences. Notice that after one step, the lowest mean is still GIGO (∼ 8.5, whereas xNES is around 8.75), but from the second step, GIGO has the highest mean because of the lower variance. x → x 2 in dimension 1 with δt = 1, sample size 5000, weights w i = 4.1 i 1250 , and learning rates η µ = 1, η Σ = 1.8. One dot per step. The CMA-ES algorithm fails here, because at the fourth step, the covariance matrix is not positive definite anymore (It is easy to see that the CMA-ES update is always defined if δtη Σ < 1, but this is not the case here). Also notice (see also Proposition 6.2) that at the first step, GIGO decreases the variance, whereas the σ-component of the IGO speed is positive. 35 Figure 8: Trajectories of GIGO, CMA and xNES optimizing x → x 2 in dimension 1 with δt = 1.5, sample size 5000, weights w i = 4.1 i 1250 , and learning rates η µ = 1, η Σ = 1.8. One dot per step. Same as δt = 1 for CMA. GIGO converges prematurely.

Conclusion
We introduced the geodesic IGO algorithm and, we showed that in the case of Gaussian distributions, Noether's theorem directly gives a first order equation satisfied by the geodesics. In terms of performance, the GIGO algorithm is similar to pure rank-µ CMA-ES, which is rather encouraging : it would be interesting to test GIGO on real problems. Moreover, GIGO is a reasonable and totally parametrization invariant 38 algorithm, and as such, it should be studied for other families of probability distributions 39 . Noether's theorem could be a crucial tool for this. We also showed that xNES and GIGO are not the same algorithm, and we defined Blockwise GIGO, a simple extension of the GIGO algorithm showing that xNES has a special status as it admits a definition which is "almost" parametrization-invariant. 38 Provided we can compute the solution of the equations of the geodesics. 39 Like Bernoulli distributions. However, in that case, the length of the geodescis is finite, and other problems arise.

Algorithm 1 For all algorithms
µ ← µ 0 if The algorithm updates Σ directly then Σ ← Σ 0 Find some A such that Σ = AA T else {The algorithm updates a square root A of Σ} A ← A 0 Σ = AA T end if while NOT (Termination criterion) do for i = 1 to λ do z i ∼ N (0, I) x i = Az i + µ end for Compute the IGO initial speed (included in the algorithms below) Update the mean and the covariance (the updates are Algorithms 2 to 6). end while