A dimensionality reduction technique for unconstrained global optimization of functions with low effective dimensionality

We investigate the unconstrained global optimization of functions with low effective dimensionality, that are constant along certain (unknown) linear subspaces. Extending the technique of random subspace embeddings in [Wang et al., Bayesian optimization in a billion dimensions via random embeddings. JAIR, 55(1): 361--387, 2016], we study a generic Random Embeddings for Global Optimization (REGO) framework that is compatible with any global minimization algorithm. Instead of the original, potentially large-scale optimization problem, within REGO, a Gaussian random, low-dimensional problem with bound constraints is formulated and solved in a reduced space. We provide novel probabilistic bounds for the success of REGO in solving the original, low effective-dimensionality problem, which show its independence of the (potentially large) ambient dimension and its precise dependence on the dimensions of the effective and randomly embedding subspaces. These results significantly improve existing theoretical analyses by providing the exact distribution of a reduced minimizer and its Euclidean norm and by the general assumptions required on the problem. We validate our theoretical findings by extensive numerical testing of REGO with three types of global optimization solvers, illustrating the improved scalability of REGO compared to the full-dimensional application of the respective solvers.


Introduction
In this paper, we address the unconstrained global optimization problem where f : R D → R is a real-valued continuous, possibly non-convex, deterministic, function defined on the whole R D . We assume that there exists x * ∈ R D such that min x∈R D f (x) = f (x * ) = f * . This implies that f is bounded below, namely, f * > −∞, and that the minimum in (P) is attained (not all minimizers are at infinity).
To alleviate the curse of dimensionality, we further restrict ourselves to a particular class of functions whose true (intrinsic) dimension is much less than the ambient problem dimension. These functions are constant along certain linear subspaces, which may not necessarily Figure 1: The function in (1.1) and its domain are plotted on the left-and right-side, respectively. The red line is the effective subspace x = (1 − 1)y and it intersects the blue lines of global minimizers at x * k defined in Example 1.1 for k = −1, 0, 1; these points also correspond to optimal solutions in the reduced space.
be aligned with the standard axes. In literature, these functions are known under different names: functions with 'low effective dimensionality' [43], functions with 'active subspaces' [8] and 'multi-ridge' functions [17,41]. They have been found in a number of applications mainly related to parameter studies. In hyper-parameter optimization for neural networks [2] and heuristic algorithms for combinatorial optimization problems [23], studies have shown that the respective objective functions are affected by only a few hyper-parameters while the many other input hyper-parameters are redundant. Similarly, in complex engineering and physical simulation problems [8], such as in climate modelling [26], systems are modelled by several input parameters with only a small number of the parameters or a combination of them having a true effect on the system's behaviour. To clarify this concept, we give a simple example of a function with lower effective dimensionality. (1.1) By solving f (x) = 0, we find that the set of global minimizers is given by {(1 1) T t − (0 0.5 + πk) T : t ∈ R, k ∈ Z}. For each fixed value of k, the set corresponds to a distinct line of global minimizers along which the function is constant (see Figure 1). The effective subspace 1 of f is (x 1 x 2 ) = (1 − 1) T y for y ∈ R. We substitute this in (1.1) to obtain the reduced/lowerdimensional optimization problem min y∈R sin 2 (2y − 0.5), which has the same global minimum as (1.1), with global minimizers y * k = πk/2 + 0.25, k ∈ Z. We recover the corresponding solutions to (1.1) by setting x * k = (1 − 1) T y * k for k ∈ Z.
As Example 1.1 illustrates, it is possible to cast (P) into a lower-dimensional problem which has the same global minimum f * . This is straightforward when the effective subspace is known, but far less so in applications where f is potentially black-box. When the effective subspace is unknown, it was proposed (in the context of Bayesian Optimization) in Wang et al. [43] to use random embeddings. The proposed technique solves the following lower-dimensional optimization problem instead of directly tackling (P): where A is a D × d Gaussian random matrix (see Definition A.1) and Y = [−δ, δ] d for some carefully chosen δ > 0 and d D. Note that, unlike (P), (RP) has (box) constraints, which are typically imposed to make the approach practical (i.e. to avoid unrealistic searches over infinite domains). Definition 1.2. We say that (RP) is successful if there exists y * ∈ Y such that f (Ay * ) = f * .
Related work. The scalability challenges of Bayesian Optimization (BO) algorithms for generic black-box functions have prompted research into improving efficiency of this class of methods for functions with special structure. Different structural assumptions on the objective have been analysed for BO, such as additivity or (partial) separability, which assumes that the objective function can be represented as the sum of smaller-dimensional functions with non-overlapping variables [44,25,28] or with overlapping ones [35].
Another popular structural assumption is the above-mentioned low-effective dimensionality of the objective. In its simplest form, this considers the effective subspace to be aligned with the coordinate axes, which is equivalent to the presence of redundant variables [7,1]. More generally, the optimization of functions that are constant along arbitrary linear subspaceswhich, as mentioned above, is also the focus of this paper -has been addressed using BO methods in [11,43,19,13], and extended to other problem and algorithm classes such as derivativefree optimization [34], multi-objective optimization [33] and evolutionary methods [38]. Some proposals learn the effective subspace of the function beforehand (using for example, a low rank matrix recovery approach) [41,17] and then optimize in the reduced subspace [11,13]. Alternating learning and optimization steps has also been proposed [19], as well as bypassing learning and directly optimizing in randomly-chosen low-dimensional subspaces (provided an estimate of the effective dimension is known) [43,4,5].
For the latter, Wang et al. [43] developed the so-called REMBO algorithm, which is a BO framework for problem (P) with box constraints x ∈ X that uses Gaussian random embeddings (namely, A is a Gaussian random matrix) to generate the reduced problem (RP). They find that the size of Y is the primary factor in determining the success (or failure) of the reduced problem, and quantify the probability of success of (RP) for the case when the embedded dimension d is equal to the effective one and the effective subspace is aligned with the coordinate axes (see [43,Theorem 3]). A challenge of (RP) for BO with box constraints is that, even when (RP) is successful, the high-dimensional image Ay ∈ R D of a point y ∈ Y may be outside the feasible set X . For this reason, REMBO is equipped with a map p X : R D → R D that projects the image of the reduced solutions that fall outside X to the closest point on the boundary of X . To model a Gaussian Process for the reduced problem, [43] proposes two kernels: a high-dimensional k X and a low-dimensional k Y . Kernel k X suffers from high-dimensionality as it constructs a GP in a D-dimensional space. The benefit of k Y is that it constructs a GP in a d-dimensional subspace, but this kernel over-explores regions in Y whose high-dimensional images outside X are mapped to the same points in X through a non-injective p X . To remedy these issues, Binois et al. [4] propose a new kernel k Ψ which has the benefit of being low-dimensional while avoiding the over-exploratory tendency of k Y . In [5], Binois et al. also propose a new mapping γ (instead of p X ) and define Y and new kernels based on this new mapping.
Sanyang and Kabán [38] develop REMEDA, which uses random embeddings within an Evolutionary algorithm EDA. Their theoretical results on quantifying the size of Y/the success of (RP) improve on those in [43] and are applicable for certain choices of d, greater than the effective subspace dimension; they also experiment with estimating the effective dimension numerically.
Qian et al. [34] extend the framework and some of the results in Wang et al. [43] to functions with approximate low effective subspaces, proposing the use of multiple random embeddings within any derivative-free solver. They contrast the use of a single versus multiple embeddings on three test problems of varying dimensions and using three different types of derivative-free solvers (evolutionary, Bayesian and model-based).
Recently, in the context of Bayesian optimization, Nayebi et al. [29] use a different random ensemble based on hashing matrices to represent the embedded subspaces and define Y as [−1, 1] d ; this formulation guarantees that the high-dimensional points are always inside X and, thus, their method avoids the feasibility corrections of REMBO.
Our contributions. We investigate a general random embeddings framework for unconstrained global optimization of functions with low effective dimensionality, where we allow the effective subspace of the objective function and its dimension (denoted by d e ) to be arbitrary (not necessarily aligned with coordinate axes and not limited in dimension by problem constants) 2 . This framework also allows the use of any global solver to solve the reduced problem.
We significantly extend and improve the theoretical analyses in [43,38], providing an indepth investigation of the reduced problem (RP) when Gaussian random embeddings are used. In particular, while [43,38] estimate the Euclidean norm of a (random) reduced minimizer, we derive its exact distribution, using tools from random matrix theory. We show that this reduced minimizer, when appropriately scaled, follows the inverse chi-squared distribution with d − d e + 1 degrees of freedom, where d is the dimension of the random embedding (Theorem 3.7). Moreover, we derive the probability density function of this reduced minimizer (Theorem 3.10) by first proving that it follows a spherical distribution. These results imply that, under certain assumptions, solving (RP) has no dependence on the ambient dimension D. Subsequently, Theorem 4.1 and Corollary 4.2 estimate the probability that (RP) is successful. The latter result extends both [43,Theorem 3] and [38, Theorem 2] to arbitrary effective subspaces and any d ≥ d e , and establishes a notable and more precise trade-off between the success of (RP), δ (the size of the reduced domain Y) and the embedding dimension d; thus allowing us to choose appropriate values for these parameters in the algorithm. Furthermore, we describe how to extend the main results to affine random embeddings (which draw random subspaces at any chosen (reference) point in R D ), which indicate that the probability of success of (RP) is higher if the point of reference is closer to the set of global minimizers.
Similarly to the algorithmic frameworks proposed in [43,34], we propose REGO (Random Embeddings for Global Optimization) that solves a single randomly-embedded reduced problem (RP) instead of (P) and is compatible with any generic global optimization solver 3 . We use and validate our theoretical results by providing extensive numerical testing of REGO with three types of solvers for (RP): DIRECT (Lipschitz-optimization), BARON (branch and bound), and KNITRO (multi-start local optimization). We use 19 standard global optimization test problems to generate functions with effective dimensionality structure and of growing ambient dimension D. When comparing REGO with the direct optimization of the ensuing problems without embeddings, we find that REGO's performance is essentially independent of D for all three solvers and that it is successful in recovering the original global minimum in most cases with only one embedding 4 . We also test the robustness of REGO's performance to variations in algorithm parameters such as δ and d.
Paper outline. In Section 2, we formally define and describe functions with low effective dimensionality emphasizing their geometrical aspects. In Section 3, we characterize the reduced minimizers in the reduced space focusing on the minimal 2-norm minimizer. For this minimizer, we derive the distribution of its Euclidean norm and its probability density function. We use the former result in Section 4 to derive a probabilistic bound for the success of (RP). In Section 5, we conduct numerical experiments to test REGO algorithm on functions with low effective dimensionality using three optimization solvers, namely, DIRECT, BARON and KNITRO, while in Section 6 we draw our conclusions and future directions.
Notation. We use bold capital letters to denote matrices (A) and bold lowercase letters (a) to denote vectors. In particular, we use I D to denote the D × D identity matrix and 0 D , 1 D (or simply 0, 1) to denote the D-dimensional vectors of zeros and ones, respectively. For an D × d matrix A, we write range(A) to denote the linear subspace spanned by the columns of A in R D .
We let ·, · , · and · ∞ to denote the usual Euclidean inner product, the Euclidean norm and the infinity norm, respectively. Where emphasis is needed, for the Euclidean norm we also use · 2 .
Given two random variables (vectors) x and y (x and y), we write x law = y (x law = y) to denote the fact that x and y (x and y) have the same distribution. We reserve the letter A to refer to a D × d Gaussian random matrix (see Definition A.1) and write χ 2 n to denote a chi-squared random variable with n degrees of freedom (see Appendix A.2).

Functions with low effective dimensionality
In this section, we formally define functions with low effective dimensionality and describe the geometry of (RP).

Definitions and assumptions
Functions with low effective dimensionality can be defined in at least two ways [17,43]. We will work with a definition given in terms of linear subspaces, provided in [43].
and d e is the smallest integer satisfying (2.1).
The linear subspace T is called the effective subspace of f and its orthogonal complement T ⊥ , the constant subspace of f . It is convenient to think of T ⊥ as a subspace of no variation of largest dimension (along which the value of f does not change) and T as its orthogonal complement.
Every vector x can be decomposed as x = x + x ⊥ , where x and x ⊥ are orthogonal projections of x onto T and T ⊥ , respectively. In particular, if x * is a global minimizer and f * is the global minimum of f in X then x * = x * + x * ⊥ and Moreover, we have for every vector x ⊥ in T ⊥ . It is important to note that there can be multiple points x * in R D satisfying the above definition such as, for instance, x * −1 , x * 0 and x * 1 in Example 1.1. By contrast, the function f = (x 1 − x 2 − 0.5) 2 admits a unique x * given by (0.25 − 0.25) T .
We summarize the above discussion in the following assumption.
Assumption 2.2. The function f : R D → R is continuous and has effective dimensionality d e ≤ d with effective subspace 5 T and constant subspace T ⊥ spanned by the columns of orthonormal matrices U ∈ R D×de and V ∈ R D×(D−de) , respectively.
Recalling the definition of problem (P) on page 1, let be the set of global minimizers in R D . Under Assumption 2.2, the set G can be represented as a union of (possibly infinitely many) affine subspaces each containing one particular x * (see proof of Theorem 4.1). Each of these affine subspaces is a (D − d e )-dimensional set {x ∈ R D : x ∈ x * + T ⊥ } -a translation of T ⊥ by the vector x * that the corresponding affine subspace must contain. In particular, if there is a unique Note also that point(s) x * lie in G ∩ T , and are the closest minimizers to the origin in Euclidean norm amongst all the minimizers lying in their respective affine subspaces. Our analysis applies to any minimizer x * with x * = 0. If x * = 0, then (RP) has a trivial solution. In that case, the origin is a global minimizer implying that every embedding is successful with a solution y * = 0. Hence, we focus our analysis for finding a(ny) minimizer x * ∈ G with x * = 0. Assumption 2.3. Given Assumption 2.2, let x * ∈ G such that x * = U U T x * -the unique Euclidean projection of x * onto T -is non-zero. Let G * := x * + T ⊥ be the affine subspace of G that contains x * .
The set G contains infinitely many global minimizers -a particularly useful feature of the functions with low effective dimensionality; this fact suggests that targeting G numerically may potentially be easier than if G contained only one point.

Geometric description
We now provide a geometric description of (RP), which serves as a basis for our theoretical investigations.
In Figure 2, we illustrate schematically T (the effective subspace of f ), T ⊥ (the orthogonal component of T ), G * (a connected component 6 of G), x * (the orthogonal projection of the global minimizers on G * onto T ). Since the orientation and position of these geometric objects are solely determined by the (deterministic) objective function, they are fixed, non-random.
By applying the 'random embedding' (RP), we switch from optimizing over R D to optimizing over Y. The linear mapping y → Ay maps points of the hypercube Y to points along the subspace range(A) in R D , which means that searching over Y is equivalent to searching over the corresponding feasible set along range(A) in R D . An example of this mapping is illustrated in Figure 2 with two red line segments: the segment (from −δ1 to δ1) representing Y is being mapped to the right segment, which lies in range(A). It is important to note that the centre of Y maps to the origin in R D and, hence, the corresponding search in the original space is also centred at the origin.
The most valuable information that we want to retain while performing dimensionality reduction is the value of f * . We would like min y∈Y f (Ay) = f * , which holds only if there is at least one y * in Y such that f (Ay * ) = f * . This condition has a geometric interpretation and, following from the definition of G, it can equivalently be stated as: there exists a y * ∈ Y such that Ay * ∈ G. (2.3) For (2.3) to hold, we must first ensure that there exists a y * in R d such that Ay * ∈ G. (2.4) In this regard, Wang et al. [43] proved the following theorem.

Theorem 2.4. (see [43, Theorem 2]) Let Assumption 2.2 hold and let
Then, with probability one, for any fixed x ∈ R D , there exists a y ∈ R d such that f (x) = f (Ay). In particular, for a global minimizer x * , with probability one, there While satisfaction of (2.4) only depends on d ≥ d e , that of (2.3) is determined by the values of both d and δ. For larger values of d and/or δ the probability that (2.3) is satisfied is higher. One must, on the other hand, be cognisant of the fact that larger values of d -the dimension of (RP) -and/or δ -the half-length of the domain -demand more computational resources. Therefore, a careful calibration of these two parameters is needed to ensure that (RP) is successful for most embeddings, at the same time being capable to converge to the solution within the computational budget. In this regard, our analysis will attempt to answer the following question: What are optimal values of d and δ such that (2.3) is satisfied with 'high' probability?

Characterizing minimizers in the reduced space
The analysis of this section focuses on determining the distribution of the random minimizer y * of f (Ay), which satisfies f (Ay * ) = f * . These results will inform us on the effects of the different parameters on the success of (RP) allowing us to estimate the values of δ and d that are likely to increase the chances of successful recovery of f * .
The following theorem provides a useful characterization of y * . The theorem and its proof were inspired by the proofs of Theorems 2 and 3 in [43].
Theorem 3.1. Let Assumption 2.2 hold and let x * and G * be defined as in Assumption 2.3. Let A be a D × d Gaussian matrix. Then, y * ∈ R d satisfies Ay * ∈ G * if and only if Proof. Let y * ∈ R d be such that Ay * ∈ G * . First, we establish that Suppose that Ay * ∈ G * . Then, using the definition of G * in Assumption 2.3 we can write Ay * = x * + x ⊥ for some x ⊥ ∈ T ⊥ . The orthogonal projection of Ay * onto T is given by Conversely, assume that y * satisfies Note that V V T Ay * lies on T ⊥ as it is the orthogonal projection of Ay * onto T ⊥ , which implies that Ay * ∈ G * . This completes the proof of (3.2). Now we show that (3.1) and (3.3) are equivalent. We multiply both sides of x * = U U T Ay * by S T , and obtain Since x * is in the column span of U , it can be written as x * = U z * for some (unique) vector z * ∈ R de . By substituting the above into (3.4) we obtain This reduces to where we have used the identities U T U = I and V T U = 0, which follow from Assumption 2.2. To obtain (3.3) from (3.1), multiply (3.1) by U .
Observe that z * = x * since U z * = x * and U is orthogonal.
Gaussian matrix and where G * is defined as in Assumption 2.3. Then, the following holds • If d = d e , then S * has exactly one element with probability 1.
• If d > d e , then S * has infinitely many elements with probability 1.
Proof. It follows from Theorem 3.1 that the set S * and the set of solutions to By = z * coincide. According to Theorem A.3, BB T is positive definite with probability 1, which implies that rank(BB T ) = d e with probability 1. Since rank(B) = rank(BB T ), rank(B) = d e with probability 1. Hence, the linear system By = z * almost surely has a solution. If d = d e the linear system has only one solution. If d > d e the system is underdetermined and, therefore, has infinitely many solutions.

Choosing a suitable minimizer
While S * contains infinitely many solutions if d > d e , it is sufficient that one of these solutions is contained in Y for (RP) to be successful. We proceed further by choosing one particular solution y * that is easy to analyse and based on the analysis will adjust parameters δ and d appropriately to ensure that the chosen y * falls inside the feasible set Y with high probability. The solutions that are likely to fall inside the feasible domain must be close to the origin. In this regard, we propose two candidates: Due to the definition of Y as a box, the minimizer (3.6) with the minimal infinity norm is particularly of interest. Since y * ∞ has the smallest infinity norm among all solutions in S * , knowledge of y * ∞ would allow us to choose the smallest possible Y while ensuring that (RP) is successful. Despite this convenient fact, we found that it is more difficult to study y * ∞ and have decided to investigate y * 2 instead. Proof. Assume that y * 2 ∈ Y. Then, y * 2 is a feasible solution of (RP). By the definitions of y * 2 and S * , we also have Ay * 2 ∈ G * ; this implies that f (Ay * 2 ) = f * . Hence, (RP) is successful by Definition 1.2.
Corollary 3.6. Let Assumption 2.2 hold. Problem (3.5) has a unique solution given by Proof. It follows from Theorem 3.1 that the solution(s) of (3.5) must be equal to the solution(s) of the following problem min y 2 2 s.t. By = z * , which has the solution (3.7).

Distribution of minimal Euclidean norm minimizer
The present section derives the distribution of y * 2 and its Euclidean norm.
The distribution of the Euclidean norm of y * 2 Theorem 3.7. Let Assumption 2.2 hold and let x * and y * 2 be defined as in Assumption 2.3 and (3.5), respectively. Then, y * 2 satisfies Proof. The result almost immediately follows from Corollary 3.6 and Lemma A.15. These yield The result is implied by z * = x * (see Remark 3.2).
The above result is equivalent to saying that y * 2 2 / x * 2 follows the inverse chi-squared distribution with d − d e + 1 degrees of freedom (see Definition A.7). The theorem reveals a linear dependence of y * 2 on x * ; larger values of x * contribute to the increase in the likelihood of y * 2 being further away from the origin. The theorem also suggests that y * 2 is independent of D as long as x * is fixed.
Proof. For any > 0, we have where the second equality follows from Theorem 3.7. By letting = x * 2 /δ, we obtain the result. (3.8) The expected value in (3.8) is inversely proportional to d − d e . In other words, for a fixed d e , larger values of the dimension of the embedding subspace bring y * 2 closer to the origin. This observation indicates that the increase in d allows us to decrease δ whilst the probability of y * 2 ∈ Y is kept constant.

The probability density function
The following theorem derives the probability density function of y * 2 . Theorem 3.10. Let Assumption 2.2 hold and let x * and y * 2 be defined as in Assumption 2.3 and (3.5), respectively. Then, the probability density function of y * 2 is given by Proof. Corollary 3.6 and Lemma A.17 imply that the p.d.f. of y * 2 is given by By using the equation z * = x * , we obtain the desired result. Figure 3 illustrates the p.d.f. of two-dimensional y * 2 . The shape of the p.d.f. resembles a volcano with the mass concentrated at a certain distance from the origin suggesting that y * 2 is unlikely to be neither too close to, nor too distant from the origin. We also note that the p.d.f. is independent of D.

Bounding the success of the reduced problem
This section is the culmination of this paper's analysis. Based on the results established earlier we derive a bound for the probability of success of (RP).
The following theorem presents a notable connection between the success of (RP) and the chi-squared distribution.
Theorem 4.1. Let Assumption 2.2 hold. Then, for any δ > 0, we have Proof. Note the following relationship between the probabilities: where the first inequality follows from Lemma 3.5 and where the second inequality is implied by y * 2 ∞ ≤ y * 2 2 . By applying Corollary 3.8 to the last probability in (4.2) and using the definition of x * given in Assumption 2.3, we obtain for any δ > 0 and any x * ∈ G such that U U T x * 2 = 0. Note that (4.3) also holds for U U T x * 2 = 0 since, in this case, (RP) is successful with probability 1 (see the discussion preceding Assumption 2.3). Hence, (4.3) holds for any x * ∈ G, which then implies where the equality follows from the fact that the tail distribution P[X > x] of any random variable X is a monotonically decreasing function in x.
In what follows, we show that min x * ∈G U U T x * 2 2 = min x * ∈G x * 2 2 . Define sets Z = {z ∈ R d : U z = U U T x * , x * ∈ G} and S = {U z + V c : z ∈ Z, c ∈ R D−de }, where V is defined in Assumption 2.3. First, we establish that G = S by showing that G ⊆ S and that S ⊆ G.
Let x * ∈ G. We can write x * = U U T x * + V V T x * since U U T + V V T = I. Let z = U T x * and c = V T x * and note that z ∈ Z and c ∈ R D−de . Hence, x * ∈ S, which proves that G ⊆ S. Let x * ∈ S. Then, x * = U z + V c for some z ∈ Z and c ∈ R D−de . We have where the second equality follows from the assumption that f has low effective dimensionality and the fact that V c ∈ T ⊥ , the fourth equality follows from the definition of x * (given in Assumption 2.3) and the last equality follows from (2.2). Hence, by definition of G, x * ∈ G. This proves that S ⊆ G. Finally, we have (by definition of Z) Using Theorem 4.1, one can now bound the success of (RP) by applying any tail bound on the chi-squared distribution. We use the bound derived in Lemma A.6.
Let R * denote the right hand side of (4.4). First, we note that R * is a function of µ/δ and d − d e . The bound reveals a linear relationship between µ and δ; scaling µ and δ by the same factor does not affect the value of R * . Furthermore, observe that for smaller values of µ or larger values of δ, R * is closer to 1. Numerical experiments show that for large values of n and/or µ/δ, the bound (4.4) is less tight; this is also signified by the asymptotic behaviour of R * , R * → −∞ monotonically as µ/δ → ∞ making the bound useless for large enough µ/δ.
It is remarkable that R * has no dependence on D, the dimension of the original optimization problem. This implies that larger D does not diminish the success of the reduced problem as long as µ and Previous bounds. One can derive similar bounds for the success of (RP) by bounding P[ y * 2 ≤ δ] in (4.2) using the Cauchy-Schwarz inequality. Since y * We can now use any suitable tail bound for the smallest singular value of the Gaussian matrix to bound the latter probability. Wang et al. [43], by applying the above technique and the result in [12] to bound the singular value, derived the following bound Their derivation is predicated on the assumptions that d = d e and that T is spanned by the standard basis vectors. In [38], Sanyang and Kabán extended Wang et al.'s bound to any δ . Using the bound in [9] for s min (B T ) they showed that One can also use Rudelson and Vershynin's bound in [36, Theorem 1.1] to obtain where C, c > 0 are absolute constants. This bound shows dependence of the probability on the difference d − d e , which is also manifest in our bound. The Rudelson and Vershynin's bound cannot be used for practical purposes due to the unknown C and c; we require explicit bounds to define the size of Y. Unlike the bounds of Wang et al. [43] and Sanyang and Kaban [38], Corollary 3.8 is applicable to any d ≥ d e and an arbitrary subspace T . Moreover, using the exact distribution of y * 2 given in Corollary 3.8, we circumvent the application of the intermediate Cauchy-Schwarz and bound the distribution of y * 2 directly.
Affine random embeddings. It is not difficult to extend (RP) to affine random subspace embeddings. In the affine case, we replace x by Ay + p, where p ∈ R D is a fixed point. The reduced optimization problem is then given by The results that apply to the linear embeddings also apply to the affine embeddings after minor adjustments. Theorem 2.4, for example, can be easily extended to the affine case to show that the intersection between p + range(A) and G takes place with probability 1 if d ≥ d e .
The affine version of the results are provable with the same assumptions except for a minor alteration in Assumption 2.3: the condition x * = 0 changes to x * = p. To obtain the affine versions of Theorem 3.7 and Theorem 3.10, replace x * with x * − p , where p = U U T p is the orthogonal projection of p onto T . For the affine versions of Theorem 4.1 and Corollary 4.2, replace min x * ∈G x * with min x * ∈G x * − p .

Choices of (RP) parameters
The present section aims to test numerically the quality of the bound (4.4). We will also use the results of this section to select suitable pairs of parameters d and δ for (RP) in the numerical experiments later. Suppose that we are given a function f satisfying Assumption 2.2 with the set of global minimizers G consisting of only one connected component. Let x * for f be defined as in Assumption 2.3 and z be defined by the equation U z = x * . We also define µ := min x * ∈G x * and note that µ = x * .
We test (4.4) for f by contrasting the left-hand side of (4.4) (denoted by L * ) to its right-hand side (denoted by R * ). We compare L * and R * for four different values of d − d e , namely, 0, 1, 2 and 3. For each value of d − d e , we express R * as a function ofδ := δ/µ and using its closed form we plot R * forδ ∈ [0.02, 10]. We do not have a closed form expression for L * , but we can approximate it numerically. In what follows, we describe how this could be done. We start by writing whereB denotes a d e × d Gaussian matrix andz = z/µ. Here, the second equality follows from Definition 1.2, and the third equality follows from Theorem 3.1 and the fact that G has only one connected component. Note that z = 1 since z = x * = µ (see Remark 3.2). We assignz to a random vector with unit norm and keepz fixed throughout the experiment 8 . For eachδ ∈ [0.02, 10], we generate a thousand Gaussian matricesB and estimate the latter probability in (5.1) as the proportion of instances for which the statement under the probability is true. Unlike for R * , L * depends on individual values of d and d e . We plot the estimates for L * for the following values of d − d e : 0, 1, 2, 3 and, in each plot, we repeat the experiment for d e = 2, 3, 4, 5, 6. The plots are presented in Figure 4.
Numerical findings. The plots in Figure 4 -confirming the conclusions of Corollary 4.2suggest that the variation in success of (RP) is mainly determined by the value of d − d e ; the larger is the difference, the higher is the probability of success of (RP) for a givenδ. These curves, being independent of µ, can be used to find suitableδ for any problem for the corresponding values of d − d e ; the size of the Y box, δ, can then be set to Mδ if an upper bound M on µ is known.
Choosing d and δ in practice. When it comes to the numerical application of (RP) in practice, initialization of parameters d and δ might be problematic. From the theoretical discussions above we learned that the parameters d and δ must be defined based on d e and µ, the values of Algorithm 1 Random Embeddings for Global Optimization (REGO) applied to (P). Apply a global optimization solver (e.g. BARON, DIRECT, KNITRO) to (RP) until a termination criterion is satisfied, and define y min to be the generated (approximate) solution of (RP). 4: Reconstruct x min = Ay min which are typically unknown in practice, for example, for black-box functions. We circumvent this issue by estimating d e and µ rather than trying to calculate their exact values; note that all we need is an upper bound d on d e . The parameter d e or an upper bound may be known from prior studies or can be found with active subspace identification methods (see, e.g., [8]); these use gradients of f to estimate d e .
Estimating µ can be a harder task. A rough estimate for µ can be obtained if the search in the original space is restricted to a certain domain; a trivial upper bound in this case is given by the maximum distance between the origin and the boundary of the domain. The search domain that is commonly imposed to practically solve unconstrained optimization problems is box constraints, such as X = [−1, 1] D for which µ ≤ √ D. In Appendix C, we test REGO assuming that √ D is the best bound known for µ. To compensate for unknown µ, one could also try increasing δ or d gradually to explore larger regions in R D .

Testing REGO with state-of-the-art global solvers
Algorithms. The algorithm for the random embeddings method named REGO (Random Embeddings for Global Optimization) is outlined in Algorithm 1. Below, we give the descriptions of the three state-of-the-art solvers we use to test REGO.
DIRECT( [18,24,16]) version 4.0 (DIviding RECTangles) is a deterministic 9 global optimization solver first introduced in [24] as an extension of Lipschitzian optimization. DIRECT does not require information about the gradient nor about the Lipschitz constant and, hence, can be used for black-box functions. DIRECT divides the search domain into rectangles and evaluates the function at the centre of each rectangle. Based on the previously sampled points, DIRECT carefully decides what rectangle to divide next balancing between local and global searches. Jones et al. [24] showed that DIRECT is guaranteed to converge to global minimum, but convergence may sometimes be slow.
BARON( [37,40]) version 17.10.10 (Branch-And-Reduce Optimization Navigator) is a branchand-bound type global optimization solver for non-linear and mixed-integer non-linear programs. To provide lower and upper bounds for each branch, BARON utilizes algebraic structure of the objective function. It also includes a preprocessing step where it performs a multi-start local search to obtain a tight global upper bound. In comparison to other existing global solvers, BARON was demonstrated to be the most robust and fastest (see [30]). However, BARON accepts only a few (general) classes of functions 10 including polynomial, exponential, logarithmic, etc. and, unlike DIRECT, it is unable to optimize black-box functions.
KNITRO([6]) version 10.3.0 is a large-scale non-linear local optimization solver capable of handling problems with hundreds of thousands of variables. KNITRO allows to solve problems using one of the four algorithms: two interior point type methods (direct and conjugate gradient) and two active set type methods (active set and sequential quadratic programming). In contrast 9 Here, we refer to the predictable behaviour of the solver given a fixed set of parameters. 10 For instance, BARON cannot be applied to problems which include trigonometric functions.
to BARON and DIRECT, which specialize on finding global minima, KNITRO focuses on finding local solutions. Nonetheless, KNITRO has multi-start capabilities, i.e., it solves a problem locally multiple times every time starting from a different point in the feasible domain. It is this feature that we make use of in the experiments.
Generating the test set. Our test set of functions with low effective dimensionality will be derived from 19 global optimization problems (of dimensions 2-6) with known global minima [20,14,39], some of which are from the Dixon-Szego set [10]. The list of the problems is given in Table 3, Appendix B.
Below we describe the method adopted from Wang et al. [43] to generate high-dimensional functions with low effective dimensionality. Letḡ(x) be any function from Table 3; let d e be its dimension and let the given domain be scaled to [−1, 1] de . We create a D-dimensional function g(x) by adding D − d e fake dimensions toḡ(x), g(x) =ḡ(x) + 0 · x de+1 + 0 · x de+2 + · · · + 0 · x D . We further rotate the function by applying a random orthogonal matrix Q to x to obtain a non-trivial constant subspace. The final form of the function we test is given as Note that the first d e rows of Q now span the effective subspace T of f (x). Furthermore, where G 1 and G 2 are the sets of global minimizers of f andḡ, respectively. For each problem in the test set, we generate three functions f as defined in (5.2) one for each D = 10, 100, 1000. We will tackle (P) for each f both directly (we call it 'no embedding') and applying REGO outlined in Algorithm 1.
Experimental setup (REGO). We compare 'no embedding' and REGO using the three solvers above. Let g i , s j , n j and D k denote the ith function in the problem set (g 1 = Beale, etc., see Table 3), jth solver (s 1 = DIRECT, s 2 = BARON, s 3 = KNITRO), the total number of problems in the problem set solvable by jth solver (n 1 = 19, n 2 = 15, n 3 = 18) and kth ambient dimension (D 1 = 10, D 2 = 100, D 3 = 1000), respectively. Let f ik denote the D k -dimensional function with low effective dimensionality constructed from g i as described previously.
Within 'no embedding' framework, for each pair (s j ,D k ), we solve f ik for i = 1, 2, . . . , n j with solver s j and record the proportion of the problems that attain convergence (see definition in Table 1).
For each f ik (1 ≤ i ≤ n j , 1 ≤ k ≤ 3), we apply REGO 100 times every time with a different Gaussian matrix. Thus, in total, for each pair (s j ,D k ) we solve n j × 100 problems. We record the proportion of problems that attain convergence (see Table 1) out of these n j × 100 problems.
We also record the number of function evaluations (for DIRECT and KNITRO) and CPU time (for all the three solvers) spent before termination within the two frameworks. For each (s j ,D k ), function evaluations and time are averaged out over n j × 100 problems within REGO and over n j problems within 'no embedding'.
We conduct the above experiment for REGO with the following pairs of parameters (d,δ): 3)) and the value forδ was chosen as the smallestδ that gives at least 90% chance of success based on the curve of R * in Figure 4.
Experimental setup (solvers). Due to the difference in algorithmic procedures of the solvers, they allow different budget constraints and have different convergence and termination criteria; we present these in Table 1.

Numerical results
(RP) successful. We record the proportion of instances for which (RP) is successful. Table 2 presents these percentages for each particular choice of d and D averaged over 19 problems in the test set. We observe that the percentages are very high and appear to be independent of D supporting the conclusions of Corollary 4.2.   DIRECT ( Figure 5). For all the four initialisations of REGO, we observe that the average proportions of problems that attained convergence (see definition in Table 1) are invariant with respect to the ambient dimension. This frequency of convergence is higher within 'no embedding' for D = 10, 100, but exhibits a significant drop for D = 1000. The average function evaluation count is maintained within REGO, but doubles within 'no embedding' for a tenfold increase in D. Growth in CPU time takes place within both frameworks, being highest for 'no embedding'. BARON ( Figure 6). In comparison with 'no embedding', the frequency of convergence opt is higher within REGO in most cases. We note that BARON's both convergence and convergence opt exhibit invariance with respect to D within REGO. As for 'no embedding', we observe a decrease in the frequencies of both convergence and convergence opt . In addition, we observe an increase in CPU time spent within 'no embedding', whilst the time is almost constant within REGO.
KNITRO (Figure 7). We see that the proportion of solved problems is invariant with respect to the ambient dimension within REGO and, surprisingly, within 'no embedding' as well. However, the average number of function evaluations and time spent differ significantly between the two frameworks. With REGO, the average number of function evaluations remain at the same level for all D. Average time grows within both frameworks, but at a higher rate for 'no embedding'. The average time differs by a factor of 70 for D = 1000 in favour of REGO. We think that the growth in time within REGO is due to more costly function and derivative evaluations for larger D.

Conclusions and future work
We study a general algorithmic framework for functions with low effective dimensionality that solves the reduced problem (RP) using a single Gaussian random embedding and a(ny) general global optimization solver. Our precise theoretical findings backed by the numerical experiments show that the success of (RP) is essentially independent of D and mainly depends on the gap between the embedding dimension d and the dimension of effective subspace d e , and the ratio between the size of Y (namely δ) and µ (the Euclidean distance to the closest affine subspace of minimizers). REGO with three standard global solvers produced high frequencies of convergence, generally outperforming the respective solver's performance when applied directly to the problems (without the dimensionality reduction) in terms of proportion of problems solved and/or computational cost.
Our in-depth investigations are conceptual in nature, and there is clearly more work that needs to be done to make this framework practically applicable to global optimization problems with special structure. In particular, as outlined on page 16, our REGO approach depends on knowing (an upper bound d on) the effective dimension d e . Future work may include estimating d e prior to optimizing, noting that REGO does not need to learn the entire effective subspace only its dimension. One could also estimate d or d e numerically, as proposed in [38], where d is gradually increased until no significant changes in the best function value found are observed. Our theoretical choices for δ also depend on µ, which again needs estimating. In this case, choosing a box domain for f would provide a rough estimate for µ (as discussed on page 16), with the remark that REGO cannot (yet) guarantee feasibility with respect to given bounds. To achieve the latter, one needs to either add projection operators as in [43,4,5], or include the problem constraints in the formulation of (RP) and allow multiple random embeddings as in [34]. Alternative potential directions include investigating other random or deterministic matrix choices for the embeddings, as considered for example in [29]. Real-life problems are often only approximately low dimensional, and so their optimization requires further extensions and analysis of the random embedding framework.
Theorem A.2. (see [21,Theorem 2.3.10]) Let A be an D × d Gaussian random matrix. If U ∈ R D×p , D ≥ p, and V ∈ R d×q , d ≥ q, are orthogonal, then U T AV is a Gaussian random matrix.
A related notion that plays an important role in the study of Gaussian matrices is the Wishart distribution represented by matrix A T A (or AA T ), where A is an overdetermined (underdetermined) Gaussian matrix. A Wishart matrix is positive definite with probability 1: Then, the Wishart matrix A T A is positive definite with probability 1.
The immediate consequence of the result is that the Wishart matrix is nonsingular with probability 1.
A.2 Chi-squared random variable Definition A.4 (Chi-squared random variable). Given a collection Z 1 , Z 2 , . . . , Z n of n independent standard normal variables, a random variable X = Z 2 1 + Z 2 2 + · · · Z 2 n is said to follow the chi-squared distribution with n degrees of freedom. We denote this by X ∼ χ 2 n .
The following lemma provides a notable relationship between the inverse of the Wishart matrix and the chi-squared random variable.
In the following lemma we derive an upper bound for the cumulative density function (c.d.f.) of the chi-squared random variable.

A.3 The inverse chi-squared random variable
Definition A.7 (Inverse chi-squared random variable). Given X ∼ χ 2 n , a random variable Y = 1/X is said to follow the inverse chi-squared distribution with n degrees of freedom. We denote this by Y ∼ 1/χ 2 n (see [27, A5]).
Lemma A.11. [15, p. 13] Let x and y be random vectors such that x law = y and let f i (·), i = 1, 2, . . . , m, be measurable functions. Then, where u is distributed uniformly on the unit sphere S D and r is a univariate random variable independent of u. Theorem A.14. (see [22,Theprem 2.1.]) Let x law = ru be a spherically distributed D×1 random vector with P[x = 0] = 0, where r is independent of u with p.d.f. h(·). Then, p.d.f. g(x) of x is given by For more details regarding spherical distributions refer to [15,21,3].
A.5 The least Euclidean norm solution to the random linear system The present section establishes key properties of the least Euclidean norm solution to the underdetermined random linear system.
(C) LetB be a d e × d Gaussian matrix, where d e ≤ d, and let z ∈ R de be a fixed nonzero vector. Denote by y 2 the least 2-norm solution toBy = z.
Lemma A. 15. Given (C), y 2 satisfies Proof. The least Euclidean norm solution y 2 toBy = z is given by For its Euclidean norm, we have Using Lemma A.5 we obtain the desired result: Lemma A. 16. Given (C), y 2 follows a spherical distribution.
Proof. Let S be any d×d orthogonal matrix. Let f : R ded×1 → R d×1 be a vector-valued function defined as We first would like to prove that f is a measurable function. Recall that a function is measurable if and only if each of its components is measurable. It is enough to show that p 1 /q is measurable; the same argument will apply to the rest of its components. First, we note that (i) p 1 and q are measurable; (ii) q is non-zero almost everywhere.
To prove (i), observe that the polynomials p 1 and q are sums of scalar multiples of products of standard normal random variables, which by definition are measurable. Sums, scalar multiples and products of measurable functions are measurable; hence, p 1 and q must be measurable. To prove (ii), we refer to Theorem A.3, which says that the matrixBB T is positive definite with probability 1 implying that all of its eigenvalues are strictly positive with probability 1. Then, (ii) follows from the fact that the determinant of the symmetric square matrix is equal to the product of its eigenvalues. Now, we can apply [45,Theorem 4.10] to deduce that p 1 /q is measurable; this completes the proof that f is measurable. For y 2 =B T (BB T ) −1 z, we have Hence, y 2 follows a spherical distribution by Definition A.10.
Lemma A. 17. Given (C), the probability density function of y 2 is given by where n = d − d e + 1.
Proof. The fact that y 2 has a spherical distribution is a key ingredient in the proof. To simplify the derivations, let us assume for now that z = 1. By Lemma A.12, where r is a univariate random variable and where u is a random vector distributed uniformly on S d ; moreover, r and u are independent. Our first goal is to show that r and y 2 have the same distribution. This fact follows immediately from Theorem A.13 if we show that P[y 2 = 0] = 0. Let W ∼ 1/χ 2 d−de+1 . According to Lemma A.15, we have By using (A.4) for h(·) in the above, we obtain g(ŷ) = 2 −n/2 π −d/2 Γ(d/2) Γ(n/2) (ŷ Tŷ ) −(n+d)/2 e −1/(2ŷ Tŷ ) . (A.5) To derive the p.d.f. for arbitrary non-zero z, we consider the linear transformationȳ = z ŷ. The Jacobian of the transformation is equal to 1/ z d . Thus, the p.d.f.ḡ(ȳ) ofȳ satisfies which together with (A.5) yields the desired result. Table 3 contains the explicit formula, domain and global minimum of the functions used to generate the high-dimensional test set. The problem set contains 19 problems taken from [20,14,39]. Problems that cannot be solved by BARON are marked with ' * '. Problems that will not be solved by KNITRO are marked with ' • '. Table 3: The problem set listed in alphabetical order.

C Additional experiments
We conducted three more experiments to test REGO's robustness to changes in the parameters. In all three experiments, the same budget and termination criteria as in the main experiment are used.
(A) In this experiment, we assume that no good estimate for µ is known and that µ can be as large as √ D (for example, when X = [−1, 1] D constraint is imposed). We test REGO with the following parameters: In the figures we also include curves for δ opt = 2.2 √ d e taken from the main experiment. Results are presented in Figure 10 in Appendix C.

Conclusions
(A) We test robustness of REGO assuming that µ is equal to √ D (which makes δ to be relatively large and dependent on D). Despite this dependence, the frequency of convergence for BARON and KNITRO is high showing mild dependence on D.
(B) The purpose of this experiment is to see how different values of d affect the performance of REGO while δ is kept constant. For larger d, we expect (RP) to be successful with higher chance. Nonetheless, the results show that sometimes, for larger d, REGO's performance may be compromised; this is for example true for BARON's convergence opt . Since δ is set to a relatively large value, (RP) is successful with high probability even for smallest d.
This suggests that as long as d and δ produce relatively high chance of success of (RP), one should stop increasing their values lest convergence to the global minimum require larger computational resources.
(C) In this experiment, we apply REGO with different values of δ while keeping d constant.
The results display no significant differences between the performances with different parameters. Even the results with the optimal δ (used in the main experiment) do not differ considerably from the one with the largest δ except for BARON where the former wins in terms of convergence opt and CPU time. The results of this experiment together with the results in (B) indicate that it is better to increase δ and keep d constant if one wants to increase success of (RP) with minimal increase in computational cost.