SURVEY OF TRUST-REGION DERIVATIVE FREE OPTIMIZATION METHODS

In this survey article we give the basic description of the interpolation based derivative free optimization methods and their variants. We review the recent contributions dealing with the maintaining the geometry of the interpolation set, the management of the trust region radius and the stopping criteria. Derivative free algorithms developed for problems with some structure like for partially separable functions are discussed. Two different versions of derivative free algorithms are applied for the optimization of the configuration of the geometry of a stirrer. Numerical results are presented to show the applicability of the algorithms to practical problems.

1. Introduction.Derivative free optimization(DFO) methods are designed for solving nonlinear optimization without constraints where the derivatives of the objective function are not available.We consider formally the problem min x∈R n f (x) where f is a smooth nonlinear objective function from R n into R and is bounded below.We assume that the gradient ∇f (x) and the Hessian ∇ 2 f (x) can not be computed for any x.
There is a high demand from practitioners for such algorithms because this kind of problems occur relatively frequently in the industry.In applications either the evaluation of the objective function f (x) is very difficult or expensive, or the derivatives of f are not available.The last situation occurs when the computation of f (x) at a given point x results from some physical, chemical or econometrical experiment or measurement, or is a result of large and expensive computer simulation for which the source code is either not available or unmodifiable, which can be considered as a black box.In practice the value of f (x) is contaminated with noise or may be non-smooth; but we don't consider these cases in our presentation.
There are mainly four classes of derivative-free optimization methods.The first class of DFO algorithms are the direct search or pattern search methods which are based on the exploration of the variable space by using sample points from a predefined class geometric 2 B ÜLENT KARAS ÖZEN pattern and use either the Melder-Need simplex algorithm or parallel direct search algorithm.They do not exploit the inherit smoothness of the objective function and require therefore a very large number of function evaluations.They can be useful for non-smooth problems.A comprehensive survey of these methods can be found in [16].The second class of DFO's are line search methods which consists of a sequence of n + 1 one-dimensional searches introduced by Powell [21].The combination of finite difference techniques coupled with quasi-Newton method constitutes the third class of the algorithms [22].The last class of the methods are based on modeling the objective function by multivariate interpolation in combination with the trust-region techniques.These methods were introduced by Winfried [30], [31] and by Powell [23], [25].In this survey we consider this class of DFO algorithms.The main idea of these methods is building a polynomial model, interpolating the objective function at all points at which its value is known.The model is then minimized over the trust region and a new point, is computed.The objective function evaluated at this new point thus possibly enlarging the interpolation set.This newly computed point is checked as to whether the objective function is improved and the whole process repeated until convergence is achieved.
The main criterion for assessment and comparison of DFO algorithms is the number of function evaluations they require for minimization.The use of all DFO algorithms mentioned above are limited to amoderate number of variables, e.g., 20 variables.In the direct search methods, the number of grid points which are required to fill the whole sample space could be very large.Similarly, for interpolation methods a total of (n + 1)(n + 2)/2 known function values would be necessary for a full quadratic model.
with the trust-region radius ∆ k may be given as for some g ∈ R n and some symmetric n × n matrix H, where • , • denotes the inner product.The vector g and the matrix H do not necessarily correspond to the first and second derivatives of the objective function f .They are determined by requiring that the model (1) interpolates the function f at a set Y = {yi} of points containing the current iterate Here, Y denotes the set of interpolation points, which is a subset of the set of points at which the values of f is known, including the current iterate.Building the full quadratic model (1) requires the determination of f (x k ), the components of the vector g k and the entries of the matrix H k ; so that the cardinality of Y must be equal to However if n > 1, the condition (2) is not sufficient for the existence and uniqueness of the interpolant and to guarantee the good quality of the model.Geometric conditions (known as poisedness) on the set Y are required to ensure the the existence and uniqueness of the interpolant, i.e., that the points of Y do not collapse into a lower dimensional subset or lie on a quadratic curve.If {φi(•)} p i=1 denotes a basis of the linear space of n−dimensional quadratics, the model (1) and the interpolation conditions (2) can be written as is sufficiently positive, the iteration is successful; as the next iteration point, x k+1 = x k +s k will be taken and the trust-region radius ∆ k is enlarged.If ρ k is not sufficiently positive, then the iteration was not successful; the current x k will be kept and the trust-region radius is reduced.The general form of the DFO algorithm can be given as follows with the given constants [8] 0 • Compute f (x + k ) and the ratio .
• Step 5: Update the current iterate Determine x k such that Then, if Increment k by one and go to Step 1.There are three particular issues with the implementation of this method.The first one is the acceptance of a new iterate.In the standard trust-region method, the new point produced by the algorithm is accepted as soon as f (x + k ) is sufficiently much smaller than f (x k ), the current objective value.In case of DFO algorithms this situation is more complex, because to include the x + k in the set of interpolation points Y we have to remove another point from Y .Only in the early stages of the iterations, where the models are incomplete (models that are less than full quadratic or linear models) all new points x + k are included.Otherwise, one has to take care of the geometry of Y ; the new iterate should not deteriorate the geometry of Y and the new interpolation set remain "as poised as possible".There is no guarantee that the quality of geometry and poisedness remains acceptable after the inclusion of the new point.
The interpolation set must be maintained in each iteration either by including a new iterate into the the interpolation set or by improving the model.In order to keep the interpolation set poised at each iteration, error estimates for the function and its derivatives are needed.In [10] using Taylor type error bounds between the model m(x) and the objective function f (x) for multivariate interpolation, the authors derive practical measures defining poisedness.The error in the gradient can be estimated as for x ∈ Y .Here, d denotes the dimension of the interpolating polynomial, G depends only on the function f and Λ Y is a constant depending on the interpolation set Y .For the convergence of the DFO methods, the error in gradient of should go to zero when the distances between all y i and x go to zero and Λ Y has to remain uniformly bounded for all interpolation sets.For Lagrange interpolation polynomials Li(x), ΓY describes an upper bound of the form where i = 1, . . ., p and x varies in the convex hull of Y .The set Y can be maintained as poised by choosing the points which ARE to leave or to enter Y so that this bound is reduced [24].In Powell's algorithm a point is replaced by a new one having the maximum of the Lagrange polynomials.
For Newton polynomials a similar approach is followed.When Newton polynomials are chosen, from theoretical point of view, the pivots of the matrix (3), i.e., the values of the non-normalized Newton fundamental polynomials at their associated interpolation points |N u i (y i )|, must be non-zero to ensure the poisedness.However, in practice it may be sufficient to have If this condition is satisfied, then the interpolation problem is well-poised.If this condition is not satisfied, the geometry of the interpolation set have to be improved.A point y − = y i = x k is replaced by another point y + such that In [10] both approaches are criticized because they do not guarantee uniform bounds on Λ Y and proposes algorithms on how to maintain the well-poised interpolation set.In these algorithms, the set of polynomials have to be computed at each updates of Y , but not the interpolation set.
Because the condition number of the matrix (3) depends only on the choice of the basis functions ϕ i (x), it can not be used as an indicator for the well-poisedness of the interpolation set.But the condition number of a scaled version of the matrix (3) can be used with the Lagrange polynomials as shown in [10] for deriving Taylor-like bounds for the function and its derivatives.This approach is named Λ-poisedness.
The second issue concerns the management of the trust-region radius.In the classical trust-region algorithms, the radius ∆ k is decreased when no significant reduction of the objective function was achieved.It can happen that a very small trust-region radius leads to stagnation of the iteration.In case of DFO algorithms, one has to check first whether the interpolation set Y is poised or not; because an inadequate geometry may be the reason for insufficient reduction of the objective function.If Y is not poised, first its geometry should be improved before reducing the trust-region radius.It is important not to modify ∆ k in the early stages of the iteration, because the inclusion of a new point y+ with ||y + − y k ≤ ||∆ k and replacing some past points y − ∈ Y \{x k } by y + an improvement can be obtained.
The third issue deals with the stopping criteria.At each iteration we have to apply a convergence test.In classical trust-region algorithms it is sufficient to check that the gradient of f is small enough at the current iterate x k .In case of DFO methods the gradient ∇f (x) is not available.If the model is sufficiently accurate, one can assume that Whenever the quadratic model m k has been updated, ||g k || can be computed and it can be checked if it is small enough (if ||g k || ≤ for some ).If this is the case, then the model is sufficiently accurate around a ball centered at x k with , where κ is a constant independent of x [4] .This condition should be verified in a finite number of steps.
3. Derivative-free methods for partially separable functions.The use of DFO methods is limited in practice to moderate size problems with about 20 variables.When the number of variables grow, the DFO methods will fail.But many problems arising in industry have a structure which appears in the form of blocks or parts.Each block has its own variables which are linked with the variables of other blocks.A collection of chemical reactor tanks each with its temperature and pressure can be considered as such systems.Discretization of PDE's (partial differential equations) with domain decomposition techniques are another example of this type; each single variable is only connected to a few neighborhoods.In order to develop the DFO methods work, the underlying structure of the objective function should be exploited to minimize the number of function evaluations.The structure in many problems mentioned above can be captured by the notion of partially separability introduced in [13], [15] which may be expressed as where f : R n → R and each individual function fi(x) is depending on the internal variables U i x with U i being an n i ×n matrix (n i << q).In many cases, the matrix U i is a row subset of the identity matrix and the internal variables is form a small subset of the variables of the original objective function.The cardinality of these q subsets is given by n i and the The concept of partial separability in connection with optimization problems was introduced by Griewank and Toint in [12], [14] and studied in the framework of the development of the LANCELOT package [5].It was shown in [12] that every twice differentiable continuous function with a sparse Hessian is partially separable.This is only an existence result and specifies the decomposition (7) indirectly.But for all problems with a sparse Hessian, the partially separable decomposition is explicitly known [4], [28].A special class of partially separable functions which occur in unconstrained optimization is that with a banded and sparse structure of the objective function [2], [3].
In the trust-region algorithm for partially separable functions, the quadratic model can be build for each individual function fi as: where each gi ∈ R n and Hi is an ni × ni matrix.Each model m i,k is an approximation of f i 's around x and depends on the n i dimensional vectors of the subspace R i .After computation of the models ( 6) and interpolating individual functions, the quadratic model for the global objective function f at x k is obtained as This model can then be minimized within the current trust region B k .This model requires at most (n max + 1)(n max + 2)/2 function evaluations, where n max = max i=1,••• ,q n i .This indicates that the number of function evaluations is by far less than for the unstructured problem because nmax << n and is often independent of n.The next problem is the management of the q interpolation sets Y i , i = 1, • • • , q.There are two possibilities.The objective function f can be evaluated as a whole by computing the collection of individual function evaluations f i (x)'s and the models m i,k (x) for a given x.This would be very expensive in terms of function evaluations.The second strategy is the grouping of the necessary function evaluations.An individual function f i (x) can be computed independently of the others.Then the geometry of the jth interpolation set Y j is improved by yielding a vector with ni 'useful' components, then the question arises which values of this vector should be assigned to the remaining n − n j components.This is done in [4] by using the CPR procedure in [11] for estimating sparse Jacobian matrices to the n × q occurrence matrix of variables x j into elements f i 's.
The reduction of function evaluations can be illustrated for an objective function with a tridiagonal Hessian: the unstructured DFO requires a (n + 1)(n + 2)/2), sparse version of DFO needs 6n and partially separable DFO only 6 function evaluations.In a series of papers, Colson and Toint [2,3,28] compared the unstructured algorithm UDFO with its structured versions for partially separable functions, PSDFO(I) which refers to the case where each indvidual function f i (x) is individually known and PSDFO(G) in which the individual functions can be obtained by calling a routine and therefore the complete collection of {f i (x)} q i=1 must be computed for every x.The numerical experiments for medium sized (n = 25) and large sized (n = 200) test problems show the benefits of using the the partially separability property of the objective functions.For many test problems PSDFO(I) requires less than half of function evaluations required for PSDFO(G).For problems with sparse and banded Hesssian and objective functions, the model m k can be viewed as a linear combination of 1 + n + n H monomials, where n H is the number of nonzeros in the lower triangular part of H(x), with nH as a small fraction of n.In this case the interpolation set Y is linear rather than quadratic.The numerical results for such kind of problems reported in [2] and a comparison is made with Powell's sparse algorithms UOBSQA, UBDQA and with sparse version of the structured PSDFO(G) of Colson and Toint in [2,28].

Implementation of derivative-free algorithms.
There are several implementations of interpolation based on derivative free methods which differ in detail.The oldest one is the DFO (Derivative Free Optimization) package developed by Conn and coworkers [6,7,8].The newly developed CONDOR (Constrained, Non-linear, Direct, parallel Optimization using trust Region method for high-computing load function) package by Berghen [1] is based on the UOBYQA of Powell [23].The packages UDFO and PSDFO of Colson and Toint [2], [4] where mentioned above.Although there are some extensions for constrained problems, both packages are developed mainly for unconstrained optimization problems and for easy (box) constrains.
The differences between the DFO package and CONDOR which are used for the stirrer optimization in the next Section, are summarized as: • The DFO package uses Newton polynomials while CONDOR uses Lagrange polynomials as the basis for the space of quadratic polynomials.The choice of Newton polynomials (in DFO) causes the matrix (3) to be lower triangular with a special block diagonal structure, while the use of Lagrange polynomials (in CONDOR) results in an identity matrix.• The minimization of the possibly quadratic model in the DFO package is done by applying a standard optimization procedure, e.g., sequential quadratic programming (SQP), the IPOPT [29].CONDOR uses the Moré and Sorenson algorithm [20] for the computation of the trust region radius and the minimization of the quadratic model, which is very stable.• Furthermore, when the computation of the the trust-region radius ∆ is checking completely the validity (with ) of the model around the point x k in CONDOR is based on whether one of the following conditions [1] x j − x k ≤ 2δ for all j (8) holds for the Lagrange interpolating basis {P j }, where d 3 dα 3 f (x + αv) ≤ M such that v = 1 and α ∈ R. Condition (8) prevents the algorithm from sampling the model at (n + 1)(n + 2)/2 new points.However, the checking of the validity of the model in DFO is mainly based on condition (9), which is not very often satisfied by the trust-region radius.Hence, in some cases more function evaluations are needed in order to rebuild the interpolation polynomial.
• Another essential difference between the DFO and the CONDOR algorithms is that the former uses the smallest Frobenius norm of the Hessian matrix to minimize the local model, which may cause numerical instability, whereas CONDOR uses the Euclidean norm which is more robust.• The DFO package uses interpolating polynomials which oscillate between the linear and quadratic.CONDOR uses full quadratic polynomials.• There is a parallel extension of CONDOR.
Another variant of the DFO algorithms is the so-called wedge trust region method for linear and quadratic models [18].This method differs from the others by the minimization of the quadratic model within the trust region with the aim to improve the geometry of the interpolation set.It can be illustrated for the linear model The last inequality is called the wedge constraint.This formulation is also given for the quadratic model and the method is compared for several test examples in [18].

5.
Application to stirrer optimization.The DFO package was successfully applied as a black-box optimization routine in optimizing energy systems [17] and for the helicopter rotor blade design [9], where some practical aspects of DFO algorithm were described.Numerical tests in both papers show that DFO method works faster and more accurately than the derivative based methods such as the Quasi-Newton.CONDOR was also compared with DFO package in some respects for several test problems and its applicability for some CFD simulations has been shown in [1].
Applications of numerical optimization techniques for fluid flow problems are found mostly in the field of aerodynamics.An overview of the subject is given by Mohammadi and Pironneau [19].For general fluid flow applications the resulting nonlinear optimization problems are typically solved by employing descent algorithms based on gradient information.However, the computation of the gradient of the objective function with respect to the design variables is usually very costly and contaminated by noise.Usually the objective function, f (x), is a result of very expensive computer simulations and yet the derivatives may not be available at all.
For the computer simulation of fluid inside the stirrer by the CFD package FASTEST-3D we have a prototype of such a problem.By simulating the complex discrete Navier-Stokes system, the local gradients of the objective function f (x) with respect to the design variables x, are not provided by the flow solver and are not directly available.In such case we may easily apply derivative-free optimization methods.In [26] we have used the the DFO package to optimize the geometric parameters of the stirrer, whereas in [27] the numerical results obtained with CONDOR and DFO packages are compared.In this Section we present these results in order to show the applicability of both algorithms.
We consider a Rushton turbine as a representative test case for optimizing a practical stirrer configuration.The geometric parameters, which are considered to define the standard configuration, are given in Table 1.The numerical grid consists of 22 blocks with a total number of 238 996 control volumes.There are 17 blocks defined as rotating with the stirrer while the remaining 5 blocks are defined as stationary with the vessel.A sketch of the considered configuration and the corresponding surface grid of the stirrer are shown in Figure 1.One time step integration needs approximately 20 seconds of computing time on an eight processor Redstone cluster machine.This results in about 8 hours of computing time to reach a steady state flow in the sense of a frozen rotor computation, a criterion that we adopt for all cases.

Parameter
The dimensionless Newton number, N e, which relates the resistance force to the inertia force, is considered as the characteristic reference quantity to be minimized.Here, N denotes the rotational speed of the impeller and P is the power computed from the flow quantities over the surface S of the impeller (pressure p, velocity u i , stress τ ij ).
The design variables are the disk thickness x, the bottom clearance C, and the baffle length W , for which the inequality constraints 0.001 ≤ x ≤ 0.005, 0.02 ≤ C ≤ 0.075, and 0.005 ≤ W ≤ 0.03 are prescribed.All other parameters are kept fixed according to the standard configuration.
Because the gradient information is not available, for both algorithms the minimum of the trust region radius, δ min , is used as a stopping criterion.In DFO, for Re = 100, δ min = 0.0001 and for Re = 1000, δmin = 0.001 were chosen.For CONDOR, we have used δ min = 0.0001 for both Reynolds numbers.
As a first example we consider the minimization of the Newton number for a Reynolds number of Re = 1000.Figure 2 shows the Newton number versus the number of cycles for the two optimization algorithms.For both algorithms the Newton number attains the same optimum.CONDOR terminates earlier then the DFO although the latter oscillates around the optimum.Figure 3 depicts the corresponding changes of the design variables during the function evaluations.The three design parameters assign almost the same optimal values.We remark that the two optimization tools differ in building the starting model: DFO starts to build the model objective function approximation (unique or not) from the very beginning.That is, its starting objective polynomial approximation is not fully quadratic.On the other hand, CONDOR starts with a fully quadratic model for the objective function.Despite its long initialization it turns out that CONDOR needs less function evaluations than DFO to reach an optimum point after completing the initial quadratic approximation.DFO, as the time passes, oscillates around the minimum although it approaches the minimum very sharply at the beginning.CONDOR waits for a full quadratic polynomial to build its model, and then gradually approaches the minimum, using a complete interpolating bases.In any case, however, both algorithms reach the same optimized Newton number which is significantly (about 37%) lower than the one for the standard tank configuration stated in Table 1.
6. Conclusions and perspectives.The main drawback of DFO methods is that they can be applied to moderately sized problems, i.e. to problems with around 20 variables.In order to apply these methods for large-scale problems in applications, the underlying structure of the objective functions should be exploited.There is some progress for sparse problems and for partially separable functions as mentioned in this paper.Problems arising in CFD simulation and other similar large-scale optimization problems are potential application areas of DFO algorithms.Another theoretically and practically interesting

2 .
Description of the DFO algorithm.DFO algorithms belong to the class of trustregion methods which are iterative and built around the current iterate as a cheaper model of the objective function which is easier to minimize than the objective function.At the kth step the quadratic model within the trust-region B k , Impeller diameter D = T /3 = 0.05 m Bottom clearance C = H/2 = 0.075 m Height of the liquid H = T = 0.15 m Length of the baffles W = 3D/10 = 0.015 m Length of the blade = D/4 = 0.0125 m Height of the blade w = D/5 = 0.01 m Disc thickness x = D/5 = 0.00175 m Diameter of the disk d = 3D/4 = 0.0375 m The working Newtonian fluid is a glucose solution with density ρ = 1330 kg/m 3 and viscosity µ = 0.105 Pas.

Figure 2 .Figure 3 .
Figure 2. Newton number versus number of the loops.
After building the model m k (x k + s) around the current iterate x k , the step s k is computed by minimizing the model m k over the trust-region.If the ratio of the achieved reduction in the objective function versus the predicted reduction in model Step 1: Initialization For a given xs and f (xs) choose an initial interpolation set Y .Determine x0 ∈ Y such that f (x 0 ) = min y i ∈Y f (y i ).Choose an initial trust region radius ∆ 0 > 0. Set

Table 1 .
Geometrical parameters of standard stirrer configuration.