Skip to main content
Log in

Stochastic derivative-free optimization using a trust region framework

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

This paper presents a trust region algorithm to minimize a function f when one has access only to noise-corrupted function values \(\bar{f}\). The model-based algorithm dynamically adjusts its step length, taking larger steps when the model and function agree and smaller steps when the model is less accurate. The method does not require the user to specify a fixed pattern of points used to build local models and does not repeatedly sample points. If f is sufficiently smooth and the noise is independent and identically distributed with mean zero and finite variance, we prove that our algorithm produces iterates such that the corresponding function gradients converge in probability to zero. We present a prototype of our algorithm that, while simplistic in its management of previously evaluated points, solves benchmark problems in fewer function evaluations than do existing stochastic approximation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Amos, B.D., Easterling, D.R., Watson, L.T., Castle, B.S., Trosset, M.W., Thacker, W.I.: Fortran 95 implementation of QNSTOP for global and stochastic optimization. In: Proceedings of the High Performance Computing Symposium, pp. 15:1–15:8. Society for Computer Simulation International, San Diego, CA, USA (2014)

  2. Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bastin, F., Cirillo, C., Toint, P.L.: An adaptive Monte Carlo algorithm for computing mixed logit estimators. CMS 3(1), 55–79 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  4. Billups, S.C., Larson, J., Graf, P.: Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM J. Optim. 23(1), 27–53 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  5. Box, G., Wilson, K.: On the experimental attainment of optimum conditions. J. R. Stat. Soc. Ser. B (Methodological) 13(1), 1–45 (1951)

    MathSciNet  MATH  Google Scholar 

  6. Chang, K.H., Hong, L.J., Wan, H.: Stochastic trust-region response-surface method (STRONG)—a new response-surface framework for simulation optimization. INFORMS J. Comput. 25(2), 230–243 (2012)

    Article  MathSciNet  Google Scholar 

  7. Chen, R., Menickelly, M., Scheinberg, K.: Stochastic Optimization Using a Trust-Region Method and Random Models (2015). Preprint http://arxiv.org/abs/1504.04231

  8. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. Society for Industrial and Applied Mathematics (2009)

  9. Deng, G., Ferris, M.C.: Adaptation of the UOBYQA algorithm for noisy functions. In: Proceedings of the 2006 Winter Simulation Conference, pp. 312–319. IEEE (2006)

  10. Deng, G., Ferris, M.C.: Extension of the direct optimization algorithm for noisy functions. In: Proceedings of the 2007 Winter Simulation Conference, pp. 497–504. IEEE (2007)

  11. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  12. Dupuis, P., Simha, R.: On sampling controlled stochastic approximation. IEEE Trans. Autom. Control 36(8), 915–924 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  13. Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  14. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  15. Larson, J.: Derivative-free optimization of noisy functions. PhD disseration, University of Colorado Denver (2012)

  16. Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20(1), 172–191 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  17. Moré, J.J., Wild, S.M.: Estimating computational noise. SIAM J. Sci. Comput. 33(3), 1292–1314 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  18. Moré, J.J., Wild, S.M.: Estimating derivatives of noisy simulations. ACM Trans. Math. Softw. 38(3), 1–21 (2012)

    Article  MathSciNet  Google Scholar 

  19. Myers, R.H., Montgomery, D.C., Vining, G.G., Borror, C.M., Kowalski, S.M.: Response surface methodology: a retrospective and literature survey. J. Qual. Technol. 36(1), 53–77 (2004)

    Google Scholar 

  20. Nedić, A., Bertsekas, D.P.: The effect of deterministic noise in subgradient methods. Math. Program. 125(1), 75–99 (2009)

    MathSciNet  MATH  Google Scholar 

  21. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)

    Article  MATH  Google Scholar 

  22. Powell, M.: UOBYQA: Unconstrained optimization by quadratic approximation. Math. Program. 92(3), 555–582 (2002)

  23. Scheinberg, K.: Using random models in derivative-free optimization. Plenary presentation at “The International Symposium on Mathematical Programming” (2012)

  24. Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  25. Spall, J.C.: Introduction to Stochastic Search and Optimization. Wiley, Hoboken (2003)

    Book  MATH  Google Scholar 

  26. Tomick, J., Arnold, S., Barton, R.: Sample size selection for improved nelder-mead performance. In: Proceedings of the 1995 Winter Simulation Conference, pp. 341–345. IEEE (1995)

Download references

Acknowledgments

We thank Alexandre Proutiere for providing critical insights that allowed this work to be completed. We also thank Katya Scheinberg and an anonymous referee for alerting us to errors in earlier drafts of our analysis. We thank Layne Watson for sending us the Fortran code for QNSTOP. This material is based upon work supported by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357. We thank Gail Pieper for her useful language editing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeffrey Larson.

Appendix

Appendix

We now prove a sequence of results culminating in a theorem showing that a model regressing a sufficiently large, strongly \(\varLambda \)-poised set of points will be \(\alpha \)-probabilistically \(\kappa \)-fully linear.

Lemma 6

Let f be continuously differentiable on \(\varOmega \), and let m(x) denote the linear model regressing a set of strongly \(\varLambda \)-poised points \(Y = \left\{ y^0,\ldots ,y^p \right\} \). Then, for any \(x \in \varOmega \) the following identities hold:

  1. 1.

    \(m(x) - f(x) = \sum \limits _{i=0}^p G_i(x)^T(y^i-x) \ell _i(x) + \sum \limits _{i=0}^p \epsilon _i \ell _i(x)\),

  2. 2.

    \(\nabla m(x) - \nabla f(x) = \sum \limits _{i=0}^p G_i(x)^T(y^i-x) \nabla \ell _i(x) + \sum \limits _{i=0}^p \epsilon _i \nabla \ell _i(x)\),

where \(\epsilon _i\) denotes the noise at \(y^i\), \(\ell _i\) is the ith Lagrange polynomial, and \(G_i(x) = \nabla f(v_i(x)) - \nabla f(x)\) for some point \(v_i(x) = \theta x + (1-\theta ) y^i, 0 \le \theta \le 1\) on the line segment connecting x to \(y^i\).

Proof

The proof is nearly identical to that of [4, Lemma 4.5]. \(\square \)

The following lemma is similar to [4, Lemma 4.4].

Lemma 7

Given a sample set \(Y=\left\{ y^0, \ldots , y^p \right\} \subset {\mathbb {R}}^n\), let \(\hat{Y} = \left\{ \hat{y}^0, \ldots , \hat{y}^p \right\} \) be the shifted and scaled sample set defined by \(\hat{y}^i = (y^i - \bar{y})/R\) for some \(R > 0\) and \(\bar{y} \in {\mathbb {R}}^n\). Let \(\phi = \left\{ \phi _0(x), \ldots , \phi _q(x)\right\} \) be a basis for \({\mathcal {P}}_n^d\), and define the basis \(\hat{\phi } = \left\{ \hat{\phi }_0(x), \ldots , \hat{\phi }_q(x) \right\} \) by \(\hat{\phi }_i(x) = \phi _i(Rx + \bar{y})\), \(i=0,\ldots , n\). Let \(\left\{ \ell _0(x), \ldots , \ell _p(x) \right\} \) be regression Lagrange polynomials for Y, and let \(\left\{ \hat{\ell }_0(x), \ldots , \hat{\ell }_p(x) \right\} \) be regression Lagrange polynomials for \(\hat{Y}\). Then \(\displaystyle M(\phi ,Y) = M(\hat{\phi },\hat{Y})\). Moreover, if Y is poised, then

$$\begin{aligned} \ell _j(x) = \hat{\ell }_j\left( \frac{x-\bar{y}}{R}\right) , \quad \text{ for } j=0,\ldots ,p. \end{aligned}$$

Proof

The proof is identical to that of [4, Lemma 4.4]. (Note that the wording there gives specific choices for \(R, \bar{y}\) and \(\phi \); however, its proof does not rely on these choices). \(\square \)

Lemma 8

Let \(Y = \{y^0, \ldots , y^p\} \subset B(y^0;\varDelta )\) be a strongly \(\varLambda \)-poised sample set on \(B(y^0;\varDelta )\). Let \(\ell (x) = \left( \ell _0(x), \ldots , \ell _p(x) \right) ^T\), where \(\ell _j(x)\), \(j=0,\ldots ,p\), are the regression Lagrange polynomials or order d for Y. Let \(q+1\) denote the dimension of \({\mathcal {P}}_n^d\). Then for any \(x \in B(y^0;\varDelta )\),

$$\begin{aligned} \left\| { \nabla \ell (x) } \right\| \le \frac{(q+1) \hat{\theta }}{\varDelta \sqrt{p+1}} \varLambda , \end{aligned}$$

where \(\hat{\theta }\) is a constant that depends on n and d but is independent of Y and \(\varLambda \).

Proof

Let \(\hat{Y} = \left\{ \hat{y}^0, \ldots , \hat{y}^p \right\} \) be the shifted and scaled sample set defined by \(\hat{y}^i = (y^i - y^0)/\varDelta \).

Let \(\bar{\phi } = \left\{ \bar{\phi }_0(x), \ldots , \bar{\phi }_q(x)\right\} \) be a basis for \({\mathcal {P}}^d\), and define the basis \(\hat{\phi } = \left\{ \hat{\phi }_0(x), \ldots , \hat{\phi }_q(x) \right\} \), by \(\hat{\phi }_i(x) = \bar{\phi }_i(\varDelta x+ y^0)\), \(i=0,\ldots , n\). Let \(\left\{ \ell _0(x), \ldots , \ell _p(x) \right\} \) be the regression Lagrange polynomials for Y, and let \(\left\{ \hat{\ell }_0(x), \ldots , \hat{\ell }_p(x) \right\} \) be the regression Lagrange polynomials for \(\hat{Y}\).

Let \(M=M(\hat{\phi },\hat{Y})\) (defined in (2)). By the proof of [8, Lemma 4.12], \(\left\| {M(M^T M)^{-1}} \right\| \le \frac{(q+1) \bar{\theta }}{\sqrt{p+1}} \varLambda \) for some \(\bar{\theta } > 0\) that depends on n and d but is independent of Y and \(\varLambda \). (Note that the statement of [8, Lemma 4.12] assumes \(\max _i \left\| {\hat{y}^i} \right\| =1\); however, its proof does not rely on this assumption.) Thus, for any \(x \in B(0;1)\),

$$\begin{aligned} \left\| {\nabla \hat{\ell }(x)} \right\| = \left\| { M(M^TM)^{-1} \nabla \hat{\phi }(x) } \right\| \le \frac{(q+1)\bar{\theta }}{\sqrt{p+1}} \varLambda \left\| {\nabla \hat{\phi }(x)} \right\| \le \frac{(q+1)\hat{\theta }}{\sqrt{p+1}} \varLambda , \end{aligned}$$

where \(\hat{\theta } = \bar{\theta }\max _{x \in B(0;1)} \left\| {\nabla \hat{\phi }(x)} \right\| \). By Lemma 7,

$$\begin{aligned} \left\| {\nabla \ell (x)} \right\| = \frac{1}{\varDelta }\left\| { \nabla \hat{\ell }\left( \frac{x-y^0}{\varDelta }\right) } \right\| \le \frac{(q+1)\hat{\theta }}{\varDelta \sqrt{p+1}} \varLambda . \end{aligned}$$

\(\square \)

We now use the preceding three results to bound the error between the function and a linear model regressing a strongly \(\varLambda \)-poised set of \(p+1\) points.

Proposition 1

Let \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a continuously differentiable function with Lipschitz continuous gradient with constant \(L_g\). Let \(\bar{x} \in {\mathbb {R}}^n\), \(\varLambda > 0\) and \(\varDelta > 0\). Let \(Y = \{y^0, \ldots , y^p\} \subset B(\bar{x};\varDelta )\) be a strongly \(\varLambda \)-poised set of sample points with \(y^0=\bar{x}\). For \(i \in \{0,\ldots ,p\}\), let \(\bar{f}_i = f(y^i) + \epsilon _i\), where the noise \(\epsilon _i\) is sampled from a random distribution with mean 0 and variance \(\sigma ^2 < \infty \). Let \(m: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be the unique linear polynomial approximating the data \(\{ (y^i, \bar{f}_i), i=0,\ldots ,p\}\) by least-squares regression. There exist constants \(\bar{C}\) and \(\bar{a}\) such that for any \(a > \bar{a}\) the following inequalities hold:

  1. 1.

    \({\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left| m(z)-f(z) \right| > a \varDelta ^2 \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(a-\bar{a})^2 (p+1) \varDelta ^4}\); and

  2. 2.

    \({\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left\| \nabla m(z)-\nabla f(z) \right\| > a \varDelta \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(a-\bar{a})^2 (p+1) \varDelta ^4}\).

Proof

Let \(\left\{ \ell _0(x),\ldots ,\ell _p(x) \right\} \) be the linear regression Lagrange polynomials for the set \(Y = \left\{ y^0,\ldots ,y^p \right\} \). Define \(\ell (z)=\left[ \ell _0(x),\ldots ,\ell _p(x) \right] ^T\). Define \(\epsilon = \left[ \epsilon _0,\ldots , \epsilon _p\right] ^T\), and note that \({\mathbb {E}}\left[ \epsilon _i \epsilon _j\right] =0\) for \(i \ne j\). Define \(Z_0(x) = \epsilon ^T \ell (x)\). Then

$$\begin{aligned} {\mathbb {E}}\left[ Z_0\left( \bar{x}\right) ^2\right]= & {} {\mathbb {E}}\left[ \left( \sum \limits _{i=0}^p \epsilon _i \ell _i\left( \bar{x}\right) \right) ^2\right] = \sum \limits _{i=0}^p {\mathbb {E}}\left[ \epsilon _i^2\right] \ell _i\left( \bar{x}\right) ^2 \nonumber \\= & {} \sigma ^2 \sum \limits _{i=0}^p \ell _i\left( \bar{x}\right) ^2 = \sigma ^2 \left\| {\ell \left( \bar{x}\right) } \right\| ^2 \le \frac{\sigma ^2(n+1)^2 \varLambda ^2}{p+1}, \end{aligned}$$
(22)

where the last inequality follows from Definition 2 and the assumption that Y is strongly \(\varLambda \)-poised on \(B(\bar{x};\varDelta )\).

Because \(Z_0\) is a linear function, its Jacobian matrix is constant. Denote this matrix by \(Z_1 = \nabla Z_0(\bar{x}) = \epsilon ^T \nabla \ell (\bar{x})\). Then,

$$\begin{aligned} {\mathbb {E}}\left[ \left\| {Z_1} \right\| ^2\right]&= \sum \limits _{i=0}^p {\mathbb {E}}\left[ \epsilon _i^2\right] \nabla \ell _i(\bar{x})^T \nabla \ell _i(\bar{x}) = \sigma ^2 \text{ Tr }\left( \nabla \ell (\bar{x}) \nabla \ell (\bar{x})^T\right) \nonumber \\&= \sigma ^2 \left\| {\nabla \ell (\bar{x})} \right\| _F^2 \le \sigma ^2 (n+1) \left\| {\nabla \ell (\bar{x})} \right\| _2^2 \nonumber \\&\le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{\varDelta ^2 (p+1)}, \quad \text{ by } \text{ Lemma }~8, \end{aligned}$$
(23)

where \(\hat{\theta }\) is a constant that depends on n but is independent of Y and \(\varLambda \).

Next, observe that for any \(z\in B(\bar{x};\varDelta )\), \( \left\| {Z_0(z)} \right\| ^2 = \left\| { Z_0(\bar{x}) + Z_1(z - \bar{x})} \right\| ^2 \le \left( \left\| {Z_0(\bar{x})} \right\| + \varDelta \left\| {Z_1} \right\| \right) ^2 \le 2\left\| {Z_0(\bar{x})} \right\| ^2 + 2 \varDelta ^2 \left\| {Z_1} \right\| ^2\). Combining this with (22) and (23) yields the inequality

$$\begin{aligned} {\mathbb {E}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| ^2\right] \le \frac{2\sigma ^2(n+1)^2\varLambda ^2}{p+1} + \frac{2\sigma ^2(n+1)^3 \hat{\theta }^2 \varLambda ^2}{ p+1} =C_0 \frac{\sigma ^2 \varLambda ^2}{p+1}, \end{aligned}$$

where \(C_0 = 2(n+1)^2 \left( 1+(n+1)\hat{\theta ^2}\right) \).

By Markov’s inequality, for any \(a>0\),

$$\begin{aligned} {\mathbb {P}}\left[ \left\| {Z_1} \right\| \ge a \varDelta \right] = {\mathbb {P}}\left[ \left\| {Z_1} \right\| ^2 \ge a^2 \varDelta ^2 \right] \le \frac{{\mathbb {E}}\left[ \left\| {Z_1} \right\| ^2\right] }{a^2 \varDelta ^2} \le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{a^2 \varDelta ^4 (p+1)}, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| \ge a \varDelta ^2\right] \le \frac{ {\mathbb {E}}\left[ \max _{z \in B(\bar{x};\varDelta )}\left\| {Z_0(z)} \right\| ^2\right] }{a^2 \varDelta ^4} \le \frac{C_0 \sigma ^2 \varLambda ^2}{a^2 \varDelta ^4 (p+1)}. \end{aligned}$$

Define \(g_i(z)=G_i(z)^T(y^i-z)\), where \(G_i(z)\) is defined in Lemma 6. Let \(g(z)=\left[ g_0(z),\ldots ,g_p(z) \right] ^T\). Because \(\nabla f\) is Lipschitz continuous, \(|g_i(z)| \le 2 L_g \varDelta ^2\) and \(\left\| {g(z)} \right\| \le 2 \sqrt{p+1} L_g \varDelta ^2\). By Lemma 6,

$$\begin{aligned} \left\| {\nabla m(z)-\nabla f(z)} \right\|&= \left\| { \left( g(z)^T + \epsilon ^T \right) \nabla \ell (\bar{x})} \right\| \\&\le 2\sqrt{p+1} L_g \varDelta ^2 \left\| {\nabla \ell (\bar{x})} \right\| + \left\| {Z_1} \right\| \\&\le 2 L_g \varDelta (n+1)\hat{\theta } \varLambda + \left\| {Z_1} \right\| , \quad \text{ by } \text{ Lemma }~8. \end{aligned}$$

Thus,

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {\nabla m(z)-\nabla f(z)} \right\| \ge a \varDelta \right]&\le {\mathbb {P}}\left[ \left\| {Z_1} \right\| \ge \left( a - 2 L_g (n+1) \hat{\theta } \varLambda \right) \varDelta \right] \\&\le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{\left( a - 2 L_g (n+1) \hat{\theta } \varLambda \right) ^2 \varDelta ^4 (p+1)} \\&= \frac{C_1 \sigma ^2 \varLambda ^2}{\left( a-\bar{a}_1\right) ^2 (p+1) \varDelta ^4}, \end{aligned}$$

where \(C_1 = (n+1)^3 \hat{\theta }^2\) and \(\bar{a}_1=2L_g(n+1)\hat{\theta } \varLambda \).

By Definition 2, \(\left\| \ell (z) \right\| \le \frac{n+1}{\sqrt{p+1}}\varLambda \). Thus, by Lemma 6,

$$\begin{aligned} \left| m(z)- f(z) \right|&= \left| \left( g(z) + \epsilon \right) ^T \ell (z) \right| \\&\le \left\| {g(z)} \right\| \left\| {\ell (z)} \right\| +\left\| {Z_0(z)} \right\| \\&\le 2 L_g \sqrt{p+1} \varDelta ^2 \frac{(n+1) \varLambda }{\sqrt{p+1}} {+} \left\| {Z_0(z)} \right\| {=} 2 L_g \varDelta ^2 (n+1) \varLambda + \left\| {Z_0(z)} \right\| . \end{aligned}$$

Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left| m(z)- f(z) \right| \ge a \varDelta ^2\right] \\&\le {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| \ge \left( a - 2 L_g (n+1) \varLambda \right) \varDelta ^2\right] \\&\le C_0 \frac{\sigma ^2 \varLambda ^2}{\left( a - 2 L_g (n+1) \varLambda \right) ^2 \varDelta ^4 (p+1)} \\&= C_0 \frac{\sigma ^2 \varLambda ^2}{\left( a - \bar{a}_2 \right) ^2 \varDelta ^4 (p+1)}, \end{aligned}$$

where \(\bar{a}_2 = 2L_g(n+1)\varLambda \). The result follows with \(\bar{a} = \max \{ \bar{a}_1, \bar{a}_2 \}\) and \(\bar{C} = \max \{C_0,C_1\}\). \(\square \)

We now show that the models \(F_k^0\) and \(F_k^s\) can easily be constructed to satisfy Conditions 1 and 2.

Proposition 2

Let \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a Lipschitz continuously differentiable function. Let \(\{c_k\}\), \(\{\varDelta _k\}\), and \(\{a_k\}\) be positive sequences such that \(c_k \rightarrow +\infty \) and \(a_k \downarrow 0\). Let \(\{x^k\}\) and \(\{s^k\}\) be sequences in \({\mathbb {R}}^n\) with \(\left\| {s^k} \right\| \le \varDelta _k\). Let \(\varLambda > 0\), for each k define \(\delta _k = \min \{\varDelta _k, 1\}\), and let \(Y_k^0 \subset B\left( x^k;a_k \delta _k \right) \) and \(Y_k^s \subset B\left( x^k+s^k; a_k \delta _k \right) \) be strongly \(\varLambda \)-poised sample sets with at least \(c_k/(a_k \delta _k)^4\) points each. Let \(m_k^0\) and \(m_k^s\) be the linear polynomials fitting the computed function values on the sample sets \(Y_k^0\) and \(Y_k^s\), respectively, by least-squares regression. Define \(F_k^0 = m_k^0\left( x^k\right) \) and \(F_k^s = m_k^s \left( x^k+s^k \right) \). Then \(\{F_k^0\}\) and \(\{F_k^s\}\) satisfy Conditions 1 and 2.

Proof

Let \(\bar{C}\) and \(\bar{a}\) be the constants defined in Proposition 1. Since \(a_k \downarrow 0\) and \(c_k \rightarrow +\infty \), then for any \(\omega > 0\) there exists \(\bar{k}(\omega )\) such that for all \(k > \bar{k}(\omega )\), \(\displaystyle 2\bar{a}\left( a_k \delta _k\right) ^2 < \frac{\beta \eta }{2} \delta _k^2 \le \frac{\beta \eta }{2} \min \{ \varDelta _k, \varDelta _k^2\}\) and \(\displaystyle \frac{\bar{C}\sigma ^2 \varLambda ^2}{\bar{a}^2 \left| Y_k^0 \right| a_k^4\delta _k^4} \le \omega /2\). Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2 \right] \\&\le {\mathbb {P}}\left[ \left| m_k^0\left( x^k\right) -f\left( x^k\right) \right| > 2 \bar{a}\left( a_k \delta _k\right) ^2\right] \\&\le \frac{\bar{C} \sigma ^2 \varLambda ^2}{\bar{a}^2 \left\| {Y_k^0} \right\| a_k^4 \delta _k^4}&\text{ by } \text{ Proposition }~1 \\&\le \frac{\omega }{2}. \end{aligned}$$

Using the same argument, we can show that

$$\begin{aligned} {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2 \right] \le \frac{\omega }{2}. \end{aligned}$$

Thus, for \(k > \bar{k}(\omega )\),

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0-f\left( x^k\right) -F_k^s+f\left( x^k+s^k\right) \right| > \beta \eta \min \left\{ \varDelta _k, \varDelta _k^2\right\} \right] \\&\quad \le {\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2\right] + {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2\right] \le \omega , \end{aligned}$$

so Condition 1 is satisfied.

To prove that Condition 2 holds, observe that for \(k > \bar{k}(1)\), \(\left( \beta \eta + \xi \right) \delta _k^2 > \left( 2 \bar{a} + \frac{\xi }{a_k^2} \right) a_k^2 \delta _k^2\). Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \\&\le {\mathbb {P}}\left[ \left| m_k^0\left( x^k\right) -f\left( x^k\right) \right| > \left( 2 \bar{a}+\frac{\xi }{a_k^2}\right) \left( a_k \delta _k\right) ^2\right] \\&\le \frac{\bar{C} \sigma ^2 \varLambda ^2}{\left( \bar{a}+\xi /a_k^2\right) ^2 \left\| {Y_k^0} \right\| a_k^4 \delta _k^4}&\text{ by } \text{ Proposition }~1\\&\le \frac{\bar{a}^2}{2(\bar{a}+\xi /a_k^2)^2} \le \frac{\bar{a}^2 a_k^2}{4 \xi } \le \frac{\theta }{2 \xi }, \end{aligned}$$

for the constant \(\theta = \max _{k} \bar{a} a_k^2/2\). Similarly, we can show that \(\displaystyle {\mathbb {P}}\left[ \left| F_k^s - f(x^k+s^k) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \le \frac{\theta }{2 \xi }\). It follows that

$$\begin{aligned}&{\mathbb {P}}\left[ F_k^0 - f\left( x^k\right) +f\left( x^k+s^k\right) -F_k^s > \left( \beta \eta +\xi \right) \min \left\{ \varDelta _k, \varDelta _k^2\right\} \right] \\&\quad \le {\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \\&\quad \quad + {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] < \frac{\theta }{\xi }. \end{aligned}$$

This proves that Condition 2 is satisfied. \(\square \)

We next show that models built using our proposed method are \(\alpha \)-probabilistically \(\kappa \)-fully linear.

Proposition 3

Let \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a continuously differentiable function with Lipschitz continuous gradient with constant \(L_g\). Let \(\bar{x} \in {\mathbb {R}}^n\), \(\varLambda > 0\) and \(\varDelta > 0\). Let \(Y = \{y^0, \ldots , y^p\} \subset B(\bar{x};\varDelta )\) be a strongly \(\varLambda \)-poised set of sample points with \(y^0=\bar{x}\). For \(i \in \{0,\ldots ,p\}\), let \(\bar{f}_i = f(y^i) + \epsilon _i\), where the noise \(\epsilon _i\) is sampled from a random distribution with mean 0 and variance \(\sigma ^2 < \infty \). Let \(m: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be the unique linear polynomial approximating the data \(\{ (y^i, \bar{f}_i), i=0,\ldots ,p\}\) by least-squares regression. Let \(\alpha \in (0,1)\) be given. There exists a positive constant \(\hat{C}\) depending only on \(\varLambda \), and \(L_g\) and constants \(\kappa = (\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}})\) depending only on \(\varLambda \), \(\alpha \), and \(L_g\), such that if \(p \ge \frac{\hat{C}}{\varDelta ^4_k}\), then m is \(\alpha \)-probabilistically \(\kappa \)-fully linear.

Proof

Let \(\bar{C}\) and \(\bar{a}\) be defined as in Proposition 1. Defining \(\kappa _{\mathrm{ef}}= \kappa _{\mathrm{eg}}= a = \bar{a}+1\) and \(\hat{C} = \frac{\bar{C} \sigma ^2 \varLambda ^2}{1 - \frac{\alpha }{2}}\), if \(p \ge \frac{\hat{C}}{\varDelta ^4} - 1\), then by Proposition 1, we have

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left| m(z)-f(z) \right| > \kappa _{\mathrm{ef}}\varDelta ^2 \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(p+1) \varDelta ^4} \le 1 - \frac{\alpha }{2} \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left\| \nabla m(z)-\nabla f(z) \right\| > \kappa _{\mathrm{eg}}\varDelta \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(p+1) \varDelta ^4} \le 1 - \frac{\alpha }{2}. \end{aligned}$$

Therefore, for any \(y \in B(\bar{x},\varDelta )\),

$$\begin{aligned} {\mathbb {P}}\left[ \left\| {\nabla f(y) - \nabla m(y)} \right\| \le \kappa _{\mathrm{eg}}\varDelta \text{ and } \left| f(y) - m(y) \right| \le \kappa _{\mathrm{ef}}\varDelta ^2\right] \ge \alpha , \end{aligned}$$

and the model is \(\alpha \)-probabilistically \(\kappa \)-fully linear. \(\square \)

One can easily test whether a set \(Y_k\), with \(\left| Y_k \right| > \frac{C}{\varDelta _k}\) is strongly \(\varLambda \)-poised by using Theorem 4.12 in [8]. A partial set of points that are not strongly \(\varLambda \)-poised can be completed to a strongly \(\varLambda \)-poised set by using Algorithm 6.7 in [8].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Larson, J., Billups, S.C. Stochastic derivative-free optimization using a trust region framework. Comput Optim Appl 64, 619–645 (2016). https://doi.org/10.1007/s10589-016-9827-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-016-9827-z

Keywords

Mathematics Subject Classification

Navigation