Stochastic derivative-free optimization using a trust region framework

Larson, Jeffrey; Billups, Stephen C.

doi:10.1007/s10589-016-9827-z

Stochastic derivative-free optimization using a trust region framework

Published: 17 February 2016

Volume 64, pages 619–645, (2016)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Jeffrey Larson¹ &
Stephen C. Billups²

1117 Accesses
25 Citations
Explore all metrics

Abstract

This paper presents a trust region algorithm to minimize a function f when one has access only to noise-corrupted function values $\bar{f}$. The model-based algorithm dynamically adjusts its step length, taking larger steps when the model and function agree and smaller steps when the model is less accurate. The method does not require the user to specify a fixed pattern of points used to build local models and does not repeatedly sample points. If f is sufficiently smooth and the noise is independent and identically distributed with mean zero and finite variance, we prove that our algorithm produces iterates such that the corresponding function gradients converge in probability to zero. We present a prototype of our algorithm that, while simplistic in its management of previously evaluated points, solves benchmark problems in fewer function evaluations than do existing stochastic approximation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A trust region method for noisy unconstrained optimization

Article 24 March 2023

A globally convergent trust-region algorithm for unconstrained derivative-free optimization

Article 29 July 2014

An adaptive trust-region method without function evaluations

Article 02 March 2022

References

Amos, B.D., Easterling, D.R., Watson, L.T., Castle, B.S., Trosset, M.W., Thacker, W.I.: Fortran 95 implementation of QNSTOP for global and stochastic optimization. In: Proceedings of the High Performance Computing Symposium, pp. 15:1–15:8. Society for Computer Simulation International, San Diego, CA, USA (2014)
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)
Article MathSciNet MATH Google Scholar
Bastin, F., Cirillo, C., Toint, P.L.: An adaptive Monte Carlo algorithm for computing mixed logit estimators. CMS 3(1), 55–79 (2006)
Article MathSciNet MATH Google Scholar
Billups, S.C., Larson, J., Graf, P.: Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM J. Optim. 23(1), 27–53 (2013)
Article MathSciNet MATH Google Scholar
Box, G., Wilson, K.: On the experimental attainment of optimum conditions. J. R. Stat. Soc. Ser. B (Methodological) 13(1), 1–45 (1951)
MathSciNet MATH Google Scholar
Chang, K.H., Hong, L.J., Wan, H.: Stochastic trust-region response-surface method (STRONG)—a new response-surface framework for simulation optimization. INFORMS J. Comput. 25(2), 230–243 (2012)
Article MathSciNet Google Scholar
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic Optimization Using a Trust-Region Method and Random Models (2015). Preprint http://arxiv.org/abs/1504.04231
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. Society for Industrial and Applied Mathematics (2009)
Deng, G., Ferris, M.C.: Adaptation of the UOBYQA algorithm for noisy functions. In: Proceedings of the 2006 Winter Simulation Conference, pp. 312–319. IEEE (2006)
Deng, G., Ferris, M.C.: Extension of the direct optimization algorithm for noisy functions. In: Proceedings of the 2007 Winter Simulation Conference, pp. 497–504. IEEE (2007)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Article MathSciNet MATH Google Scholar
Dupuis, P., Simha, R.: On sampling controlled stochastic approximation. IEEE Trans. Autom. Control 36(8), 915–924 (1991)
Article MathSciNet MATH Google Scholar
Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993)
Article MathSciNet MATH Google Scholar
Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
Article MathSciNet MATH Google Scholar
Larson, J.: Derivative-free optimization of noisy functions. PhD disseration, University of Colorado Denver (2012)
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20(1), 172–191 (2009)
Article MathSciNet MATH Google Scholar
Moré, J.J., Wild, S.M.: Estimating computational noise. SIAM J. Sci. Comput. 33(3), 1292–1314 (2011)
Article MathSciNet MATH Google Scholar
Moré, J.J., Wild, S.M.: Estimating derivatives of noisy simulations. ACM Trans. Math. Softw. 38(3), 1–21 (2012)
Article MathSciNet Google Scholar
Myers, R.H., Montgomery, D.C., Vining, G.G., Borror, C.M., Kowalski, S.M.: Response surface methodology: a retrospective and literature survey. J. Qual. Technol. 36(1), 53–77 (2004)
Google Scholar
Nedić, A., Bertsekas, D.P.: The effect of deterministic noise in subgradient methods. Math. Program. 125(1), 75–99 (2009)
MathSciNet MATH Google Scholar
Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)
Article MATH Google Scholar
Powell, M.: UOBYQA: Unconstrained optimization by quadratic approximation. Math. Program. 92(3), 555–582 (2002)
Scheinberg, K.: Using random models in derivative-free optimization. Plenary presentation at “The International Symposium on Mathematical Programming” (2012)
Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)
Article MathSciNet MATH Google Scholar
Spall, J.C.: Introduction to Stochastic Search and Optimization. Wiley, Hoboken (2003)
Book MATH Google Scholar
Tomick, J., Arnold, S., Barton, R.: Sample size selection for improved nelder-mead performance. In: Proceedings of the 1995 Winter Simulation Conference, pp. 341–345. IEEE (1995)

Download references

Acknowledgments

We thank Alexandre Proutiere for providing critical insights that allowed this work to be completed. We also thank Katya Scheinberg and an anonymous referee for alerting us to errors in earlier drafts of our analysis. We thank Layne Watson for sending us the Fortran code for QNSTOP. This material is based upon work supported by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357. We thank Gail Pieper for her useful language editing.

Author information

Authors and Affiliations

Argonne National Laboratory, 9700 S. Cass Ave., Bldg. 240, Lemont, IL, 60439, USA
Jeffrey Larson
University of Colorado Denver, P.O. Box 173364, CB 170, Denver, CO, 80217-3364, USA
Stephen C. Billups

Authors

Jeffrey Larson
View author publications
You can also search for this author in PubMed Google Scholar
Stephen C. Billups
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey Larson.

Appendix

We now prove a sequence of results culminating in a theorem showing that a model regressing a sufficiently large, strongly $\varLambda $-poised set of points will be $\alpha $-probabilistically $\kappa $-fully linear.

Lemma 6

Let f be continuously differentiable on $\varOmega $, and let m(x) denote the linear model regressing a set of strongly $\varLambda $-poised points $Y = \left\{ y^0,\ldots ,y^p \right\} $. Then, for any $x \in \varOmega $ the following identities hold:

1.
$m(x) - f(x) = \sum \limits _{i=0}^p G_i(x)^T(y^i-x) \ell _i(x) + \sum \limits _{i=0}^p \epsilon _i \ell _i(x)$,
2.
$\nabla m(x) - \nabla f(x) = \sum \limits _{i=0}^p G_i(x)^T(y^i-x) \nabla \ell _i(x) + \sum \limits _{i=0}^p \epsilon _i \nabla \ell _i(x)$,

where $\epsilon _i$ denotes the noise at $y^i$, $\ell _i$ is the ith Lagrange polynomial, and $G_i(x) = \nabla f(v_i(x)) - \nabla f(x)$ for some point $v_i(x) = \theta x + (1-\theta ) y^i, 0 \le \theta \le 1$ on the line segment connecting x to $y^i$.

Proof

The proof is nearly identical to that of [4, Lemma 4.5]. $\square $

The following lemma is similar to [4, Lemma 4.4].

Lemma 7

Given a sample set $Y=\left\{ y^0, \ldots , y^p \right\} \subset {\mathbb {R}}^n$, let $\hat{Y} = \left\{ \hat{y}^0, \ldots , \hat{y}^p \right\} $ be the shifted and scaled sample set defined by $\hat{y}^i = (y^i - \bar{y})/R$ for some $R > 0$ and $\bar{y} \in {\mathbb {R}}^n$. Let $\phi = \left\{ \phi _0(x), \ldots , \phi _q(x)\right\} $ be a basis for ${\mathcal {P}}_n^d$, and define the basis $\hat{\phi } = \left\{ \hat{\phi }_0(x), \ldots , \hat{\phi }_q(x) \right\} $ by $\hat{\phi }_i(x) = \phi _i(Rx + \bar{y})$, $i=0,\ldots , n$. Let $\left\{ \ell _0(x), \ldots , \ell _p(x) \right\} $ be regression Lagrange polynomials for Y, and let $\left\{ \hat{\ell }_0(x), \ldots , \hat{\ell }_p(x) \right\} $ be regression Lagrange polynomials for $\hat{Y}$. Then $\displaystyle M(\phi ,Y) = M(\hat{\phi },\hat{Y})$. Moreover, if Y is poised, then

$$\begin{aligned} \ell _j(x) = \hat{\ell }_j\left( \frac{x-\bar{y}}{R}\right) , \quad \text{ for } j=0,\ldots ,p. \end{aligned}$$

Proof

The proof is identical to that of [4, Lemma 4.4]. (Note that the wording there gives specific choices for $R, \bar{y}$ and $\phi $; however, its proof does not rely on these choices). $\square $

Lemma 8

Let $Y = \{y^0, \ldots , y^p\} \subset B(y^0;\varDelta )$ be a strongly $\varLambda $-poised sample set on $B(y^0;\varDelta )$. Let $\ell (x) = \left( \ell _0(x), \ldots , \ell _p(x) \right) ^T$, where $\ell _j(x)$, $j=0,\ldots ,p$, are the regression Lagrange polynomials or order d for Y. Let $q+1$ denote the dimension of ${\mathcal {P}}_n^d$. Then for any $x \in B(y^0;\varDelta )$,

$$\begin{aligned} \left\| { \nabla \ell (x) } \right\| \le \frac{(q+1) \hat{\theta }}{\varDelta \sqrt{p+1}} \varLambda , \end{aligned}$$

where $\hat{\theta }$ is a constant that depends on n and d but is independent of Y and $\varLambda $.

Proof

Let $\hat{Y} = \left\{ \hat{y}^0, \ldots , \hat{y}^p \right\} $ be the shifted and scaled sample set defined by $\hat{y}^i = (y^i - y^0)/\varDelta $.

Let $\bar{\phi } = \left\{ \bar{\phi }_0(x), \ldots , \bar{\phi }_q(x)\right\} $ be a basis for ${\mathcal {P}}^d$, and define the basis $\hat{\phi } = \left\{ \hat{\phi }_0(x), \ldots , \hat{\phi }_q(x) \right\} $, by $\hat{\phi }_i(x) = \bar{\phi }_i(\varDelta x+ y^0)$, $i=0,\ldots , n$. Let $\left\{ \ell _0(x), \ldots , \ell _p(x) \right\} $ be the regression Lagrange polynomials for Y, and let $\left\{ \hat{\ell }_0(x), \ldots , \hat{\ell }_p(x) \right\} $ be the regression Lagrange polynomials for $\hat{Y}$.

Let $M=M(\hat{\phi },\hat{Y})$ (defined in (2)). By the proof of [8, Lemma 4.12], $\left\| {M(M^T M)^{-1}} \right\| \le \frac{(q+1) \bar{\theta }}{\sqrt{p+1}} \varLambda $ for some $\bar{\theta } > 0$ that depends on n and d but is independent of Y and $\varLambda $. (Note that the statement of [8, Lemma 4.12] assumes $\max _i \left\| {\hat{y}^i} \right\| =1$; however, its proof does not rely on this assumption.) Thus, for any $x \in B(0;1)$,

$$\begin{aligned} \left\| {\nabla \hat{\ell }(x)} \right\| = \left\| { M(M^TM)^{-1} \nabla \hat{\phi }(x) } \right\| \le \frac{(q+1)\bar{\theta }}{\sqrt{p+1}} \varLambda \left\| {\nabla \hat{\phi }(x)} \right\| \le \frac{(q+1)\hat{\theta }}{\sqrt{p+1}} \varLambda , \end{aligned}$$

where $\hat{\theta } = \bar{\theta }\max _{x \in B(0;1)} \left\| {\nabla \hat{\phi }(x)} \right\| $. By Lemma 7,

$$\begin{aligned} \left\| {\nabla \ell (x)} \right\| = \frac{1}{\varDelta }\left\| { \nabla \hat{\ell }\left( \frac{x-y^0}{\varDelta }\right) } \right\| \le \frac{(q+1)\hat{\theta }}{\varDelta \sqrt{p+1}} \varLambda . \end{aligned}$$

$\square $

We now use the preceding three results to bound the error between the function and a linear model regressing a strongly $\varLambda $-poised set of $p+1$ points.

Proposition 1

Let $f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be a continuously differentiable function with Lipschitz continuous gradient with constant $L_g$. Let $\bar{x} \in {\mathbb {R}}^n$, $\varLambda > 0$ and $\varDelta > 0$. Let $Y = \{y^0, \ldots , y^p\} \subset B(\bar{x};\varDelta )$ be a strongly $\varLambda $-poised set of sample points with $y^0=\bar{x}$. For $i \in \{0,\ldots ,p\}$, let $\bar{f}_i = f(y^i) + \epsilon _i$, where the noise $\epsilon _i$ is sampled from a random distribution with mean 0 and variance $\sigma ^2 < \infty $. Let $m: {\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be the unique linear polynomial approximating the data $\{ (y^i, \bar{f}_i), i=0,\ldots ,p\}$ by least-squares regression. There exist constants $\bar{C}$ and $\bar{a}$ such that for any $a > \bar{a}$ the following inequalities hold:

1.
${\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left| m(z)-f(z) \right| > a \varDelta ^2 \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(a-\bar{a})^2 (p+1) \varDelta ^4}$; and
2.
${\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left\| \nabla m(z)-\nabla f(z) \right\| > a \varDelta \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(a-\bar{a})^2 (p+1) \varDelta ^4}$.

Proof

Let $\left\{ \ell _0(x),\ldots ,\ell _p(x) \right\} $ be the linear regression Lagrange polynomials for the set $Y = \left\{ y^0,\ldots ,y^p \right\} $. Define $\ell (z)=\left[ \ell _0(x),\ldots ,\ell _p(x) \right] ^T$. Define $\epsilon = \left[ \epsilon _0,\ldots , \epsilon _p\right] ^T$, and note that ${\mathbb {E}}\left[ \epsilon _i \epsilon _j\right] =0$ for $i \ne j$. Define $Z_0(x) = \epsilon ^T \ell (x)$. Then

$$\begin{aligned} {\mathbb {E}}\left[ Z_0\left( \bar{x}\right) ^2\right]= & {} {\mathbb {E}}\left[ \left( \sum \limits _{i=0}^p \epsilon _i \ell _i\left( \bar{x}\right) \right) ^2\right] = \sum \limits _{i=0}^p {\mathbb {E}}\left[ \epsilon _i^2\right] \ell _i\left( \bar{x}\right) ^2 \nonumber \\= & {} \sigma ^2 \sum \limits _{i=0}^p \ell _i\left( \bar{x}\right) ^2 = \sigma ^2 \left\| {\ell \left( \bar{x}\right) } \right\| ^2 \le \frac{\sigma ^2(n+1)^2 \varLambda ^2}{p+1}, \end{aligned}$$

(22)

where the last inequality follows from Definition 2 and the assumption that Y is strongly $\varLambda $-poised on $B(\bar{x};\varDelta )$.

Because $Z_0$ is a linear function, its Jacobian matrix is constant. Denote this matrix by $Z_1 = \nabla Z_0(\bar{x}) = \epsilon ^T \nabla \ell (\bar{x})$. Then,

$$\begin{aligned} {\mathbb {E}}\left[ \left\| {Z_1} \right\| ^2\right]&= \sum \limits _{i=0}^p {\mathbb {E}}\left[ \epsilon _i^2\right] \nabla \ell _i(\bar{x})^T \nabla \ell _i(\bar{x}) = \sigma ^2 \text{ Tr }\left( \nabla \ell (\bar{x}) \nabla \ell (\bar{x})^T\right) \nonumber \\&= \sigma ^2 \left\| {\nabla \ell (\bar{x})} \right\| _F^2 \le \sigma ^2 (n+1) \left\| {\nabla \ell (\bar{x})} \right\| _2^2 \nonumber \\&\le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{\varDelta ^2 (p+1)}, \quad \text{ by } \text{ Lemma }~8, \end{aligned}$$

(23)

where $\hat{\theta }$ is a constant that depends on n but is independent of Y and $\varLambda $.

Next, observe that for any $z\in B(\bar{x};\varDelta )$, $ \left\| {Z_0(z)} \right\| ^2 = \left\| { Z_0(\bar{x}) + Z_1(z - \bar{x})} \right\| ^2 \le \left( \left\| {Z_0(\bar{x})} \right\| + \varDelta \left\| {Z_1} \right\| \right) ^2 \le 2\left\| {Z_0(\bar{x})} \right\| ^2 + 2 \varDelta ^2 \left\| {Z_1} \right\| ^2$. Combining this with (22) and (23) yields the inequality

$$\begin{aligned} {\mathbb {E}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| ^2\right] \le \frac{2\sigma ^2(n+1)^2\varLambda ^2}{p+1} + \frac{2\sigma ^2(n+1)^3 \hat{\theta }^2 \varLambda ^2}{ p+1} =C_0 \frac{\sigma ^2 \varLambda ^2}{p+1}, \end{aligned}$$

where $C_0 = 2(n+1)^2 \left( 1+(n+1)\hat{\theta ^2}\right) $.

By Markov’s inequality, for any $a>0$,

$$\begin{aligned} {\mathbb {P}}\left[ \left\| {Z_1} \right\| \ge a \varDelta \right] = {\mathbb {P}}\left[ \left\| {Z_1} \right\| ^2 \ge a^2 \varDelta ^2 \right] \le \frac{{\mathbb {E}}\left[ \left\| {Z_1} \right\| ^2\right] }{a^2 \varDelta ^2} \le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{a^2 \varDelta ^4 (p+1)}, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| \ge a \varDelta ^2\right] \le \frac{ {\mathbb {E}}\left[ \max _{z \in B(\bar{x};\varDelta )}\left\| {Z_0(z)} \right\| ^2\right] }{a^2 \varDelta ^4} \le \frac{C_0 \sigma ^2 \varLambda ^2}{a^2 \varDelta ^4 (p+1)}. \end{aligned}$$

Define $g_i(z)=G_i(z)^T(y^i-z)$, where $G_i(z)$ is defined in Lemma 6. Let $g(z)=\left[ g_0(z),\ldots ,g_p(z) \right] ^T$. Because $\nabla f$ is Lipschitz continuous, $|g_i(z)| \le 2 L_g \varDelta ^2$ and $\left\| {g(z)} \right\| \le 2 \sqrt{p+1} L_g \varDelta ^2$. By Lemma 6,

$$\begin{aligned} \left\| {\nabla m(z)-\nabla f(z)} \right\|&= \left\| { \left( g(z)^T + \epsilon ^T \right) \nabla \ell (\bar{x})} \right\| \\&\le 2\sqrt{p+1} L_g \varDelta ^2 \left\| {\nabla \ell (\bar{x})} \right\| + \left\| {Z_1} \right\| \\&\le 2 L_g \varDelta (n+1)\hat{\theta } \varLambda + \left\| {Z_1} \right\| , \quad \text{ by } \text{ Lemma }~8. \end{aligned}$$

Thus,

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {\nabla m(z)-\nabla f(z)} \right\| \ge a \varDelta \right]&\le {\mathbb {P}}\left[ \left\| {Z_1} \right\| \ge \left( a - 2 L_g (n+1) \hat{\theta } \varLambda \right) \varDelta \right] \\&\le \frac{\sigma ^2(n+1)^3\hat{\theta }^2 \varLambda ^2}{\left( a - 2 L_g (n+1) \hat{\theta } \varLambda \right) ^2 \varDelta ^4 (p+1)} \\&= \frac{C_1 \sigma ^2 \varLambda ^2}{\left( a-\bar{a}_1\right) ^2 (p+1) \varDelta ^4}, \end{aligned}$$

where $C_1 = (n+1)^3 \hat{\theta }^2$ and $\bar{a}_1=2L_g(n+1)\hat{\theta } \varLambda $.

By Definition 2, $\left\| \ell (z) \right\| \le \frac{n+1}{\sqrt{p+1}}\varLambda $. Thus, by Lemma 6,

$$\begin{aligned} \left| m(z)- f(z) \right|&= \left| \left( g(z) + \epsilon \right) ^T \ell (z) \right| \\&\le \left\| {g(z)} \right\| \left\| {\ell (z)} \right\| +\left\| {Z_0(z)} \right\| \\&\le 2 L_g \sqrt{p+1} \varDelta ^2 \frac{(n+1) \varLambda }{\sqrt{p+1}} {+} \left\| {Z_0(z)} \right\| {=} 2 L_g \varDelta ^2 (n+1) \varLambda + \left\| {Z_0(z)} \right\| . \end{aligned}$$

Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left| m(z)- f(z) \right| \ge a \varDelta ^2\right] \\&\le {\mathbb {P}}\left[ \max _{z \in B(\bar{x};\varDelta )} \left\| {Z_0(z)} \right\| \ge \left( a - 2 L_g (n+1) \varLambda \right) \varDelta ^2\right] \\&\le C_0 \frac{\sigma ^2 \varLambda ^2}{\left( a - 2 L_g (n+1) \varLambda \right) ^2 \varDelta ^4 (p+1)} \\&= C_0 \frac{\sigma ^2 \varLambda ^2}{\left( a - \bar{a}_2 \right) ^2 \varDelta ^4 (p+1)}, \end{aligned}$$

where $\bar{a}_2 = 2L_g(n+1)\varLambda $. The result follows with $\bar{a} = \max \{ \bar{a}_1, \bar{a}_2 \}$ and $\bar{C} = \max \{C_0,C_1\}$. $\square $

We now show that the models $F_k^0$ and $F_k^s$ can easily be constructed to satisfy Conditions 1 and 2.

Proposition 2

Let $f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be a Lipschitz continuously differentiable function. Let $\{c_k\}$, $\{\varDelta _k\}$, and $\{a_k\}$ be positive sequences such that $c_k \rightarrow +\infty $ and $a_k \downarrow 0$. Let $\{x^k\}$ and $\{s^k\}$ be sequences in ${\mathbb {R}}^n$ with $\left\| {s^k} \right\| \le \varDelta _k$. Let $\varLambda > 0$, for each k define $\delta _k = \min \{\varDelta _k, 1\}$, and let $Y_k^0 \subset B\left( x^k;a_k \delta _k \right) $ and $Y_k^s \subset B\left( x^k+s^k; a_k \delta _k \right) $ be strongly $\varLambda $-poised sample sets with at least $c_k/(a_k \delta _k)^4$ points each. Let $m_k^0$ and $m_k^s$ be the linear polynomials fitting the computed function values on the sample sets $Y_k^0$ and $Y_k^s$, respectively, by least-squares regression. Define $F_k^0 = m_k^0\left( x^k\right) $ and $F_k^s = m_k^s \left( x^k+s^k \right) $. Then $\{F_k^0\}$ and $\{F_k^s\}$ satisfy Conditions 1 and 2.

Proof

Let $\bar{C}$ and $\bar{a}$ be the constants defined in Proposition 1. Since $a_k \downarrow 0$ and $c_k \rightarrow +\infty $, then for any $\omega > 0$ there exists $\bar{k}(\omega )$ such that for all $k > \bar{k}(\omega )$, $\displaystyle 2\bar{a}\left( a_k \delta _k\right) ^2 < \frac{\beta \eta }{2} \delta _k^2 \le \frac{\beta \eta }{2} \min \{ \varDelta _k, \varDelta _k^2\}$ and $\displaystyle \frac{\bar{C}\sigma ^2 \varLambda ^2}{\bar{a}^2 \left| Y_k^0 \right| a_k^4\delta _k^4} \le \omega /2$. Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2 \right] \\&\le {\mathbb {P}}\left[ \left| m_k^0\left( x^k\right) -f\left( x^k\right) \right| > 2 \bar{a}\left( a_k \delta _k\right) ^2\right] \\&\le \frac{\bar{C} \sigma ^2 \varLambda ^2}{\bar{a}^2 \left\| {Y_k^0} \right\| a_k^4 \delta _k^4}&\text{ by } \text{ Proposition }~1 \\&\le \frac{\omega }{2}. \end{aligned}$$

Using the same argument, we can show that

$$\begin{aligned} {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2 \right] \le \frac{\omega }{2}. \end{aligned}$$

Thus, for $k > \bar{k}(\omega )$,

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0-f\left( x^k\right) -F_k^s+f\left( x^k+s^k\right) \right| > \beta \eta \min \left\{ \varDelta _k, \varDelta _k^2\right\} \right] \\&\quad \le {\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2\right] + {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta }{2} \delta _k^2\right] \le \omega , \end{aligned}$$

so Condition 1 is satisfied.

To prove that Condition 2 holds, observe that for $k > \bar{k}(1)$, $\left( \beta \eta + \xi \right) \delta _k^2 > \left( 2 \bar{a} + \frac{\xi }{a_k^2} \right) a_k^2 \delta _k^2$. Thus,

$$\begin{aligned}&{\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \\&\le {\mathbb {P}}\left[ \left| m_k^0\left( x^k\right) -f\left( x^k\right) \right| > \left( 2 \bar{a}+\frac{\xi }{a_k^2}\right) \left( a_k \delta _k\right) ^2\right] \\&\le \frac{\bar{C} \sigma ^2 \varLambda ^2}{\left( \bar{a}+\xi /a_k^2\right) ^2 \left\| {Y_k^0} \right\| a_k^4 \delta _k^4}&\text{ by } \text{ Proposition }~1\\&\le \frac{\bar{a}^2}{2(\bar{a}+\xi /a_k^2)^2} \le \frac{\bar{a}^2 a_k^2}{4 \xi } \le \frac{\theta }{2 \xi }, \end{aligned}$$

for the constant $\theta = \max _{k} \bar{a} a_k^2/2$. Similarly, we can show that $\displaystyle {\mathbb {P}}\left[ \left| F_k^s - f(x^k+s^k) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \le \frac{\theta }{2 \xi }$. It follows that

$$\begin{aligned}&{\mathbb {P}}\left[ F_k^0 - f\left( x^k\right) +f\left( x^k+s^k\right) -F_k^s > \left( \beta \eta +\xi \right) \min \left\{ \varDelta _k, \varDelta _k^2\right\} \right] \\&\quad \le {\mathbb {P}}\left[ \left| F_k^0 - f\left( x^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] \\&\quad \quad + {\mathbb {P}}\left[ \left| F_k^s - f\left( x^k+s^k\right) \right| > \frac{\beta \eta +\xi }{2} \delta _k^2\right] < \frac{\theta }{\xi }. \end{aligned}$$

This proves that Condition 2 is satisfied. $\square $

We next show that models built using our proposed method are $\alpha $-probabilistically $\kappa $-fully linear.

Proposition 3

Let $f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be a continuously differentiable function with Lipschitz continuous gradient with constant $L_g$. Let $\bar{x} \in {\mathbb {R}}^n$, $\varLambda > 0$ and $\varDelta > 0$. Let $Y = \{y^0, \ldots , y^p\} \subset B(\bar{x};\varDelta )$ be a strongly $\varLambda $-poised set of sample points with $y^0=\bar{x}$. For $i \in \{0,\ldots ,p\}$, let $\bar{f}_i = f(y^i) + \epsilon _i$, where the noise $\epsilon _i$ is sampled from a random distribution with mean 0 and variance $\sigma ^2 < \infty $. Let $m: {\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be the unique linear polynomial approximating the data $\{ (y^i, \bar{f}_i), i=0,\ldots ,p\}$ by least-squares regression. Let $\alpha \in (0,1)$ be given. There exists a positive constant $\hat{C}$ depending only on $\varLambda $, and $L_g$ and constants $\kappa = (\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}})$ depending only on $\varLambda $, $\alpha $, and $L_g$, such that if $p \ge \frac{\hat{C}}{\varDelta ^4_k}$, then m is $\alpha $-probabilistically $\kappa $-fully linear.

Proof

Let $\bar{C}$ and $\bar{a}$ be defined as in Proposition 1. Defining $\kappa _{\mathrm{ef}}= \kappa _{\mathrm{eg}}= a = \bar{a}+1$ and $\hat{C} = \frac{\bar{C} \sigma ^2 \varLambda ^2}{1 - \frac{\alpha }{2}}$, if $p \ge \frac{\hat{C}}{\varDelta ^4} - 1$, then by Proposition 1, we have

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left| m(z)-f(z) \right| > \kappa _{\mathrm{ef}}\varDelta ^2 \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(p+1) \varDelta ^4} \le 1 - \frac{\alpha }{2} \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left[ \max _{z \in B(x;\varDelta )} \left\| \nabla m(z)-\nabla f(z) \right\| > \kappa _{\mathrm{eg}}\varDelta \right] \le \frac{\bar{C}\sigma ^2 \varLambda ^2}{(p+1) \varDelta ^4} \le 1 - \frac{\alpha }{2}. \end{aligned}$$

Therefore, for any $y \in B(\bar{x},\varDelta )$,

$$\begin{aligned} {\mathbb {P}}\left[ \left\| {\nabla f(y) - \nabla m(y)} \right\| \le \kappa _{\mathrm{eg}}\varDelta \text{ and } \left| f(y) - m(y) \right| \le \kappa _{\mathrm{ef}}\varDelta ^2\right] \ge \alpha , \end{aligned}$$

and the model is $\alpha $-probabilistically $\kappa $-fully linear. $\square $

One can easily test whether a set $Y_k$, with $\left| Y_k \right| > \frac{C}{\varDelta _k}$ is strongly $\varLambda $-poised by using Theorem 4.12 in [8]. A partial set of points that are not strongly $\varLambda $-poised can be completed to a strongly $\varLambda $-poised set by using Algorithm 6.7 in [8].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Larson, J., Billups, S.C. Stochastic derivative-free optimization using a trust region framework. Comput Optim Appl 64, 619–645 (2016). https://doi.org/10.1007/s10589-016-9827-z

Download citation

Received: 17 July 2013
Published: 17 February 2016
Issue Date: July 2016
DOI: https://doi.org/10.1007/s10589-016-9827-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic derivative-free optimization using a trust region framework

Abstract

Access this article

Similar content being viewed by others

A trust region method for noisy unconstrained optimization

A globally convergent trust-region algorithm for unconstrained derivative-free optimization

An adaptive trust-region method without function evaluations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Stochastic derivative-free optimization using a trust region framework

Abstract

Access this article

Similar content being viewed by others

A trust region method for noisy unconstrained optimization

A globally convergent trust-region algorithm for unconstrained derivative-free optimization

An adaptive trust-region method without function evaluations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation