trustOptim: An R Package for Trust Region Optimization with Sparse Hessians

Trust region algorithms are nonlinear optimization tools that tend to be stable and reliable when the objective function is non-concave, ill-conditioned, or exhibits regions that are nearly flat. Additionally, most freely-available optimization routines do not exploit the sparsity of the Hessian when such sparsity exists, as in log posterior densities of Bayesian hierarchical models. The trustOptim package for the R programming language addresses both of these issues. It is intended to be robust, scalable and efficient for a large class of nonlinear optimization problems that are often encountered in statistics, such as finding posterior modes. The user must supply the objective function, gradient and Hessian. However, when used in conjunction with the sparseHessianFD package, the user does not need to supply the exact sparse Hessian, as long as the sparsity structure is known in advance. For models with a large number of parameters, but for which most of the cross-partial derivatives are zero (i.e., the Hessian is sparse), trustOptim offers dramatic performance improvements over existing options, in terms of computational time and memory footprint.


Introduction
The need to optimize continuous nonlinear functions occurs frequently in statistics, most notably in maximum likelihood and maximum a posteriori (MAP) estimation.Users of R (R Development Core Team, 2012) have a choice of dozens of optimization algorithms.The most readily available algorithms are those that are accessed from the optim function in the base R distribution.These algorithms include conjugate gradient (CG), quasi-Newton using BFGS updates, limited-memory BFGS (L-BFGS), derivative-free heuristic search (Nelder-Mead) and simulated annealing (SANN).In addition, users can install contributed packages that implement algorithms that are either not available through optim, or improve those that are.Many are described in the CRAN Task View for Optimization and Mathematical Programming (Theussel, 2013).For example, the nloptwrap (Borchers, 2013) and optimx packages (Nash and Varadhan, 2011) each implement many different algorithms, each with its own characteristics and limitations.Other packages, like Rcgmin (Nash, 2013) and trust (Geyer, 2009), are purpose-built for a single algorithm (conjugate gradient and trust region, respectively).Any particular algorithm may be more appropriate for some problems than for others, and having such a large number of alternatives lets the informed R user choose the best tool for the task at hand.
One limitation of most of these algorithms is that they can be difficult to use when there is a large number of decision variables.Search methods like Nelder-Mead are inefficient with a massive number of parameters because the search space is large, and they do not exploit information about slope and curvature to speed up the time to convergence.CG and BFGS do use gradient information, with BFGS tracing out the curvature of the function by using successive gradients to approximate the inverse Hessian.However, because BFGS stores the entire dense inverse Hessian, its use is resource-intensive when the number of parameters is large.For example, the Hessian for a 50,000 parameter model requires 20GB of RAM to store it as a standard, dense base R matrix.Conjugate gradient methods, and the limited-memory methods in nloptwrap (e.g., L-BFGS, truncated Newton and variable metric) do not store the full Hessian (or its inverse), so they can be more suited for largescale problems.However, like BFGS, they are not certain to approximate the curvature of the objective function accurately at any particular iteration, especially if the function is not convex.
Conjugate gradient, BFGS, L-BFGS, truncated Newton and variable metric methods fall into the "line search" class of nonlinear optimization algorithms.In short, line search methods choose a direction along which to move from x t to x t+1 , and then find the distance along that direction that yields the greatest improvement in the objective function.A simple example of a line search method is "steepest descent," which follows the direction of the gradient at x t , and searches for the "best" point along that line.Steepest descent in known to be inefficient, which is why one might use these other methods to find a better direction in which to advance (Nocedal and Wright, 2006).However, if the objective function is ill-conditioned, non-convex, or has long ridges or plateaus, the optimizer may try to search far away from x t , only to select an x t+1 that is closer to x t , but offers only small improvement in the objective function.At worst, the line search step will try to evaluate the objective function so far away from x t that the objective function is not finite, and the algorithm will fail.
In contrast, the trust package (Geyer, 2009), as well as the package that is presented in this paper, take a "trust region" approach.In our experience, trust region algorithms tend to be more robust and stable than line search algorithms, and may succeed for certain kinds of large-scale problems that line search methods cannot solve.Like many other nonlinear optimizers, they are iterative, and use gradient and Hessian estimates at each step to decide where to move next.Trust region methods work by choosing a maximum distance for the move from x t to x t+1 , defining a "trust region" around x t that has a radius of that maximum distance, and then letting a candidate for x t+1 be the minimum, within the trust region, of a quadratic approximation of the objective function.We call this constrained quadratic program the "trust region subproblem" or TRS.Because we do not consider points outside of the trust region, the algorithm never runs too far, too fast, from the current iterate.If we try to move to a point in the trust region that is worse than, or insufficiently better than, the current point, we adaptively shrink the trust region.This step excludes other points that are too far away from x t to be reasonable candidates for x t+1 .We then solve the new TRS with the smaller trust region.If we accept a point close to the border of the trust region, and that point gives as a large enough improvement in the objective function, we can expand the trust region for the next iteration.By adaptively adjusting the size of the trust region, we try to prevent the algorithm from jumping over the local optimum, while allowing for steps that are large enough for the algorithm to converge quickly.
Like line search methods, trust region methods are guaranteed to converge to a point where the norm of the gradient is nearly zero and the Hessian is positive definite, if such a point exists.The primary advantage of trust region methods is stability.If a point along a line search path causes the objective function to be undefined or indeterminate, most implementations of line search methods will fail.It is not immediately clear how the search should proceed in that event; user intervention is usually required.In contrast, the search for x t+1 in a trust region algorithm is always a solution to the TRS, which should always be finite, even when the Hessian is indefinite.If the objective function, at the solution to the TRS, is not finite (or just not much better than at x t ), we reject that proposal, shrink the trust region, and try again.Furthermore, a line search requires repeated estimation of the objective function, while trust region methods evaluate the objective function only after solving the TRS.Thus, trust region methods can run a lot faster when the objective function is expensive to compute.Although there is no guarantee that trust region algorithms will always converge faster than other alternatives, they may work better for difficult optimization problems that other algorithms cannot solve.
To use the trust package, the user must provide a function that returns the Hessian of the objective function as a standard, dense R matrix.Because trust computes the eigenvalues of the Hessian to solve the TRS, it tends to work well on functions with no more than a few hundred variables.The computation time and memory utilization is too high to make it practical for larger problems.In this paper, we present the trustOptim package as an alternative trust region optimizer for R. Unlike trust, trustOptim is optimized for problems for which the Hessian is sparse.Sparse Hessians occur when a large number of the cross-partial derivatives of the objective function are zero.For example, suppose we want to find the mode of a log posterior density for a Bayesian hierarchical model.
If we assume that individual-level parameter vectors β i and β j are conditionally independent, the cross-partial derivatives between all elements of β i and β j are zero.If the model includes a large number of heterogeneous units, and a relatively small number of population-level parameters, the proportion of non-zero entries in the Hessian will be small.Since we know up front which elements of the Hessian are non-zero, we need to compute, store, and operate on only those non-zero elements.By storing sparse Hessians in a compressed format, and using a library of numerical algorithms that are efficient for sparse matrices, we can run the optimization algorithms faster, with a smaller memory footprint, than algorithms that operate on dense Hessians. 1 In this paper, we will show that trustOptim can perform better than Hessian-free algorithms as well.
In the next section, we discuss the specifics of the trust region implementation in trustOptim.We then introduce the trust.optimfunction, describe how to use it, and demonstrate its performance in a hierarchical binary regression example.As part of this demonstration, we compare its performance to that of some other gradient-based nonlinear optimizers that are available for R. 2

Algorithmic details
Consider f (x), an objective function over a p-dimensional vector that we want to minimize.Let g be the gradient, and let B be the Hessian.The goal is to find a local minimum of f (x), with no constraints on x, within some window of numerical precision (i.e., where ||g||/ √ n < for small > 0).We will assume that B is positive definite at the local optimum, but not necessarily at other values of x.Iterations are indexed by t.

Trust region methods for nonlinear optimization
The details of trust region methods are described in both Nocedal and Wright (2006) and Conn et al. (2000), and the following exposition borrows heavily from both sources.At each iteration of a trust region algorithm, we construct a quadratic approximation to the objective function at x t , and minimize that approximation, subject to a constraint that the solution falls within a trust region with radius d.More formally, each iteration of the trust region algorithm involves solving the "trust region subproblem," or TRS.
The norm • M is a Mahalanobis norm with respect to some positive definite matrix M.
Let s t be the solution to the TRS for iteration t, and consider the ratio This ratio is the improvement in the objective function that we would get from a move from x t to x t+1 , where x t+1 = x t + s t , relative to the improvement that is predicted by the quadratic approximation.Let η 1 be the minimum value of ρ t for which we deem it "worthwhile" to move from x t to x t+1 , and let η 2 be the maximum ρ t that would trigger a shrinkage in the trust region.If ρ t < η 2 , or if f (x t + s t ) is not finite, we shrink the trust region by reducing d t by some predetermined factor, and compute a new s t by solving the TRS again.If ρ t > η 1 , we move to x t+1 = x t + s t .Also, if we do accept the move, and s t is on the border of the trust region, we expand the trust region by increasing d, again by some predetermined factor.The idea is to not move to a new x if f (x t+1 ) would be worse than f (x t ).By expanding the trust region, we can propose larger jumps, and potentially reach the optimum more quickly.We want to propose only moves that are among those that we "trust" to give reasonable values of f (x).If it turns out that a move leads to a large improvement in the objective function, and that the proposed move was constrained by the radius of the trust region, we want to expand the trust region so we can take larger steps.If the proposed move is bad, we should then reduce the size of the region we trust, and try to find another step that is closer to the current iterate.Of course, there is no reason that the trust region needs to change at after at a particular iteration, especially if the solution to the TRS is at an internal point.
There are a number of different ways to solve the TRS; Conn et al. (2000) is authoritative and encyclopedic in this area.The trustOptim package uses the method described in Steihaug (1983).The Steihaug algorithm is, essentially, a conjugate gradient solver for a constrained quadratic program.If B t is positive definite, the Steihaug solution to the TRS will be exact, up to some level of numerical precision.However, if B t is indefinite, the algorithm could try to move in a direction of negative curvature.If the algorithm happens to stumble on such a direction, it goes back to the last direction that it moved, runs in that direction to the border of the trust region, and returns that point of intersection with the trust region border as the "solution" to the TRS.This solution is not necessarily the true minimizer of the TRS, but it still might provide sufficient improvement in the objective function such that ρ t > η 1 .If not, we shrink the trust region and try again.As an alternative to the Steihaug algorithm for solving the TRS, Conn et al. (2000) suggest using the Lanczos algorithm.The Lanczos approach may be more likely to find a better solution to the TRS when B k is indefinite, but at some additional computational cost.We include only the Steihaug algorithm for now, because it still seems to work well, especially for sparse problems.
As with other conjugate gradient methods, one way to speed up the Steihaug algorithm is to rescale the trust region subproblem with a preconditioner M. Note that the constraint in TRS is expressed as an M-norm, rather than a Euclidean norm.The positive definite matrix M should be close enough to the Hessian that M −1 B t ≈ I, but still cheap enough to compute that the cost of using the preconditioner does not exceed the benefits.Of course, the ideal preconditioner would be B t itself, but B t is not necessarily positive definite, and we may not be able to estimate it fast enough for preconditioning to be worthwhile.In this case, one could use a modified Cholesky decomposition, as described in Nocedal and Wright (2006),

Computing Hessians
The trustOptim package provides three trust region "methods" that differ only in how the Hessian matrix B is computed and stored.The Sparse method is the main method in trustOptim, and is optimized for objective functions with sparse Hessians.This method requires the user to supply a function that returns the Hessian as a dgCMatrix matrix, as defined in the Matrix package (Bates and Maechler, 2012b).It is preferred if an analytical expression for the Hessian is readily available, or if the user can compute the Hessian using algorithmic, or automatic, differentiation (AD) software, such as the CppAD library for C++, (Bell, 2012), or AD Model Builder (Fournier et al., 2012) with the R2admb package (Bolker and Skaug, 2012).However, in conjunction with the sparseHessianFD package (Braun, 2012), trustOptim can still be used even if the Hessian is not available, as long as the sparsity structure is known in advance.The routines in sparseHessianFD take as input the row and column indices of the non-zero elements of the lower triangle of the Hessian, and return functions that compute the Hessian through finite differencing of the gradient.These routines exploit the sparsity structure using the algorithms published in Coleman et al. (1985) and can often be faster than computing the Hessian directly.
trustOptim also includes two quasi-Newton methods, BFGS and SR1 for estimating inverse Hessians when the exact Hessian is not available.These methods approximate the Hessian by tracing the curvature of the objective function through repeated estimates of the gradient, and differ only in the formula they use to for the Hessian updates.These Hessians are stored as dense matrices, so they are not appropriate for large problems.In fact, many of the algorithms in nloptwrap will perform better.We include BFGS and SR1 in the package for convenience and completeness.

Using the package
To run the algorithms in trustOptim, the user will call the trust.optimfunction.Its signature is: trust.optim(x, fn, gr, hs=NULL, method=c("SR1","\method{BFGS}","Sparse"), control=list(), ...) The user must supply a function fn that returns f (x), the value of the objective function to be minimized, and a function gr that returns the gradient.For the Sparse method, the function hs returns the Hessian as a sparse matrix of class dgCMatrix, which is defined in the Matrix package.The functions fn, gr, and hs all take a parameter vector as the first argument.Additional named arguments can be passed to fn, gr or hs through the . . .argument.If only the sparsity structure of the Hessian is known, one can use the sparseHessianFD package to construct a function that can be used as the argument to hs.
The quasi-Newton methods SR1 and BFGS do not require the user to provide any Hessian information.For those methods, hs should be, and will default to, NULL.
Although it is true that the CG and BFGS methods in optim do not require a user-supplied gradient, those methods will otherwise estimate the gradient using finite differencing.
In general, we never recommend finite-differenced gradients for any problem other than those with a very small number of variables, even when using optim.Finite differencing takes a long time to run when there is a large number of variables, and is subject to numerical error, especially near the optimum when elements of the gradient are close to zero.Using sparseHessianFD with finite-differenced gradients means that the Hessian is "doubly differenced," and the resulting lack of numerical precision renders those Hessians next to worthless.

Control parameters
The control argument takes a list of options, all of which are described in the package manual.Most of these arguments are related to the internal workings of the trust region algorithm, such as how close a step needs to be to the border of the trust region before the region expands.However, there are a few arguments that deserve some special attention.

Stopping criteria
The trust.optim function will stop when g / √ p < for a sufficiently small , where g is the gradient, p is the number of parameters, and the norm is Euclidean.The parameter is the prec parameter in the control list.It defaults to √ .Machine$double.eps, which is the square root of the computer's floating point precision.However, sometimes the algorithm cannot get the gradient to be that flat.When that occurs, the trust region will shrink, until its radius is less than the value of the cg.tol parameter.The algorithm will then stop with the message "Radius of trust region is less than stop.trust.radius."This event is not necessarily a problem if the norm of the gradient is still small enough that the gradient is flat for all practical purposes.For example, suppose we set prec to be 10 −7 and that, for numerical reasons, the norm of the gradient simply cannot get below 10 −6 .If the norm of the gradient were the only stopping criterion, the algorithm would continue to run, even though it has probably hit the local optimum.With the alternative stopping criterion, the algorithm will also stop when it is clear that the algorithm can no longer take a step that leads to an improvement in the objective function.
There is, of course, a third stopping criterion.The maxit is the maximum number of iterations the algorithm should run before stopping.However, keep in mind that if the algorithm stops at maxit, it is almost certainly not at a local optimum.
Note that many other nonlinear optimizers, including optim, do not use the norm of the gradient as a stopping criterion.Instead, apart from the SANN method, optim stops when the absolute or relative changes in the objective function are less that abstol or reltol, respectively.This often causes optim to stop prematurely, when the estimates of the gradient and/or Hessian are not precise, or if there are some regions of the domain where the objective function is nearly flat.In theory, this should never happen, but in reality, it happens all the time.For an unconstrained optimization problem, there is no reason why the norm of the gradient should not be zero (within numerical precision) before the algorithm stops.
The cg.tol parameter specifies the desired accuracy for each solution of the trust region subproblem.If it is set too high, there is a loss of accuracy at each step, but if set too low, the algorithm may take too long at each trust region iteration.In general, each TRS solution does not need to be particularly precise.Similarly, the trust.iterparameter controls the maximum number of conjugate gradient iterations for each attempted solution of the trust region subproblem.To minimize the loss of accuracy that occurs when the conjugate gradient step stops prematurely, this number should be set high.

Preconditioners
Currently, the package offers two preconditioners: an identity preconditioner (no preconditioning), and an inexact modified Cholesky preconditioner (Nocedal and Wright, 2006, Algorithm 7.3).The identity and diagonal preconditioners are available for all of the methods.For the Sparse method, the modified Cholesky preconditioner will use a positive definite matrix that is close to the potentially indefinite Hessian (trust.optimdoes not require that the objective function be positive definite).For BFGS, the Cholesky preconditioner is available because BFGS updates are always positive definite.If the user selects a Cholesky preconditioner for SR1, the algorithm will use the identity preconditioner instead.
There is no general rule for selecting preconditioners.There will be a tradeoff between the number of iterations needs to solve the problem and the time it takes to compute any particular preconditioner.In some cases, the identity preconditioner may even solve the problem in fewer iterations than a modified Cholesky preconditioner.

Example: Binary choice
In this section, we present two related examples that demonstrate that trustOptim performs better than many other R optimizers when the problem is large and the Hessian is sparse, but does not do as well for small problems with dense Hessians.We start with an example of the first case: a hierarchical binary choice model with heterogeneous coefficients.After that, we present an example of the second case, in which the coefficients are homogeneous.

Hierarchical binary choice
Suppose we have a dataset of N households, each with T opportunities to purchase a particular product.Let y i be the number of times household i purchases the product, out of the T purchase opportunities.Furthermore, let p i be the probability of purchase; p i is the same for all T opportunities, so we can treat y i as a binomial random variable.The purchase probability p i is heterogeneous, and depends on both k continuous covariates x i , and a heterogeneous coefficient vector β i , such that The coefficients are distributed across the population of households following a multivariate normal distribution with mean µ and covariance Σ.We assume that we know Σ, but we do not know µ.Instead, we place a multivariate normal prior on µ, with mean 0 and covariance Ω 0 , which is determined in advance.Thus, each β i , and µ are k−dimensional vectors, and the total number of unknown variables in the model is (N + 1)k.
The log posterior density, ignoring any normalization constants, is Since the β i are drawn iid from a multivariate normal, ∂ 2 log π ∂β i β j = 0 for all i = j.We also know that all of the β i are correlated with µ.Therefore, the Hessian will be sparse with a "block-arrow" structure.For example, if N = 6 and k = 2, then p = 14 and the Hessian will have the pattern as illustrated in Figure 1.An additional advantage of using new.sparse.hessian.obj is that when we pass additional arguments to the objective function here, they are stored in obj, and we do not need to include them again in the call to the optimizer.
Next, we run the optimizer.Definitions of the control parameters are described in detail in the package documentation.The control parameters to which a user might want to pay the most attention are those related to convergence of the main algorithm (stop.trust.radius,prec and maxit), verbosity of the reporting of the status of the algoritm (report.freq,report.leveland report.freq),and the selection of the preconditioner (0 for no preconditioner, and 1 for a modified Cholesky preconditioner).The output of the algorithm supplies the iteration number, the value of the objective function and norm of the gradient, whether the trust region is expanding, contracting, or staying the same size, and the current radius of the trust region.It will also report the number of iterations it took for the Steihaug algorithm to solve the trust region subproblem, and the reason the Steihaug algorithm stopped.In this example, for the first two iterations, the solution to the TRS was reached after only eight conjugate gradient steps, and this solution was at the boundary of the trust region.Since the improvement in the objective function was substantial, we expand the trust region and try again.By the third iteration, the trust region is sufficiently large that the TRS solution is found in the interior through subsequent conjugate gradient steps.Once the interior solution of the TRS is found, the trust region algorithm moves to the TRS solution, recomputes the gradient and Hessian of the objective function, and repeats until the first-order conditions of the objective function are met.
This problem has 2,008 parameters, and converged in less than two seconds.
The return value of the trust.optimfunction returns all of the important values, such as the solution to the problem, the value, gradient and Hessian of the objective function, the number of iterations, the final trust radius, the number of non-zeros in the Hessian, and the method used.
Next, we compare the performance of trust.optim to some alternative nonlinear optimizers in R. The methods are summarized in Table 1.The three methods from the nloptwrap package are "limited memory," in the sense that they do not compute or store a complete, exact Hessian (or inverse of it).The conjugate gradient method in the Rcgmin falls into this category as well.The only method that is called from the base R package is BFGS, which is identified as bfgs-optim in the subsequent text.I exclude the others because the conjugate gradient and L-BFGS methods are largely superseded by those in Rcgmin and nloptwrap, and because Nelder-Mead and SANN are of a completely different class of optimizer than the one considered in this paper.As described in the introduction, trust (Geyer, 2009) is another stable and robust implementation of a trust region optimizer, and we found that it works well for modestly-sized problems of no more than a few hundred parameters.Unlike trustOptim, it requires the user to provide a complete Hessian as a dense matrix, so it cannot exploit sparsity when that sparsity exists.It also uses eigenvalue decompositions to solve the TRS, as opposed to the Steihaug conjugate gradient approach.Finally, stopping criterion in for the algorithm in trust is based on the change in the value of the objective function, and not the norm of the gradient.
Naturally, there are many other optimization tools available for R users.These are described in the R Task View on Optimization and Mathematical Programming.
We compare the algorithms by simulating datasets from the hierarchical binary choice model, and using the optimization algorithms to find the mode of the log posterior density.The Rfunc and RfuncHess classes are defined in the files inst/Rfunc.h and inst/RfuncHess.h,respectively.These classes contain functions that return the value of the objective function, the gradient, and the Hessian.Rfunc is used for the quasi-Newton methods, RfuncHess is used for Sparse.Both classes contain references to Rcpp::Function objects that, in turn, are references to the R functions that compute the objective function and gradient.Thus, a call to the get f() function will return the result of a call to the corresponding R function.
The RfuncHess class returns the Hessian, as an Eigen sparse matrix, in a similar way.

Discussion
The motivation behind trustOptim was the practical difficulty in finding modes of posterior densities of hierarchical models.Existing optimization tools in the both the base R distribution and other contributed packages were either too cumbersome to use when there are a large number of parameters, too imprecise when encountering ridges, plateaus or saddle points in the objective function, or too lenient in determining when the optimization algorithm should stop.The product of the effort behind addressing these problems is a package that can be more robust, efficient and precise than existing options.This is not to say that trustOptim will outperform other nonlinear optimizers in all cases.But at least for hierarchical models, or other models with sparse Hessians, trustOptim is a useful tool in the statisticians toolbox.

Figure 1 :
Figure 1: Sparsity pattern for hierarchical binary choice example.

The
src/Rinterface.cppfile defines the C++ functions that collect data from R, passes them to the optimization routines, and return the results.There is one function for Sparse and another for the quasi-Newton methods SR1 and BFGS.Each function constructs an optimizer object of the class that is appropriate for that method.The class Trust CG Optimizer, for the quasi-Newton methods is defined in the file inst/include/CG-quasi.h,and the class Trust CG Sparse, is defined in the fileinst/CG-sparse.h.Both of these classes inherit from the Trust CG Base class, which is defined in inst/CG-base.h.All of the optimization is done by member functions in Trust CG Base; Trust CG Optimizer and Trust CG Sparse differ only in how they handle the Hessian and the preconditioners.

Table 2 :
There are six test conditions, determined by crossing the number of heterogeneous Convergence times and gradient norms for hierarchical binary choice example.See Table1for descriptions of the methods.Because of time and memory constraints, we did not run the trust method for the N = 5000 cases.

Table 3 :
Convergence times and gradient norms for binary choice example with homogeneous coefficients.See Table1for descriptions of the methods.The Rcgmin method is excluded because it would not converge reliably after multiple attempts.