BB : An R Package for Solving a Large System of Nonlinear Equations and for Optimizing a High-Dimensional Nonlinear Objective Function

We discuss R package BB , in particular, its capabilities for solving a nonlinear system of equations. The function BBsolve in BB can be used for this purpose. We demonstrate the utility of these functions for solving: (a) large systems of nonlinear equations, (b) smooth, nonlinear estimating equations in statistical modeling, and (c) non-smooth estimating equations arising in rank-based regression modeling of censored failure time data. The function BBoptim can be used to solve smooth, box-constrained optimization problems. A main strength of BB is that, due to its low memory and storage requirements, it is ideally suited for solving high-dimensional problems with thousands of variables.


Introduction
R (R Development Core Team 2009) package BB provides functions for solving large-scale nonlinear problems.BB (version 2009.6-1)comprises six functions, which are nested at three levels.At the bottom level are three functions: sane and dfsane for solving a nonlinear system of equations; and spg for optimizing a nonlinear objective function with box constraints.These functions, especially dfsane and spg, are the workhorses of BB.The functions BBsolve and BBoptim are at the next higher level.BBsolve is a wrapper for dfsane.It takes a single parameter vector as starting value and calls dfsane repeatedly with different algorithm control parameters to try and achieve successful convergence to the solution.Similarly, BBoptim, which is a wrapper for spg, takes a single parameter vector as starting value and calls spg repeatedly with different algorithm control parameters.At the top-most level is the function multiStart.This takes a matrix of parameters as multiple starting values and, depending on the value of the argument action specified by the user, calls either BBsolve or BBoptim for each starting value.
The main purposes of this article are: (1) to introduce BB to R users, (2) to present background necessary for the appropriate use of the algorithms, and (3) to demonstrate the utility of the algorithms by presenting results on a variety of test problems.In addition to BB, R package nleqslv (Hasselman 2009) has recently been added to the Comprehensive R Archive Network (CRAN) http://CRAN.R-project.org/.However, nleqslv uses Newton-type methods, and hence it may not be suitable for solving large systems of equations.We will confine this article to the problem of finding a root of simultaneous nonlinear equations, and not discuss spg for nonlinear optimization, since that problem can be addressed using existing R functions including optim, nlminb, and nlm.Other optimization packages are summarized in the CRAN task view (Zeileis 2005) on optimization at http://CRAN.R-project.org/view=Optimization.However, we point out that the optimization function spg in BB is different from the existing functions in that it is well suited to large-scale optimization, since it does not require the Hessian matrix of the objective function.It is based on the Barzilai-Borwein gradient method developed by Raydan (1997).Our R implementation (which is based on the Fortran code of Birgin, Martínez, and Raydan 2001) is competitive with the limited-memory BFGS algorithm (method = "L-BFGS-B") in optim for large-scale, box-constrained optimization, and is superior to the conjugate gradient methods (method = "CG") in optim for large-scale, unconstrained optimization.Results presented here were obtained with BB version 2009-6.1.The most recent version of package BB is available from CRAN at http://CRAN.R-project.org/package=BB.A tutorial on BB is available and can be viewed from an R session by typing: R> vignette("BB", package = "BB") Also, a version of this paper, augmented with results from the system on which the vignette is compiled, can be viewed by typing: R> vignette("BBvignetteJSS", package = "BB") This paper was compiled with settings for the number of simulations and bootstrap samples as R> nsim <-1000 R> nboot <-500 but, in order to maintain a reasonable build time for the package, these values are set very much smaller in the vignette (nsim = 10, nboot = 50).

Solving nonlinear system of equations
We are interested in solving the nonlinear system of equation where F : R p → R p is a nonlinear function with continuous partial derivatives.We are interested in situations where p is large, and where the Jacobian of F is either unavailable or requires a prohibitive amount of storage, although the algorithms in BB are also applicable when p is small.The best known methods for solving (1) are Newton's method and the quasi-Newton's methods (Ortega and Rheinboldt 1970;Dennis and Schnabel 1983).Newton's method employs a working linear approximation to F (x) around an estimate of the solution, and improves it in an iterative manner: where J : R p × R p → R p is the Jacobian of F evaluated at x k .Quasi-Newton methods use an approximation of J, which, along with the solution vector, is updated at each iteration.For example, the classical Broyden's ("good") method is given by the equations: , where B 0 is usually the identity matrix.These methods are attractive because they converge rapidly from any good starting value.However, they need to solve a linear system of equations using the Jacobian or an approximation of it at each iteration, which can be prohibitively expensive for large p problems.
An indirect approach to solving Equation 1 is to transform it to a standard optimization problem by defining a merit function φ(u) where φ : R p → R is a functional with a unique global minimum at u = 0. Now, any solution of F (x) = 0 is also a minimizer of φ(F (x)), but the converse does not always hold.A sufficient condition for the converse to hold is that the Jacobian of F be non-singular at the minimizer of φ(u) (Ortega and Rheinboldt 1970).A commonly used merit function is the L 2 -norm of F : φ(F (x)) = F (x) , which is also known as the "residual".This approach generally does not work well in practice, and hence is little used as a stand-alone method for solving nonlinear systems.However, as discussed later, we shall use this approach for generating good starting values for the spectral algorithms.2006).These methods are an extension of the Barzilai-Borwein method for finding local minimum (Barzilai and Borwein 1988;Raydan 1997).They use ±F (x) as search directions in a systematic way, with one of the spectral coefficients as steplength, and a non-monotone linesearch technique for global convergence.This provides a robust scheme for solving nonlinear systems.The simplicity of search direction and steplength results in low-cost per iteration.

Spectral method for nonlinear systems
The spectral approach for nonlinear systems is defined by the following iteration: where α k is the spectral steplength, and d k is the search direction, which is defined as follows.
For the SANE algorithm, the sign associated with F (x k ) is that which yields a descent direction with respect to the merit function F (x k ) 2 .The only spectral steplength considered in La Cruz and Raydan (2003) and La Cruz et al. (2006) is: where . Below, and in the tables, we denote the SANE and DF-SANE algorithms that use this steplength as sane-1 and dfsane-1, respectively.In addition to Equation 3, Barzilai and Borwein (1988) proposed a second spectral steplength: We denote the SANE and DF-SANE algorithms that use this steplength as sane-2 and dfsane-2, respectively.We also consider a third spectral steplength, first proposed in Varadhan and Roland (2008) for the acceleration of EM algorithms: where sgn(x) = x/|x|, when x = 0, and is zero when x = 0.For all three steplengths, we define α 0 = min(1, 1/ F (x 0 ) ).The general effectiveness of spectral steplengths is due to the fact that they can be viewed as a Rayleigh quotient with respect to a secant approximation of the Jacobian.The scalar |α k | is closely related to the condition number of the Jacobian J k (Fletcher 2001).

Globalization using non-monotone line search
To achieve global convergence, the spectral iterative scheme (2) must be combined with a suitable line search technique.For SANE, La Cruz and Raydan (2003) consider a nonmonotone line search technique (Grippo, Lampariello, and Lucidi 1986), which can be written as where the merit function f (x) = F (x) F (x), and γ is a small positive number (we choose γ = 10 −4 ).In the above condition, denoted here as the GLL condition, M is a positive integer that plays an important role in dictating the allowable degree of non-monotonicity in the value of the merit function, with M = 0 yielding a strictly monotone scheme.As pointed out by Fletcher (2001), the Barzilai-Borwein schemes perform poorly when strict monotonicity is imposed, especially in ill-conditioned problems.They perform better when some amount of non-monotonicity is allowed, hence globalization using the GLL condition, with values of M between 5-20.The term ∇f and J k is the Jacobian of F at x k .This can be evaluated without computing the Jacobian as follows: where h = 10 −7 .
For DF-SANE (stands for "derivative-free SANE"), La Cruz et al. (2006) propose a new, and different globalization line search technique: where γ = 10 −4 , and η k > 0 decreases with k such that Note that this strategy does not involve any Jacobian computations.Hence the phrase "derivative-free".This strategy maintains the non-monotonicity of GLL, while avoiding the quadratic product involving the Jacobian, which entails an additional function evaluation at each iteration.Consequently, DF-SANE is generally more economical than SANE in terms of number of evaluations of F .The presence of η k > 0 ensures that all the iterations are well-defined, and the forcing term −γα 2 k f (x k ) provides the theoretical condition sufficient for establishing global convergence (La Cruz et al. (2006)).

Implementations of SANE and DF-SANE in BB
For detailed algorithmic implementation of the iterations and non-monotone line searches for SANE and DF-SANE, the reader is directed to La Cruz and Raydan (2003) and La Cruz et al. (2006), respectively.Also see the documentation for the R functions sane and dfsane in BB for more details.Here we only discuss the salient features of our R implementation for SANE and DF-SANE algorithms in the package BB that are different from the original Fortran codes (which can be obtained from Raydan 2009).These are: 1. We provide an option for three spectral steplengths, Equations 3, 4 and 5.The method argument in sane and dfsane functions can be used to select between these steplengths.
The original SANE and DF-SANE algorithms only allowed one steplength, Equation 3, which can be selected with method = 1.We have set method = 2, which corresponds to Equation 4, as the default.In our numerical experiments, this generally outperformed the other two methods.(See Table 1 discussed in the next section for results.) 2. We re-scale the first BB steplength as: α 0 = min(1, 1/ F (x 0 ) ), whereas in the original implementation α 0 = 1.
3. We provide an option for improving on starting values, when the user is unable to generate good starting values.We do this by calling the Nelder-Mead nonlinear simplex algorithm (Nelder and Mead 1965), as implemented in the R function optim, with the merit function f (x) as the objective function.
4. We provide an option for improving upon convergence when sane or dfsane terminates unsuccessfully in some particular manner, i.e., when convergence = 4 or 5 for sane and when convergence = 2 or 5 for dfsane.We do this by calling the limited memory BFGS algorithm (method = "L-BFGS-B") in optim with the merit function f (x) as the objective function.
5. When we are close to the solution, i.e., when f (x k ) < 10 −4 , we use the dynamic retard strategy proposed in Luengo and Raydan (2003): i.e., we use the spectral steplength from two iterations before the current one.This retarded spectral scheme was never worse than the unretarded spectral method (Equation 2) in our experiments, and in many cases it actually exhibited faster convergence (results not shown).
6.We implement an additional stopping criterion in our R functions.The iterations are terminated when there is no decrease in the merit function f (x) over noimp iterations, where we choose a default value of noimp = 100.This is particularly essential when a large M , say, M ≥ 100 is used.

What to do when the algorithm fails? -Function BBsolve
Algorithm dfsane or (sane) is said to have failed when a non-zero convergence type is obtained, i.e., when convergence > 0. In this case, we have found that the following sequential strategy generally works quite well: 1. Try a different non-monotonicity parameter M. Since the default is M = 10, try M = 50.
2. Try a different method.Since the default is method = 2, try methods 1 and 3.
3. Try with Nelder-Mead initialization NM.Since the default is NM = FALSE, the user should try NM = TRUE.
We have written an R wrapper function called BBsolve to automatically implement this strategy.We have found this function to be successful in problems where dfsane and sane had failed to converge.Here we give a simple example to illustrate this using the Freudenstein-Roth function.
Function dfsane and sane fail to converge, while BBsolve converges successfully.A similar wrapper function called BBoptim can be used to solve optimization problems when spg fails to converge.

Standard test problems
We have tested our algorithms extensively on a number of nonlinear systems considered in La Cruz and Raydan (2003), La Cruz et al. (2006), and Luksan and Vlcek (2003).Here we report the results for six problems, whose statements are given in Apppendix A. We tested four methods, sane and dfsane, each with two steplengths Equations 3 and 4, for 1000 randomly generated initial values for each problem, which are also provided in Appendix A. This approach of using random starting values is uncommon in the numerical analysis literature when testing new methods, and when comparing methods.Rather, a single, reasonably good starting value is used in each test problem.Hence, our tests are much more stringent than those commonly seen in the numerical analysis literature (e.g., La Cruz and Raydan 2003, La Cruz et al. 2006).Therefore, it should not come as a surprise that there are substantial number of convergence failures in some problems.We used F (xn) √ p ≤ 10 −7 , where p is the dimensionality of the problem, as the stopping criterion.We have successful convergence (i.e., convergence = 0) when this criterion is satisfied.The algorithm (sane or dfsane) is said to have failed when convergence > 0.
We chose p = 500 for all the 6 test problems.Unless otherwise stated explicitly, we used the default control parameter setting for all the parameters of dfsane and sane.The numerical experiment results presented here were performed using R version 2.9.1 running on a Microsoft Windows Vista operating system, with a 2.2 GHz Intel dual-core Pentium processor and 4 GB of RAM.The results are presented in Table 1.
In order to reproduce the random numbers used in this paper, the seed and RNG types are set to known values.R> require("setRNG") R> test.rng<-list(kind = "Mersenne-Twister", normal.kind= "Inversion", + seed = 1234) R> old.seed <-setRNG(test.rng) Iterative numerical procedures can be sensitive to system math libraries and even hardware floating point calculation, since a very small difference in a search steps will result in slightly different paths to the solution.This can result in a different number of iterations and/or a slightly different answer.The difference may be especially aggravated in problems where the objective function is nearly "flat" near the solution.We have run the examples here with different versions of R and on different hardware and operating systems and the results are relatively similar, but users replicating the results may see small differences.
We define a "failure" as the inability of an algorithm to achieve the default tolerance of 1.e−07 under default values for all the control parameters.It might be possible that a different control setting enables successful convergence.In fact, this is one of the main motivations for creating the BBsolve function that can automatically try different control settings to achieve successful convergence.
Algorithms sane-2 and dfsane-2 performed better (using steplength Equation 4) than sane-1 and dfsane-1 (with steplength Equation 3), except for the extended Rosenbrock function.dfsane-2 was the best method overall.We re-ran the tests with BBsolve for the two problems: exponential function 3 and extended Rosenbrock, where even the best performing method had a substantial number of convergence failures.Now, BBsolve converged successfully in the extended Rosenbrock problem for all 1000 starting values, and had only one failure in the exponential function 3 problem.This demonstrates that BBsolve is a reliable tool for solving a nonlinear system of equations.

Finding multiple roots or multiple local optima -Function multiStart
It is not uncommon for a nonlinear system of equations to have multiple roots or for a nonlinear objective function to have multiple local minima (or maxima).In this case, it may be of interest to identify as many, if not all, solutions as possible.To this end, we have provided a function called multiStart, which can accept a matrix of parameter values, where each row of the matrix is a starting value.The user needs to define this matrix and pass it as an input to multiStart.Two widely used approaches are: (1) generate random numbers according to some probability distribution, and (2) regular grid search.This function has an argument called action, which indicates whether the user wants to solve a nonlinear system of equations or to optimize a nonlinear objective function.For each starting value, multiStart calls either BBsolve or BBoptim.
We now illustrate how to use multiStart to find multiple roots.We consider a system of high-degree polynomial equations (Kearfoot 1987), comprising 3 equations in 3 variables.It has 12 real roots and 126 complex roots.Here we find all the 12 roots.
We generate 300 random starting values, each a vector of length equal to 3. The system is then solved 300 times and the unique solutions were picked out.

Solving nonlinear estimating equations in statistics
Nonlinear system of equations arise commonly in statistics.In some cases, there will be a naturally associated scalar function of parameters, which can be optimized to obtain parameter estimates.For example, maximum likelihood estimates can be obtained by solving the score equations, even though in general it is better to obtain parameter estimates by directly maximizing the log-likelihood.In other cases, there may not be a natural scalar function associated with the nonlinear system, and we need to solve the system of equations to obtain parameter estimates.This includes a broad class of statistical estimation problems under the heading of estimating functions or estimating equations, where a probability distribution for the data generating process is not explicitly postulated, but only weaker conditions such as unbiasedness and information unbiasedness are imposed on the estimating function (Small and Wang 2003).Well known examples are: generalized least squares (Carroll and Ruppert 1988), generalized estimating equations (Diggle, Heagerty, Liang, and Zeger 2002), and semi-parametric accelerated failure time models in survival analysis (Kalbfleisch and Prentice 2002).Here we consider two examples with simulated data, and one with real data.Our goal is to show the utility of BB for solving nonlinear estimating equations.

Poisson regression with offset
Poisson regression is commonly used to model data in the form of counts, i.e., number of times a particular event occurred over some known period of time.We consider data of the form (Y i , t i , X i ) : i = 1, • • • , n, where Y i are the counts over an observation period t i , and X i are the corresponding covariates.Estimating equations for poisson regression of count data, with offset, can be written as: We consider a simulation problem with n = 500, and p = 8.We set β = (−5, 0.04, 0.3, 0.05, 0.3, −0.005, 0.1, −0.4), and generate data from a poisson distribution: where t i ∼ N (µ = 100, σ = 30), and the covariates X i are generated according to the following R code.This problem can be readily solved using the glm function in R, by specifying the offset option.However, we show that it can also be directly solved by solving the estimating equations Equation 8, which are nothing but the score equations of the Poisson likelihood.
Parameter estimates from dfsane are identical to that from glm.

Rank-based regression using accelerated failure time model
Accelerated failure time (AFT) model is a useful alternative to the popular Cox relative risk model for the analysis of failure time data subject to censoring.The AFT model relates the logarithm of the failure time to a linear function of the covariates, and hence the model has direct physical interpretation in terms of the failure time.Let T i be the failure time, and X i ∈ R p be the corresponding covariates for the ith individual (i = 1, . . ., n).The semi-parametric AFT model may be written as: where β ∈ R p is a vector of regression parameters to be estimated from the data, and i are independent errors with a common, but unspecified, probability distribution.Let C i be the censoring time for ith individual.It is usually assumed that C i is independent of T i , given , where I(.) is the usual indicator function.The data then comprises (T * i , δ i , X i ).The regression parameters β are estimated by solving the weighted log-rank estimating function (Jin, Lin, Wei, and Ying 2003): where φ i is a possibly data-dependent weight function.The choice of φ i = 1 yields the log-rank estimator, and yields the Gehan estimator.In spite of the theoretical advances, semiparametric methods for the AFT model have been seldom used in practice, mainly because of the lack of efficient and reliable computational methods (Jin et al. 2003).One main difficulty is that the system of semiparametric estimating functions, (9), involves step functions of the regression parameters.Therefore, conventional numerical techniques, which depend essentially on the smoothness of the functions, cannot be used.Lin and Geyer (1992) proposed simulated annealing, but their algorithm is not guaranteed to find the true minimum.Jin et al. (2003) proposed an iterative estimator that converts the solution of (9) into a sequence of minimization problems, which can be solved using linear programming techniques.Here we take a more direct approach by directly solving (9) using the DF-SANE algorithm, which does not involve any derivatives.
We first consider a simulation problem with n = 1000, and p = 8.We randomly generated a 1000 × 8 matrix of binary and continuous covariates (see the code below for details of simulation).We set β = (0.5, −0.4,0.3, −0.2, −0.1, 0.4, 0.1, −0.6).We generated independent errors i from a log-normal distribution with mean = 1 and variance = 1.Censoring times C i were generated from a uniform distribution so as to obtain close to 20% censoring.We ran 1000 simulations, with a fixed covariate matrix X, but generating new T * and δ in each simulation.For each simulated data set, we used the same starting value β 0 = rep(0, 8) in dfsane to find a root of (9).The function aft.eqn computes (9).

R> cat("Simulation for
[1] 6.06e+02 1.76e+03 2.33e+03 5.00e+00 7.01e-03 R> signif(colMeans(stats.gh), 3) [1] 7.24e+02 2.32e+03 3.27e+03 4.99e+00 1.77e-03 We conducted another test of the ability of dfsane for solving the semi-parametric AFT equations ( 9) on a real data set that has been widely used in survival analysis: Mayo Clinic's primary biliary cirrhosis (PBC) data (see the appendix of Dickson, Grambsch, Fleming, Fisher, and Langworthy 1989).A corrected version of this data is available at the Mayo Clinic's website (http://mayoresearch.mayo.edu/mayo/research/biostat/upload/therneau_upload/pbc.dat) and in the R package survival (Therneau and Lumley 2009).We computed the regression coefficients for an AFT model with 5 covariates, age, log(albumin), log (bilirubin), edema, and log(protime), with log-rank and Gehan weights.We also estimated standard errors for them using 500 bootstrap samples.Results are provided in Table 3 [1] 0.005659013 0.522871858 0.062670939 0.283731999 0.775959845 We also estimated the semiparametric AFT model using the algorithm of Jin et al. (2003).(The R code was obtained from Dr. Jin's website http://www.columbia.edu/~zj7/index_files/Page382.htm).Comparing our results with theirs (see Table 3), we observe some differences in both the point estimates and standard errors.The point estimates for the Gehan estimator seem to agree reasonably well.For the logrank estimator, the point estimates of log(albumin) and log(protime) are considerably smaller (in absolute magnitude) for dfsane than those obtained using the method of Jin et al. (2003).The residual norm from dfsane is 2 to 4 times smaller than that of Jin et al. (2003), indicating that our solutions to (9) are better than those in Jin et al. (2003).More accurate solutions for the log-rank estimator can be obtained from Jin's algorithm by using a larger number of iterations.For example, when we used 6 iterations (the default is 3), the residual error was almost as small as that from dfsane, and the point estimates were in better agreement.Another noteworthy difference, especially for the log-rank estimator, is that our bootstrapped standard error estimates are higher than the standard error estimates of Jin et al. (2003), which were obtained using a perturbed estimating equation approach.
We also note that our AFT model estimation using dfsane is substantially faster than the algorithm proposed in Jin et al. (2003).For example, for the PBC data, the total CPU time for Gehan and log-rank estimates using dfsane is around 6.5 seconds, whereas it is around 99 seconds for Jin's algorithm (for 3 iterations).For standard error estimation, the dfsane algorithm took 1 hour, and Jin's algorithm took 6 hours for 500 Monte-Carlo samples.A major limitation of Jin's R function is that it can only handle small data sets.It runs into memory limits for even moderate size data sets, for example, it crashed when we tried it on one of the simulated data sets discussed previously with n = 1000 and 8 covariates.
It should also be noted that some problems are intrinsically hard and cannot be solved to within a small error tolerance (e.g., default tolerance = 1.e − 07).The AFT model problem is an example of this.This is a non-smooth problem.We cannot always achieve a tolerance of 1.0e − 07 in these problems.With the PBC data, there may not even be an "exact" solution that will yield a residual of 1.0e − 07.However, we can obtain a solution that is accurate enough.It might be possible to improve upon the solution given by dfsane by changing the control parameters (e.g., M, noimp, maxit) or by using BBsolve, but it may not be worth the added effort for this problem.

Conclusions
The package BB provides functions which improve the capabilities of R for solving nonlinear systems of equations and for optimizing smooth, nonlinear functions in the following ways: 1.The function BBsolve offers a reliable, low-cost method to solving large-scale nonlinear systems of equations.
2. The function BBoptim offers a reliable, low-cost method to optimizing smooth, largescale nonlinear problems.
3. The function multiStart can be used to find multiple roots or multiple local optima.
4. dfsane appears to be promising for solving non-smooth estimating equations, since it does not involve any derivatives (see condition 7).
5. Rank-based regression estimation in the accelerated failure time models can be performed effectively in R using the dfsane function in BB.

Table 1 :
Results of numerical experiments for 6 standard test problems.1000 randomly generated starting values were used for each problem.Means and inter-quartile ranges (in parentheses) are shown.Default control parameters were used in all the algorithms.

Table 2 :
Simulation results for the rank-based regression in accelerated failure time model (1000 simulations).Estimates were obtained using the dfsane algorithm with M = 100.

Table 3 :
Rank-based regression of the accelerated failure time (AFT) model for the primary biliary cirrhosis (PBC) data set.Point estimates and standard errors (in parentheses) are provided.Standard errors for dfsane are obtained from 500 bootstrap samples. .