Long range search for maximum likelihood in exponential families

: Exponential families are often used to model data sets with complex dependence. Maximum likelihood estimators (MLE) can be diﬃcult to estimate when the likelihood is expensive to compute. Markov chain Monte Carlo (MCMC) methods based on the MCMC-MLE algorithm in [17] are guaranteed to converge in theory under certain conditions when starting from any value, but in practice such an algorithm may labor to converge when given a poor starting value. We present a simple line search algorithm to ﬁnd the MLE of a regular exponential family when the MLE exists and is unique. The algorithm can be started from any initial value and avoids the trial and error experimentation associated with calibrating algorithms like stochastic approximation. Unlike many optimization algo- rithms, this approach utilizes ﬁrst derivative information only, evaluating neither the likelihood function itself nor derivatives of higher order than ﬁrst. We show convergence of the algorithm for the case where the gradient can be calculated exactly. When it cannot, it has a particularly convenient form that is easily estimable with MCMC, making the algorithm still useful to a practitioner.


Introduction
Exponential families are commonly used to model phenomena with dependence structure, where the outcomes of the response variable of interest are in fact dependent on one another. For example, the Ising model [24,37] is an exponential family model that has been used to model ferromagnetism and other spatial lattice processes [10]. A realized sample from this model is depicted in Figure 1, where neighboring pixels are more likely to have the same color. We explore this model further in Section 5.2. Other examples of phenomena with dependence structure modeled with exponential families include plant ecology [3,4], friendship networks [18,19,51], protein-protein interaction networks [43], and the lifetime fitness of plants [44]. The appeal of exponential families in these settings stems from their simplicity and maximum entropy property [17,25]. By choosing statistics of interest on the data, one fully specifies a model that gives the most reasonable inference possible derived solely from those statistics. Furthermore, exponential families have been well-studied [2,5] and utilized over the decades and have desirable properties such as a strictly concave likelihood function.

Parameter estimation methods in exponential families
Calculating the maximum likelihood estimators (MLE) for exponential families when dependence is complex, however, remains a challenging problem because the likelihood function may be computationally infeasible. In particular, the form of the likelihood is determined by the chosen statistics up to a normalizing constant, but this normalizing constant may involve a summation over an astronomical number of terms. Evaluating the likelihood function-let alone maximizing it-presents a significant challenge.
Three commonly used parameter estimation methods to circumvent this issue in exponential families are the pseudo-likelihood approach [4,35,46], which finds parameter values that maximize the pseudo-likelihood function, the Markov chain Monte Carlo maximum likelihood estimate (MCMC-MLE) approach [11,17], which uses MCMC to approximate the log likelihood so that it can subsequently be maximized, and stochastic approximation (SA) [7,27], which utilizes simple iterated updates of parameter estimates. The pseudolikelihood approach is computationally expedient, but has been shown to produce unreliable results when dependence is strong [17,49].
The MCMC-MLE approach is theoretically guaranteed to converge to the MLE if it exists and is the default algorithm in software packages such as statnet [21] in the R platform for network models. However, this approach has been shown in practice to be sensitive to initial parameter values when used without the trust region methodology recommended in [17], and the algorithm may require many iterations and enormous (sometimes infeasibly large) Monte Carlo sample sizes when the starting value is far from the MLE [23]. Improvement to the MCMC-MLE approach is an active area of research [22].
Variations on the Robbins-Monro stochastic approximation algorithm [39] have been applied to find the MLE similar contexts: [20,31,52,53] applied MCMC stochastic approximation to spatial models and [45] to social network models (exponential random graph models). SA procedures for finding the MLE of a parameter η generate iterated estimates η k to find the root of a gradient function h(η): where α k is a step size and is typically a member of a decreasing sequence of positive numbers, and U k is a random variable from the distribution specified by η k that noisily estimates the gradient function h(η k ).
Restrictive conditions are required of α k and U k to establish convergence of the sequence η k . In Robbins-Monro SA [39], the step size α k must be a sequence of positive constants that satisfies for which the choice of is commonly used, where A and B are constants that must be specified by the user. This specification requires experimentation and care: there can be significant variation in performance depending on choice of these constants. More recent research [27,Chapter 11] show that α k sequences that go to 0 slower than 1/k can result in an improved rate of convergence, where rate of convergence is measured by the asymptotic covariance of the normalized estimates about their limit point. The conditions on U k are more restrictive. Popular approaches include constraining the sequence of estimators η k to a compact set specified a priori, or assuming that the noise component of U k be a martingale difference sequence. As commonly observed [1,7,30], these may be difficult to satisfy in practice. See [1,30] for recent developments that impose less restrictive conditions using truncated updates.
An issue for any recursive search algorithm is the choice of starting point. It is often the case that algorithms are good at finding the MLE when the starting point is close to it, but of course the location of the MLE is unknown. Methods which rely on the Fisher information matrix may fail when the starting point for η is far from the MLE [20,53]; for any exponential family with bounded support, Fisher information becomes singular as the natural parameter η goes to ∞ [38]. Of course, one may try different starting points until a "good" one is found, but this can be cumbersome in practice and demands patience and sophistication of the practitioner.

Algorithm overview
In this article, we propose a simple and practical line search algorithm that converges to the MLE of any regular exponential family when the MLE exists and the first derivative of the log likelihood can be calculated exactly. When it cannot, the first derivative has a particularly convenient form that is easily estimable with MCMC, making the algorithm still useful in application. The first derivative with respect to the canonical parameter vector η has the form . . , Y m are simulated data sets having parameter vector η. The log likelihood itself is much harder to compute [17]. The second derivative with respect to the canonical parameter vector has the form − Var η g(Y ), and Monte Carlo estimate minus the empirical variance of the g(Y i ). The second derivative is less stably estimated than the first derivative, especially when η is far from the MLE so this matrix is nearly singular. We also show how to construct and apply confidence intervals in such a setting to increase the probability of convergence.
The appeal of this algorithm is its ease of use: no trial and error is needed. The computer can find the solution with no help from the user, thus making it suitable for use by naive users. Experimentation with multiple starting points or tuning parameters is not necessary and no unrealistic a priori information about the problem need be specified. It is currently used in the aster2 contributed R package [14] as the safeguard for steepest ascent and Newton-Raphson iterations in finding the MLE for aster models.
Our algorithm generates iterated estimates η k of the MLEη with the update where α k is a step size and p k is a search direction and is restricted to be an ascent direction of the log likelihood. Despite the visual similarity between (1) and (3), our line search approach treats the search direction p k in (3) as constant in the inner loop of our algorithm whereas in SA the corresponding U k in (1) is random. Furthermore, line search algorithms have more restrictions on the step size α k . The step size conditions in the classical gradient ascent algorithm, which is the basis for our algorithm, force a sufficiently large increase in the objective function at every step, guaranteeing convergence to a local maximum (which is the global maximum in an exponential family because of strict concavity of the log likelihood), if it exists. Theorem 3.2 in [32] implies the global convergence of the steepest ascent algorithm for a continuously differentiable function, ℓ(η). It requires the step length α k to satisfy the Wolfe conditions for sufficient increase and curvature: where ∇ is the gradient operator and 0 < c 1 < c 2 < 1. Variations of these conditions exist in the numerical optimization literature [32,47], but all require evaluating the objective function. Exponential families we consider are an unusual case in optimization in that the objective function is harder to compute than its derivatives and hence not previously considered by optimization theorists. In our algorithm, we replace (4) with a single modified curvature condition: for some 0 < c < 1. This replacement is possible while still guaranteeing sufficient increase and convergence to the MLE (if it exists) because we have the additional property that the exponential family log likelihood function we consider is strictly concave. The restrictions on the step size α k along a particular direction p k and the resulting values for ℓ(η k+1 ) are depicted in Figure 2. The desire to avoid calculation of higher order derivatives is motivated not just by computational considerations, but also by how much useful information can be extracted from them. As noted in Section 1.1, if η is far from The acceptable region for step size α k along a particular search direction p k according to the modified curvature condition (5). The step sizes αc and αmax correspond to values of ∇ℓ(η k + αp k ) T p k equaling c∇ℓ(η k ) T p k and 0, respectively. The condition ensures sufficient increase in the log likelihood along the search direction p k .
the MLE, the Fisher information matrix may be near-singular and algorithms like (unsafeguarded) Newton-Raphson algorithm may fail. For this reason, the best use of our algorithm may be from "long range," filling a gap in the MLE estimation toolbox. It may be expedient to switch to another algorithm like Newton-Raphson after significant progress is made and the Fisher information matrix becomes useful. Our line search algorithm with p k the Newton direction provides a safeguard for Newton-Raphson that makes it safe for use from any range. The aster2 contributed R package [14] switches p k from the steepest ascent direction to the Newton direction after a fixed number of steps (d/2 where d is the dimension η) but always finds a step length α k satisfying (5), iterating until the unsafeguarded Newton step satisfies (5).
Our algorithm can be outlined as follows. Let · denote the Euclidean norm function, and ǫ a small value greater than 0.
while ∇ℓ(η k ) > ǫ Find a step size α k that satisfies the modified curvature condition Find the new search direction p k+1 , which must be an ascent direction. k = k + 1. end while

Background exponential family theory
An exponential family of distributions [2,12] on a sample space Y has log likelihood where g(y) is a d-dimensional vector of natural statistics calculated from the observed data y, η a d-dimensional vector of natural parameters, and · , · denotes the bilinear form So that the probability function integrates to 1, the cumulant function c must have the form where µ is a measure on Y. Define the natural parameter space Ξ as the set of points η = (η 1 , . . . , η d ) that are parameter values indexing distributions in the model. An exponential family is full if the natural parameter space is and regular if, in addition, Ξ is an open set. We say an exponential family is minimal if g(y) is not concentrated on a hyperplane. Minimality guarantees that if an MLE exists, it is unique [12].
In finite state space models with complicated dependence like an Ising model or exponential random graph model, (7) is a sum which may have no simple expression and can only be evaluated by explicitly doing the sum. When the sample space Y is even moderately large, this can be prohibitively expensive. For example, the sample space Y for an Ising model defined on a 32 × 32 square lattice where each entry takes values of 0 or 1 has 2 1024 ≈ 10 300 elements. A loop with this many iterations takes too long no matter how programmed.
A useful property of all exponential families [28, p. 27] when η is in the interior of Ξ is that Thus we can express first and second derivatives of the log likelihood (6) and Fisher information, I(η), as and thereby avoid evaluation of the problematic cumulant function c so long as we make do with first and second derivatives of the log likelihood avoiding evaluation of the log likelihood itself.

Long range search algorithm for MLE
We now present our line search algorithm, which will converge to the MLE for any regular exponential family if the MLE exists. The theory is divided into two theorems, Theorem 3.1 and 3.2: the first presents the requirements for the algorithm and guarantees that the log likelihood gradient, when it can be calculated exactly, converges to zero. The second shows that when the MLE exists, this is equivalent to finding the MLE. Proofs are in Appendix A. Theorem 3.1 also can be interpreted as assuring convergence to the MLE in the Barndorff-Nielsen completion [12] even when the MLE does not exist in the conventional sense. Convergence of the gradient of the log likelihood to zero is the same as convergence of the mean value parameter µ = E η g(Y ) to g(y), which is the MLE of mean value parameter in the Barndorff-Nielsen completion. This is not an efficient method of finding the MLE in the Barndorff-Nielsen completion [33, Chapters 4 and 5], but it is interesting that our algorithm behaves well even when the MLE does not exist in the conventional sense.
used to maximize the log likelihood function ℓ( · ) of a regular exponential family on a finite sample space, where the search direction p k is a non-zero ascent direction such that the angle θ k between the search direction p k and steepest ascent direction ∇ℓ(η k ) is restricted to be less than 90 degrees by for some fixed δ > 0.
Then, unless ∇ℓ(η k ) = 0, in which case η k is already the solution and the search is complete, it is possible to find a step length α k that satisfies the curvature condition for some fixed 0 < c < 1. Furthermore, repeated iterations of (12) satisfying (13) and (14) will produce a sequence, η 1 , η 2 , . . . such that lim k→∞ ||∇ℓ(η k )|| = 0. Theorem 3.1 can be adapted to a more general setting to optimize any bounded, proper, upper semi-continuous, and strictly concave function assuming there are bounded level sets of this function, as detailed in [34]. However, by assuming here that the objective function is the log likelihood of an exponential family, the statement of the theorem is much simplified.
We apply Theorem 3.1 to find the MLE when it is known to exist: For a regular exponential family with minimal representation where the MLE exists, the line search described in Theorem 3.1 can be applied to the log likelihood function ℓ(η) so that a search starting at any η 0 ∈ Ξ will converge to the MLE of η.
The issue of MLE existence is a problem in computational geometry, not an optimization problem, so we do not address it here. See [12,33,38] for further discussion of this issue.

Refinements of algorithm
In Theorem 3.1, we restricted our search direction p k to be an ascent direction, so that ∇ℓ(η k ) T p k > 0 or, alternatively, the angle θ k between the search direction p k and steepest ascent direction ∇ℓ(η k ) is less than 90 degrees. However, this still leaves many possibilities for the choice of p k other than steepest ascent. In addition, we have specified restrictions on the step size α k in the curvature condition (14) with 0 < c < 1, but it would be useful to know if certain values of c are better than others.

Search directions
In our examples in Section 5, by default we use steepest ascent directions in our implementation for simplicity. Although often effective in early steps, steepest ascent directions can result in a zigzagging trajectory of the sequence η k [47, Section 3.1]. Conjugate gradient methods address this phenomena and cover the sample space more efficiently [32,Chapter 5]. It is easy to implement a variant of the Polak-Ribière method [32, pp. 120-122] here, requiring little more in terms of calculation or storage. The search direction p k would update with an extra intermediate step as follows: Note that when γ P R k+1 = 0, p k+1 will be just ∇ℓ(η k+1 ), the direction of steepest ascent, and thus serves as a "reset". The curvature condition (14) guarantees that this method always yields a ascent direction for p k+1 and thus Theorem 3.1 still holds.

Step size
We now turn our attention to the optimal step size α k . Taking the derivative of ℓ(η k + α k p k ) with respect to α k shows that the log likelihood is maximized as a function of α k along the direction p k when By choosing c to be small, say 0.2, we ensure that the step taken is close to maximizing the log likelihood along the search direction. This is also apparent in Figure 2.
Making c too small, however, may make it difficult to find an α k that meets the curvature condition (14) since this search must be done numerically. In fact, as the line search nears the MLE and ∇ℓ(η k ) gets smaller, the rightmost term in (14) gets smaller in magnitude (it equals c ∇ℓ(η k ) 2 if using steepest ascent directions), making a numerical search for α k more challenging.

MCMC approximations
Our algorithm requires us to be able to calculate ∇ℓ(η) using (9). When this can be done exactly, our algorithm is straightforward to apply, as illustrated in the logistic regression example in Section 5.1. However, for many applications, we will need to approximate E η g(Y ) using MCMC. That is, where Y 1 , . . . , Y m are MCMC draws from the distribution with parameter η.
There are many MCMC algorithms such as Metropolis-Hastings [15] or Swensen-Wang [48,50], used for the Ising model example in Section 5.2.
The accuracy of the approximation in (15) increases with Monte Carlo sample size m. When the current estimate is far away from the MLE, we can use smaller m to save time and work with a fairly noisy approximation of the gradient. However, when the current estimate approaches the MLE, larger m are necessary.
Our algorithm relies on the computed values of ∇ℓ(η) in the curvature condition (14), as well as the stop condition for the algorithm, ∇ℓ(η k ) < ǫ. Given that we may only have approximations of ∇ℓ(η), we cannot know for certain if either of these conditions are truly met. We can ameliorate this by constructing confidence intervals for each of the inequalities.
For the inequalities in (14), we can estimate asymptotic standard errors of ∇ℓ(η k + α k p k ) T p k and c∇ℓ(η k ) T p k − ∇ℓ(η k + α k p k ) T p k by appealing to the Markov chain Central limit theorem [6,26,40,41]. The initseq function from the R package mcmc [13] can be used to estimate asymptotic standard errors for univariate functionals of reversible Markov chains: given an MCMC sample for a univariate quantity, initseq returns a value (divided by sample size) that is an estimate of the asymptotic variance in the Markov chain central limit theorem. Both of the quantities in (14) are univariate. In the second expression, c∇ℓ(η k ) T p k − ∇ℓ(η k + α k p k ) T p k , the MCMC sample generated for ∇ℓ(η k + α k p k ) T p k is independent of the sample generated for c∇ℓ(η k ) T p k . Thus initseq can be applied to each sample separately and the results summed for an estimated variance. We can then be approximately 95% confident (nonsimultaneously) that α k satisfies (14) if where se 1 and se 2 are the asymptotic standard errors for ∇ℓ(η k + α k p k ) T p k and c∇ℓ(η k ) T p k − ∇ℓ(η k + α k p k ) T p k , respectively, calculated as described.
The delta method can be applied to estimate a standard error for ∇ℓ(η k ) . The asymptotic variance is calculated by where Σ is the variance matrix of ∇ℓ(η k ) and can be estimated by the sample variance matrix of the batch mean vectors of g(Y 1 ), . . . , g(Y n ) divided by the number of batches (the initseq function requires a univariate vector and so cannot be used here). We can be approximately 95% confident that ∇ℓ(η k ) > ǫ if In practice, however, use of confidence intervals does not appear necessary with Monte Carlo sample sizes that are set large enough so that these standard errors are initially small relative to the point estimates. The ratio of point estimate to standard error of course decreases as the algorithm progresses and the estimate of the parameter nears the MLE, reflected in ∇ℓ(η k ) nearing 0. Thus these confidence intervals are most useful as a guide for when to increase the MCMC sample size, or when to switch methods, or when to terminate the algorithm.

Combining with other algorithms
We believe the best use of this algorithm is in combination with other faster methods like MCMC-MLE or Newton-Raphson safeguarded by our line search algorithm. Our algorithm with steepest ascent or conjugate gradient search direction should be used initially from long range, when one has no good intuition for an initial value. It is well known that when the objective function is quadratic, the conjugate gradient method with exact arithmetic converges to the solution in at most d steps, where d is the dimension of the problem [32, Chapter 5]. As a rule of thumb, we think using our algorithm for 2d steps before switching seems reasonable when using conjugate gradient directions. Determining a more precise criteria for when we are inside the "radius of convergence" for algorithms like Newton-Raphson or MCMC-MLE is an area for further research.
In the case of MCMC-MLE, [22] show that getting a distribution so that g(y) is within the convex hull of the MCMC samples is a necessary condition for this algorithm to converge (in fact, this appears to be the impetus for their steplength MCMC-MLE approach). However, this is not sufficient. An effective approach may be to examine the importance sampling weights used in the log likelihood approximation in the previous iteration, and look for these to stabilize. This "rearview" approach should inform us if the previous value for η k was close enough to apply MCMC-MLE; applying MCMC-MLE to the current distribution should then converge.

Examples
We illustrate the application of our algorithm in two settings for regular exponential families, where the MLE is known to exist: 1. Logistic regression. The gradient of the log likelihood, ∇ℓ(η), can be calculated exactly. In this setting, Theorem 3.2 guarantees converges to the MLE from any starting point. Of course, logistic regression is done more efficiently through iterated reweighted least-squares, as in the glm function in the R platform. Our purpose here is to show the application of our algorithm in a familiar setting. We choose a poor starting point, where an algorithm like Newton-Raphson would fail. Other optimization algorithms such as those based on (4) would also work here. 2. Ising model. The gradient of the log likelihood, ∇ℓ(η), can only be approximated via MCMC. In this setting, Theorem 3.2 does not guarantee convergence, but our algorithm is still effective in practice, even for poor starting values. In this case, optimization algorithms based on (4) cannot be applied since they depend on evaluating the log likelihood.
In both cases, we compare our algorithm with stochastic approximation to clarify the distinctions made in Section 1.1 between the two approaches.

Example: Logistic regression
We apply our algorithm to the case of a logistic regression with a starting point far from the solution. In such a case, the Hessian matrix is often near-singular and algorithms such as Newton-Raphson which rely on it will fail. For classical SA with step size 1/k, the magnitudes of the updates diminish too quickly for the parameter estimates to approach the MLE in a reasonable amount of time.
The response vector Y has components that are Bernoulli trials with mean vector p. The natural parameter is θ i = log pi 1−pi , which is modeled componentwise as a linear function of the predictors 1, x 1 , . . . , x q , so that Defining the model matrix M to be the n × (q + 1) matrix with the x i as rows, we can express θ = M β. This in turn allows us to reparameterize the exponential family as one with β as the natural parameter vector and M T y the vector of statistics with log likelihood where y is the vector of observed Bernoulli responses. By (9), the gradient is with component values generated from a Uniform(−1, 1) distribution. Then, using 1000 independent draws from a correlated multivariate normal distribution centered at 0 as our predictors, we generated data for this model (the data vector has length 1000 like the predictor vectors). Fitting these data using the R function glm, we find the MLE of β to bê We measure the performance of our algorithm in terms of the total number of iterations used, where each iteration requires evaluation of the gradient, ∇ℓ(β k + α k p k ). Typically, several iterations take place in an inner loop to find a step size α k that meets the curvature condition (14), a process that grows increasingly difficult as the estimates near the MLE since the rightmost term in (14) gets smaller in magnitude. Once an acceptable step size is found, the parameter estimate β k is updated and a new search direction is determined, requiring another evaluation of the gradient. Our algorithm took 538 iterations over 204 different search directions to get ∇ℓ(β k ) < 0.01 and arrive at an estimate for the MLE that differs from the glm result by 3.309 in Euclidean distance (See Table 1). Using the Polak-Ribière conjugate gradient method described in the previous section resulted in comparably sharp MLE estimates (see Table 1) in fewer iterations-292 over 116 search directions-a noticeable improvement.
We also applied SA with step size 1/k (setting A = 1, B = 0 in (2)) from the same starting point β 0 . The choice of constants A and B in the step size is of course not likely to be optimal; however, we want to apply SA without trial and error experimentation. After 100,000 iterations, the parameter estimates look nothing at all like the MLE (See Table 1). The initial step sizes are far too large, then diminish rapidly so that the algorithm does not converge in a reasonable amount of time. Table 2 shows the first 20 step sizes used by SA and our line search. Our line search continues to use step sizes of relatively stable magnitude even well into the process. It should be noted that these 20 step sizes correspond to the first 20 iterations of SA but 50 iterations of our line search algorithm using steepest ascent, since we spends several iterations finding an acceptable step size for each update, and 45 iterations using conjugate gradient directions.

Example: Ising model
In this example, we apply our gradient-based line search algorithm to an Ising model [24] on a toroidal square lattice. Here the gradient of the log likelihood cannot be calculated exactly as in the logistic example and so Theorem 3.2 cannot be applied directly. However, as discussed in Section 4.3, the gradient can be approximated using MCMC, allowing our algorithm to still be effective in finding the MLE.
Ising models are exponential families where each entry in the square lattice takes the value of either zero or one. A realized sample is shown in Figure 1. The Table 2 The first 20 step sizes used by SA (with step size 1/k) and our algorithm for Example 1.
The step sizes used by our algorithm do not diminish like 1/k sufficient statistic vector is two-dimensional, comprising the number of entries with value one and the number of "neighbor" entries with the same value. Entries are considered "neighbors" if they are adjacent to one another horizontally or vertically (but not diagonally).
Here we describe the toroidal square lattice as an n × n matrix Y and each entry as Y ij , where i and j take values in 1, . . . , n considered as a cyclical set (addition is done modulo n). The sufficient statistic, g(y), has components: where I(·) denotes the indicator function taking logical expressions to the numbers zero and one, false expressions to zero and true expressions to one. Because of the dependence of neighboring entries in the lattice, there is no closed form expressing ∇ℓ(η). Instead, we need to approximate ∇ℓ(η) using MCMC as described by (15). The MCMC draws are performed here using the Swendsen-Wang algorithm [48,50], available in the contributed R package potts [16].
We choose η = 0, log(1 + √ 2) T to generate a 32 × 32 lattice, which we use as our observed data (Figure 1). This value for η is of particular interest  Table 4 The first 17 step sizes used by SA (with step size 1/k) and our algorithm for Example 2.
The step sizes used by our algorithm are initially much smaller than 1/k because it corresponds to the phase transition point [37] and has been shown to be difficult to estimate [9]. In order to get a good estimate of the MLE to which we can compare our algorithm's results, we use 10 iterations of MCMC Newton-Raphson [36] starting at the true value of η so that it will converge. We apply our line search algorithm to this data using a far off initial value of η (0) = (2, 0.001) and a fixed MCMC sample size of 10,000. Our algorithm used 62 iterations (gradient evaluations) over 17 search directions to get ∇ℓ(η k ) < 0.005 and arrive at an estimate of the MLE that differs from Newton-Raphson by 0.0037 (see Table 3). Using the Polak-Ribière conjugate gradient method resulted in comparably sharp MLE estimates using 45 iterations over 7 search directions. The total MCMC sample sizes used were 62 × 10, 0000 = 620, 000 and 45 × 10, 0000 = 450, 000, respectively.
We also applied MCMC SA, again with step size 1/k from the same starting point η (0) , and used a MCMC sample size of 1,000 for gradient calculation. Here SA converged in 1368 iterations or 1,368,000 MC samples, comparable to our algorithm (see Table 3). Table 4 shows the first 17 step sizes used by SA and our line search. The step sizes used by our line search are initially very small compared to 1/k, but stay in a range of about 1/300 to 1/3000. So, the 1/k step size used by SA in fact occasionally satisfies our curvature condition when k is large.

Discussion
We have presented a simple line search algorithm for finding the MLE of a regular exponential family when the MLE exists (or the MLE in the Barndorff-Nielsen completion when the MLE does not exist in the conventional sense). The algorithm avoids any trial-and-error experimentation involving tuning parameters or starting points commonly associated with optimization routines not invented by optimization specialists. Our algorithm is modeled after algorithms discussed in optimization textbooks [8,32,47], all of which are safeguarded to ensure rapid automatic convergence.
Convergence is guaranteed when the gradient can be calculated exactly. Even when the gradient cannot be calculated exactly and is only estimable via MCMC, the algorithm is still useful in practice, as demonstrated by the Ising model example. We have also described a way to construct and use confidence intervals to make convergence highly probable.
The algorithm can be computationally demanding. When the current iteration approaches the solution, the curvature condition for step size becomes more difficult to satisfy and the method may require several iterations of MCMC sampling and perhaps an increase in MCMC sample size. Eventual increase in MCMC sample size is unavoidable, because the achievable accuracy is inversely proportional to the square root of the MCMC sample size, as in all Monte Carlo. Thus we believe the best use of this algorithm is in combination with other faster methods like MCMC-MLE or Newton-Raphson safeguarded by our line search algorithm. Our algorithm should be used from "long range", when one has no good intuition for an initial value and is concerned about picking one that is far from the MLE. The switch between types of search direction (steepest ascent, conjugate gradient, or Newton) within our algorithm or the switch to another algorithm (such as MCMC-MLE) need not require manual intervention. When used in combination, we do not think the confidence intervals are necessary as the curvature condition is quite easily satisfied when the current iteration is far from the MLE.
One way to improve performance is to use conjugate gradient search directions rather than steepest ascent. In our examples, this reduced the number of iterations by over 25%. However, in other problems we tried with different dimensionality, this performance varied significantly and it appears that no guarantee can be made about quantity of improvement in performance, though in all cases we examined, it never did worse. This is no surprise, because the necessity of "preconditioning" for good performance of the conjugate gradient algorithm is well known (but no good "preconditioner" is available for maximum likelihood in exponential families).
There are several outstanding issues. Most notably, we have not showed convergence of the algorithm when the gradient is approximated via MCMC. This is a more difficult theoretical problem and is the motivation for stochastic approximation research. Further work is necessary to determine if one can adapt our restrictive curvature condition (14) to the approach of [1,30] in MCMC stochastic approximation.
Another remaining issue is the stopping criterion: what value should be chosen for ǫ in the exit condition ∇ℓ(η k ) < ǫ? Because the value of ∇ℓ(η k ) can only be approximated via MCMC, one cannot be certain if this condition is actually satisfied. Here again, the switch to another methodology may be appropriate, though at least in our Ising model example, our use of 10,000 for the MCMC sample size and 0.005 for ǫ were successful in obtaining a reasonable parameter estimate.
A final remaining issue is estimation of Monte Carlo error of the estimates. Here too we recommend switching to another algorithm at the end. The MCMC-MLE procedure gives accurate error estimates [11]. For very small steps these are essentially the same as the Monte Carlo error of a single unsafeguarded Newton-Raphson step, so the method in [11] can be used for either.

Appendix A: Proofs
Proof of Theorem 3.1. Let f (·) represent the negative log likelihood −ℓ(·), the objective function to be minimized. We proceed from the perspective of a minimization of a function f (·) since this is the convention in the optimization literature [32,42].
The negative log likelihood function −ℓ(·) is strictly convex by (10), and continuous since it is infinitely differentiable by Theorem 5.8 in [28]. It is bounded below by the negative log likelihood of the limiting conditional model of this exponential family described in Theorem 6 of [12], which is guaranteed to have a global minimum.
Then, unless ∇f (x k ) = 0 in which case x k is already the solution, for each k, we can uniquely define α c k as follows: The point α c k is uniquely defined because it is the minimizer of the function α → f (x k + αp k ) − αc∇f (x k ) T p k . We may also define α min k as follows: These values appear on the α-axis in Figure 3 for the case where a minimizer exists for α → f (x k + αp k ).
By the strict convexity of f and Theorem 2.14(b) in [42], Applying (16) to the right hand side of the above gives (See points a and b in Figure 3).
The subproblem α → f (x k + αp k ) is strictly convex and hence monotonically decreasing at α k such that α c k ≤ α k < α min k (in Figure 3, see points b and c).
That is, where the left-hand side denotes inf α∈R f (x k + αp k ) when α min = ∞.
Combining the second inequality of (19) with (18), we have We now turn our attention to (16). Define x c k = x k + α c k p k . Then Subtracting ∇f (x k ) T p k from both sides gives (∇f (x c k ) − ∇f (x k )) T p k = (c − 1)∇f (x k ) T p k .
By (10), ∇ 2 ℓ(η) is bounded for finite state space g(Y), which is true by assumption. Thus |∇ 2 f (x)| ≤ K for some constant K for all x. Then by Theorems 9.2 and 9.7 in [42], ∇f (x) is Lipschitz continuous relative to the convex set R d .
Thus there exists a constant L < ∞ such that Applying this relation to x c k and x k , we have The rest of this proof is nearly identical to the proof for Theorem 3.2 in [32, pp. 43-44]. Multiplying both sides of (22) by p k and then applying Cauchy-Schwartz gives Substituting (21) into the left-hand side of (23) gives Substituting (24) into (20), we obtain The angle θ j between the search direction p k and steepest descent direction −∇f (x k ) can be expressed by cos θ j = −∇f (xj ) T pj ∇f (xj ) · pj . Substituting this relation into (25) gives By summing this expression over all indices less than or equal to k, ∇f (x j ) 2 cos 2 θ j .
Because f (x) is bounded below, there exists some M < ∞ such that f (x 0 ) − f (x k+1 ) < M for all k, so that ∇f (x j ) 2 cos 2 θ j < M < ∞.
With the additional restriction on the search direction p k such that cos θ k ≥ δ > 0 for some choice of δ, for all choices of k, the convergent series in (26) implies that lim k→∞ ∇f (x k ) = 0.
The inequality (26) has been referred to as Zoutendijk's condition [32], though we arrive at this result via different assumptions.
Theorem 3.1 shows that the gradient of the objective function converges to 0. The proof for Theorem 3.2 is concerned with the conditions for mapping this convergence to the convergence of the iterated parameter estimates η k to the unique MLE. In particular, the mapping from η k to the gradient must be globally invertible.
Proof of Theorem 3.2. The Fisher information for an exponential family with minimal representation is non-singular by (11) and thus invertible. If we consider the map defined by where c is the cumulant function (7), its first derivative matrix is ∇h(η) = ∇ 2 c(η) = I(η) (27) which is again non-singular. Since this is true for any η, by the inverse function theorem, h is everywhere locally invertible. In fact, h is globally invertible. For any µ in the range of h, consider the function q(η) = µ T η − c(η).
Since ∇ 2 q(η) = −I(η) by (27), q is strictly concave. There exists an η such that ∇q(η) = µ − h(η) = 0 because of the assumption that µ is in the range of h. Thus this η is a stationary point of q, which is a global maximizer of q by concavity and the unique global maximizer by strict concavity. This says that for every µ in the range of h, there exists a unique η such that µ = h(η), which is global invertibility of h.
Since c is infinitely differentiable by Theorem 2.7.1 in [29], so is h, and by the inverse function theorem, so is h −1 (even if we do not know the form of h −1 ). The first derivative of h −1 can be expressed as and is thus non-singular everywhere, including at the MLE of η,η MLE .