A Fast and Efficient Estimation of the Parameters of a Model of Accident Frequencies via an MM Algorithm

,


Introduction
A large majority of the problems encountered in applied statistics (maximum likelihood estimation, least squares, data fitting, machine learning, data analysis, experimental design, clustering, and classification) involve the numerical optimization of a real-valued function LðθÞ depending on a parameter vector θ ∈ ℝ d , where d is a positive integer.Of all the numerical optimization algorithms developed over the years, the Newton-Raphson (NR) algorithm is the most famous and the most widely used.This algorithm starts from an initial guess θ ð0Þ and iterates according to the scheme where ∇Lðθ ðmÞ Þ and ∇ 2 Lðθ ðmÞ Þ are, respectively, the gradient vector and the Hessian matrix evaluated at θ ðmÞ .The popularity of this algorithm is explained by its fast convergence when the starting guess θ ð0Þ is chosen close enough to the maximum that is unknown in practice [1].Unfortunately, the source of NR algorithm's popularity is also that of his first flaw (its success strongly depends on the appropriate choice of θ ð0Þ in a neighbourhood of the unknown solution).
The second major drawback of the NR algorithm inherited from its mathematical formulation is that it requires the numerical inversion of the Hessian matrix ∇ 2 Lðθ ðmÞ Þ (a square matrix of order d) at each iteration, which can be very tricky if at the iteration m, the matrix ∇ 2 Lðθ ðmÞ Þ is singular or ill-conditioned.When the NR algorithm is unsuccessful, alternatives may be considered.Among them, we can mention quasi-Newton algorithms [2] (which compute approximations of ½∇ 2 Lðθ ðmÞ Þ −1 at each iteration), blockrelaxation algorithms (which divide parameters into disjoint blocks and proceed to optimization by cycling through these blocks) [3], and derivative-free optimization (DFO) algorithms [4,5].
In statistics, the last decades have seen the development and the fast breakthrough of the minorizationmaximization (MM) principle for constructing maximization algorithms [6][7][8], and the expectationmaximization (EM) algorithm [9] which is considered as a special case of MM algorithm for statistical estimation with incomplete data [6].The design of MM algorithms for maximizing LðθÞ consists in constructing a special surrogate function whose maximization is simpler but equivalent to that of LðθÞ and then maximizing that surrogate function.Since they do not need any matrix inversion, MM algorithms generally outperform the NR algorithm and are considered as relevant for high-dimensional problems and for some discrete multivariate distributions [10].
In this paper, we are interested in one of the very important issues in statistics applied to road safety which is the statistical evaluation of road safety measures.We consider a discrete multivariate statistical model coupling the distributions of accidents classified by level of severity on the sites where the measure has been implemented (called treated sites) and on the sites where the measure has not been implemented (called control sites).This model has a constrained parameter vector θ of 1 + sr components where s is the number of treated sites and r is the number of accident severity levels.Our main purpose is to design an MM algorithm for the maximum likelihood estimation of the parameter vector θ.Since MM algorithms may be slow to converge, we also consider the acceleration of our proposed MM algorithm.
The rest of this paper is structured as follows.In Section 2, we describe the model and we present the maximum likelihood estimation problem.In Section 3, we devise our new MM algorithm for estimating the parameter vector θ; afterwards, we apply an acceleration scheme in order to get the accelerated version.In Section 4, we present the results of the comparison of our proposed MM algorithm and its accelerated version to NR and quasi-Newton algorithms.

Statistical Model and Parameter Estimation
Let us consider s ðs > 0Þ geographical sites (hereafter referred to as treated sites) where a road safety measure (maximum speed reduction, installation of roundabouts, and so on) has been implemented for a certain period of time.The accidents occurring on these sites are assumed to be classified by level of severity in r levels (r > 0).Suppose that, in order to avoid confusing the effect of the measure with those of other factors likely to influence the number of crashes, each treated site is paired with another geographical site (hereafter called control site) with the same characteristics (traffic flow, weather conditions, and so on) as the treated site but where the measure has not been implemented.For k = 1, ⋯, s, let be a vector composed of 2r random variables, where for all j = 1, ⋯, r, X 1jk (respectively, X 2jk ) is a random variable representing the number of crashes of type j occurred on treated site k in the period before (respectively, after) the implementation of the measure.Also consider the non-random vector where for all j = 1, ⋯, r, z jk is a non-random variable representing the ratio of the number of accidents of severity level j in the "after" period to the number of accidents of the same severity level in the "before" period on the control site.Different models combining accident frequencies from the treated sites and the control sites have been proposed in order to estimate mainly the average effect α of the measure and, secondarily, the accident risks [11][12][13].Let S r−1 = fðp 1 , ⋯, p r Þ T ∈ ½0, 1 r , ∑ r j=1 p j = 1g, h,i be the classical inner product on ℝ r , and n k (k = 1, ⋯, s) be the total number of accidents observed at treated site k.In this paper, we consider the statistical model proposed in [13] under the following assumptions: (A1) For all k = 1, ⋯, s, X k follows the multinomial distribution Mðn k ; π k ðθÞÞ, where where α is the mean effect of the measure and for all j = 1, ⋯, r, k = 1, ⋯, s, β jk is the probability that an accident occurring on the treated site k is of severity level j.In this paper, we are interested in estimating the unknown parameter vector θ.
The likelihood of observed data x 1 , ⋯, x s , where for all k = 1, ⋯, s, where for all k = 1, ⋯, s, x α > 0, In the case s = 1, there exists a closed-form expression of b θ [14].However, in the general case s ≥ 1, the optimization problem (6a), (6b), and (6c) looks impossible to solve in closed-form and therefore calls for numerical optimization.As stated earlier, different iterative algorithms may be used to solve (6a), (6b), and (6c), each with strengths and weaknesses.The minorization-maximization (MM) strategy for constructing optimization algorithms has been shown to yield algorithms (then called MM algorithms) that outperform the classical NR and quasi-Newton algorithms [6,8,10].Moreover, it can handle constraints easily.In problems such as (6a), (6b), and (6c), it consists in defining a special surrogate function and afterwards maximizes this latter rather than the log-likelihood.MM algorithms are considered as relevant for high-dimensional problems and for discrete multivariate distributions [10].In the next section, we explain the MM principle; afterwards, we devise an MM algorithm to solve the problem (6a), (6b), and (6c).

An MM Algorithm for Computing the MLE
3.1.Brief Reminder of the MM Principle for Constructing Maximization Algorithms.The MM principle for constructing an iterative maximization algorithm consists of two steps.Let θ ðmÞ be the iterate after m iterations.For the maximization problem (6a), (6b), and (6c), the first M step consists in defining a minorizing function gðθjθ ðmÞ Þ and the second M step consists in maximizing the surrogate function gðθjθ ðmÞ Þ with respect to θ rather than the log-likelihood L ðθÞ, and the next iterate θ ðm+1Þ is obtained as the value in which gðθjθ ðmÞ Þ attains its maximum.Before going further, let us remind the definition of a minorizing function [6].

Design and Maximization of the Minorizing Function.
To develop our MM algorithm, we must define and maximize a minorizing function for LðθÞ.To this purpose, we use some mathematical inequalities related to convex functions presented in [6].The following lemma gives the expression of the minorizing function.
Lemma 1.Let θ ðmÞ be a value of the vector parameter θ and g the real-valued function of θ defined by where is an additive constant independent of θ.Then, gðθjθ ðmÞ Þ minorizes LðθÞ at θ ðmÞ .
Proof.By convexity of −log, we know that, for all a, b > 0, −log ða/bÞ ≥ −ða/bÞ + 1.So, for all k = 1, ⋯, s, Hence, We can then write and, after summing on the k indexes, we get By convexity of −log and by Equation ( 10) of [6], we also have which is equivalent to From ( 14), (17), and the definition of LðθÞ (see ( 5)), we can write which means that LðθÞ ≥ gðθjθ ðmÞ Þ where gðθjθ ðmÞ Þ is defined by (9).Moreover, we have gðθ ðmÞ jθ ðmÞ Þ = Lðθ ðmÞ Þ.We can then conclude that gðθjθ ðmÞ Þ minorizes LðθÞ at θ ðmÞ .To finalize the design of our MM algorithm, we need to deal with the constraints (6b) and (6c).In [6], the authors recommend to extend function gðθjθ ðmÞ Þ under a new form that takes into account the inequality constraints; afterwards, equality constraints will be enforced during the optimization of the extended form of gðθjθ ðmÞ Þ.Their method is based on the logarithmic barrier method for handling inequality constraints.For q inequality constraints of the form v i ðθÞ ≥ 0, i = 1, ⋯, q, the extension of gðθjθ ðmÞ Þ presented in Equation ( 23) of [6] is an additive value composed

4
Journal of Applied Mathematics of linear combinations of log v i ðθÞ.In this paper, the form of inequality constraint (6b) implies that the extension of gðθ jθ ðmÞ Þ will be composed of linear combinations of log α and log β jk .So the log-likelihood LðθÞ defined in Equation ( 5) already contains the logarithmic barrier (it diverges to negative infinity if any of the parameters α or β jk tends to zero).Thus, as the authors of [6] themselves recognize, if the initial point θ ð0Þ satisfies inequality constraints (6b), then the presence of the terms log α and log β jk in the expression of LðθÞ prevents α ðm+1Þ ≤ 0 and β ðm+1Þ jk ≤ 0 from occurring.Now, knowing θ ðmÞ , we just have to maximize gðθjθ ðmÞ Þ under the equality constraints (6c) to obtain the next iterate θ ðm+1Þ .Theorem 2. Let θ ðmÞ be the estimate of the parameter vector θ after m steps of our proposed MM algorithm.Then, the components of the next iterate θ ðm+1Þ are given by and for all k = 1, ⋯, s, j = 1, ⋯, r, Proof.If θ ðmÞ is the estimate of θ after m steps, then the next iterate denoted by θ ðm+1Þ is obtained as under equality constraints (6c).Solving problem ( 21) is equivalent to looking for the stationary point of the Lagrangian where λ = ðλ 1 , ⋯, λ s Þ is a vector of s Lagrange's multipliers and g is defined by Equation (9).Setting ∂ Lðθ, λÞ/∂α and ∂ Lðθ, λÞ/∂β jk to zero, we get Multiplying (23a) by α and (23b) by β jk , one gets For all k = 1, ⋯, s, summing on the index j in Equation (24b) and noting that ∑ r j=1 β jk = 1 and ∑ r j=1 x •jk = n k lead to Hence, Combination of (24b) and (26) leads to Thus, the solution θ to (21) satisfies and for all k = 1, ⋯, s, j = 1, ⋯, r, The updates α ðm+1Þ and β   19) and (20).At the ðm + 1Þ-iteration, the update α ðm+1Þ is computed from α ðmÞ and β ðmÞ using Formula (28) and Remark 3; afterwards, β ðm+1Þ is updated from α ðm+1Þ , α ðmÞ , and β ðmÞ using Formula (29) and Remark 3.This process is repeated until a convergence criterion is satisfied.Since the β ðm+1Þ jk are computed from α ðm+1Þ , our MM algorithm (Algorithm 1) is thus a cyclic MM algorithm [6].
3.4.Acceleration of the MM Algorithm.In many statistical estimation problems, MM algorithms may need a great number of iterations and converge very slowly.In consequence, different acceleration schemes have been developed [15][16][17].
To the best of our knowledge, the acceleration schemes developed in [16] are among the most popular, and they will be also considered in this paper.The authors of [16] presented a new class of iterative methods, called SQUAREM (squared iterative methods for EM acceleration), for accelerating the expectation-minimization (EM) algorithm and state that SQUAREM may be also used to accelerate MM algorithms.
Let F be the MM map of Algorithm 1, i.e., the function from where Knowing iterate θ ðmÞ , the SQUAREM consists in computing the next iterate θ ðm+1Þ as and γ is a scalar steplength.Varadhan and Roland [16] described three choices for the steplength, but, in many numerical experiments (see, for example, [16,17]), the steplength gives a faster convergence.As in [16,17], the accelerated MM algorithm with steplength (34) (the third choice of steplength proposed by [16]) will be denoted SqS3.The SqS3 algorithm is given hereafter (see Algorithm 2).Note that lines 7 to 9 of Algorithm 2 allow to correct the new iterate θ ðm+1Þ by performing a simple step of the nonaccelerated MM if θ ðm+1Þ does not belong to the constrained parameter space

Simulation Study
We compare, in R software [18], our MM algorithm and its accelerated version SqS3 to the NR algorithm (package nleqslv [19]) and quasi-Newton BFGS algorithm (package alabama [20]).The design of the simulation study is inspired from [21,22].
4.1.Data Generation.We have generated the data under assumptions (A1), (A2), and (A3), where the true parameter vector denoted by has taken five values defined as follows: where for all j = 1, ⋯, r, u j is a random observation from the uniform distribution on ½0:05,0:95.

MSE b θ θ
To avoid overloading the tables, the estimate b θ has been included for Scenario 1 only (see Table 1).
In Tables 1-5, we can notice that the estimated values, the standard deviation, the log-likelihoods, and the MSE are globally the same for all algorithms except BFGS when n = 5000.Regarding the convergence proportions, our proposed MM algorithm and its accelerated version SqS3 always have a 100% convergence proportion while the convergence proportion of NR is close but strictly lower than 100%.The convergence proportion of BFGS varies between 4.1% and  For all the algorithms, we notice a decrease of the MSE when n from 50 to 5000 which suggests a good fitting between the true and estimated values when n increases.When n = 5000, the MSE of BFGS is greater than those of the other algorithms which suggests a convergence of BFGS to bad values.
As far as the CPU times are concerned, we notice that, except for Scenario 1, the computation time of NR algorithm is greater than that of the MM algorithm (more than 2 times for n = 50 and more than 1.8 times for n = 5000 in Table 2, more than 3 times in Table 3, more than 6 times in Table 4, and more than 19 times in Table 5).The proposed MM algorithm is 23 to 67 times faster than BFGS.

Conclusion
In this paper, we built a minorization-maximization (MM) algorithm to compute, under box constraints and linear equality constraints, the maximum likelihood estimates of the parameters of a multivariate statistical model used in the analysis of accident frequencies.This statistical model is composed of several multinomial distributions whose parameters are dependent.Since MM algorithms are generally considered as slow to converge, we have also proposed an accelerated version of our MM algorithm using a square iterative acceleration scheme developed in [16].Using simulated data, we have proven that our proposed MM algorithm and its accelerated version are better than Newton-Raphson (NR) and quasi-newton BFGS algorithms in terms of convergence proportion and computation time.

Figure 1
Figure 1 gives an example of representation of LðθÞ and gðθjθ ðmÞ Þ where s = 1, r = 2, and θ = ðα, β 11 , β 21 Þ T .Since β 21 = 1 − β 11 , LðθÞ and gðθjθ ðmÞ Þ are considered as functions of two variables α > 0 and β 11 such that 0 < β 11 < 1.To finalize the design of our MM algorithm, we need to deal with the constraints (6b) and (6c).In[6], the authors recommend to extend function gðθjθ ðmÞ Þ under a new form that takes into account the inequality constraints; afterwards, equality constraints will be enforced during the optimization of the extended form of gðθjθ ðmÞ Þ.Their method is based on the logarithmic barrier method for handling inequality constraints.For q inequality constraints of the form v i ðθÞ ≥ 0, i = 1, ⋯, q, the extension of gðθjθ ðmÞ Þ presented in Equation (23) of[6] is an additive value composed

Remark 3 .
The updates in Formulas(19) and(20) are ideal but impossible to apply in practice since (a) the computation of α ðm+1Þ depends on the unknown β ðm+1Þ jk and vice versa and (b) in Equation (20), the computation of each β ðm+1Þ jk depends on β ðm+1Þ k and vice versa.To circumvent these difficulties, we can replace β ðm+1Þ k by β ðmÞ k on the right side of Equations (