Maximum Likelihood Estimation of Nonnegative Trigonometric Sum Models Using a Newton-like Algorithm on Manifolds

In Fern\'andez-Dur\'an (2004), a new family of circular distributions based on nonnegative trigonometric sums (NNTS models) is developed. Because the parameter space of this family is the surface of the hypersphere, an efficient Newton-like algorithm on manifolds is generated in order to obtain the maximum likelihood estimates of the parameters.


Introduction
The probability density function, f (θ; c), of a circular random variable Θ ∈ (0, 2π] must be nonnegative and periodic (f (θ + 2kπ; c) = f (θ; c)) for any integer k where c is the vector of parameters. Practical examples of circular random variables include the wind directions at different monitoring stations, the directions taken by an animal, the times at which a person conducts a daily activity, the time of occurrence of different events, and many others. Based on the results of Féjer (1915), Fernández-Durán (2004) derived a family of circular distributions based on nonnegative trigonometric sums (see also Fernández-Durán, 2007). In short, the nonnegative trigonometric sum is expressed as a squared norm of a complex number. The circular density function based on nonnegative trigonometric sums (NNTS density) is expressed as Note that i = √ −1 and, c k = c rk + ic ck are complex numbers for k = 0, . . . , M, and, c k = c rk − ic ck is the conjugate of c k . To integrate to one, it is necessary to impose the following constraint in the c parameters.
Note that c c0 = 0 and c r0 ≥ 0; i.e., c 0 is a nonnegative real number. Thus, the c parameter space corresponds to the surface of a 2M +1 dimensional hypersphere. This family of circular distributions has the advantage of being able to fit datasets that present multimodality and/or skewness because the density function can be expressed as a mixture of multimodal circular distributions. The total number of c free parameters is equal to 2M.
The main objective of this paper is to develop an efficient Newton-like optimization algorithm on the surface of a hypersphere that corresponds to a Riemann manifold, in order to obtain the maximum likelihood estimates of the c parameters.
The paper is divided into five sections, including this introduction. The second section presents a convenient, alternative way to express likelihood functions for continuous and grouped data in the univariate case. Given these convenient expressions for likelihood functions, in the third section an efficient Newton-like algorithm is developed for maximizing the log-likelihood function on the surface of the hypersphere. The proposed algorithm is a particular case of a Newton-like algorithm for scalar functions on Stiefel manifolds (Absil et al., 2008, Manton, 2002, Balogh, 2004. In the fourth section, some applications of the proposed algorithms to real continuous and grouped datasets are presented. Finally, the conclusions of the present work are presented in the fifth section.

Continuous Data
Let θ 1 , θ 2 , . . . , θ n be a random sample of univariate circular random variables from a population with density function f (θ; M, c), which is a member of the NNTS family with parameters c and M. The density function of θ j is given by which can be written in the following quadratic form: Note that c = (c 0 , c 1 , . . . , c M ) T , e j = (1, e −iθ j , e −2iθ j , e −3iθ j , . . . , e −M iθ j ) T , and c H indicate the Hermitian transpose of the vector c that corresponds to the transpose and conjugate of the vector of complex numbers c, and c T is the transpose of c. Then, the likelihood for a random sample θ 1 , . . . , θ n , denoted by L(M, c | θ 1 , . . . , θ n ), is calculated as and the corresponding log-likelihood function is

Grouped Data
Let (a j , b j ] for j = 1, . . . , Q be a partition of the interval (0, 2π], i.e., (0, 2π] = ∪ Q k=1 (a j , b j ] and (a j , b j ] ∩ (a k , b k ] = ∅ for j = k, and let N 1 , N 2 , . . . , N Q be the total number of observations in each of the intervals in the partition. Let N k be the total number of observations in the interval (a k , b k ] for k = 1, . . . , Q. This type of data is called grouped or incidence data. The likelihood function is where F (b k ; M, c) is the accumulated distribution function of the NNTS density at b k . Note that the accumulated distribution function is obtained as

The Newton-like Algorithm
Because the c parameter space corresponds to the surface of the hypersphere, to obtain the maximum likelihood estimates, it is possible to apply a Newton-like optimization algorithm on manifolds (Absil et al., 2008). Basically, a smooth manifold is a surface that can be approximated locally by a hyperplane. For a point on the manifold, the approximating hyperplane is known as the tangent space. Then, a real function on a manifold can be maximized by searching for optima in the directions of movement on the tangent space and reprojecting onto the manifold. In differential geometry, the reprojection operation is called a retraction. In this paper, the optimization problem of obtaining the maximum likelihood estimates is equivalent to maximizing a real function (i.e., the log-likelihood) on a manifold (that is, the surface of the hypersphere). The goal of the Newton-like algorithm on manifolds is to obtain the solutions of gradl(c) = 0 where gradl(c) represents the gradient of the log-likelihood function l at the point c. 2. For k = 1, 2, . . ., solve the Newton equation for the unknown η k in the tangent space at c k .
3. Set c k+1 = R c k (η k ) where R c k is the retraction from the tangent space onto the manifold at c k .
The algorithm terminates when the norm of the gradient or the norm of the difference gradl(c k ) − gradl(c k−1 ) is less than a prespecified error. For the case considered in this paper, we use differentiation rules of real functions of a complex vector to derive where P c is the projection onto the tangent space. For the case of the hypersphere, P c = 1 2π I − cc H can be used. Note that the expected value of the gradient is equal to zero, and the Hessian matrix is obtained as Fisher's information matrix, ı, which corresponds to the negative of the expected value of the Hessian, is equal to ı = −E (Hessl(c)) = nP c .
Instead of using the Hessian in the Newton algorithm, we prefer to use Fisher's information matrix in the same way as in Fisher's method of scoring (Shao, 2003, Lange, 2004, Thisted, 1988. The modified Newton algorithm consists of the following steps: 1. Select an initial point c 0 .
2. For k = 1, 2, . . ., solve the Fisher's scoring equation and because ıη k = nP c η k = nη k and the Fisher's scoring equation has the following solution for η k , where R c k is a retraction from the tangent space onto the manifold for c k . In particular, we use We terminate the algorithm when the difference ||c k+1 − c k || is less than a prespecified error.
The modifications required to apply the proposed algorithm to grouped data is direct because the log-likelihood function to be maximized has the same basic form as the one we treated above.
For the practical application of the algorithm, it is possible to work with vectors c with unit norms in order to avoid the correction terms related to the factor 1 2π . This is facilitated by the fact that it is equivalent to a modified likelihood that is obtained by multiplying the original likelihood by a constant factor. In relation to the initial point c 0 , one can use a random initial point or the normalized average of the e k statistics. In our experience, using the normalized average of the e k statistics as an initial point has worked very well for different datasets.
The proposed algorithm has been compared with results derived from sequential quadratic programming (SQP) and the Nelder-Mead optimization algorithm in many different datasets; the proposed algorithm shows much faster convergence for many different random initial points, and in contrast to SQP and the Neder-Mead algorithm, the proposed modified Newton algorithm usually converges to the same point. The optimality properties of Newton algorithms on manifolds and its convergence properties are presented in Absil et.al. (2008), Manton (2002) and Balogh (2004). Of course, as in any iterative optimization algorithm, it is important to run the algorithm many times using different initial random points to try to find the global maximum of the log-likelihood function.

Examples
The first example refers to univariate continuous circular data and consists of the directions taken by 76 turtles after treatment. This data set is taken from Fisher (1993, pp. 241), who in turn took it from Stephens (1969). The second example consists of the accumulated monthly number of deaths by suicide in England and Wales during the period 1982-1996 as an example of grouped data (Yip et al., 2000). Figure 1 presents the raw data histogram for the turtle data and the best-fitted NNTS models. Table 1 presents the values of the log-likelihood, Akaike's Information Criterion (AIC), and the Bayesian Information Criterion (BIC) for NNTS models for M = 0, 1, . . . , 10. This dataset was analyzed previously by Fernández-Durán (2004) using SQP to obtain the maximum likelihood estimates. Contrary to SQP and the Nelder-Mead optimization method, the proposed Newton-like algorithm presents much faster and more stable convergence properties.  Table 2 presents the raw monthly suicide data by sex taken from Yip et al. (2000). For grouped data, instead of using AIC or BIC criteria to select the best models among the considered models that use M = 0, 1, . . . , 6, we apply likelihood ratio tests to compare the maximized log-likelihood values for models with M = 0, 1, . . . , 5, l M , with the maximized log-likelihood of the saturated model, l S , that corresponds, in this case, to an NNTS model with M = 6. In this strategy for model selection, −2(l M − l S ) is asymptotically distributed as a chi-squared random variable with 12 − 2M degrees of freedom. The most parsimonious model was selected. Table 3 presents the values of the log-likelihood and likelihood ratio test statistics for NNTS models with M = 0, 1, . . . , 5.    Table 3: Suicide data: Log-likelihood and likelihood ratio test statistics values for the NNTS models fitted by the proposed Newton-like algorithm on the surface of a hypersphere. An asterisk (*) marks the most parsimonious models according to likelihood ratio tests using a 1% significance level.

Conclusions
A Newton-like algorithm on manifolds is developed to obtain the maximum likelihood estimates of the parameters of the NNTS family of distributions. Because the parameter space corresponds to the surface of a hypersphere, other optimization methods such as sequential quadratic programming (SQP) and the Nelder-Mead algorithm must address norm constraints that make these optimization algorithms very slow. By working with optimization algorithms on manifolds, it is possible to avoid the use of constraints, thus making the proposed algorithm in this paper much faster and more efficient than these other methods. This is possible because the likelihood function of NNTS models can be conveniently expressed in terms of quadratic forms of the relevant parameters. The convenient use of the proposed Newton algorithm has been demonstrated in several datasets consisting of continuous and grouped observations.