movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions

This article is a (slightly) modiﬁed and shortened version of Hornik and Gr¨un (2014b), published in the Journal of Statistical Software . Finite mixtures of von Mises-Fisher distributions allow to apply model-based clustering methods to data which is of standardized length, i.e., all data points lie on the unit sphere. The R package movMF contains functionality to draw samples from ﬁnite mixtures of von Mises-Fisher distributions and to ﬁt these models using the expectation-maximization al-gorithm for maximum likelihood estimation. Special features are the possibility to use sparse matrix representations for the input data, diﬀerent variants of the expectation-maximization algorithm, diﬀerent methods for determining the concentration parameters in the M-step and to impose constraints on the concentration parameters over the components. Inthis paper we describe the main ﬁtting function of the package and illustrate its application. We also discuss he resolution of several numerical issues which occur for estimating the concentration parameters and for determining the normalizing constant of the von Mises-Fisher distribution.


Introduction
Finite mixture models allow to cluster observations by assuming that for each component a suitable parametric distribution can be specified and that the mixture distribution is derived by convex combination of the component distributions.McLachlan and Peel (2000) and Frühwirth-Schnatter (2006) present overviews on the estimation of these models in a maximum likelihood as well as a Bayesian setting together with different applications of finite mixture models.
If the support of the data is given by the unit sphere, a natural choice for the component distributions is the von Mises-Fisher (vMF) distribution.The special case where data is in R 2 , i.e., the observations lie on the unit circle, is referred to as von Mises distribution.The vMF distribution has concentric contour lines similar to the multivariate normal distribution with the variance-covariance matrix being a multiple of the identity matrix.In fact Mardia and Jupp (1999, page 173) point out that if X follows a multivariate normal distribution with mean parameter µ which has length one and variance-covariance matrix κ −1 I, then the conditional distribution of X given that it has length one is a vMF distribution with mean

The vMF distribution
A random unit length vector in R d has a von Mises-Fisher (or Langevin) distribution with parameter θ ∈ R d if its density with respect to the uniform distribution on the unit sphere S d−1 = {x ∈ R d : ∥x∥ = 1} is given by f (x|θ) = e θ ⊤ x / 0 F 1 (; d/2; ∥θ∥ 2 /4), where is the confluent hypergeometric limit function and related to the modified Bessel function of the first kind I ν via (e.g., Mardia and Jupp 1999, page 168).

Simulating vMF distributions
The following algorithm provides a rejecting sampling scheme for drawing a sample from the vMF distribution with modal direction (0, . . ., 0, 1) and concentration parameter κ ≥ 0 (see Algorithm VM * in Wood 1994).The extension for simulating from the matrix Bingham vMF distribution is described in Hoff (2009) and also available in R through package rstiefel (Hoff 2012).
Step 4. Generate a uniform (d − 1)-dimensional unit vector V and return The uniform (d − 1)-dimensional unit vector V can be generated by simulating independent standard normal random variables and normalizing them (see for example Ulrich 1984).To get samples from a vMF distribution with arbitrary mean direction parameter µ, X is multiplied from the left with a matrix where the first (d − 1) columns consist of unitary basis vectors of the subspace orthogonal to µ and the last column is equal to µ.

Estimating the parameters of the vMF distribution
Using the common parametrization by κ and µ, the log-likelihood of a sample x 1 , . . ., x n from the vMF distribution is given by where r = n i=1 x i is the resultant vector (sum) of the x i .The maximum likelihood estimates (MLEs) are obtained by solving the likelihood equations for the logarithmic derivative of 1/C d (κ) and ρ = ∥r∥/n for the average resultant length, the equation for the MLE of κ becomes Using recursions for the modified Bessel function (e.g., Watson 1995, page 71), one can establish that .
It can be shown (see for example Schou 1978) that A d is a strictly increasing function which maps the interval [0, ∞) onto the interval [0, 1) and satisfies the Riccati equation As A d and hence its derivatives can efficiently be computed using continued fractions (see Section 5 for details), the equation A d (κ) = ρ can efficiently be solved by standard iterative techniques provided that good starting approximations are employed.Dhillon and Sra (2003) and subsequently Banerjee et al. (2005) suggest the approximation obtained by truncating the Gauss continued fraction representation of A d and adding a correction term "determined empirically".The former reference also suggests using this approximation as the starting point of a Newton-Raphson iteration, using the above expression for Tanabe, Fukumizu, Oba, Takenouchi, and Ishii (2007) show that for some suitable 0 ≤ c ≤ 2. The approximation in Equation 2 corresponds to c ≈ ρ 2 .They also suggest to determine κ via the fixed point iteration with a starting value in the solution range, e.g., using c = 1 or c = ρ 2 .Sra (2012) introduces a "truncated Newton approximation" based on performing exactly two Newton iterations with κ 0 determined via the c = ρ 2 approximation.Song, Liu, and Wang (2012) suggest to use a "truncated Halley approximation" based on performing exactly two Halley iterations with κ 0 determined via the c = ρ 2 approximation and using that the second derivative can be given as a function of A d (κ t ), i.e., The results in Hornik and Grün (2014a) yield the substantially improved bounds valid for 0 ≤ ρ < 1, where The difference between the upper and lower bound is at most 3ρ/2 for all 0 ≤ ρ < 1, and the difference between the lower bound and κ tends to 0 as ρ → 1−.
Convex combinations of the lower and the upper bounds can be employed as starting values for the above iteration schemes (which require a single starting point).In addition, as these bounds actually give an interval known to contain the unique root κ of the function κ → A d (κ)−ρ, one can use them as starting values for root finding methods which iteratively refine intervals containing the solution, such as simple bisection (as provided by uniroot() in R), hybrid algorithms combining derivative-based (Newton or Halley) and bisection steps (e.g.Press, Teukolsky, Vetterling, and Flannery 2002, page 366), or the Newton-Fourier method (e.g.Atkinson 1989, pages 62-64).One can show that A d is concave (e.g., using Theorem 11 in Hornik and Grün 2013, which establishes that A d = R d/2−1 is the pointwise minimum of concave functions, and hence concave): hence, employing the above bounds and a variant of the Newton-Fourier method for strictly increasing concave functions yields a quadratically convergent iterative scheme for determining κ.

Illustrative example: Household expenses
To illustrate the use of the vMF distribution to model data on the sphere we use the household data set from package HSAUR3 (Everitt and Hothorn 2018).The data are part of a data set collected from a survey on household expenditures and give the expenses of 20 single men and 20 single women on four commodity groups.In the following we will focus only on three of those commodity groups (housing, food and service) to have 3-dimensional data which is easier to visualize.The data points are projected onto the sphere by normalizing them to have length one.Thus, in the following analysis we are interested in finding groups of households which lie in a similar direction, i.e., the angle between the observations is small and we are not interested in differences in their total absolute expenses.The data points on the sphere are visualized in Figure 1 on the top left.Using the gender information, vMF distributions are fitted to the male and the female observations separately.The fitted distributions are visualized together with the mean direction indicated by a cross and with confidence circles of probability 50% (full lines) and 95% (dashed lines).Clearly the females have a smaller dispersion as indicated by the estimated κ which is equal to 96.4, as compared to the κ of the males which is given by 20.3.

Finite mixtures of vMF distributions
The mixture model with K components is given by where h(•|•) is the mixture density, Θ the vector with all α and θ parameters, and f (y|θ k ) the density of the vMF distribution with parameter θ k .Furthermore, the component weights α k are positive for all k and sum to one.

Simulating mixtures of vMF distributions
This is straightforwardly achieved by first sampling class ids z ∈ {1, . . ., K} with the mixture class probabilities α 1 , . . ., α K , and then sampling the data from the respective vMF distributions with parameter θ k .

Estimating the parameters of mixtures of vMF distributions
EM algorithms for ML estimation of the parameters of mixtures of vMF distributions are given in Dhillon and Sra (2003) and Banerjee et al. (2005).The EM algorithm exploits the fact that the complete-data log-likelihood where the component memberships of the observations are known is easier to maximize than the observed-data log-likelihood.
The EM algorithm for fitting mixtures of vMF distributions consists of the following steps: 1. Initialization: Either of the following two: (b) Assign (probabilities of) component memberships to each of the n observations.E.g., the output from spherical k-means (Hornik, Feinerer, Kober, and Buchta 2012) can be used.
Start the EM algorithm with an M-step.
2. Repeat the following steps until the maximum number of iterations is reached or the convergence criterion is met.

E-step:
Because the complete-data log-likelihood is linear in the missing data which correspond to the component memberships, the E-step only consists of calculating the a-posteriori probabilities, the probabilities of belonging to a component conditional on the observed values, using

M-step:
Maximize the expected complete-data log-likelihood by determining separately for each k: .
κk can be determined via the approximation of Equation 2, or one of the improved methods discussed in Section 2.2.
Convergence check: Assess convergence by checking (either or both) (a) if the relative absolute change in the log-likelihood values is smaller than a threshold ϵ 1 ; (b) if the relative absolute change in parameters is smaller than a threshold ϵ 2 .
If converged, stop the algorithm.
This corresponds to the soft-movMF algorithm on page 1357 in Banerjee et al. (2005).In addition they propose the hard-movMF algorithm on page 1358.The algorithm above can be modified to the hard-movMF algorithm by adding a hardening step between E-and M-step: H-step: Replace the a-posteriori probabilities by assigning each observation with probability 1 to one of the components where its a-posteriori probability is maximum.Assuming the maximum is unique, this corresponds to If the maximum is not unique, assignment is randomly with equal probability to one of the k ∈ argmax h p(h|x i , Θ), i.e., ties are broken at random.
This algorithm is also referred to as classification EM algorithm in the literature (Celeux and Govaert 1992).A further variant of the EM algorithm also considered for example in Celeux and Govaert (1992) would be the stochastic EM where instead of an hardening step a stochastic step is added between E-and M-step:

S-step:
Assign at random each observation to one component with probability equal to its a-posteriori probability.
The algorithm above determines the parameter estimates if the concentration parameters are allowed to vary freely over components.An alternative model specification could be to impose the restriction that the concentration parameters are the same for all components.This has the advantage that the clusters will be of comparable compactness and that spurious solutions containing small components with very large concentration parameters are eliminated.In the following we derive how the M-step needs to be modified if the concentration parameters are restricted to be the same over components.
From Appendix A.2 of Banerjee et al. (2005) the optimal unconstrained κ k can be obtained by solving .
If the κ k are constrained to be equal (but are not given), the optimal common κ can be obtained as follows.Using Equation A.12 in Banerjee et al. (2005), the modified Lagrangian becomes Setting partials with respect to µ k and λ k to zero as in the reference gives (again) and with these µ k we obtain for κ that i.e., κ needs to solve the equation

Illustrative example: Household expenses
In the following the gender information is not used and it is investigated if finite mixtures allow to unravel a distinction between male and female respondents in their household expenses.Assuming that it is known that there are two underlying unobserved groups, a mixture with two components is fitted.The results are visualized in Figure 1 at the bottom left.The colors are according to assignments to components using the maximum a-posteriori probabilities.
These assignments lead to one misclassification, i.e., one female is assigned to the component with the higher dispersion.The estimated parameters and the BIC value of the model are given in Table 1.
If the number of components is assumed to be unknown, the Bayesian information criterion (BIC) can be used to select a suitable number of components (see for example McLachlan and Peel 2000, Chapter 6).In this case the minimum BIC is obtained for three components if models in the range of K = 1, . . ., 5 are considered.The results of the mixture with K = 3 components are visualized in Figure 1 at the bottom right and the estimated parameters and the BIC value are given in Table 1.In this case the male respondents are split into two groups with less dispersion each and different mean directions.The R code for reproducing these results is provided in Section 4.3 after introducing package movMF.

Main fitting function movMF()
The main function in package movMF for fitting mixtures of vMF distributions is movMF(), with synopsis R> movMF(x, k, control = list(), ...) The arguments for this function are as follows.
x: A numeric data matrix, with rows corresponding to observations.If necessary the data is standardized to unit row lengths.Furthermore, the matrix can be either stored as a dense matrix, a simple triplet matrix (defined in package slam), or a general sparse triplet matrix of class 'dgtMatrix' (from package Matrix).
k: An integer indicating the number of components.
control: A list of control parameters consisting of E: Specifies the variant of the EM algorithm used with possible values "softmax" (default), "hardmax" (classification EM), and "stochmax" (stochastic EM).
kappa: This argument allows to specify how to determine the concentration parameters.
• If numbers are given, the concentration parameters are assumed to be fixed and are not estimated in the EM algorithm.
• For common concentration parameters a list with elements common = TRUE and a character string giving the estimation method needs to be provided.verbose: Logical indicating if information on the progress of the fitting process shall be printed during the estimation.
ids: Indicates either the class memberships of the observations or if equal to TRUE the class memberships are obtained from the attributes of the data.In this way the class memberships are for example stored if data is generated using function rmovMF().If this argument is specified, the EM algorithm is stopped after one iteration, i.e., the parameter estimates are determined conditional on the known true class memberships.
start: Allows to specify the starting values used for initializing the EM algorithm which then starts with an M-step.It can either be a list of matrices where each matrix contains the a-posteriori probabilities of the observations or a list of vectors containing component assignments for the observations.Alternatively it can be a character vector with entries "i", "p", "S" or "s".The length of the vector specifies how many different initializations are made."i" indicates to randomly assign component memberships to the observations.The latter three draw observations as prototypes and determine a-posteriori probabilities by taking the implied cosine dissimilarities between observations and prototypes."p" randomly picks observations as prototypes, "S" takes the first prototype to minimize the total cosine dissimilarity to the observations, and then successively picks observations farthest away from the already picked prototypes.For "s" one takes a randomly chosen observation as the first prototype, and then proceeds as for "S".For more details on these initialization methods see package skmeans (Hornik et al. 2012) which uses the same initialization schemes.
nruns: An integer indicating the number of repeated runs of the EM algorithm with random initialization.This argument is ignored if either ids or start are specified.
minalpha: Components with size below the threshold indicated by minalpha are omitted during the estimation with the EM algorithm.This avoids estimation problems in the M-step if only very few observations are assigned to one component.The disadvantage is that the initial number of components is not necessarily equal to the number of components of the returned model.

Additional functionality in movMF
The object returned by movMF() has an S3 class called 'movMF' with methods print(), coef(), logLik() and predict() (yields either the component assignments or the matrix of aposteriori probabilities).Additional functionality available in the package includes rmovMF() for drawing from a mixture of vMF distributions and dmovMF() for evaluating the mixture density.

Illustrative example: Household expenses
In the following the code for reproducing the results presented in Section 3.3 is provided.
After loading the dataset the three columns of interest are selected from the expenses and stored in variable x.Then the classification variable gender is also extracted.First movMF() is used to fit a single vMF distribution to the two separate sub-samples and then mixtures are fitted with number of components varying from 1 to 5. To avoid reporting sub-optimal solutions where the EM algorithm was trapped in a local optimum, the best result from 20 random initializations is returned.

Numerical issues
In what follows it will be convenient to write As shown in Section 2, computing log-likelihoods for (mixtures of) vMF distributions on R d requires the computation of log(H d/2−1 ), and ML estimation of concentration parameters amounts to solving equations of the form A d (κ) = ρ, where is the logarithmic derivative of H d/2−1 .
As H ν (κ) → ∞ as κ → ∞ (in fact, quite rapidly, see below), it clearly is a bad idea to try to compute log(H) as the logarithm of H, or A as the ratio of H functions.Similar considerations apply for using logarithms or ratios of incomplete modified Bessel functions I.One might wonder whether one could successfully take advantage of the fact that the Bessel functions provided by R are based on the SPECFUN package of Cody (1993) and hence also provide the exponentially scaled modified Bessel function e −κ I ν (κ) (the scaling is motivated by the asymptotic expansion I ν (κ) ∼ e κ (2πκ) −1/2 m a m (ν)/κ m for κ → ∞).However, exponential scaling does not help in situations where 1 ≪ κ ≪ ν: e.g., for κ = 6000 and ν = 10000, H overflows whereas I underflows (even though R gives Inf and 0 for the cases without and with exponential scaling, respectively).

Computing A d
One can use the Gauss continued fraction for the ratio of modified Bessel functions (https://dlmf.nist.gov/10.33.E1) to compute A as Banerjee et al. 2005), using, e.g., Steed's method (e.g., https://dlmf.nist.gov/3.10)for evaluation.However, as pointed out by Gautschi and Slavik (1978) and Tretter and Walster (1980) (and, quite recently, re-iterated by Song et al. 2012), the Perron continued fraction I ν (z) is numerically more stable (as computing it by forward recursion only accumulates positive terms, whereas for the Gauss continued fraction the terms alternate in sign), and converges substantially faster for positive z ≫ ν.Hence, by default we compute A via the Perron continued fraction (with implementation based on Equation 3.3 ′ in Gautschi and Slavik 1978), and additionally provide computation via the Gauss continued fraction (with implementation based on Equation 3.2 ′ in Gautschi and Slavik 1978) and using exponentiation of log(H) differences as alternatives.
For κ close to zero it is better (and necessary for κ = 0) to use the approximation 1978, Equation 5).The O(κ 5 ) can be made more precise by using the series representation from which the coefficient of κ 5 can straightforwardly be determined as From this approximations for A ′ and A ′′ can be obtained by term-wise differentiation and also used for κ close to 0.
As log(H ν ) ′ = R ν (and H ν (0) = 1), integration gives Thus, the bounds for the Bessel function ratio R ν can be used to obtain bounds for H ν .Writing it is easily verified that S ′ α,β = G α,β and S α,β (0) = 0. Using the Amos-type bounds from Equation 4, we thus obtain that for ν ≥ 0 and κ ≥ 0, (5) Where a single approximating value is sought, we prefer to use the upper bound which is based on the combination of upper Amos-type bounds which are second order exact at zero and infinity.Again, the bounds in Equation 5 are surprisingly tight.

Result.
Let Proof.For simplicity, write α = ν + 1/2, β L = ν + 3/2 and β U = β SS (ν), omitting the dependence on ν.We have is non-decreasing in κ, and attains its supremum for κ → ∞.Now, As The value of s( 0) is obtained by insertion, and lim ν→∞ s(ν) can be obtained as follows.We have as ν → ∞, and To show that s is non-increasing, note that we have and hence For non-negative t and τ , establishing that s is non-increasing.
and hence e −Lν (κ)/2 H ν (κ) does not over-or underflow, write and compute log(H ν (κ)) as L ν (κ)/2 plus the logarithm of the sum of the scaled series.Otherwise, use the approximation L ν (κ) (note that with θ = 700, this has a relative approximation error of about at most 1/2 • 1/1400 ≤ 0.0004).This is the approach for computing log(H ν ) currently implemented in package movMF.
From the above, we can see that when κ ≫ ν, I ν (κ) overflows quite rapidly; on the other hand, for κ = o(ν) and ν → ∞, I ν (κ) underflows quite rapidly.For computing logarithms, overflow can be avoided by employing the exponentially scaled modified Bessel function e −κ I ν (κ) (as commonly available in codes for computing Bessel functions, such as the SPECFUN package Cody 1993, used by R) and computing log(I ν (κ)) = κ + log(e −κ I ν (κ)).However, this clearly does not help avoiding underflow.
This motivates the following strategy for computing log(I ν (κ)) for a wide range of values.Start by computing a quick approximation T ν (κ) to log(I ν ), either using the above L ν (κ) + ν log(κ/2) − log(Γ(ν + 1)) or the leading term of the large ν uniform asymptotic approximation (in R, the latter is available via function besselI.nuAsym() in package Bessel; Maechler 2012).If this is "sufficiently small" in absolute value (so that I ν (κ) will neither over-nor underflow), compute log(I ν (κ)) directly as the logarithm of I ν (κ).Otherwise, if the approximation is "too large", but T ν (κ)−κ is sufficiently small in absolute value so that the exponentially scaled e −κ I ν (κ) can be computed without over-or underflow, compute log(I ν (κ)) as κ + log(e −κ I ν (κ)).Otherwise, use the quick approximation, or the large ν uniform asymptotic approximation with additional terms (package Bessel allows up to 4 additional terms).This strategy can readily be translated into a strategy for computing log(H ν ).

Application: useR! 2008 abstracts
In 2008 the "useR!2008", the 3rd international R user conference, took place in Dortmund, Germany.In total 177 abstracts were submitted and accepted for presentation at the conference.The abstracts with additional information such as title, author, session, and keywords were originally made available in the R data package corpus.useR.2008.abstractsavailable from the repository at https://datacube.wu.ac.at/.The dataset is also included in package movMF.
The following code loads the data contained in package movMF.
R> data("useR_2008_abstracts", package = "movMF") Using the tm package (Feinerer, Hornik, and Meyer 2008;Feinerer 2012) the data can be pre-processed by (1) generating a corpus from the vector of abstracts and (2) building a document-term matrix from the corpus.For constructing the document-term matrix each abstract needs to be tokenized (i.e., split into words, e.g., by using white space characters as separators) and transformed to lower case.Punctuation as well as numbers can be removed and the words can be stemmed (i.e., inflected words are reduced to a base form).In addition a minimum length can be imposed on the words as well as a minimum and maximum frequency within an abstract required.
The vector of abstracts is transformed to an object which has an extended class of 'Source' using VectorSource().This object is used as input for Corpus() to generate the corpus.The map from the corpus to the document-term matrix is performed using DocumentTermMatrix().The control argument of DocumentTermMatrix() specifies which pre-processing steps are applied to determine the frequency vectors of term occurrences in each abstract.We use the titles and the abstracts together to construct the document-term matrix.

R> useR_2008_abstracts_DTM <-weightTfIdf(useR_2008_abstracts_DTM)
In the following different mixtures of vMF distributions are fitted to training data using 10fold cross-validation and are compared based on the predictive log-likelihoods on the hold-out data to select a suitable model.The numbers of components are varied and the mixtures are fitted with concentration parameters constrained to be the same over components as well as where the concentration parameters are allowed to freely vary over components.In Figure 2 the fitted models are compared using the cross-validated predictive log-likelihoods.
The results for the models with free concentration parameters are on the left, for the models with common concentration parameters on the right.The predictive log-likelihoods on the hold-out data indicate that the best solutions have between 1 and 5 components and that the models with more components have rather bad predictive log-likelihoods.
Following the conclusions of the comparison of the predictive log-likelihoods values we further investigate the model where the concentration parameters are constrained to be equal over components and the number of components is equal to 2. The clustering obtained from the a-posteriori probabilities is analyzed by comparing the cluster membership with the keywords assigned to an abstract.Because each abstract might have several keywords assigned, the abstracts and their cluster assignments are suitably repeated.

R> (tab <-tab[rowSums(tab) > 8, ])
Component Keyword 1 2 bioinformatics 3 8 biostatistics 1 11 connectivity 11 0 environmetrics 4 7 high performance computing 13 0 modeling 1 13 user interfaces 15 0 The table is also visualized in Figure 3. Abstracts where the keywords assigned relate to infrastructure or implementational issues such as connectivity, high performance computing and user interfaces are associated with one component, whereas abstracts which are related to statistical modeling issues such as bioinformatics, biostatistics and modeling are more likely to be assigned to the other component.Figure 3: Cross-tabulation of the keywords in each session and the cluster assignment for keywords which were assigned to more than 8 abstracts.

Summary
An R package for fitting finite mixtures of vMF distributions is presented.Special focus has been given on numerical issues when evaluating the log-likelihood as well as on methods for ML estimation of the concentration parameter.
A possible extension for package movMF would be to allow for more flexible distributions in the components which also have as support the unit sphere.The vMF distribution is the analogue of the isotropic multivariate normal distribution, i.e., where the variance-covariance matrix is a multiple of the identity matrix.In R 3 the generalization of the vMF distribution which is the analogue to the general multivariate normal distribution on the two-dimensional unit sphere is the Fisher-Bingham (or Kent) distribution (Kent 1982).Finite mixtures of Fisher-Bingham distributions were used in Peel, Whiten, and McLachlan (2001) to identify joint sets present in rock mass.They also allowed for an additional component which followed the uniform distribution on the unit sphere.However, no generalization for higher dimensions are available and Dortet-Bernadet and Wicker (2008) propose to use inverse stereographic projections of multivariate normal distributions to cluster gene expression profiles.

Figure 1 :
Figure1: Household expenses data with gender indicated by color after projection to the sphere at the top left, estimated vMF distributions with confidence circles for each gender group separately at the top right and the estimated mixtures of vMF distributions with K = 2 and K = 3 with confidence circles at the bottom.

Figure 2 :
Figure 2: Predictive log-likelihoods for different and common concentration parameters κ for the fitted mixtures of vMF distributions to the "useR!2008" abstracts.

Table 1 :
Results of fitting mixtures of vMF distributions to the household expenses example.
Logical indicating if convergence of the algorithm should be checked and if in this case the algorithm should be stopped before the maximum number of iterations is reached.For E equal to "softmax" this argument is set by default to FALSE.Note that only condition (a) of the convergence check (see Section 3.2) is assessed, i.e., the relative change in the log-likelihood values.Integer indicating the maximum number of iterations of the EM algorithm.(Default: 100.) reltol: If the relative change in the log-likelihood falls below this threshold the EM algorithm is stopped if converge is TRUE.(Default: sqrt(.Machine$double.eps).) converge:maxiter: To reduce the dimension of the problem and omit terms which occur too frequently or too infrequently in the corpus to be of use when clustering the abstracts, we omit all terms which occur in less than 5 abstracts and more than 90 abstracts.
The resulting document-term matrix has 177 rows and 3853 columns.The ten most frequent terms (occurring in different abstracts) are the following.
For each training data set the EM algorithm is repeated 20 times with different random initializations.