Model-Based Clustering, Classiﬁcation, and Discriminant Analysis Using the Generalized Hyperbolic Distribution: MixGHD R package

The MixGHD package for R performs model-based clustering, classiﬁcation, and discriminant analysis using the generalized hyperbolic distribution (GHD). This approach is suitable for data that can be considered a realization of a (multivariate) continuous random variable. The GHD has the advantage of being ﬂexible due to skewness, concentration, and index parameters; as such, clustering methods that use this distribution are capable of estimating clusters characterized by diﬀerent shapes. The package provides ﬁve diﬀerent models all based on the GHD, an eﬃcient routine for discriminant analysis, and a function to measure cluster agreement. This paper is split into three parts: the ﬁrst is devoted to the formulation of each method, extending them for classiﬁcation and discriminant analysis applications, the second focuses on the algorithms, and the third shows the use of the package on real datasets.


Introduction
Broadly, classification refers to the process of assigning labels to sets of observations. In general, classification is unsupervised (also known as clustering), semi-supervised, or (fully) supervised. Generally speaking, the goal is the same, to group observations based on shared characteristics. Classifying, in fact, is a key instrument in data mining and data analysis.
Classification can serve the twofold aim of highlighting discriminating factors and grouping homogeneous collections of units in datasets. The latter point is extremely useful in many fields such as medicine, e.g., for identifying homogeneous groups of patients, or marketing, e.g., identifying homogeneous groups of customers. This main focus of this paper is cluster analysis but the described methods can be used for semi-supervised and supervised learning as well. Many cluster analysis techniques exist in the statistical and machine learning literature, in this paper we will focus on a non-hierarchical clustering technique known as model-based clustering (McNicholas 2016).
Of course, not all non-hierarchical clustering techniques are model-based and these are distinguished by not making any explicit assumptions on the distribution of the clusters. Typically, they group statistical units into k clusters with respect to a distance measure. The most common method in this context is k-means clustering (MacQueen 1967). Several extensions of k-means for high-dimensional data clustering exist (e.g., Bock 1987;De Sarbo and Manrai 1992;Arabie and Hubert 1994;De Soete and Carroll 1994;Stute and Zhu 1995;Vichi and Kiers 2001;Vichi and Saporta 2009;Yamamoto and Hwang 2014). An alternative distancebased method is probabilistic distance (PD) clustering (Ben-Israel and Iyigun 2008), which assigns units to a cluster according to their probability of membership, under the constraint that the product of the probability and the distance of each point to any cluster center is a constant. Tortora, Gettler Summa, Marino, and Palumbo (2016a) propose a transformation of the method for high-dimensional data sets, Rainey, Tortora, and Palumbo (2019) and Tortora, McNicholas, and Palumbo (2020a) propose a new distance measure.
Model-based methods assume that a population is a convex linear combination of a finite number of (component) probability densities. Until recently, the component densities have typically been Gaussian distributed, and several parsimonious extensions of Gaussian mixtures for high-dimensional data have been proposed (e.g., Ghahramani and Hinton 1997;McLachlan, Peel, and Bean 2003;Bouveyron, Girard, and Schmid 2007;Murphy 2008, 2010;Baek, McLachlan, and Flack 2010;Montanari and Viroli 2011). Recently, the focus of the literature has been on mixtures of non-Gaussian distributions for high-dimensional datasets (e.g., Andrews and McNicholas 2011a,b;Steane, McNicholas, and Yada 2012;Lin, McNicholas, and Hsiu 2014;Murray, McNicholas, and Browne 2014b;Murray, Browne, and McNicholas 2014a;Lin, McLachlan, and Lee 2016;McNicholas, McNicholas, and Browne 2017;Tang, Browne, and McNicholas 2018;Kim and Browne 2019;Murray, Browne, and McNicholas 2020;Punzo, Blostein, and McNicholas 2020). Of particular interest is the generalized hyperbolic distribution (GHD) which can detect clusters with non-elliptical form because it contains skewness, concentration, and index parameters. These parameters allow the GHD to be much more flexible compared to most other distributions. Browne and McNicholas (2015) examine different representations of the GHD and outline a mixture of GHDs for clustering. Each component scale matrix has a number of free parameters that increases quadratically in the number of variables p. Tortora, McNicholas, and Browne (2016b) propose a parsimonious version of the model, the mixture of generalized hyperbolic factor analyzers, to extend the method for higher dimensional data sets. A multiple scaled extension of the method was proposed by Tortora, Franczak, Browne, and McNicholas (2019), where the authors added even more flexibility to the models letting the concentration and index parameters vary per dimension.
The volume of work on clustering and classification methodology has led to the release of new clustering software. A commonly used statistical software is R (R Core Team 2021), and many of the previously cited methods have a corresponding R package. For example, k-means clustering is directly implemented in R through the stats package (R Core Team 2021), specifically with the kmeans function. Two packages that are worth mentioning, because they implement several techniques useful for cluster visualization and for the choice of the number of clusters together with some basic clustering methods, are cluster (Maechler, Rousseeuw, Struyf, Hubert, and Hornik 2021) and fpc (Hennig 2020). Some of the extensions of k-means for high-dimensional datasets can be found in the clustrd package (Markos, Iodice D'Enza, and Van de Velden 2019). PD-clustering and its extension are implemented in the package FPDcluster (Tortora, Vidales, Palumbo, and McNicholas 2020b). Among a large variety of packages available for model-based clustering is the widely used mclust package (Scrucca, Fop, Murphy, and Raftery 2016) and an analogue in parallel pmclust (Chen and Ostrouchov 2021). The two packages implement model-based clustering, classification, and density estimation using the Gaussian distribution. An alternative for model-based clustering using the Gaussian distribution is the Rmixmod package (Lebret, Iovleff, Langrognet, Biernacki, Celeux, and Govaert 2015), an R interface for the MixMod software (Biernacki, Celeux, Govaert, and Langrognet 2006). A third alternative is the package mixture (Pocuca, Browne, and McNicholas 2021), it carries out model-based clustering and classification using the 14 parsimonious Gaussian clustering models from Celeux and Govaert (1995). Several existing packages for clustering high-dimensional datasets use the Gaussian distribution, each implementing a different model. The pgmm package (McNicholas, ElSherbiny, McDaid, and Murphy 2019) implements the 12 parsimonious Gaussian mixture models for cluster analysis from Murphy (2008, 2010) and an associated classification model (see McNicholas 2010). HDclassif (Bergé, Bouveyron, and Girard 2012) and FisherEM (Bouveyron and Brunet 2020) implement the models described in Bouveyron et al. (2007) and Bouveyron and Brunet (2012), respectively.
The EMMIXskew package (Wang, Ng, and McLachlan 2018) implements model-based clustering using the normal, the Student-t, the skew normal, and the skew-t distributions, while the EMMIXuskew package (Lee and McLachlan 2014b) implements model-based clustering using the unrestricted skew t distribution given in Lee and McLachlan (2014a). The package uskewFactors (Murray, Browne, and McNicholas 2016) implements the mixtures of unrestricted skew-t factor analyzers. An alternative to the common paradigms is proposed by Azzalini and Torelli (2007), who use a clustering method based on nonparametric density estimation. The corresponding package is pdfclust (Menardi and Azzalini 2014). For large and sparse data sets, mixtures of von Mises-Fisher distributions can be fit using the package movMF (Hornik and Grün 2014). The two packages flexmix (Leisch 2004;Grün and Leisch 2008) and mixtools (Benaglia, Chauveau, Hunter, and Young 2009) allow the user to choose different distributions. Specifically, flexmix is extremely flexible letting the user input the chosen distribution. For a list of R packages on cluster analysis and finite mixture models see Leisch and Grün (2021).
The aim of this paper is to describe the MixGHD package (Tortora, El-Sherbiny, Browne, Franczak, and McNicholas 2021) which implements five different methods based on the GHD. The package is available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=MixGHD. As mentioned before, the GHD is a very flexible distribution that has many other distributions as special or limiting cases. For these reasons, this package fills in the gap in the existing package landscape. Moreover, in this paper, the three methods proposed in Tortora et al. (2019) are extended to be used for discriminant analysis and model-based classification. This paper has the following structure. In Section 2, we introduce model-based classification. Sections 3 to 5 describe the five methods implemented in the MixGHD package, with some implementation details described in Section 6. Section 7 describes the MixGHD package with real data examples.

Model-based classification
The basic idea of model-based clustering is that a random vector X follows a (parametric) finite mixture distribution if, for all x ⊂ X, its density can be written as where G is the number of clusters, π g > 0 is the gth mixing proportion such that G g=1 π g = 1, f g (x | θ g ) is the gth component density that we assume to be of the same type for all the components, i.e., f g (x | θ g ) = f (x | θ g ). Therefore, the model-based clustering likelihood function, for x 1 , . . . , x n , can be written as (1) In model-based classification, given n p-dimensional vectors x 1 , . . . , x n , k of them have known labels and the model can be used to predict the other n − k labels. Following McNicholas (2010), order the n observations so that the first k are labeled -this can be done without loss of generality. Let G be the number of classes, H ≥ G be the number of fitted components, and z ig the component membership labels so that z ig = 1 if x i is in component g, and z ig = 0 otherwise, for i = 1, . . . , k and g = 1, . . . , G. The model-based classification likelihood is Note that H ≥ G in general, but it is typically assumed that H = G.
Discriminant analysis is a special case of classification in which k = n and, therefore, we only use the first part of (2). Cluster analysis in (1) can be obtained setting k = 0, in which case we use only the second part of Equation (2); see McNicholas (2010) for details. In the following, we will consider the GHD density function. Extensive details on model-based clustering, classification, and discriminant analysis are given by McNicholas (2016).

Mixture of generalized hyperbolic distributions
A random p-dimensional variable X is distributed according to a GHD if its density can be represented as where φ p is a multivariate p dimensional Gaussian distribution and h (v | ω, η, λ), called the weight function, is the density of a univariate generalized inverse Gaussian (GIG) distribution. Formally, the density of the GIG distribution is given by where η > 0 is a scale parameter, ω > 0 is a concentration parameter, λ ∈ R is an index parameter, and K λ is the modified Bessel function of the third kind with index λ. Browne and McNicholas (2015) propose an identifiable representation of the GHD by setting η = 1, which gives is the squared Mahalanobis distance between x and location parameter µ ∈ R p , α ∈ R p is a skewness parameter, Σ is a p × p positive defined scale matrix, and K λ , ω, and λ are as defined for (4).
The random variable X can be generated via the relationship where N ∼ N p (0, Σ) and V ∼ GIG (ω, 1, λ), i.e., V follows a GIG distribution with density as in (4). It follows that A finite mixture of GHDs (MGHD) has density is the density of the GHD given in (5) and, as before, π g is the gth mixing proportion.

Mixture of generalized hyperbolic factor analyzers
In the MGHD, the scale matrices Σ 1 , . . . , Σ G contain Gp(p + 1)/2 free parameters, i.e., a number that is quadratic in p. When p is large, the number of parameters to estimate becomes too big, so to overcome this issue Tortora et al. (2016b) propose the mixture of generalized hyperbolic factor analyzers (MGHFA). In a factor analyzers model (Ghahramani and Hinton 1997;McLachlan and Peel 2000), the random variable X can be represented as with probability π g , for i = 1, . . . , n and g = 1, . . . , G. The matrix Λ g is a p × q matrix of factor loadings. The factors U ig are independently distributed U ig ∼ N q (0, I q ) with q < p, (6) and note that N can be decomposed as N = ΛU + , where U ∼ N q (0, I q ) and ∼ N p (0, Ψ), and Λ and Ψ are a p × q factor loading matrix and a p × p diagonal matrix with positive entries, respectively. From (7), it follows that This leads to a mixture of generalized hyperbolic factor analyzers model with density is the density of the GHD given in (5) and π g are the mixing proportions.

Extensions of the generalized hyperbolic distribution
The multiple scaled distributions are an extension of the distribution of the type in (3), where the weight function is the product of p univariate functions (Forbes and Wraith 2014). This transformation can be obtained by letting Σ = ΓΦΓ and adding . A finite mixture of multiple scaled GHDs (MMSGHDs) has density The MSGHD is not convex nor quasi-convex, and consequently there are situations in which the contour plots are not convex. In some situations, see Figure 1a and 1b, a convex contour plot can be more suitable. For this reason, Tortora et al. (2019) propose the convex MMSGHD (cMMSGHD). A convex contour plot can be ensured by adding a constraint to the index parameter λ, i.e., λ ≥ 1, see Tortora et al. (2019) for details. The GHD cannot be obtained as a special or limiting case of the MSGHD and vice versa. For this reason, Tortora et al. (2019) propose the mixture of coalesced GHDs (MCGHD). A random variable X follows a CGHD if it can be modeled as follows where ∈ (0, 1), S is distributed according to a MSGHD f MSGH (µ, Γ, Φ, α, ω, λ), and R = ΓY where Y is distributed according to a GHD f GH (µ, Γ, Φ, α, ω 0 , λ 0 ), where Σ = ΓΦΓ .  This implies that the density of a MCGHD is It is worth nothing that the main difference from the GHD is that the MSGHD parameters ω g and λ g are p-dimensional vectors whereas the GHD parameters ω 0g and λ 0g are unidimensional. The advantage of the MCGHD is that it includes both the GHD and the MSGHD as special cases.

Overview
The most widely used technique for parameter estimation in a finite mixture model context is the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977), which is an iterative technique for finding maximum likelihood estimates when data are incomplete or treated as incomplete. The EM algorithm iterates between two steps, an E-step and an M-step. On the E-step, the (conditional) expected value of the complete-data log-likelihood, Q say, is calculated. On the M-step, Q is maximized with respect to the parameter estimates.
The E-step and the M-step are iterated until convergence is reached; see Section 6.2 for details about stopping rules.
For the MGHFA, the parameters are estimated using an extension of the EM algorithm called the alternating expectation-conditional maximization (AECM) algorithm (Meng and Van Dyk 1997). Similar to the EM algorithm, it is based on the complete-data log-likelihood, but it allows for the specification of different complete-data at each stage of the algorithm and the M-step is replaced by a number of conditional maximization (CM) steps. For details on parameter estimation in the MGHFA, refer to Tortora et al. (2016b). For the MMSGHD, cMMSGHD and MCGHD, Γ cannot be found in closed form and an optimization routine is used. The result is that, in each M-step, the likelihood increases with respect to Γ but it is not maximized; accordingly, the algorithm is formally a generalized EM (GEM) algorithm. For details on parameter estimation, refer to Tortora et al. (2019).

Model selection, convergence, and evaluation
The MAP{ẑ ig } logẑ ig is the estimated mean entropy. The EM algorithm, the AECM algorithm, and the GEM algorithm used for the parameter estimation of the models are iterated until convergence is reached. The convergence is determined using a stopping rule based on the Aitken acceleration (Aitken 1926). Let l (k) be the value of the log-likelihood after k iterations. The asymptotic maximum of the log-likelihood at iteration k can be estimated using the Aitken acceleration via An asymptotic estimate of the log-likelihood at iteration k + 1 is and we consider the algorithm to have converged if where is small (McNicholas, Murphy, McDaid, and Frost 2010).
The adjusted Rand index (ARI; Hubert and Arabie 1985), which compares predicted classifications with true classifications, can be used to evaluate the results. The ARI corrects the Rand index (Rand 1971) for chance; its expected value under random classification is 0 , and it takes a value 1 when there is perfect class agreement. Steinley (2004) gives guidelines for interpreting ARI values. For more pairwise agreement indices see the cl_agreement function in the CLUE package (Hornik 2005).

MixGHD R package
MixGHD is an R package developed in an object-oriented design using the standard S4 paradigm and C programming language. The package contains five functions for modelbased clustering and classification: MGHD, MGHFA, MSGHD, cMSGHD, and MCGHD. The DA function is a routine for discriminant analysis, the ARI function computes the adjusted Rand index, and the contourpl function produce a contour plot. The package also contains the functions rGHD, rMSGHD, and rMCGHD, to pseudo-randomly generate numbers from the corresponding distributions, and the functions dGHD, dMSGHD, and dMCGHD to compute the density of the corresponding distributions. Table 1 shows the input arguments for the MGHD, MGHFA, MSGHD, cMSGHD, and MCGHD functions with a brief description.

Cluster analysis
To illustrate the use of the package, we use the bankruptcy dataset (Alman 1968) from the MixGHD package . The dataset contains the ratio of retained earnings (RE) to total assets as well as the ratio of earnings before interests and taxes (EBIT) to total assets of 66 American firms. Half of the selected firms had filed for bankruptcy. R> library("MixGHD") R> data("bankruptcy", package = "MixGHD") R> res <-MCGHD(data = bankruptcy[, 2:3], G = 2:3, method = "kmedoids", + max.iter = 1000, modelSel = "BIC") The best model (BIC) for the range of components used is G = 2.
The BIC for this model is -288.7835.

R> summary(res)
The number of components used for the model is G = 2. The variables RE and EBIT are considered for cluster analysis. The BIC criterion is used to select between G = 2 or G = 3, the maximum number of iterations is 1000, and k-medoids is used as the starting criterion.

Arguments Description data
An n × p matrix or data frame such that rows correspond to observations and columns correspond to variables. gpar0 An optional list containing the initial parameters of the mixture model. If specified, it must have a list structure containing as many elements as the number of components G. Each element must include all the parameters for the selected model. An optional string indicating the initialization criterion; if not specified k-means clustering is used. Alternative methods are hierarchical "hierarchical", k-medoids "kmedoids", random "random", and modelbased "modelBased" clustering. nr An optional number indicating the number of starting values when random is used, 10 by default. scale An optional logical value indicating whether or not the data should be scaled; true by default. modelSel An optional string indicating the model selection criterion; if not specified, the AIC is used. Alternative methods are the BIC, ICL, and AIC3. q Only when MGHFA is used, a numerical vector specifying the number of latent factors; q = 2 by default. The function summary shows the value of the BIC, AIC, AIC3, and ICL and the number of elements in each cluster. The output is an S4 object of class 'MixGHD' containing the following parameters: • index: Value of the index used for model selection for each model, BIC in this case.
• gpar: A list of the model parameters in the rotated space.
• map: A vector of integers indicating the maximum a posteriori classifications for the best model.
• par: A list of the model parameters.
• z: A matrix giving the raw values upon which map is based.
For each component, the estimated parameters are stored in the list gpar,

1 1 33 3 2 30
The MCGHD has good performance on the bankruptcy dataset, with an ARI of 0.824 and only three misclassifications. Figure 2a shows the obtained partition. The cluster represented by o is characterized by skewness in both directions, which makes it hard to be identified by less flexible clustering methods. Figure 2b shows the value of the log-likelihood at each iteration of the EM algorithm. For comparison, the MGHD is applied on the same dataset.

Data generation and density estimation
For the three density functions: GHD, MSGHD, and MCGHD, the package also contains functions to pseudo-randomly generate data (rGHD, rMSGHD, and rMCGHD) and for density estimates (dGHD, dMSGHD, and dMCGHD). The input of the functions are described in Table 2.
The output of the random generation functions are pseudo randomly generated n×p datasets, the output of the d functions are numerical vectors with the density values.
The following examples show the use of the rMCGHD and dMCGHD function, the use of the other functions is analogues.

Discriminant analysis
To easily perform discriminant analysis, the package contains a routine called DA. The DA function requires the input arguments in Table 1, with the exception of the data and labels that are substituted by the input parameters in Table 3.
Discriminant analysis requires the dataset to be divided into a training set and a test set, where n 1 and n 2 are the number of units in the training and test sets respectively. The input parameters change according to the chosen method. The outputs are: • model: A list with the model parameters.
• testMembership: A vector of integers indicating the membership of the units in the test set.
• ARItest : A value indicating the adjusted Rand index for the test set.
• ARItrain : A value indicating the adjusted Rand index for the training set.
We applied the DA routine to the sonar dataset from the MixGHD R package. The data report the patterns obtained by bouncing sonar signals at various angles and under various conditions. There are 208 patterns in all: 111 obtained by bouncing sonar signals off a metal An optional numerical parameter with the weight for the CGHD. if not specified, "GHD" is used. Alternative methods are the "MGHFA", "MSGHD", "cMSGHD", and "MCGHD".   The command lab <-as.numeric(factor(sonar[, 61])) transforms the labels into a numerical vector. The data are divided into training and test sets, with 30% of the data in each cluster belonging to the test set.
The AIC for this model is -12460.25.

R> ls(modelDA)
[1] "ARItest" "ARItrain" "model" "testMembership" R> modelDA$ARItest [1] 0.6605439 R> modelDA$ARItrain As result of the DA routine, we obtain the ARI for the test and on the training sets, as well as the model and the membership for the test set. Model is an S4 object of class 'MixGHD'. For the test set of the sonar data, the ARI is 0.660 and, because no model was specified, the routine used the default model, i.e., MGHD.

Classification
The wine dataset, pgmm package , contains data on 27 chemical and physical properties of wine from the Piedmont region of Italy. There are three different types of wine: Barolo, Grignolino, and Barbera. To perform classification we assume that 25% of the memberships are unknown, the value 0 indicates unknown membership.
The AIC for this model is -10121.01.
The AIC for this model is -11429.97.

Computational details
The package uses several R packages and functions. To implement the Bessel function the package Bessel (Maechler 2019) is used with exponentially scaled results to avoid underflow. The gradient is calculated using the function grad, from the package numDeriv (Gilbert and Varadhan 2019). To generate data the functions rgig and rmvnorm from the packages ghyp (Weibel, Luethi, and Breymann 2020) and mvtnorm (Genz, Bretz, Miwa, Mi, Leisch, Scheipl, and Hothorn 2020), respectively, are used. To reduce the computational time, the expectation step and the parameter updates of the functions MGHD, MGFA, MSGHD, cMSGHD, and MCGHD, are coded in C. The parameter initialization is done in R using the following functions: kmeans for k-means, gpcm for model-based (package mixture, Pocuca et al. 2021), pam for k-medoids (package cluster, Maechler et al. 2021), and hclust for hierarchical. The starting parameters are then passed to C where the appropriate algorithm for each function is used, see Section 6.
The outputs from C are passed back to R where the indices discussed in Section 6.2 are computed. All the other functions are implemented entirely in R.

Conclusion
This paper illustrates the use of the MixGHD package for R. The package contains five main functions for model-based clustering, classification, and discriminant analysis based on the generalized hyperbolic distribution (GHD). The GHD is a very flexible distribution; other wellknown distributions are special or limiting cases thereof. It can detect clusters characterized by a variety of shapes because it has skewness, concentration, and index parameters. The MGHD function performs clustering and classification using the GHD, the MGHFA function uses the mixture of generalized hyperbolic factor analyzers, useful for high-dimensional data. The other three functions: MSGHD, cMSGHD, and MCGHD, implement the three corresponding models that represent three recently proposed and more flexible variations of the MGHD. All of the models can be used with different starting techniques and several other options. The package also contains a DA routine for discriminant analysis, an ARI function that computes the adjusted Rand index, a contourpl function for contour plots and several functions for pseudo-random number generation and density estimation using the GHD, MSGHD, and MCGHD. The paper shows how to use the functions and to interpret the outputs on real datasets.
The current version of the package includes only one model for high-dimensional data, i.e., the MGHFA. Future research will focus on the extension of the MSGHD and MCGHD for high dimensional data. Moreover, the GHD could also be used for model-based regression, in which the random response variables follow a generalized hyperbolic regression model given a set of explanatory variables.