Abstract
Families of mixtures of multivariate power exponential (MPE) distributions have already been introduced and shown to be competitive for cluster analysis in comparison to other mixtures of elliptical distributions, including mixtures of Gaussian distributions. A family of mixtures of multivariate skewed power exponential distributions is proposed that combines the flexibility of the MPE distribution with the ability to model skewness. These mixtures are more robust to variations from normality and can account for skewness, varying tail weight, and peakedness of data. A generalized expectation-maximization approach, which combines minorization-maximization and optimization based on accelerated line search algorithms on the Stiefel manifold, is used for parameter estimation. These mixtures are implemented both in the unsupervised and semi-supervised classification frameworks. Both simulated and real data are used for illustration and comparison to other mixture families.
Similar content being viewed by others
References
Absil, P.-A., Mahony, R., & Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
Aitken, A.C. (1926). On Bernoulli’s numerical solution of algebraic equations. In Proceedings of the royal society of edinburgh (pp. 289–305).
Andrews, J. L., & McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Azzalini, A. (1986). Further results on a class of distributions which includes the normal ones. Statistica, 46(2), 199–208.
Azzalini, A., & Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika, 83, 715–726.
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Basford, K., Greenway, D., McLachlan, G., & Peel, D. (1997). Standard errors of fitted component means of normal mixtures. Computational Statistics, 12(1), 1–18.
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3–4), 561–575.
Böhning, D., & Lindsay, B. G. (1988). Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics, 40(4), 641–663.
Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71, 52–78.
Branco, M. D., & Dey, D. K. (2001). A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis, 79(1), 99–113.
Browne, R. P., & McNicholas, P. D. (2014). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Statistics and Computing, 24(2), 203–210.
Browne, R. P., & McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2), 176–198.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Cho, D., & Bui, T. D. (2005). Multivariate statistical modeling for image denoising using wavelet transforms. Signal Processing: Image Communication, 20 (1), 77–89.
da Silva Ferreira, C., Bolfarine, H., & Lachos, V. H. (2011). Skew scale mixtures of normal distributions: Properties and estimation. Statistical Methodology, 8(2), 154–171.
Dang, U. J., Browne, R. P., & McNicholas, P. D. (2015). Mixtures of multivariate power exponential distributions. Biometrics, 71(4), 1081–1089.
Dang, U.J., Browne, R.P., Gallaugher, M.P., & Band McNicholas, P.D. (2021). mixSPE: Mixtures of power exponential and skew power exponential distributions for use in model-based clustering and classification. R package version 0.9.1.
Day, N. E. (1969). Estimating the components of a mixture of normal distributions. Biometrika, 56(3), 463–474.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1–38.
DiCiccio, T. J., & Monti, A. C. (2004). Inferential aspects of the skew exponential power distribution. Journal of the American Statistical Association, 99 (466), 439–450.
Flury, B. (2012). Flury: data sets from Flury, 1997. R package version 0.1–3.
Forina, M., & Tiscornia, E. (1982). Pattern-recognition methods in the prediction of italian olive oil origin by their fatty-acid content. Annali di Chimica, 72(3-4), 143–155.
Fraley, C., Raftery, A. E., Murphy, T. B., & Scrucca, L. (2012). mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of statistics, university of Washington, Seattle, Washington.
Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Franczak, B. C., Browne, R. P., McNicholas, P. D., & Burak, K. L. (2018). MixSAL: Mixtures of multivariate shifted asymmetric Laplace (SAL) distributions. R package version, 1, 0.
Gallaugher, M. P. B., & McNicholas, P. D. (2018). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93.
Gallaugher, M. P. B., & McNicholas, P. D. (2019). On fractionally-supervised classification: Weight selection and extension to the multivariate t-distribution. Journal of Classification, 36(2), 232–265.
Gómez, E., Gomez-Viilegas, M. A., & Marin, J. M. (1998). A multivariate generalization of the power exponential family of distributions. Communications in Statistics-Theory and Methods, 27(3), 589–600.
Hartigan, J. A., & Wong, M. A. (1979). A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C, 28(1), 100–108.
Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8(3), 431–444.
Horst, A.M., Hill, A.P., & Gorman, K.B. (2020). Palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Hunter, D. R., & Lange, K. (2000). Rejoinder to discussion of optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1), 52–59.
Hunter, D. R., & Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58(1), 30–37.
Hurley, C. (2012). gclus: clustering Graphics. R package version 1.3.1.
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Lee, S., & McLachlan, G. J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing, 24 (2), 181–202.
Lee, S. X., & McLachlan, G. J. (2016). Finite mixtures of canonical fundamental skew t-distributions: The unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing, 26(3), 573–589.
Lin, T. -I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing, 20(3), 343–356.
Lin, T. -I., Ho, H. J., & Lee, C -R. (2014). Flexible mixture modelling using the multivariate skew-t-normal distribution. Statistics and Computing, 24(4), 531–546.
Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics (pp. 1–163).
Lindsey, J. K. (1999). Multivariate elliptically contoured distributions for repeated measurements. Biometrics, 55(4), 1277–1280.
McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference, 140(5), 1175–1181.
McNicholas, P. D. (2016a). Mixture model-based classification. Boca Raton: Chapman & Hall/CRC Press.
McNicholas, P. D. (2016b). Model-based clustering. Journal of Classification, 33(3), 331–373.
McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious gaussian mixture models. Statistics and Computing, 18(3), 285–296.
McNicholas, P. D., Murphy, T. B., McDaid, A. F., & Frost, D. (2010). Serial and parallel implementations of model-based clustering via parsimonious gaussian mixture models. Computational Statistics & Data Analysis, 54(3), 711–723.
McNicholas, P.D., ElSherbiny, A., McDaid, A.F., & Murphy, T. B. (2022). pgmm: Parsimonious Gaussian Mixture Models. R package version 1.2.6-2.
McNicholas, S.M., McNicholas, P.D., & Browne, R.P. (2017). A mixture of variance-gamma factor analyzers. In Big and complex data analysis (pp 369–385). Springer international publishing, Cham.
Morris, K., & McNicholas, P. D. (2013). Dimension reduction for model-based clustering via mixtures of shifted asymmetric Laplace distributions. Statistics and Probability Letters, 83(9), 2088–2093.
Murray, P. M., Browne, R. B., & McNicholas, P. D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics and Data Analysis, 77, 326–335.
Murray, P. M., Browne, R. B., & McNicholas, P. D. (2017). Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. Journal of Multivariate Analysis, 161, 141–156.
Nakai, K., & Kanehisa, M. (1991). Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Structure. Function, and Bioinformatics, 11(2), 95–110.
Nakai, K., & Kanehisa, M. (1992). A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14(4), 897–911.
O’Hagan, A., Murphy, T. B., Gormley, I. C., McNicholas, P. D., & Karlis, D. (2016). Clustering with the multivariate normal inverse gaussian distribution. Computational Statistics and Data Analysis, 93, 18–30.
Peel, D., & McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing, 10(4), 339–348.
Pocuca, N., Browne, R.P., & McNicholas, P.D. (2022). Mixture: Mixture models for clustering and classification. R package version 2.0.5.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Steinley, D. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Streuli, H. (1973). Der heutige stand der kaffeechemie. In 6th international colloquium on coffee chemisrty (pp. 61–72).
Subedi, S., & McNicholas, P. D. (2014). Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Advances in Data Analysis and Classification, 8(2), 167–193.
Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2), 443–482.
Tortora, C., ElSherbiny, A., Browne, R. P., Franczak, B.C., McNicholas, P.D., & Amos, D.D. (2018). MixGHD: Model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. R package version 2.2.
Venables, W.N., & Ripley, B.D. (2002). Modern applied statistics with S, fourth edn. New York: Springer. ISBN 0-387-95457-0.
Verdoolaege, G., De Backer, S., & Scheunders, P. (2008). Multiscale colour texture retrieval using the geodesic distance between multivariate generalized Gaussian models. In 2008 15th IEEE international conference on image processing (pp. 169–172).
Vrbik, I., & McNicholas, P. D. (2014). Parsimonious skew mixture models for model-based clustering and classification. Computational Statistics and Data Analysis, 71, 196–210.
Vrbik, I., & McNicholas, P. D. (2015). Fractionally-supervised classification. Journal of Classification, 32(3), 359–381.
Wolfe, J.H. (1965). A computer program for the maximum likelihood analysis of types. U.S. Naval personnel research activity, technical bulletin:65-15.
Zhu, X., Sarkar, S., & Melnykov, V. (2022). Mattransmix: An r package for matrix model-based clustering and parsimonious mixture modeling. Journal of Classification, 39(1), 147–170.
Acknowledgements
This research was supported by the Natural Sciences and Engineering Research Council of Canada through their Discovery Grants program for Dang, Browne, and McNicholas, the Banting Postdoctoral Fellowship for Gallaugher, as well as the Canada Research Chairs program for McNicholas.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Code and Data Accessibility
All data used here are publicly available; references have been provided within the bibliography. An implementation of mixtures of skewed power exponential distributions and mixtures of power exponential distributions is available as the R package mixSPE (Dang et al., 2021).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix. Supporting Information
Parameter Recovery
A three-component mixture is simulated with 500 observations in total. Group sample sizes are sampled from a multinomial distribution with mixing proportions \((0.2, 0.34, 0.46)^{\prime }\). The first component is simulated from a heavy-tailed three-dimensional MSPE distribution with \(\boldsymbol {\mu }_{1} = (12, 14, 0)^{\prime }\), β1 = 0.85, and \(\boldsymbol {\psi }_{1}=(-5, -10, 0)^{\prime }\). The second component is simulated with \(\boldsymbol {\mu }_{2} = (-3, -10, 0)^{\prime }\), β2 = 0.9, and \(\boldsymbol {\psi }_{2}=(15, 10, 0)^{\prime }\). The third component is simulated with light tails with \(\boldsymbol {\mu }_{3} = (3, 1, 0)^{\prime }\), β3 = 2, and ψ3 = ψ2. Lastly, the scale matrices were common to all three components with diag \((\boldsymbol {\Delta }_{g}) = (4, 3, 1)^{\prime }\) and
for g = 1, 2, 3. The simulated components were well separated to show parameter recovery. MSPE was run on 100 such datasets and perfect classification obtained on this well separated data each time. Note that there is an identifiability issue with individual parameter estimates (different combinations of individual parameter estimates yield the same fit) and closed form equations for overall mean and variance are not available. Hence, we demonstrate parameter recovery of overall cluster-specific mean and covariances in Table 6 by comparing estimates from data simulated (via a Metropolis-Hastings rule) using individual parameter estimates from the GEM fit. Clearly, the estimates are overall close to the generated values.
2.1 Performance on Additional Datasets
In addition to those considered in the main body of the manuscript, we also considered the following data available through various R packages:
Wine Dataset
The expanded twenty seven variable wine dataset, obtained from pgmm, has 27 different measurements of chemical aspects of 178 Italian wines (three types of wine).
Iris Dataset
The iris dataset (included with R) consists of 150 observations, 50 each of 3 different species of iris. There are four different variables that were measured, namely the petal length and width and the sepal length and width.
Swiss Banknote Dataset
The Swiss banknote dataset, obtained from MixGHD (Tortora et al., 2018) looked at 6 different measurements from 100 genuine and 100 counterfeit banknotes. The measurements were length, length of the diagonal, width of the right and left edges, and the top and bottom margin widths.
Crabs Dataset
The crabs dataset, obtained from MASS (Venables & Ripley, 2002), contains 200 observations with 5 different variables that measure characteristics of crabs. There were 100 males and 100 females, and two different species of crabs, orange, and blue. This creates four different groups of crabs based on gender/species combinations.
Bankruptcy Dataset
The bankruptcy dataset, obtained from MixGHD, looked at the ratio of retained earnings to total assets, and the ratio of earnings before interests and taxes to total assets of 33 financially sound and 33 bankrupt American firms.
Yeast Dataset
A subset of the yeast dataset from Nakai and Kanehisa (1991, 1992) sourced through the MixSAL package (Franczak et al., 2018) is also used. There are measurements on three variables: McGeoch’s method for signal sequence recognition, the score of the ALOM membrane spanning region prediction program, and the score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins along with the possible two cellular localization sites, CYT (cytosolic or cytoskeletal) and ME3 (membrane protein, no N-terminal signal) for the proteins.
Diabetes Dataset
The diabetes dataset, obtained from mclust (Fraley et al., 2012), considered 145 non-obese adult patients with different types of diabetes classified as normal, overt, and chemical. There were three measurements, the area under the plasma glucose curve, the area under the plasma insulin curve, and the steady-state plasma glucose.
Unsupervised Classification
Unsupervised classification, i.e., clustering, is performed on the (scaled) datasets mentioned above using the same comparison distributions as on the simulated data. The ARI and the number of groups chosen by the BIC are shown in Table 7.
The banknote data is interesting in that while the two elliptical mixtures, MPE and Gaussian mixture models, split the counterfeit and genuine banknotes into four different groups with the same classification overall (Table 8), the selected MSPE model fits three components splitting the counterfeit banknotes into a larger and a smaller component, while the ghpcm model splits the observations into two groups only. We see that, for the crabs, the MPE distribution exhibits the best performance (with eight “blue males” being misclassified into a different component) while the other three methods choose three components; however, the clusters found are a little different (Table 8). For example, the MSPE model perfectly separates one species of crab from the other; however, for the “blue” species, it does not differentiate between the sexes. For the second species, there are only four misclassifications for differentiating the sexes. The ghpcm model has a similar fit to the MSPE model (species separated nicely but not sexes), but with two fewer misclassifications. The gpcm had a poorer fit compared to known labels, clustering sex better than species. The bankruptcy data shows some interesting results. The ghpcm model fits three clusters to the data, with two small clusters of two and three observations, respectively, with poor performance compared to known labels (Table 8). The other three comparators performed somewhat similarly with MSPE fitting four components (including one with eight tightly clustered points). For the diabetes data, the two selected skewed mixtures under-fit the number of components (seem to combine the normal and chemical classes but able to differentiate from the overt class) as compared to the elliptical mixtures, which fare better and similarly. Moreover, the estimates of skewness were not trivial, and both groups had heavier tails in the MSPE fit. On the other hand, for the MPE fit, the tails were approximately Gaussian (common βg ≈ 1). For the yeast dataset, apart from the ghpcm mixtures, the other three mixtures overfit the number of components; however, the ghpcm mixture clustering was not meaningful compared to known labels. For the iris data, only the MPE mixtures’ selected model had three components. While for the expanded 27-dimensional wine dataset, the gpcm mixtures perform best with perfect classification. Interestingly, the gpcm mixtures have poorer performance relatively in the semi-supervised fits on these data. Note that this phenomenon, whereby cluster analysis can obtain better results compared to using semi-supervised classification, has been noted before, e.g., by Vrbik and McNicholas (2015) and Gallaugher and McNicholas (2019).
The relative performance of the MPE versus MSPE mixtures in Table 9 suggests that there are cases in which using these skewed mixtures might not be ideal and could be a possible subject of future work.
Semi-supervised Classification
For each dataset, we take 25 labelled/unlabelled splits with 25% supervision. In Table 9, we display the median ARI values along with the first and third quartiles over the 25 splits. For the most part, as found in the main text of this manuscript, performance in the semi-supervised scenarios was better than in the fully unsupervised scenarios. Performance across the four comparators was also quite comparable with few exceptions. For the yeast data, both in the unsupervised and semi-supervised context, the MSPE mixtures performed the best. Similarly, for the diabetes data, the MSPE mixtures perform the best, with gpcm mixtures having the most inferior classification performance. On the 27 dimensional wine data, the MPE mixtures performed well with MSPE mixtures having more variability in ARI across the runs. For the iris dataset, the MPE and ghpcm models showed the best overall performance while, for the banknote dataset, all algorithms exhibit similar performance. For the bankruptcy data, the gpcm algorithm performed the poorest while, for crabs, the gpcm algorithm performed the best along with ghpcm and MPE mixtures which had similar performance. For the crabs dataset, although the MSPE models had poorer classification compared to the other three mixtures, the performance was still close to the other mixture distributions.
A reviewer noted that using mixtures of canonical fundamental skew t (CFUST) distributions (Lee & McLachlan, 2016), one can obtain an ARI close to 1 with there being only one misclassification. However, we were unable to obtain this solution from CFUST; perhaps due to different initialization. For the semi-supervised runs, this solution could be obtained for all four comparator distributions (for the best out of 25 runs).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dang, U.J., Gallaugher, M.P., Browne, R.P. et al. Model-Based Clustering and Classification Using Mixtures of Multivariate Skewed Power Exponential Distributions. J Classif 40, 145–167 (2023). https://doi.org/10.1007/s00357-022-09427-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-022-09427-7