Skew t mixture of experts

doi:10.1016/j.neucom.2017.05.044

Neurocomputing

Volume 266, 29 November 2017, Pages 390-408

https://doi.org/10.1016/j.neucom.2017.05.044 Get rights and content

Abstract

Mixture of experts (MoE) is a popular framework in the fields of statistics and machine learning for modeling heterogeneity in data for regression, classification and clustering. MoE for continuous data are usually based on the normal distribution. However, it is known that for data with asymmetric behavior, heavy tails and atypical observations, the use of the normal distribution is unsuitable. We introduce a new robust non-normal mixture of experts modeling using the skew t distribution. The proposed skew t mixture of experts, named STMoE, handles these issues of the normal mixtures experts regarding possibly skewed, heavy-tailed and noisy data. We develop a dedicated expectation conditional maximization (ECM) algorithm to estimate the model parameters by monotonically maximizing the observed data log-likelihood. We describe how the presented model can be used in prediction and in model-based clustering of regression data. Numerical experiments carried out on simulated data show the effectiveness and the robustness of the proposed model in fitting non-linear regression functions as well as in model-based clustering. Then, the proposed model is applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data. The obtained results confirm the usefulness of the model for practical data analysis applications.

Introduction

Mixture of experts (MoE) [31] is a popular framework in the statistics and machine learning fields for modeling heterogeneity in data for regression, classification and clustering. They consist in a fully conditional mixture model where both the mixing proportions, known as the gating functions, and the component densities, known as the experts, are conditional on some input covariates. MoE have been investigated, in their simple form, as well as in their hierarchical form [34] (e.g., Section 5.12 of [44]) for regression and model-based cluster and discriminant analyses and in different application domains. MoE Have also been investigated for rank data [20] and network data [21] with social science applications. A survey on the topic can be found in [22]. A complete review of the MoE models can be found in [65]. MoE for continuous data are usually based on the normal distribution. Along this paper, we will call the MoE using the normal distribution the normal mixture of experts, abbreviated as NMoE. However, it is well-known that the normal distribution is sensitive to outliers. Moreover, for a set of data containing a group or groups of observations with heavy tails or asymmetric behavior, the use of normal experts may be unsuitable and can unduly affect the fit of the MoE model. In this paper, we attempt to overcome these limitations in MoE by proposing a more adapted and robust mixture of experts model that can deal with possibly skewed, heavy-tailed data and with outliers.

Recently, the problem of sensitivity of NMoE to outliers have been considered by [48] where the authors proposed a Laplace mixture of linear experts (LMoLE) for a robust modeling of non-linear regression data. The model parameters are estimated by maximizing the observed-data likelihood via a minorization-maximization (MM) algorithm. Here, we propose an alternative MoE model, by relaying on other non-normal distribution that generalizes the normal distribution, that is, the skew-t distribution introduced quite recently by [4]. We call the proposed MoE model the skew-t mixture of experts (STMoE). One may use the t distribution, as in the t mixture of experts (TMoE) proposed by [9], [10] which provides a natural robust extension of the normal distribution to model data with more heavy tails and to deal with possible outliers. The robustness of the t distribution may however be not sufficient in the presence of asymmetric observations. In mixture modeling, to deal with this issue regarding skewed data, [41] proposed the univariate skew-t mixture model that allows for accommodation of both skewness and thick tails in the data, by relying on the skew-t distribution [4]. For the general multivariate case using skew-t mixtures, one can refer to [36], [37], [38], [40], [51], and recently, the unifying framework for previous restricted and unrestricted skew-t mixtures, using the CFUST distribution [39]. We note that the STMoE model presented in this paper is more general and more robust compared to the TMoE model presented in Chamroukhi-TMoE. As discussed in Section 3.2, the TMoE model can be seen as a particular case of the STMoE model when the skewness parameter goes to zero. The presented model is then more robust as it is able to accommodate more complex data distribution where the data or a group of data are skewed and affected by atypical observations. The TMoE model may fail in this context.

The inference in the previously described approaches is performed by maximum likelihood estimation via the expectation-maximization (EM) algorithm or its extensions [15], [43], in particular the expectation conditional maximization (ECM) algorithm [46]. [18] have also considered the Bayesian inference framework for namely the skew-t mixtures.

For the regression context, the robust modeling of regression data has been studied namely by [5], [29], [63] who considered a mixture of linear regressions using the t distribution. In the same context of regression, [58] proposed the mixture of Laplace regressions, which has been then extended by [48] to the case of mixture of experts, by introducing the Laplace mixture of linear experts (LMoLE). Recently, [66] introduced the scale mixtures of skew-normal distributions for robust mixture regressions. However, unlike our proposed STMoE model, the regression mixture models of [5], [29], [58], [63], [66] do not consider conditional mixing proportions, that is, mixing proportions depending on some input variables, as in the case of mixture of experts, which we investigate here. In addition, the approaches of [5], [29], [58], [63] do not consider both the problem of robustness to outliers together with the one of dealing with possibly asymmetric data.

Here we consider the mixture of experts framework for non-linear regression problems and model-based clustering of regression data, and we attempt to overcome the limitations of the NMoE model for dealing with asymmetric, heavy-tailed data and which may contain outliers. We investigate the use of the skew t distribution for the experts, rather than the commonly used normal distribution. We propose the skew-t mixture of experts (STMoE) model that allows for accommodation of both skewness and heavy tails in the data and which is robust to outliers. This model corresponds to an extension of the unconditional skew t mixture model [41], to the mixture of experts (MoE) framework, where the mixture means are regression functions and the mixing proportions are also covariate-varying.

For the model inference, we develop a dedicated expectation conditional maximization (ECM) algorithm to estimate the model parameters by monotonically maximizing the observed data log-likelihood. The expectation-maximization algorithm and its extensions [15], [43] are indeed very popular and successful estimation algorithms for mixture models in general and for mixture of experts in particular. Moreover, the EM algorithm for MoE has been shown by [47] to be monotonically maximizing the MoE likelihood. The authors have showed that the EM (with Iteratively Reweighted Least Squares (IRLS) in this case) algorithm has stable convergence and the log-likelihood is monotonically increasing when a learning rate smaller than one is adopted for the IRLS procedure within the M-step of the EM algorithm. They have further proposed an expectation conditional maximization (ECM) algorithm to train MoE, which also has desirable numerical properties. The MoE has also been considered in the Bayesian framework, for example one can cite the Bayesian MoE [61], [62] and the Bayesian hierarchical MoE [7]. Related MoE considering the asymmetric t distribution in a Bayesian framework is proposed by [67]. Beyond the Bayesian parametric framework, the MoE models have also been investigated within the Bayesian non-parametric framework. We cite for example the Bayesian non-parametric MoE model [54] and the Bayesian non-parametric hierarchical MoE approach of [30] using Gaussian Processes experts for regression. For further models on mixture of experts for regression, the reader can be referred to for example the book of [57]. In this paper, we investigate semi-parametric models under the maximum likelihood estimation framework.

The remainder of this paper is organized as follows. In Section 2 we briefly recall the normal MoE framework. In Section Then, in Section 3, we present the STMoE model and in Section 4 the parameter estimation technique using the ECM algorithm. We then investigate in Section 5 the use of the proposed model for non-linear regression and for prediction. We also show in Section 6 how the model can be used in a model-based clustering prospective. In Section 7, we discuss the model selection. Section 8 is dedicated to the experimental study to assess the proposed model. Finally, in Section 9, conclusions are drawn and we open a future work.

Section snippets

Mixture of experts for continuous data

Mixtures of experts [31], [34] are used in a variety of contexts including regression, classification and clustering. Here, we consider the MoE framework for fitting (non-linear) regression functions and clustering of univariate continuous data . The aim of regression is to explore the relationship of an observed random variable Y given a covariate vector $X \in R^{p}$ via conditional density functions for $Y | X = x$ of the form f(y|x), rather than only exploring the unconditional distribution of Y. Thanks

The skew t mixture of experts (STMoE) model

The proposed skew t mixture of experts (STMoE) model is a MoE model in which the expert components have a skew-t density, rather than the standard normal one as in the NMoE model. The skew-t distribution [4], which is a robust generalization the skew-normal distribution [2], [3], as well as its stochastic and hierarchical representations, which will be used to define the proposed STMoE model, are recalled in the following section.

Maximum likelihood estimation of the STMoE model

The unknown parameter vector $Ψ$ of the STMoE model is estimated by maximizing the following observed-data log-likelihood given an observed i.i.d sample of n observations, that is, the responses $(y_{1}, \dots, y_{n})$ and the corresponding predictors $(x_{1}, \dots, x_{n})$ and $(r_{1}, \dots, r_{n})$ : $\log L (Ψ) = \sum_{i = 1}^{n} \log \sum_{k = 1}^{K} π_{k} (r_{i}; α) ST (y; μ (x_{i}; β_{k}), σ_{k}^{2}, λ_{k}, ν_{k}) \cdot$ We perform this iteratively by a dedicated ECM algorithm. The complete data consist of the observations as well as the latent variables $(u_{1}, \dots, u_{n})$ and $(w_{1}, \dots, w_{n}),$ and the latent

Prediction using the STMoE

The goal in regression is to be able to make predictions for the response variable(s) given some new value of the predictor variable(s) on the basis of a model trained on a set of training data. In regression analysis using mixture of experts, the aim is therefore to predict the response y given new values of the predictors (x, r), on the basis of a MoE model characterized by a parameter vector $\hat{Ψ}$ inferred from a set of training data, here, by maximum likelihood via EM. These predictions can be

Model-based clustering using the STMoE

The MoE models can also be used for a model-based clustering perspective to provide a partition of the regression data into K clusters. Model-based clustering using the proposed STMoE consists in assuming that the observed data ${x_{i}, r_{i}, y_{i}}_{i = 1}^{n}$ are generated from a K component mixture of skew t experts, with parameter vector $Ψ$ where the STMoE components are interpreted as clusters and hence associated to clusters. The problem of clustering therefore becomes the one of estimating the MoE

Model selection for the STMoE

One of the issues in mixture model-based clustering is model selection. The problem of model selection for the STMoE models presented here in its general form is equivalent to the one of choosing the optimal number of experts K, the degree p of the polynomial regression and the degree q for the logistic regression. The optimal value of the triplet (K, p, q) can be computed by using some model selection criteria such as the Akaike Information Criterion (AIC) [1], the Bayesian Information

Experimental study

This section is dedicated to the evaluation of the proposed approach on simulated data and real-world data . We evaluated the performance of proposed ECM algorithm¹ for the STMoE model in terms of modeling, robustness to outliers and clustering.

Concluding remarks and future work

In this paper we proposed a new non-normal MoE model, which generalizes the normal MoE model and attempts to simultaneously accommodate heavy tailed data with possible outliers and asymmetric distribution. The proposed STMoE is based on the flexible skew t distribution that is suggested for possibly non-symmetric, heavy tailed and noisy data. We developed a CEM algorithm for model inference and described the use of the model in non-linear regression and prediction as well as in model-based

References (67)

BaiX. et al.
Robust fitting of mixture regression models
Comput. Stat Data Anal.
(2012)
F. Chamroukhi
Robust mixture of experts modeling using the t distribution
Neural Netw.
(2016)
F. Chamroukhi et al.
Time series modeling by a regression approach based on a latent process
Neural Netw.
(2009)
ChenK. et al.
Improved learning algorithms for mixture of experts in multiclass classification
Neural Netw.
(1999)
I.C. Gormley et al.
A mixture of experts latent position cluster model for social network data
Stat. Methodol.
(2010)
M.I. Jordan et al.
Convergence results for the EM approach to mixtures of experts architectures
Neural Netw.
(1995)
H.D. Nguyen et al.
Laplace mixture of linear experts
Comput. Stat. Data Anal.
(2016)
A. O’Hagan et al.
Clustering with the multivariate normal inverse gaussian distribution
Comput. Stat. Data Anal.
(2016)
SongW. et al.
Robust mixture regression model fitting by laplace distribution
Comput. Stat. Data Anal.
(2014)
D. Young et al.
Mixtures of regressions with predictor-dependent mixing proportions
Comput. Stat. Data Anal.
(2010)

F. Li et al.

Flexible modeling of conditional distributions using smooth mixtures of asymmetric student t densities

J. Stat. Plan. Inference

(2010)

H. Akaike

A new look at the statistical model identification

IEEE Trans. Autom. Control

(1974)

A. Azzalini

A class of distributions which includes the normal ones

Scand. J. Stat.

(1985)

A. Azzalini

Further results on a class of distributions which includes the normal ones

Scand. J. Stat.

(1986)

A. Azzalini et al.

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution

J. R. Stat. Soc. Ser. B

(2003)

C. Biernacki et al.

Assessing a mixture model for clustering with the integrated completed likelihood

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

C. Bishop et al.

Bayesian hierarchical mixtures of experts.

Proceeding UAI’03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

(2003)

R.P. Brent

Algorithms for minimization without derivatives. Chapter 5

(1973)

F. Chamroukhi, Non-normal mixtures of experts, arXiv:1506.06707v2 [stat.ME]....

F. Chamroukhi et al.

A regression model with a hidden logistic process for feature extraction from time series

Proceedings of the International Joint Conference on Neural Networks (IJCNN)

(2009)

E.A. Cohen

Some effects of inharmonic partials on interval perception

Music Percept.

(1984)

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. B

(1977)

S. Faria et al.

Fitting mixtures of linear regressions

J. Stat. Comput. Simul.

(2010)

S. Frühwirth-Schnatter

Finite Mixture and Markov Switching Models (Springer Series in Statistics)

(2006)

S. Frühwirth-Schnatter et al.

Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions

Biostatistics

(2010)

S. Gaffney et al.

Trajectory clustering with mixtures of regression models

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining

(1999)

I.C. Gormley et al.

A mixture of experts model for rank data with applications in election studies

Ann. Appl. Stat.

(2008)

I.C. Gormley et al.

Mixture of experts modelling with social science applications

J. Comput. Gr. Stat.

(2011)

P. Green

Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives

J. R. Stat. Soc. B

(1984)

J. Hansen et al.

Giss analysis of surface temperature change

J. Geophys. Res.

(1999)

J. Hansen et al.

A closer look at united states and global surface temperature change

J. Geophys. Res.

(2001)

N. Henze

A probabilistic representation of the skew-normal distribution

Scand. J. Stat.

(1986)

D. Hunter et al.

Semiparametric mixtures of regressions

J. Nonparametric Stat.

(2012)

Cited by (20)

Unsupervised performance analysis of 3D face alignment with a statistically robust confidence test
2024, Neurocomputing
This paper addresses the problem of analyzing the performance of 3D face alignment (3DFA), or facial landmark localization. This task is usually supervised, based on annotated datasets. Nevertheless, in the particular case of 3DFA, the annotation process is rarely error-free, which strongly biases the results. Alternatively, unsupervised performance analysis (UPA) is investigated. The core ingredient of the proposed methodology is the robust estimation of the rigid transformation between predicted landmarks and model landmarks. It is shown that the rigid mapping thus computed is affected neither by non-rigid facial deformations, due to variabilities in expression and in identity, nor by landmark localization errors, due to various perturbations. The guiding idea is to apply the estimated rotation, translation and scale to a set of predicted landmarks in order to map them onto a mathematical home for the shape embedded in these landmarks (including possible errors). UPA proceeds as follows: (i) 3D landmarks are extracted from a 2D face using the 3DFA method under investigation; (ii) these landmarks are rigidly mapped onto a canonical (frontal) pose, and (iii) a statistically-robust confidence score is computed for each landmark. This allows to assess whether the mapped landmarks lie inside (inliers) or outside (outliers) a confidence volume. An experimental evaluation protocol, that uses publicly available datasets and several 3DFA software packages associated with published articles, is described in detail. The results show that the proposed analysis is consistent with supervised metrics and that it can be used to measure the accuracy of both predicted landmarks and of automatically annotated 3DFA datasets, to detect errors and to eliminate them. Source code and supplemental materials for this paper are publicly available at https://team.inria.fr/robotlearn/upa3dfa/.
Mixture of robust Gaussian processes and its hard-cut EM algorithm with variational bounding approximation
2021, Neurocomputing
Citation Excerpt :
Actually, it has been used as the distribution of the noise in the least absolute deviation regression [28]. Moreover, student-t distribution and skew-t distribution have been adopted in the mixture-distribution models to enhance the robustness [29–31]. In Fig. 5, we sketch the probabilistic density functions of Gaussian distribution, Laplace distribution and student-t distribution with mean 0 and variance 2.
The Gaussian process is a powerful statistical learning model and has been applied widely in nonlinear regression and classification. However, it fails to model multi-modal data from a non-stationary source since a prior Gaussian process is generally stationary. Based on the idea of the mixture of experts, the mixture of Gaussian processes was established to increase the model flexibility. On the other hand, the Gaussian process is also sensitive to outliers and thus robust Gaussian processes have been suggested to own the heavy-tailed property. In practical applications, the datasets may be multi-modal and contain outliers at the same time. In order to overcome these two difficulties together, we propose a mixture of robust Gaussian processes (MRGP) model and establish a precise hard-cut EM algorithm for learning its parameters. Since the exact solving process is intractable due to the fact that non-Gaussian probability density functions of the noises are adopted into the likelihood of the proposed model on the dataset, we employ a variational bounding method to approximate the marginal likelihood functions so that the hard-cut EM algorithm can be implemented effectively. Moreover, we conduct various experiments on both synthetic and real-world datasets to evaluate and compare our proposed MRGP method with several competitive nonlinear regression methods. The experimental results demonstrate that our MRGP model with the hard-cut EM algorithm is much more effective and robust than the competitive nonlinear regression models.
Mixture of linear experts model for censored data: A novel approach with scale-mixture of normal distributions
2021, Computational Statistics and Data Analysis
Mixture of linear experts (MoE) model is one of the widespread statistical frameworks for modeling, classification, and clustering of data. Built on the normality assumption of the error terms for mathematical and computational convenience, the classical MoE model has two challenges: (1) it is sensitive to atypical observations and outliers, and (2) it might produce misleading inferential results for censored data. The aim is then to resolve these two challenges, simultaneously, by proposing a robust MoE model for model-based clustering and discriminant censored data with the scale-mixture of normal (SMN) class of distributions for the unobserved error terms. An analytical expectation–maximization (EM) type algorithm is developed in order to obtain the maximum likelihood parameter estimates. Simulation studies are carried out to examine the performance, effectiveness, and robustness of the proposed methodology. Finally, a real dataset is used to illustrate the superiority of the new model.
Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model
2019, Neurocomputing
Mixture of experts (MoE) models are a class of artificial neural networks that can be used for functional approximation and probabilistic modeling. An important class of MoE models is the class of mixture of linear experts (MoLE) models, where the expert functions map to real topological output spaces. Recently, Gaussian gated MoLE models have become popular in applied research. There are a number of powerful approximation results regarding Gaussian gated MoLE models, when the output space is univariate. These results guarantee the ability of Gaussian gated MoLE mean functions to approximate arbitrary continuous functions, and Gaussian gated MoLE models themselves to approximate arbitrary conditional probability density functions. We utilize and extend upon the univariate approximation results in order to prove a pair of useful results for situations where the output spaces are multivariate. We do this by proving a pair of lemmas regarding the combination of univariate MoLE models, which are interesting in their own rights.
Flexible mixture regression with the generalized hyperbolic distribution
2024, Advances in Data Analysis and Classification
Parsimonious mixture-of-experts based on mean mixture of multivariate normal distributions
2022, Stat

View all citing articles on Scopus

Faicel Chamroukhi is since 2016 a Full Professor of Statistics and Data Science at University of Caen, department of Mathematics and Computer Science and the Lab of Mathematics Nicolas Oresme (LMNO) UMR CNRS 6139. From 2011 to 2015, he has been an Associate Professor at University of Toulon and the Information Sciences and Systems Lab. (LSIS) UMR CNRS 7296. In 2015–2016 he was awarded a CNRS research leave at the Lab of Mathematics Paul Painlevé, UMR CNRS 8524 in Lille, where he has also been an Invited Scientist at INRIA. He received his Master degree of Engineering Sciences from Pierre & Marie Curie (Paris 6) University in 2007 and his Ph.D. degree in applied mathematics and computer science, in the area of statistical leaning and data analysis from Compiègne University of Technology in 2010. In 2011, he was qualified for the position of Associate Professor in applied mathematics (CNU 26), computer science (CNU 27), and signal processing (CNU 61). In 2015, he received his Accreditation to Supervise Research (HDR) in applied mathematics and computer science, in the area of statistical learning and data science, from Toulon University. In 2016, he was qualified for the position of Professor in the area of applied mathematics, computer science, and signal processing (CNU 26, 27, 61). His multidisciplinary research is in the area of Data Science and includes statistics, machine learning and statistical signal processing, with a particular focus of the statistical methodology and inference of latent data models for complex heterogeneous high dimensional and massive data, temporal data, functional data, and their application to real-world problems including dynamical systems, acoustic/speech processing, life sciences (medicine, biology), information retrieval, social networks.

View full text

Skew t mixture of experts

Abstract

Introduction

Section snippets

Mixture of experts for continuous data

The skew t mixture of experts (STMoE) model

Maximum likelihood estimation of the STMoE model

Prediction using the STMoE

Model-based clustering using the STMoE

Model selection for the STMoE

Experimental study

Concluding remarks and future work

Comput. Stat Data Anal.

Neural Netw.

Neural Netw.

Neural Netw.

Stat. Methodol.

Neural Netw.

Comput. Stat. Data Anal.

Comput. Stat. Data Anal.

Comput. Stat. Data Anal.

Comput. Stat. Data Anal.

J. Stat. Plan. Inference

A new look at the statistical model identification

IEEE Trans. Autom. Control

A class of distributions which includes the normal ones

Scand. J. Stat.

Further results on a class of distributions which includes the normal ones

Scand. J. Stat.

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution

J. R. Stat. Soc. Ser. B

Assessing a mixture model for clustering with the integrated completed likelihood

IEEE Trans. Pattern Anal. Mach. Intell.

Bayesian hierarchical mixtures of experts.

Proceeding UAI’03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Algorithms for minimization without derivatives. Chapter 5

A regression model with a hidden logistic process for feature extraction from time series

Proceedings of the International Joint Conference on Neural Networks (IJCNN)

Some effects of inharmonic partials on interval perception

Music Percept.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. B

Fitting mixtures of linear regressions

J. Stat. Comput. Simul.

Finite Mixture and Markov Switching Models (Springer Series in Statistics)

Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions

Biostatistics

Trajectory clustering with mixtures of regression models

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining

A mixture of experts model for rank data with applications in election studies

Ann. Appl. Stat.

Mixture of experts modelling with social science applications

J. Comput. Gr. Stat.

Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives

J. R. Stat. Soc. B

Giss analysis of surface temperature change

J. Geophys. Res.

A closer look at united states and global surface temperature change

J. Geophys. Res.

A probabilistic representation of the skew-normal distribution

Scand. J. Stat.

Semiparametric mixtures of regressions

J. Nonparametric Stat.