Skew t mixture of experts
Introduction
Mixture of experts (MoE) [31] is a popular framework in the statistics and machine learning fields for modeling heterogeneity in data for regression, classification and clustering. They consist in a fully conditional mixture model where both the mixing proportions, known as the gating functions, and the component densities, known as the experts, are conditional on some input covariates. MoE have been investigated, in their simple form, as well as in their hierarchical form [34] (e.g., Section 5.12 of [44]) for regression and model-based cluster and discriminant analyses and in different application domains. MoE Have also been investigated for rank data [20] and network data [21] with social science applications. A survey on the topic can be found in [22]. A complete review of the MoE models can be found in [65]. MoE for continuous data are usually based on the normal distribution. Along this paper, we will call the MoE using the normal distribution the normal mixture of experts, abbreviated as NMoE. However, it is well-known that the normal distribution is sensitive to outliers. Moreover, for a set of data containing a group or groups of observations with heavy tails or asymmetric behavior, the use of normal experts may be unsuitable and can unduly affect the fit of the MoE model. In this paper, we attempt to overcome these limitations in MoE by proposing a more adapted and robust mixture of experts model that can deal with possibly skewed, heavy-tailed data and with outliers.
Recently, the problem of sensitivity of NMoE to outliers have been considered by [48] where the authors proposed a Laplace mixture of linear experts (LMoLE) for a robust modeling of non-linear regression data. The model parameters are estimated by maximizing the observed-data likelihood via a minorization-maximization (MM) algorithm. Here, we propose an alternative MoE model, by relaying on other non-normal distribution that generalizes the normal distribution, that is, the skew-t distribution introduced quite recently by [4]. We call the proposed MoE model the skew-t mixture of experts (STMoE). One may use the t distribution, as in the t mixture of experts (TMoE) proposed by [9], [10] which provides a natural robust extension of the normal distribution to model data with more heavy tails and to deal with possible outliers. The robustness of the t distribution may however be not sufficient in the presence of asymmetric observations. In mixture modeling, to deal with this issue regarding skewed data, [41] proposed the univariate skew-t mixture model that allows for accommodation of both skewness and thick tails in the data, by relying on the skew-t distribution [4]. For the general multivariate case using skew-t mixtures, one can refer to [36], [37], [38], [40], [51], and recently, the unifying framework for previous restricted and unrestricted skew-t mixtures, using the CFUST distribution [39]. We note that the STMoE model presented in this paper is more general and more robust compared to the TMoE model presented in Chamroukhi-TMoE. As discussed in Section 3.2, the TMoE model can be seen as a particular case of the STMoE model when the skewness parameter goes to zero. The presented model is then more robust as it is able to accommodate more complex data distribution where the data or a group of data are skewed and affected by atypical observations. The TMoE model may fail in this context.
The inference in the previously described approaches is performed by maximum likelihood estimation via the expectation-maximization (EM) algorithm or its extensions [15], [43], in particular the expectation conditional maximization (ECM) algorithm [46]. [18] have also considered the Bayesian inference framework for namely the skew-t mixtures.
For the regression context, the robust modeling of regression data has been studied namely by [5], [29], [63] who considered a mixture of linear regressions using the t distribution. In the same context of regression, [58] proposed the mixture of Laplace regressions, which has been then extended by [48] to the case of mixture of experts, by introducing the Laplace mixture of linear experts (LMoLE). Recently, [66] introduced the scale mixtures of skew-normal distributions for robust mixture regressions. However, unlike our proposed STMoE model, the regression mixture models of [5], [29], [58], [63], [66] do not consider conditional mixing proportions, that is, mixing proportions depending on some input variables, as in the case of mixture of experts, which we investigate here. In addition, the approaches of [5], [29], [58], [63] do not consider both the problem of robustness to outliers together with the one of dealing with possibly asymmetric data.
Here we consider the mixture of experts framework for non-linear regression problems and model-based clustering of regression data, and we attempt to overcome the limitations of the NMoE model for dealing with asymmetric, heavy-tailed data and which may contain outliers. We investigate the use of the skew t distribution for the experts, rather than the commonly used normal distribution. We propose the skew-t mixture of experts (STMoE) model that allows for accommodation of both skewness and heavy tails in the data and which is robust to outliers. This model corresponds to an extension of the unconditional skew t mixture model [41], to the mixture of experts (MoE) framework, where the mixture means are regression functions and the mixing proportions are also covariate-varying.
For the model inference, we develop a dedicated expectation conditional maximization (ECM) algorithm to estimate the model parameters by monotonically maximizing the observed data log-likelihood. The expectation-maximization algorithm and its extensions [15], [43] are indeed very popular and successful estimation algorithms for mixture models in general and for mixture of experts in particular. Moreover, the EM algorithm for MoE has been shown by [47] to be monotonically maximizing the MoE likelihood. The authors have showed that the EM (with Iteratively Reweighted Least Squares (IRLS) in this case) algorithm has stable convergence and the log-likelihood is monotonically increasing when a learning rate smaller than one is adopted for the IRLS procedure within the M-step of the EM algorithm. They have further proposed an expectation conditional maximization (ECM) algorithm to train MoE, which also has desirable numerical properties. The MoE has also been considered in the Bayesian framework, for example one can cite the Bayesian MoE [61], [62] and the Bayesian hierarchical MoE [7]. Related MoE considering the asymmetric t distribution in a Bayesian framework is proposed by [67]. Beyond the Bayesian parametric framework, the MoE models have also been investigated within the Bayesian non-parametric framework. We cite for example the Bayesian non-parametric MoE model [54] and the Bayesian non-parametric hierarchical MoE approach of [30] using Gaussian Processes experts for regression. For further models on mixture of experts for regression, the reader can be referred to for example the book of [57]. In this paper, we investigate semi-parametric models under the maximum likelihood estimation framework.
The remainder of this paper is organized as follows. In Section 2 we briefly recall the normal MoE framework. In Section Then, in Section 3, we present the STMoE model and in Section 4 the parameter estimation technique using the ECM algorithm. We then investigate in Section 5 the use of the proposed model for non-linear regression and for prediction. We also show in Section 6 how the model can be used in a model-based clustering prospective. In Section 7, we discuss the model selection. Section 8 is dedicated to the experimental study to assess the proposed model. Finally, in Section 9, conclusions are drawn and we open a future work.
Section snippets
Mixture of experts for continuous data
Mixtures of experts [31], [34] are used in a variety of contexts including regression, classification and clustering. Here, we consider the MoE framework for fitting (non-linear) regression functions and clustering of univariate continuous data . The aim of regression is to explore the relationship of an observed random variable Y given a covariate vector via conditional density functions for of the form f(y|x), rather than only exploring the unconditional distribution of Y. Thanks
The skew t mixture of experts (STMoE) model
The proposed skew t mixture of experts (STMoE) model is a MoE model in which the expert components have a skew-t density, rather than the standard normal one as in the NMoE model. The skew-t distribution [4], which is a robust generalization the skew-normal distribution [2], [3], as well as its stochastic and hierarchical representations, which will be used to define the proposed STMoE model, are recalled in the following section.
Maximum likelihood estimation of the STMoE model
The unknown parameter vector of the STMoE model is estimated by maximizing the following observed-data log-likelihood given an observed i.i.d sample of n observations, that is, the responses and the corresponding predictors and : We perform this iteratively by a dedicated ECM algorithm. The complete data consist of the observations as well as the latent variables and and the latent
Prediction using the STMoE
The goal in regression is to be able to make predictions for the response variable(s) given some new value of the predictor variable(s) on the basis of a model trained on a set of training data. In regression analysis using mixture of experts, the aim is therefore to predict the response y given new values of the predictors (x, r), on the basis of a MoE model characterized by a parameter vector inferred from a set of training data, here, by maximum likelihood via EM. These predictions can be
Model-based clustering using the STMoE
The MoE models can also be used for a model-based clustering perspective to provide a partition of the regression data into K clusters. Model-based clustering using the proposed STMoE consists in assuming that the observed data are generated from a K component mixture of skew t experts, with parameter vector where the STMoE components are interpreted as clusters and hence associated to clusters. The problem of clustering therefore becomes the one of estimating the MoE
Model selection for the STMoE
One of the issues in mixture model-based clustering is model selection. The problem of model selection for the STMoE models presented here in its general form is equivalent to the one of choosing the optimal number of experts K, the degree p of the polynomial regression and the degree q for the logistic regression. The optimal value of the triplet (K, p, q) can be computed by using some model selection criteria such as the Akaike Information Criterion (AIC) [1], the Bayesian Information
Experimental study
This section is dedicated to the evaluation of the proposed approach on simulated data and real-world data . We evaluated the performance of proposed ECM algorithm1 for the STMoE model in terms of modeling, robustness to outliers and clustering.
Concluding remarks and future work
In this paper we proposed a new non-normal MoE model, which generalizes the normal MoE model and attempts to simultaneously accommodate heavy tailed data with possible outliers and asymmetric distribution. The proposed STMoE is based on the flexible skew t distribution that is suggested for possibly non-symmetric, heavy tailed and noisy data. We developed a CEM algorithm for model inference and described the use of the model in non-linear regression and prediction as well as in model-based
Faicel Chamroukhi is since 2016 a Full Professor of Statistics and Data Science at University of Caen, department of Mathematics and Computer Science and the Lab of Mathematics Nicolas Oresme (LMNO) UMR CNRS 6139. From 2011 to 2015, he has been an Associate Professor at University of Toulon and the Information Sciences and Systems Lab. (LSIS) UMR CNRS 7296. In 2015–2016 he was awarded a CNRS research leave at the Lab of Mathematics Paul Painlevé, UMR CNRS 8524 in Lille, where he has also been
References (67)
- et al.
Robust fitting of mixture regression models
Comput. Stat Data Anal.
(2012) Robust mixture of experts modeling using the t distribution
Neural Netw.
(2016)- et al.
Time series modeling by a regression approach based on a latent process
Neural Netw.
(2009) - et al.
Improved learning algorithms for mixture of experts in multiclass classification
Neural Netw.
(1999) - et al.
A mixture of experts latent position cluster model for social network data
Stat. Methodol.
(2010) - et al.
Convergence results for the EM approach to mixtures of experts architectures
Neural Netw.
(1995) - et al.
Laplace mixture of linear experts
Comput. Stat. Data Anal.
(2016) - et al.
Clustering with the multivariate normal inverse gaussian distribution
Comput. Stat. Data Anal.
(2016) - et al.
Robust mixture regression model fitting by laplace distribution
Comput. Stat. Data Anal.
(2014) - et al.
Mixtures of regressions with predictor-dependent mixing proportions
Comput. Stat. Data Anal.
(2010)
Flexible modeling of conditional distributions using smooth mixtures of asymmetric student t densities
J. Stat. Plan. Inference
A new look at the statistical model identification
IEEE Trans. Autom. Control
A class of distributions which includes the normal ones
Scand. J. Stat.
Further results on a class of distributions which includes the normal ones
Scand. J. Stat.
Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution
J. R. Stat. Soc. Ser. B
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Trans. Pattern Anal. Mach. Intell.
Bayesian hierarchical mixtures of experts.
Proceeding UAI’03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Algorithms for minimization without derivatives. Chapter 5
A regression model with a hidden logistic process for feature extraction from time series
Proceedings of the International Joint Conference on Neural Networks (IJCNN)
Some effects of inharmonic partials on interval perception
Music Percept.
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. B
Fitting mixtures of linear regressions
J. Stat. Comput. Simul.
Finite Mixture and Markov Switching Models (Springer Series in Statistics)
Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions
Biostatistics
Trajectory clustering with mixtures of regression models
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A mixture of experts model for rank data with applications in election studies
Ann. Appl. Stat.
Mixture of experts modelling with social science applications
J. Comput. Gr. Stat.
Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives
J. R. Stat. Soc. B
Giss analysis of surface temperature change
J. Geophys. Res.
A closer look at united states and global surface temperature change
J. Geophys. Res.
A probabilistic representation of the skew-normal distribution
Scand. J. Stat.
Semiparametric mixtures of regressions
J. Nonparametric Stat.
Cited by (20)
Mixture of robust Gaussian processes and its hard-cut EM algorithm with variational bounding approximation
2021, NeurocomputingCitation Excerpt :Actually, it has been used as the distribution of the noise in the least absolute deviation regression [28]. Moreover, student-t distribution and skew-t distribution have been adopted in the mixture-distribution models to enhance the robustness [29–31]. In Fig. 5, we sketch the probabilistic density functions of Gaussian distribution, Laplace distribution and student-t distribution with mean 0 and variance 2.
Mixture of linear experts model for censored data: A novel approach with scale-mixture of normal distributions
2021, Computational Statistics and Data AnalysisFlexible mixture regression with the generalized hyperbolic distribution
2024, Advances in Data Analysis and Classification
Faicel Chamroukhi is since 2016 a Full Professor of Statistics and Data Science at University of Caen, department of Mathematics and Computer Science and the Lab of Mathematics Nicolas Oresme (LMNO) UMR CNRS 6139. From 2011 to 2015, he has been an Associate Professor at University of Toulon and the Information Sciences and Systems Lab. (LSIS) UMR CNRS 7296. In 2015–2016 he was awarded a CNRS research leave at the Lab of Mathematics Paul Painlevé, UMR CNRS 8524 in Lille, where he has also been an Invited Scientist at INRIA. He received his Master degree of Engineering Sciences from Pierre & Marie Curie (Paris 6) University in 2007 and his Ph.D. degree in applied mathematics and computer science, in the area of statistical leaning and data analysis from Compiègne University of Technology in 2010. In 2011, he was qualified for the position of Associate Professor in applied mathematics (CNU 26), computer science (CNU 27), and signal processing (CNU 61). In 2015, he received his Accreditation to Supervise Research (HDR) in applied mathematics and computer science, in the area of statistical learning and data science, from Toulon University. In 2016, he was qualified for the position of Professor in the area of applied mathematics, computer science, and signal processing (CNU 26, 27, 61). His multidisciplinary research is in the area of Data Science and includes statistics, machine learning and statistical signal processing, with a particular focus of the statistical methodology and inference of latent data models for complex heterogeneous high dimensional and massive data, temporal data, functional data, and their application to real-world problems including dynamical systems, acoustic/speech processing, life sciences (medicine, biology), information retrieval, social networks.