1 Introduction

The finite mixture (FM) approach (McLachlan & Peel, 2004) assumes that the population comprises \(G\) homogeneous subgroups. For each subgroup, a probability density function describes the data generating process, and each cluster is characterized by group-specific sets of regression parameters. The data are assumed to be drawn from a population divided into clusters that share functionally similar patterns. Homogeneous groups within the sample are identified and the regression coefficients are computed in each group. The impact of the predictor variables differs between classes and are these differences that determine the class membership. The class-specific coefficients are computed without any prior knowledge or assumption on the clustering. It is a model-based clustering to account for between groups heterogeneity (Compiani & Kitamura, 2016). Vice versa, neglecting groups heterogeneity leads to misleading results: the estimated coefficients would average the true values of each group without mirroring anyone; the estimated error variance would be larger due to groups heterogeneity, thus causing flawed inference (Van Horn et al., 2013).

In the linear regression model, \({y}_{i }={{\varvec{x}}}_{i}^{\boldsymbol{^{\prime}}}{\varvec{\beta}}+{\varepsilon }_{i}\), the standard finite mixture estimator relates the dependent variable \({y}_{i}\) to the set of \(p\) explanatory variables \({{\varvec{x}}}_{i}\) while a non-observable latent variable \({z}_{i}\) denotes the subgroup each unit belongs to. For a unit in the \(k\)th group, with \(k=1,..,G\), the regression model is

$$y_i=\boldsymbol x_{{}_i}^{'}{\varvec\beta}_k+\varepsilon_i.$$
(1)

The probability for a generic unit to belong to the \(k\)th component is denoted by \({\pi }_{k}\). The number of clusters in the model is endogenously determined, and \({z}_{i}\) is a non-observable latent variable determining for each observation its involvement in one group or another. The \({z}_{i}\) is usually approximated by closely related observable variables. For each observation, the probability of inclusion in a group, \({\pi }_{k}\), is function of the latent variable \({z}_{i}\), which assumes values 0 or 1 to define the exclusion or the inclusion of the ith observation in the specific \(k\)th group:

$${\pi }_{k}={\pi (z}_{i})=\frac{\mathrm{exp}{(z}_{i })}{{\sum }_{i=1,n}\mathrm{exp}{(z}_{i })}.$$
(2)

The likelihood combines the conditional likelihood of each class weighted by the associated latent-class probability.

$$\prod\nolimits_{i=1, n}{\sum }_{k=1,G}{\pi }_{k}f\left({y}_{i}|{{\varvec{x}}}_{i},{{\varvec{\beta}}}_{k}\right).$$
(3)

Unfortunately, we do not know from which group an observation is drawn, the observed sample is incomplete since the latent variable \({z}_{i}\) is unobservable. However, \({z}_{i}\) can be approximated by a subset of group-specific explanatory variables \({{\varvec{x}}}_{iq}\), with \(q\le p\), and this subset is used to model the prior probability of component membership. A multinomial logit model computes \({\pi }_{k}\) as

$${\pi }_{k}=\frac{\mathrm{exp}{(g({\varvec{x}}}_{iq }))}{\sum_{i=1,n}\mathrm{exp}{(g({\varvec{x}}}_{iq }))},$$
(4)

where \({g({\varvec{x}}}_{iq})\) is the function relating the probability of being in the \(k\) class to the characteristics in the \({{\varvec{x}}}_{iq}\) subset. It is a model-based clustering approach where the mixture proportions are a logistic model, and the mixture components are linear models. The outcome variable distribution depends on both the covariates in \({{\varvec{x}}}_{i}\) and the latent cluster membership variable \({z}_{i}\). Further details can be found in Bouveyron et al. (2019).

The estimation is an iterative process. The expectation–maximization (EM) algorithm (Dempster et al., 1977) computes the probability to belong to a given group by (4), yielding the probability weights. Next the weighted regression parameters are estimated in (3), and the iterations update weights and regression coefficients until convergence is reached.

Alfò et al. (2017) and, independently, Furno and Caracciolo (2020) extend the finite mixture approach to the tails of the \({y}_{i}\) conditional distribution, at the quantiles.

The normality based MLE is sensitive to outliers or heavy-tailed error distributions. Alfò et al. (2017) assume an asymmetric Laplace distribution to compute quantile regressions within the likelihood framework, since quantile regressions ensure robust estimates. To model group heterogeneity, they introduce a set of individual specific random effects \({{\varvec{b}}}_{k}\) in the linear model (1):

$$\mathrm{E}\left({y}_{i}|{{\varvec{x}}}_{i},{{\varvec{b}}}_{k}\right)={{\varvec{x}}}_{i}^{\boldsymbol{^{\prime}}}{\varvec{\beta}}+{{\varvec{w}}}_{i}\boldsymbol{^{\prime}}{{\varvec{b}}}_{k},$$
(5)

where \({{\varvec{w}}}_{i}\) is a q subset of the design vector \({{\varvec{x}}}_{i}\), \({{\varvec{w}}}_{i}\) = \({{\varvec{x}}}_{iq}\), with q ≤ p. The \({{\varvec{b}}}_{k}\) vary across groups and it is usually assumed that \({\mathrm{E}({\varvec{b}}}_{k})\) = 0 to ensure identifiability of the \({\varvec{\beta}}\) vector. The \({{\varvec{b}}}_{k}\) are the group coefficients and are meant to model unobserved group-specific heterogeneity, accounting for dependence between responses recorded within the same cluster. The \({\varvec{\beta}}\) vector provides the global set of parameters shared by the entire sample. Equation (5) defines a random effect model, with \({{\varvec{b}}}_{k}\) providing the random coefficient effect.

In addition, Alfò et al. (2017), by introducing specific weighting systems, extend the model to compute expectiles and M-quantiles. Expectiles move toward the tails the regression computed at the conditional mean and, just like OLS, the estimates are affected by outliers. The M estimator, that replaces the least squares criterion by a robust criterion, is one of the most used robust regression estimators, and the M-quantiles combine the outlier detection of the M estimator with the shifting location of the quantile regression.

Furno and Caracciolo (2020) consider a different version of the finite mixture estimator. The focus is on dividing in homogeneous clusters the data set, excluding the \({{\varvec{b}}}_{k}\) term, to analyze between groups comparison at the conditional mean, i.e., computing the standard FM estimator. Then the finite mixture model is moved toward the tails by implementing quantile regressions on the observations modified by the probability weights \({\pi }_{k}\) as estimated at the conditional mean.

The difference between the two approaches is in the modelling of heterogeneity. Both approaches implement finite mixture not only on average, at the conditional mean, but also in the tails of the conditional distribution of the dependent variable. In Alfò et al. (2017) the individual heterogeneity is defined by the random effect term \({{\varvec{b}}}_{k}\), and the grouping varies with the quantiles. In Furno and Caracciolo (2020) the heterogeneity is modelled by splitting the sample into homogeneous sub-groups at the conditional mean. Once defined the clusters, the regression model is estimated in the tails within each class, and the grouping does not vary across quantiles.

The simulations look at relative bias, dispersion, expected shortfall and Akaike values. In addition, contaminated error distributions − a standard approach to generate outliers − are considered in the simulations. Bartolucci and Scaccia (2005) specifically suggest implementing finite mixture regressions to model mixtures of normal errors. This set of experiments is particularly suitable for the Furno and Caracciolo estimator, with one group describing the central/good data and the other group gathering the outlying/tail observations. With heavy contamination, the relative bias increases. Bai et al. (2012) show how maximum likelihood is sensitive to non-normality and they suggest to robustify finite mixture regressions by computing a robust regression in the EM steps of the iterations. We embed this idea into our framework by initializing the FM estimation process with a robust regression, i.e., pre-multiplying the original data by robust weights to curb outliers before implementing the analyzed estimators. This approach sizably improves the relative bias in the Furno and Caracciolo estimator, particularly for the random slope coefficient, and reduces the expected shortfall. Finally, increasing the number of groups \(G\) improves the average Akaike values in the Furno and Caracciolo estimator.

The case study analyzes Italian students’ score at international OECD-PISA math test in year 2009 as influenced by schools’ characteristics. Students’ scores are split in 2 and in turn 3 groups sharing similar characteristics, like school size and curricula. The academic and technical track are highly beneficial with respect to vocational schools, and their improvement declines across quartiles, for the best students, where the quartiles represent different proficiency levels. The gender gap is quite sizable at all quartiles, for low/medium/high scoring students. Regional gap and school size are negative but improve across quartiles. Increasing the number of groups and of explanatory variables improves the Akaike value.

2 The Tail Estimators

By assuming an asymmetric Laplace distribution, Alfò et al. (2017) estimate model (5) at any chosen location, the center or the tails of the conditional distribution. Formally, their model considers the likelihood function for a finite mixture of distributions in the exponential family, defining the distribution of the response variable for the \(i\)th measurement in the \(k\)th cluster of the finite mixture model.

$${\prod }_{i=1,n}\left[{\sum }_{k=1,G}{\pi }_{k}f({y}_{i}|{{\varvec{x}}}_{i},{{\varvec{b}}}_{k})\right].$$
(6)

Equation (6) computes for each group/component of the finite mixture the group probability \({\pi }_{k}\) and the regression coefficients \({\varvec{\beta}}\). In (6) while the \({{\varvec{b}}}_{k}\) differ across groups, the \({\varvec{\beta}}\) vector is unique for the entire sample.

Equation (6) can be estimated in the tails within each group by implementing the quantile regression estimator (Koenker, 2005). The quantile estimator introduces the asymmetric weights \(\theta\) and \(1-\theta\), with \(0\le \theta \le 1,\) to move the regression model away from the conditional mean. The finite mixture quantile regression estimator can be defined by

$$\begin{array}{l}{\sum }_{i=1,n}{\pi }_{k}{\sum }_{k=1,G}\rho^{\prime} (\theta )({y}_{i}-{{\varvec{x}}}_{i}^{\prime}{\varvec{\beta}}(\theta )-{{\varvec{w}}}_{i}^{\prime}{{\varvec{b}}}_{k}(\theta )){{\varvec{x}}}_{i}=0\\ {\sum }_{i=1,n}{\pi }_{k}{\sum }_{k=1,G}\rho^{\prime} (\theta )({y}_{i}-{{\varvec{x}}}_{i}^{\prime}{\varvec{\beta}}(\theta )-{{\varvec{w}}}_{i}^{\prime}{{\varvec{b}}}_{k}(\theta )){{\varvec{w}}}_{i}=0\\ {\rho }^{^{\prime}}\left(\theta \right)=\theta I\left({\varepsilon }_{ik}>0\right)+\left(1-\theta \right)I\left({\varepsilon }_{ik}\le 0\right),\end{array}$$
(7)

where \({\varepsilon }_{ik}=({y}_{i}-{{\varvec{x}}}_{i}^{\prime}{\varvec{\beta}}-{{\varvec{w}}}_{i}^{\prime}{{\varvec{b}}}_{k})\). For instance, to estimate the 75th quantile regression, the estimated equation will be characterized by 75% of the observations below the 3rd quartile, and 25% of the observations above it. It represents the regression passing through the third quartile of the conditional distribution of the dependent variable, given the selected covariates. The asymmetric weights assign the value \(\theta\)= 0.75 to the larger observations to attract the estimated equation upward, and \(\left(1-\theta \right)\) = 0.25 to the remaining data. The \({\pi }_{k}\) weights are free to change, so that the grouping can differ across quantiles.

By introducing an additional weighting system to control for outliers, \(\varphi ({\varepsilon }_{ik}\)), Alfò et al. (2017) compute a finite mixture M-quantile regression estimator, that besides the probability \({\pi }_{k}\), combines i) the asymmetric weights defined to move the regression toward the tails with ii) a bound to constrain outliers within each group. Bai et al. (2012) suggested to replace the least squares criterion by a robust criterion. Based on a Monte Carlo study, they show robustness and efficiency gains of their approach when there are outliers in the data, and their results are comparable to the traditional MLE under normality. The quantile regression estimator, and the quantile FM regression estimator in (7), provide results that are robust to anomalous values. However, when the model is estimated at very low/high quantiles outliers can affect the estimated coefficients. The M-quantile estimator, by bounding exceedingly large errors, grants robustness even at very low/high quantiles, thus improving upon the quantile estimator in (7).

The finite mixture M-quantile estimator, at the selected location θ, is defined by the following asymmetrically weighted first order conditions

$$\begin{array}{l}\sum\nolimits_{i=1,n}\pi_k\sum\nolimits_{k=1,G}\varphi(\theta)(y_i-{\boldsymbol x}_i'\boldsymbol\beta(\theta)-{\boldsymbol w}_i'{\boldsymbol b}_k(\theta)){\boldsymbol x}_i=0\\\sum\nolimits_{i=1,n}\pi_k\sum\nolimits_{k=1,G}\varphi(\theta)(y_i-{\boldsymbol x}_i'\boldsymbol\beta(\theta)-{\boldsymbol w}_i{\boldsymbol'\boldsymbol b}_k(\theta)){\boldsymbol w}_i=0\\\varphi\left(\theta\right)=\left\{\begin{array}{lll}\theta I(\varepsilon_{ik}>0)+(1-\theta)I(\varepsilon_{ik}\leq0)&\mathrm{if}&-c<\varepsilon_{ik}<c\\\theta&&\mathrm{if}\;\varepsilon_{ik}>c\\\left(1-\theta\right)&&\mathrm{if}\;\varepsilon_{ik}<-c\end{array}\right..\end{array}$$
(8)

The above \(\varphi\) function is the Huber (1981) loss function, and the tuning constant generally selected, \(c=\) 1.345, is the bound constraining large errors. Its value is determined by bounding the influence of residuals in the Huber estimator, or the influence of both residuals and explanatory variables in other robust estimators, like in Hampel et al. (1986).

Finally, an additional finite mixture regression estimator can be considered. The standard expectile estimator moves the OLS regression toward the tails by simply introducing asymmetric weights to change location. It maintains the OLS characteristics: ease of computation and non-robustness to outliers. The finite mixture expectile regression estimator combines the asymmetric weights defining the location and the clustering weights identifying the groups. It is defined as follows

$$\begin{array}{l}{{\sum }_{i=1,n}{\pi }_{k}{\sum }_{k=1,G}\lambda (\theta)({y}_{i}-{{\varvec{x}}}_{i}^{\prime}{\varvec{\beta}}(\theta)-{{\varvec{w}}}_{i}^{\prime}{{\varvec{b}}}_{k}(\theta))}^{2}\\ \lambda \left((\theta)\right)=\theta I\left({\varepsilon }_{ik} > 0\right)+ \left(1 - \theta \right)I\left({\varepsilon }_{ik} \le 0\right).\end{array}$$
(9)

Summarizing, the quantile regression estimator in (7) provides results that are robust to anomalous values. However, when the model is estimated at very low/high quantiles outliers can affect the estimated coefficients. The M-quantile estimator in (8), by bounding exceedingly large errors, grants robustness even at very low/high quantiles thus improving upon the quantile estimator. The expectile estimator in (9), vice versa, is not robust at all. The \(\lambda (\theta )\) weight is simply a shifting term that moves toward the tails the regression model. Equation (9) computes the standard OLS estimator when \(\theta\)=0.5.

Furno and Caracciolo (2020) in the clustering approach of model (1) follow a slightly different method. They first compute the finite mixture model at the center of the conditional distribution, iterating between (3) and (4). Once grouped the observations into homogeneous clusters, an asymmetric weighting system computes (1) in the tails of the conditional distribution within each group. The standard finite mixture approach provides the group probabilities \({\pi }_{k}\) and the estimated model at the conditional mean. Next, for units in group \(k\) as defined by the FM class probability weights \({\pi }_{k}\), the \(\theta\) quantile regression is computed. This approach does not allow to change the definition of the groups across quantiles. On the other hand, it allows to compute as many sets of \({\varvec{\beta}}\) parameters as the number of classes \(G\) and to compare groups at each quantile. It does not consider a general \({\varvec{\beta}}\) vector together with group specific \({{\varvec{b}}}_{k}\), but many \({\varvec{\beta}}\) vectors, one for each homogeneous class, thus allowing between groups comparison with respect to each explanatory variable.

Equations (7), (8), and (9) are implemented within each of the groups already estimated at the conditional mean, and become

$${\sum }_{i=1,n}{\rho }^{\prime}\left(\theta \right)({\pi }_{k}{y}_{i}-{{\pi }_{k}{\varvec{x}}}_{i}^{\prime}{{\varvec{\beta}}}_{k}(\theta )){{\varvec{x}}}_{i}=0$$
(10)
$${\sum }_{i=1,n}\varphi \left(\theta \right)({\pi }_{k}{y}_{i}-{{\pi }_{k}{\varvec{x}}}_{i}^{\prime}{{\varvec{\beta}}}_{k}(\theta )){{\varvec{x}}}_{i}=0$$
(11)
$${\sum }_{i=1,n}\lambda \left(\theta \right){({{\pi }_{k}y}_{i}-{{\pi }_{k}{\varvec{x}}}_{i}^{\prime}{{\varvec{\beta}}}_{k}(\theta ))}^{2}$$
(12)

with \(\rho\) being the usual quantile regression check function

$$\rho\left(\theta\right)=\left[\theta I(\nu_{ik}>0)+(1-\theta)I(\nu_{ik}\leq0)\right]$$
(13)

while the M-quantile bounding function is

$$\varphi(\theta)=\left\{\begin{array}{lll}\theta I(\nu_{ik}>0)+(1-\theta)I(\nu_{ik}\leq0)&\mathrm{if}&-c<\nu_{ik}<c\\\theta&&\mathrm{if}\;\nu_{ik}>c\\\left(1-\theta\right)&&\mathrm{if}\;\nu_{ik}<-c\end{array}\right.$$
(14)

and the expectile function \(\lambda\) is

$$\lambda (\theta )=[\theta I\left({\nu }_{ik}>0\right)+\left(1-\theta \right)I\left({\nu }_{ik}\le 0\right),$$
(15)

with \({\nu }_{ik}=({{\pi }_{k}y}_{i}-{\pi }_{k}{{\varvec{x}}}_{i}^{\prime}{{\varvec{\beta}}}_{k}(\theta ))\).

In Furno and Caracciolo (2020) the individual specific parameters \({{\varvec{b}}}_{k}\) are not considered since the classes are computed once and for all at the conditional mean and do not change when moving to the tails. This simplified definition of the estimator allows to analyze between classes differences with respect to each explanatory variable, since there are as many estimates of the \({\varvec{\beta}}\) vector as the number of classes.

3 Simulations

To compare results the same simulation scheme in Alfò et al. (2017) is here implemented. The simulations are computed in Stata, version 15. The Stata codes are in Appendix 2.

The model is a random coefficients equation defined as

$$\begin{array}{ccc}{y}_{ik}=\left({\beta }_{1}+{b}_{1i}\right)+\left({\beta }_{2}+{b}_{2i}\right){x}_{2ik}+{\beta }_{3}{x}_{3ik}+{\varepsilon }_{ik},& i = 1, . . , n & k = 1, . . , G,\end{array}$$
(16)

the error term is in turn a standard normal or a Student t with 3 degrees of freedom, \({t}_{3}\); the \({\varvec{\beta}}\) vector assumes values (\({\beta }_{1}\), \({\beta }_{2}\), \({\beta }_{3}\))’ = (100, 2, 1)’; \({b}_{1i}\) and \({b}_{2i}\) are individual specific random parameters distributed in turn as a standard normal or a Student t with 3 degrees of freedom; \({x}_{2ik}\) is defined as the sum of two independent standard normal; \({x}_{3ik}\) is a binomial B(100;0.5); the sample size is n = 100 and there are 1000 iterations for each experiment.

The first set of experiments considers the standard normal distribution for \({b}_{1i}\), \({b}_{2i}\), and the error term. The second set of experiments considers Student t distributions.

The analyzed estimators are the finite mixture model at the quantiles together with its variants defining M-quantiles and expectiles. The estimators are computed at the locations \(\theta\)=0.25, 0.50, 0.75, respectively at the first, second, and third quartile, M-quantile, and expectile regressions. The last two estimators are computed only for (10) to (12).

As mentioned, the tail finite mixture estimators can be implemented in two different ways. Both estimators are computed in Stata, and the steps are the following

  1. i)

    estimate FM at the conditional mean to compute the group probability weights \({\pi }_{k}\);

  2. ii)

    pre-multiply the observations by the group probability weight \({\pi }_{k}\);

  3. iii)

    compute quantile, expectile, M-quantile for the probability weighted data.

For both the assumption of constant clusters across quantiles is considered.This assumption helps quantifying the pay-off of changes in the group composition across quantiles assumed by the Alfò et al. (2017) estimator. The random effect results are presented in Tables 1 and 3, fmquantile in the tables. A unique \({\varvec{\beta}}\) vector − representing the fixed effect model − is estimated for the entire sample. These results are computed by constraining the coefficients of the finite mixture to be the same across groups, with the sole exception of the constant term that changes to capture the individual randomness \({{\varvec{b}}}_{k}\). Then the data are modified by the probabilities to belong to each group as in ii). Next group 2 data are appended to group 1 (with 3 groups, group 3 should be appended to group 1 and 2 data). A single quantile regression on the twice long data set (three times long when G = 3) is computed to obtain a single \({\varvec{\beta}}\) vector for all groups. In case the constant term of each group is of interest a dummy variable is added to the estimated regression (two dummy variables if \(G\)=3 and the group 3 intercept is relevant). These codes are in Appendix 2.

To compute (10) to (12) estimators the data set is divided into \(G\)=2 sub-groups, there is an estimated \({\varvec{\beta}}\) vector for each sub-group and the results are reported in Tables 2, 4, and 6, fmq in the tables. Finite mixture expectiles and M-quantiles are computed only for the Furno and Caracciolo (2020) estimator, respectively fm expectile and fm M-quantile in the tables, and their behavior is summarized below the quantile fmq results.

For each estimator the relative bias and its dispersion are reported in the tables. The average expected shortfall in 1000 iterations, ES, and the average Akaike statistics are also collected, together with their dispersion in the iterations. To compute Akaike in the quantile regression, the estimated likelihood is replaced by the estimated quantile regression objective function, as defined in Behl et al. (2014). In detail, the Akaike values for the quantile regression are computed by AIC \(=n {log}_{10}\left[{\sum }_{{e}_{i}<0}\left(1-\theta \right)\left|{e}_{i}\right|+{\sum }_{{e}_{i}\ge 0}\theta \left|{e}_{i}\right|+p\right]\), where \({e}_{i}\) are the quantile regression residuals, \(p\) the number of parameters, and \(n\) the sample size. However, the use of the L1 norm for the quantile regressions in place of the L2 norm of OLS and maximum likelihood does not allow to compare the Akaike values of different estimators like quantile and expectile. A comparison is possible only within the same estimator as computed in differing models.

4 Results

Table 1 reports the average relative bias of the estimated coefficients, \({\beta }_{1}\), \({\beta }_{2}\), \({\beta }_{3}\), over 1000 iterations and the standard deviation for the random effect model (5). The errors and the random effects \({{\varvec{b}}}_{k}\) are normally distributed, in the top section of this table, while in the bottom section of the table the errors follow a Student t with 3 degrees of freedom. These results can be compared with the results for FMQR of Table 3 (for the standard normal experiments) and Table 4 (for the Student t distributions) in Alfò et al. (2017) for their quantile finite mixture estimator with \({r}_{i}=1\), i.e., with one measurement for each class. The relative bias is larger in Table 1 than in Alfò et al. (2017) tables, but the dispersion of Table 1 is sizably smaller than in the Alfò et al. (2017) tables. Both effects, a greater bias coupled with a smaller dispersion, can be related to the constraint here imposed: besides constant coefficients across groups, except for the intercept that is free to change to account for the quantile and the individual heterogeneity, we add constant clustering across quartiles. This seems to cause larger bias but smaller dispersion. Focusing on Table 1 below, the dispersion is generally larger for the \({\beta }_{3}\) slope, that is not a random coefficient; the bias increases at the upper quartile; the Akaike values are nearly stable, while the expected shortfall increases with \(\theta\).

Table 1 Standard normal and Student t distributions, random effect model, \(n=\) 100, 1000 replicates

Table 2 presents the estimates of (10) to (12) computed in two homogeneous sub-groups. In this table there are two estimated \({\varvec{\beta}}\) vectors, one for each group. Therefore, the estimated coefficients do change both across quartiles and between groups. Once computed the class probability \({\pi }_{k}\), the quantile, the M-quantile and the expectile regressions are computed using weighted observations \({\pi }_{k}{y}_{i}=\) weighty, \({\pi }_{k}{x}_{1i}=\) weightx1,\({\pi }_{k}{x}_{2i}=\) weightx2. Stata provides the code qreg weighty weightx1 weightx2 to compute the finite mixture regression at the chosen quantile, while for the expectile and the M-quantile estimators an additional shifting weight is introduced to move the OLS and the Huber robust regression upward or downward, away from the conditional mean (these codes are reported in Appendix 2). The relative bias of the quantile, expectile and the M-quantile FM regression estimators are in Table 2. Comparing fmq with fmquantile results in the previous table, the constant term is generally under-estimated in both; the \({\beta }_{3}\) relative bias is the largest although it is not random in both tables; the dispersion of fmq is smaller than in Table 1 but for \({\beta }_{2}\); the fmq Akaike values are smaller than in fmquantile of Table 1. In Table 2 the fm M-quantile empirical distributions are generally more dispersed than the others, but provide the smallest expected shortfall, ES, in most experiments since it bounds outliers. Looking at the Akaike criterion, as mentioned the fm expectile regression yields the largest values since its objective function is in terms of squared residual, followed by the finite mixture M-quantile regression, where the squared residuals are bounded and cannot grow beyond the chosen bound, and by the fmq that is characterized by the smallest Akaike values since the quantile regression objective function considers the absolute value and not the squares of the residuals.

Table 2 Standard normal and Student t distributions, 2 groups,\(n=\)100, 1000 replicates

Next the standard normal error distribution is replaced by a heavily contaminated normal, 30%, with a contaminating distribution N(50,100), while the Student t error distribution is replaced by a 10% contaminated Student t with a \({t}_{6}\) central distribution and a \({t}_{3}\) contaminating distribution. This set of experiments looks at the performance of the finite mixture estimators when there are two distributions generating the sample. In this set of experiments the \({{\varvec{b}}}_{k}\) in model (5) − or the \({\pi }_{k}\) grouping probability in (10) to (12) − mirror the individual characteristics, or the between group heterogeneity, together with this more complex data generating process. This set of experiments is particularly suitable for the (10) to (12) estimators, with the central distribution describing one group and the contaminating distribution describing the other group of tail observations. Bartolucci and Scaccia (2005) specifically suggest implementing finite mixture regressions to model mixtures of normal errors. Of the two contaminating schemes selected, the contaminated normal defines a bimodal distribution with very differing variances, while the Student t contamination defines a unimodal distribution.

Table 3 reports the results for the Alfò et al. estimator, while Table 4 considers the estimated coefficients in (10) to (12). When comparing Table 3 with Table 1, the relative bias has a slight increase, and the dispersion increases with contaminated normal; the contamination increases the Akaike values as well. In the Student t contamination case, the results of the two tables are comparable.

Table 3 Contaminated normal and Student t contaminated errors, random effect model, \({\varvec{n}}=\) 100, 1000 replicates; N(0, 1) and N(50, 100); Student \({{\varvec{t}}}_{6}\) and \({{\varvec{t}}}_{3}\)
Table 4 Contaminated error distributions: normal contamination, N(0, 1) and N(50, 100); Student t contamination, \({t}_{6}\) and \({t}_{3}\); two groups; \(n=\) 100; 1000 replicates

In Table 4 the sample is divided in two sub-groups to better account for between groups heterogeneity, yielding two vectors of estimated coefficients − one vector for the group generated by the central distribution and the other for the data generated by the contaminating distribution − for the (10) to (12) estimators. The results can be compared with those in Table 2. Bias and dispersion increase in the upper section of the table but for \({\beta }_{1}\), with heavily contaminated normal errors, together with the Akaike results. The increase in bias and dispersion is quite limited in the bottom section of the table with Student t contamination, as found in fmquantile. This is due to the different contamination scheme: in the normal contamination the data are bimodal with a contaminating distribution characterized by very large mean and variance − \(\mu =\) 50 and \({\sigma }^{2}=\) 100 for the contaminating normal versus \(\mu =\) 0 and \({\sigma }^{2}=1\) at the center. This is not the case in the Student t error contamination, where the contaminated error distribution is unimodal, centered on \(\mu =0,\) and the variance of the two Student t distributions are not too far apart, \({\sigma }^{2}({t}_{6})\)=1.5 versus \({\sigma }^{2}\)(\({t}_{3}\)) = 3 for the contaminating distribution.

In Table 4 the relative bias is smaller in the finite mixture M-quantile regression estimator. Group 2 estimates are more dispersed than group 1, possibly caused by the larger variance of the contaminating distribution, \({\sigma }^{2}\)=100 and in turn \({\sigma }^{2}\)(\({t}_{3}\)) = 3, since group 2 seems to gather the observations generated by the contaminating distributions. In the experiments with contaminated distributions, it seems preferable to separately compute the coefficients vector in each class, thus untying the impact of the observations generated by the central distribution from the effect of those generated by the contaminating one.

Tables 5 and 6 report a different set of experiments. Following Bai et al. (2012), before implementing the finite mixture regression model, a robust regression is estimated. They compute a robust regression in the EM steps of the iterations, but this is not doable in Stata that provides a unique code to compute FM. Therefore, we implement a robust regression before running steps i) to iii) of our approach. The robust regression Stata code is rreg \({y}_{i} {x}_{1i}\boldsymbol{ }{x}_{2i}\), genwt(name). Then we pre-multiply the data by these weights, yielding namey = name*\({y}_{i},\) namex1 = name*\({x}_{1i}\) and namex2 = name*\({x}_{2i}\), then we implement steps i) to iii) on this modified data set. This approach could be particularly useful in the contaminated error distributions since the anomalous values in the tails are automatically bounded by the robust weights. It adds an initial step, provided by a robust regression, that yields weights to curb outliers. These weights modify/bound the observations and the transformed data are then analyzed by the finite mixture regression estimators at the center and in the tails for the contaminated distribution experiments. The comparison of this set of results with the values in Tables 3 and 4 shows that the robust regression initializing procedure provides a sizable reduction in the relative bias in all the experiments with contaminated normal errors, particularly for \({\beta }_{2}\), and in fmq even the standard deviation decreases. The expected shortfall decreases in many experiments. The introduction of an initializing robust step provides very good results and confirms the validity of the Bai et al. (2012) approach.

Table 5 Contaminated normal and Student t contaminated errors, random effect model; robust initialization; \(n=\) 100; 1000 replicates; N(0, 1) and N(50, 100); Student \({t}_{6}\) and \({t}_{3}\)
Table 6 Contaminated normal and contaminated Student t errors, robust initialization \(n=\) 100, 1000 replicates; N(0, 1) and N(50, 100); Student \({t}_{6}\) and \({t}_{3}\)

Finally, Tables 7 and 8 consider a 3-group finite mixture regression model estimated in the tails, at the 25th and 75th quartiles, in the case of contaminated error distributions. The 3-group model shows a reduction in bias, dispersion, and expected shortfall, while the Akaike values of Table 7 are slightly larger than in Tables 3 and 5. In the constant clustering estimator, fmq, characterized by the smaller Akaike values due to its definition, it is not easy to choose between the 2 or the 3-group models since there is not a unique synthetic statistic. The Akaike criterion is indeed estimated within each group. An average Akaike value across groups could be computed for comparability’s sake. In the fmq estimator, the average value in the 2-group model is 265.5 in Table 4 and 261.6 in Table 6 at the 25th quantile and respectively 268 and 271.7 at the 75th quantile, to be compared with 238 at the 25th and 241 at the upper quartile in the 3-group case. Thus the 3-group model shows smaller average Akaike values in the 3-group fmq results.

Table 7 Contaminated normal and Student t contaminated errors, random effect model, 3 groups, \(n=\) 100, 1000 replicates; N(0, 1) and N(50, 100); Student \({t}_{6}\) and \({t}_{3}\)  
Table 8 Contaminated normal and contaminated Student t errors, 3 groups \(n=\) 100, 1000 replicates; N(0, 1) and N(50, 100); Student \(t_6\) and \(t_3\)  

On the overall the simulations show that:

  • the constant term is systematically under-estimated in each table; the relative bias increases in fmq with respect to fmquantile while its dispersion is smaller but for the \({\beta }_{2}\) random coefficient slope.

  • with heavy contamination, like in the contaminated normal experiments, the relative bias increases, less so in the Student t experiments that model a milder contamination. The fmq estimator would be preferable with contamination since one group could model the central distribution and the other group could gather observations generated by the contaminating one.

  • initializing the estimation process with a robust regression in the error contamination sizably improves the relative bias and dispersion in the fmq estimator, particularly for the random slope \({\beta }_{2}\), and reduces the expected shortfall.

  • increasing the number of groups \(G\) with contaminated errors does not improve the Akaike in fmquantile, while in the fmq estimator \(G\)=3 improves the average Akaike values.

5 Case study

The math scores of 15 years old Italian students taking the OECD-PISA test on student proficiency in year 2009 is considered. The math score, \({y}_{i}\), is explained by school characteristics to assess the link between performance and school structures. However, we select only a small number of explanatory variables − field, school size and student–teacher ratio − to streamline the discussion of the results. The field is defined by the dummy variables academic and technical to distinguish students enrolled in vocational fields; school size collects the number of students; student–teacher ratio is the ratio between number of students and teachers. The sample comprises \(n\)=26,497 observations and Table 9 reports the summary statistics.

Table 9 Student score data, summary statistics

Table 10 reports the results of the finite mixture model as computed at the mean and at the quartiles. The fmq estimator splits the sample in two data driven sub-groups. Figure 1 shows the probability to belong to each group. These probabilities define the two homogeneous groups of high and low scoring students in the math test, with probability centered on 0.64 to belong to group 1 and centered on 0.36 for group 2. In the top section of the table, school size has a negative impact for group 2 at the top quartile – the highly scoring students, while it is not significant at the median and quite small at the first quartile – the low scoring ones. The academic and the technical track improve upon vocational schools particularly for group 2. The group 2 advantage due to field declines at 75th, for the top students. At the mean and at the first quartile the between groups discrepancy due to curricula is quite sizable. At the higher quartile the group differences are less marked, although the dissimilarity between the estimated coefficients of the two groups does not disappear. The best students of each group, although performing differently, are closer one another than the low scoring students of each group, and the low scoring group 2 students profit most from the academic and technical fields. The bottom section of this table reports the estimates of fmquantile in (5), computing a single \({\varvec{\beta}}\) vector of coefficients and gathering heterogeneity in the group component \({{\varvec{b}}}_{k}\). These results show the declining impact of curricula across quartile as well, but they are decidedly less informative than the clustering model estimates since they do not allow to compare groups. In the last line the Akaike values are reported as computed at the conditional mean and in the tails. As mentioned, the Akaike values can only compare the same estimator as computed in differing models.

Table 10 Math score at PISA test in 2009, finite mixture regression estimated at the quantiles,\(n=\)26,497, fixed clustering model fmq
Fig. 1
figure 1

Math scores. Probability to belong to a given group as computed at the conditional mean by the standard finite mixture regression estimator, \(G=2\). Group 1 has the highest probability, centered on 0.64, while group 2 has an average probability of 0.36

Table 11 considers a 3-group model and Fig. 2 depicts the probabilities in each group. Comparing the estimates across groups and at the various quantiles becomes more cumbersome. The third group has the largest average probability, 0.398, and shows the greatest beneficial impact of curricula, increasing across quartiles; a negative impact of the student teacher ratio and a small, mostly not significant impact of school size. The impact of the academic curriculum is negative in group 1 and more so in group 2, worsening with the quartiles i.e., for low, medium, and high scoring students. For the technical track the negative sign occurs only in group 2 at the top quantile: the highly scoring students of group 2 take less advantage from the technical curriculum with respect to group 1, that provides the baseline. The discrepancy among groups increases in the 3-group model at all quartiles. The Akaike values are smaller than in the previous table. The bottom section of the table reports the results for the random coefficient model (5). The estimates are all positive and significant, slightly larger than their analogues at the bottom section of Table 10. The results show a declining impact of curricula and a slightly increasing impact of the other coefficients across quartile. The Akaike statistics are smaller than in the previous Table 10.

Table 11 Math score at PISA test in 2009, finite mixture estimates at the quantiles,\(n=\)26,497, 3 groups
Fig. 2
figure 2

Math scores. Probability to belong to a given group as computed at the conditional mean by the standard finite mixture regression estimator with \(G\) = 3. Group 1 has an average probability of 0.287, group 2 presents the smallest average probability equal to 0.178, while group 3 has the highest average probability of 0.398

Next, we introduce gender and region of residence among the explanatory variables of the model to better characterize groups and further reduce between groups heterogeneity. Gender captures the good performance of boys in math, although they do worse than girls in reading proficiency. Region of residence accounts for regional discrepancies since southern schools and in general southern regions lag the rest of the country. These additional variables turn out to be statistically significant. They help explaining part of the between groups gap. However, there are many other factors at play in diversifying the groups and our analysis is not intended to be exhaustive. A 2-group analysis eases the interpretation of the results in a model with a greater number of explanatory variables. These estimates are reported in Table 13 and Fig. 4 in Appendix 1.

Table 12 reports the results for the 3-group model comprising these two additional explanatory variables. Group 1 has probability centered on 0.383, while group 2 has an average probability of 0.403. These two groups have a similar pattern, as can be seen in the left graph of Fig. 3. The average value of group 3 is the lowest and is equal to 0.213, its histogram peaks at the lower side of the graph and it looks like it takes the place of former group 2 of Fig. 2. In this table the reference class is group 1 and the negative sign of the academic coefficient in group 2 and 3 can be interpreted as a lower impact of this curriculum in group 2 and group 3 when compared to the group 1 baseline. The sign of the regional variable and of school size is negative in group 1 and group 3 at all quartiles and is positive in group 2. This can be interpreted as a generally negative effect of these two variables across quantiles, but their impact sizably improves for group 2 with respect to the group 1 baseline at each \(\theta\). The student teacher ratio is negative in group 1 at all quartiles, and its impact is positive in group 2 and 3 with respect to the baseline. Across quartiles, looking at low/medium/high scoring students, the regional impact improves while the gender gap increases in each group. The academic track is increasingly negative across quartiles, i.e., for the best students in group 2 and 3. On the overall, group 1 collects the students that in Table 9 were formerly gathered in group 3. The Akaike values are smaller than in Table 11 for group 1 and 3. Moving to the bottom section of the table, the curricula show a decreasing impact across \(\theta\); the regional coefficient is negative but improves over quartiles; the gender gap is quite sizable; the school size impact is negative at and above the median. Surprisingly, the Akaike results increase with respect to the previous tables.

Table 12 Math score at PISA test in 2009, finite mixture regression estimated at the quantiles,\(n=\)26,497, 3-group model with additional explanatory variables
Fig. 3
figure 3

Math scores. Probability to belong to a given group as computed at the conditional mean by the standard finite mixture regression estimator, \(G=3\). Region of residence and gender among the explanatory variables. Group 1 has probability centered on 0.383, while group 2 has the highest average probability of 0.403. These two groups have a similar pattern. The average value of group 3 is the lowest and is equal to 0.213. It looks like it replaces former group 2 of Fig. 2

Summarizing: the curricula are beneficial mostly for one group of students, and their impact declines for the best students; the gender gap is quite sizable at all quartiles, for low/medium/high scoring students; regional gap and school size are negative but improve in one group with respect to the baseline.

6 Conclusion

The analysis focuses on the behavior of the finite mixture estimators in the tails. The finite mixture regression approach assumes the population of interest as generated by homogeneous subpopulations to account for between groups heterogeneity. It identifies homogeneous groups within the sample and computes the coefficients of a linear regression model within each group. The clustering process is simultaneous with the estimation. It computes class-specific coefficients without any prior knowledge or assumption on the definition of the clustering.

Two different approaches have been independently defined in the literature: one where, assuming asymmetric Laplace distribution, the grouping is allowed to change in the tails (Alfò et al., 2017) while the coefficients are kept constant across groups. The other defines the groups once and for all at the conditional mean, before estimating the model in the tails for each group (Furno & Caracciolo, 2020). The former computes a random coefficient model while the latter splits the sample into homogeneous sub-groups at the conditional mean and computes the model in each subgroup at the selected quantile, without changing the cluster composition across quantiles. The Furno and Caracciolo (2020) approach yields: i) stability of the grouping across quantiles and ii) the estimation of a coefficient vector for each group. This second feature allows a very detailed between groups comparison at the quantiles.

The simulations consider normal and Student t errors and, in turn, contaminated normal and contaminated Student t errors. The results show a good performance of these estimators, particularly in the contaminated error experiments.

A case study on students’ math score in the OECD-PISA international proficiency test provides an example of the performance of the tail estimator with real data. Students are grouped in two and then in three classes, according to homogeneous characteristics like curricula and school size, and the impact of the covariates at various scores quartiles represent their influence for low/median/high proficiency students. The between groups divergence due to curricula is quite sizable and shrinks at the upper quartile, for the best students; the negative impact of school size and the inefficiency of southern schools penalize two out of three groups at all quartiles. The general pattern across quantiles of the cluster model estimates is confirmed by the random coefficient model results, but a between groups comparison of each estimated coefficient is not available in the latter model due to its definition. Vice versa, the Furno and Caracciolo estimator allows a detailed between groups comparison of the estimated coefficients and a deeper analysis of their pattern across quartiles as well.