Finite Mixture of Compositional Regression With Gaussian Errors

In this paper, we consider to evaluate the e ciency of volleyball players according to your performance of attack, block and serve, considering the compositional structure of the data related to the fundaments of this sport. In this way, we consider a nite mixture of regression model to compositional data. The maximum likelihood estimation of this model was obtained via an EM algorithm. A simulation study reveals that the parameters are correctly recovery. In addition, the estimators are asymptotically unbiased. By considering real dataset of Brazilian volleyball competition, we show that the model proposed presents best t than the usual regression model.


Introduction
The performance of highlevel volleyball teams is considered fundamental for guarantee success at championships.Such performance may be related to eciency of the players at the game.The knowledge about the main factors (for instance, the eciency of the players) that aect the result of a game helps the decision-making of coaches, providing advantages for improving the skills of the teams.Hence, this is an important issue that must be analysed to contribute to the development of tactical and technical strategies.Some studies about the eciency of the volleyball players have been developed recently.For example, Bozhkova (2013) analyzed the eciency of the best volleyball players based on the scoring winning points and the assisting actions, concluding that the attack is the most points-winning skill within the best volleyball players in the world.Pena, Guerra, Busca & Serra (2013) evaluated skills and factors that better predicted the outcomes of a regular seasons volleyball matches based on the logistic regression.
The points scored of the players like attack, block and serve have structure of compositional data which represent positive components, i.e., proportions of a whole.Compositional parts can be expressed in any scale without loss of information: accordingly, the sample space of representation of compositional data with a constant sum constraint is the simplex dened by S D = {(x 1 , . . ., x D ) : x j > 0 for j = 1, . . ., D and D j=1 x j = k}, where k > 0 and D is the number of variables (components).
Three essential principles of compositional data analysis are scale invariance, permutation invariance and subcompositional coherence (Aitchison 1986, Pawlowsky Glahn, Egozcue & Tolosana-Delgado 2015).Scale invariance means that a composition has information only about relative values.According to Aitchison & Egozcue (2005), such concept is easily formalized into a statement that all meaningful functions of a composition can be expressed in terms of a set of component ratios.The concept of permutation invariance is that if it provides same results when the components in the composition is changed.Finally, the subcompositional coherence can be summarized as: if we have two compositions, being one full compositions and another one a subcomposition of these full compositions, the inference about the relations within the common parts should be the same results (Aitchison & Egozcue 2005).Aitchison & Shen (1980) and Aitchison (1986) introduced an appropriate theory for compositional data.The methodology involves transformations from restricted sample space simplex to well-dened real sample space R. The general idea consists in the constraints that are removed, then standard statistical methods can be applied to the transformed observations.Such transformations were named by the additive logratio transformation (alr) and the centered logratio transformation (clr).Indeed, both alr and clr transformation represent coordinates with relation to the Aitchison geometry (Pawlowsky Glahn et al. 2015).
The use of multivariate normal distribution to compositional data can be found in Hron, Filzmoser & Thompson (2012), Egozcue, Daunis-I-Estadella, Pawlowsky-Glahn, Hron & Filzmoser (2011), among others.Thus, our main motivation is to study the eciency of the volleyball players through of the performance of attack, block and serve that result in point scoring in a game.The methodology of compositional data was applied in the points scored of the players during all the League: attack (x 1 ), block (x 2 ) and serve (x 3 ).Beyond that, it was considered a compositional regression model to study the relation between the fundaments and the associated covariates: z 1 is the percent of the team's eciency in the reception and z 2 is the ratio of wins sets under losers sets, i.e., the higher the value of such ratio, the more likely the number of wins sets of the teams.
Preliminary, it was tted a bivariate normal regression modelling for y 1 and y 2 independent random variables.Figure 1 shows the qq-plots of the tting.Moreover, it was calculated the Shapiro-Wilks (SW) test, Kolmogorov-Smirnov (KS) test and Anderson-Darling (AD) test to verify the normality assumption of the data.The SW, KS and AD tests for y 1 presented p-value equal to 0.000 for all tests, rejected the null hypothesis that the sample came from an univariate normal distribution.On the other hand, the SW, KS and AD tests showed that y 2 follows an univariate normal distribution with p-values: 0.855, 0.902, 0.678, respectively.
According to the tests on the normality assumption for y 1 and y 2 , a new approach for this data must have be considered, mainly for y 1 .In this case, the mixture analysis is conducted to investigate the better t for the the eciency in points scoring by volleyball players.
The aim of the paper is to introduce a Gaussian mixture regression model for compositional data with alr transformation, considering the multivariate structure of the data.
The methodology of nite mixture models has been much discussed in the literature.Quandt & Ramsey (1978) proposed such methodology in general form of switching regression.One of its advantages is to identify and relate populations with two or more subpopulations.According to Miljkovic, Shaik & Miljkovic (2016), the variability of the variable may be explained better through by the investigation in a mixture of two or more distributions than a single distribution.
The paper is organized as follows.Section 2 introduces some preliminaries for compositional data and the methodology of Gaussian mixture regression model applied through the alr-coordinates, Sections 3 and 4 provide the results of the simulation study and application to a real data set related to the Brazilian Men's Volleyball Super League 2014/2015 and Section 5 ends the paper with some nal remarks.

Methodology
First of all, the denition of compositional data is given below.Consider x = (x 1 , x 2 , . . ., x D ) a compositional vector, x i a positive value, for i = 1, . . ., D and The operation closure assigns a constant sum representative to a composition.It divides each component of a vector by the sum of the components, rescaling of the initial vector to the constant sum 1.In mathematical terms, the denition is given by Denition 1 (Closure).For any vector of D strictly positive real components, The family of the logratio coordinates is an alternative to lead with the constraints of compositional data, applying them before the statistical analysis.One of them was introduced by Aitchison (1986) called alr-coordinates.It is dened as The inverse alr-coordinates is given by The alr-coordinates are not symmetric in the components, because the part x D is in the denominator of the component logratios.Such coordinates ζ i = ln(x i /x D ) are simple logratios and easily interpretable (Pawlowsky Glahn et al. 2015).
The regression model assuming alr-coordinates for the response variable is given by where y i = (y i1 , . . ., y id ) is a vector (1 × d) of response variables where d = D − 1 and D number of the components; z i is a vector (1 × p) of covariates associated to the i-th sample; β 0 is a vector (1 × d) intercepts; β 1 is a matrix (p × d) of regression coecients and i are random errors with distribution N (0, σ 2 ), for j = 1, . . ., D − 1 and i = 1, . . ., n.
In order to obtain the mixture structure for y 1 and univariate normal regression for y 2 , the likelihood where φ(y|µ k , σ 2 k ) is the normal distribution with mean µ k = β 0k − β 1k z i and variance σ 2 , for k = 1, . . ., K and i = 1, . . ., n.
The standard tool for estimate the parameters of mixture models is the EM algorithm, known for its applications in clustering and classications models (McLachlan & Peel 2000).The simulation studies and statistical analysis of application were perfomed using R software (R Development Core Team 2013) through of the packages mixtools, maxLik and compositions.

EM Algorithm for Regression Model
The standard methods for nding maximum likelihood solution fail to solve the present setup.A powerful tool is to apply the EM algorithm proposed by Dempster, Laird & Rubin (1977).
The EM algorithm is an iterative method and the process of iterations is based on two steps, E (for expectation) and M (for maximization).
Following Faria & Soromenho (2010), the E-step calculates the Q-function which the expected value of the log likelihood function conditional on the parameter estimates and the observed data on the (t + 1)th iteration, where for i = 1, . . ., n and k = 1, . . ., K w represents the posterior probability that the ith observation belongs to the kth component of the mixture.
In the M -step, the function ( 3) is maximized to obtain the updated estimates θ (t+1) .It follows that the M -step involves solving the following explicity equations expressed by, where Z is a n × (d + 1) matrix of predictors, W k is a n × n diagonal matrix with diagonal entries w ik and Y is a n × 1 vector of response variable for k = 1, . . ., K (Faria & Soromenho 2010).
We considered the discrimination criterion method based on log-likelihood function evaluated at the MLEs.Let m be the number of parameters to be tted and θ the MLE's of θ, the discrimination criterion method is Akaike information criterion (AIC) computed through AIC = −2l( θ; x) + 2m.
The data was generated randomly by the following scheme.A uniform random number u ∈ (0, 1) was generated and the respective value was used to select a specic component k from mixture of regression models.Moreover, the associated covariate was generated through by z 1 ∼ Bernoulli (0.5) and a normal random ik with mean 0 and variance σ 2 k , for k = 1, 2. Lastly, the value y 1i was calculated based on the values of z 1 , ik .
The criteria used to verify the performance of the algorithm were bias, standard deviation (SD), the mean square error (MSE) and coverage probability (CP).The coverage probability of condence interval was computed through by bootstrapping, whereas the standard errors are not provided by the EM algorithm used in parameter estimation.
Tables 1 and 2 display the averages of the maximum likelihood estimates (Mean), standard deviation (SD), bias, mean square error (MSE) and coverage probability (CP) of the asymptotic 95% condence intervals for the parameters considering two cases when π A = (0.5, 0.5) and π B = (0.2, 0.8).We can observe that the estimates are closer to the real value, besides the estimators are asymptotically unbiased for the parameters.According to the increase of the sample size, the MSE values decrease.Moreover, the coverage probabilities were stable.We applied the proposed methodology a real data set where the sample corresponds to 127 players extracted from (Brazilian Volleyball Confederation (CBV) 2016).The data related to proportions of the volleyball players who participated of Brazilian Men's Volleyball Super League 2014/2015.The methodology of compositional data was applied in the points scored of the players during all the League which are considered components: attack (x 1 ), block (x 2 ) and serve (x 3 ).The associated covariates to the model are: z 1 is the percent of the team's eciency in the reception and z 2 is the ratio of wins sets under losers sets, i.e., the higher the value of such ratio, the more likely the number of wins sets of the teams.
The main goal is to verify individually whether the fundaments (attack, block and serve) have relation to the associated covariates.
The ternary diagram (Figure 2) presents the three fundaments attack, block and serve.Such type of graphic represents a 3-part composition using a 2-dimensional plot (Van Den Boogaart & Tolosana-Delgado 2013).There is a concentration of points in direction to the attack component.Only some points are directed for block and serve components.For sake of comparison, the discrimination criterion method was analysed based on log-likelihood function evaluated at the MLEs.Table 3 presents the maximum likelihood estimates and the result of AIC criteria for tted models.Gaussian mixture model with 2 components has smallest value AIC indicating the best t among the other models considered.We can observe that the mixing proportions by component of the 2-GM model are 0.118 and 0.882 reecting how the data is distributed within each subpopulation.The model with 2-GM tted better than others regressions for y 1 and based on the preliminary test of normality, the t of the linear regression is adequate for y 2 (Figure 1).Such conclusions are corroborated by the behaviour of the tting for the residuals of the 2-GM model in the Figure 3.This study provides a mixture compositional regression model to study the efciency volleyball players.Based on the preliminary results, one of the variables, namely y 1 , did not show good t for a regression model with normal errors according to the tests of normality and the Figure 1.The Gaussian mixture compositional regression model was then developed and tted to our dataset, corroborating with the preliminary results.Two approaches were considered, two and three components mixture regressions for the data of eciency of volleyball players, according to the performance of the fundaments: attack, block and serve.Furthermore, the estimates of simulation study and the application for real dataset were obtained via an EM algorithm.The results pointed out that the fundaments of volleyball players are better described by using the compositional mixture model with two components, according to the discrimination criteria.Such approach considers the heterogeneous characteristics of the data.
Finally, the study's conclusions identied points in the attack as fundamental to highlight the eective teams through the estimates of proportions.As future work, following Egozcue & Pawlowsky-Glahn (2005) and Egozcue, Pawlowsky-Glahn, Mateu-Figueras & Barceló-Vidal (2003), the orthonormal coordinates (isometric logratio) can be incorporated in the nite mixture compositional model, instead of alr-coordinates, probable leading to some improvement.

Figure 2 :
Figure 2: Ternary diagram for the components: attack, block and serve.

Figure 3 :
Figure 3: QQ-plot for the residuals of the 2-GM model.

Table 1 :
Simulated data.Mean, SD, bias, MSE and CP for estimates based on 1,000 generated samples of the two-component mixtures regression models.

Table 2 :
Simulated data.Mean, SD, bias, MSE and CP for estimates based on 1,000 generated samples of the two-component mixtures regression models.

Table 3 :
Summary of the Maximum Likelihood Estimates for the parameters and comparison through the discrimination criterion of the bivariate normal (BN),