Profile Monitoring for Compositional Data

In a growing number of quality control applications, the quality of a product or process is best characterized and summarized by a functional relationship between a response variable and one or more explanatory variables. Profile monitoring is used to understand and to check the stability of this relationship over time. In some applications with compositional data, the relationship can be characterized by a Dirichlet regression model. We evaluate five T 2 control charts for monitoring these profiles in Phase I. A real example from production of concrete is given.


Resumen
En un gran número de aplicaciones la calidad de un producto o proceso está mejor representada por una relación funcional entre una variable de respuesta y una o más variables explicatorias.El monitoreo de perfiles permite entender y chequear la estabilidad de esta relación funcional a través del tiempo.En algunas aplicaciones con datos composicionales, la relación puede ser representada por un modelo de regresión Dirichlet.En este artículo nosotros evaluamos cinco cartas de control T 2 para monitorear estos perfiles en Fase I. Un ejemplo real asociado a la producción de concreto es presentado.
The linear regression model is commonly used for monitoring profiles.However, it is not appropriate for situations where the response is restricted to the interval (0, 1) since it may yield fitted values in the variable of interest that exceed its lower and upper bounds.Ferrari & Cribari-Neto (2004) proposed a regression model that is tailored for situations where the dependent variable Y is measured continuously on the standard unit interval, i.e. 0 < Y < 1.The proposed model is based on the assumption that the response is Beta distributed.The Beta distribution is very flexible for modeling proportions since its density can have quite different shapes depending on the values of the two parameters that index the distribution.Vasconcellos & Cribari-Neto (2005) proposed a class of regression models where the response is Beta distributed and the two parameters that index this distribution are related to covariates and regression parameters.However, the proposed regression models are restricted to the univariate case and cannot be applied in many practical situations where data consist of multivariate positive observations summing to one, that is, the study of compositional data, see Aitchison (1986) and Aitchison (2003).Melo, Vasconcellos & Lemonte (2009) proposed a particular structure for compositional data regression, based on the Dirichlet distribution, which is a generalization of the Beta distribution for the simplex sample space.A profile application in a concrete manufacturing plant, which after a preliminary study was found to fit appropriately this structure motivated this paper.
Compositional data are frequently encountered in industries such as the chemical, pharmaceutical, textil, plastic, concrete, steel, asphalt, among other.Sev-Revista Colombiana de Estadística 37 (2014) 157-179 eral statistical methods for monitoring processes characterized by compositional data have been studied.See for example, Sullivan & Woodall (1996), Boyles (1997), Yang, Cline, Lytton & Little (2004) and Vives-Mestres, Daunis-i Estadella & Martín-Fernández (2013).However, there are not methods for monitoring these processes when the random vectors associated to the compositional data present a functional relationship with a set of explanatory variables.
In this paper, the control charting mechanisms discussed by Williams, Woodall & Birch (2007) and Yeh, Huwang & Li (2009) are extended for monitoring functional relationships in Phase I characterized by a Dirichlet regression model using a regression structure that allows the modeling of relationships between random vectors with Dirichlet distribution and a set of explanatory variables.
The structure of this paper is outlined as follows: In Section 2, we show the Dirichlet regression model for compositional data and the estimation of the model parameters.Five T 2 control charts approaches used for monitoring linear profiles in Phase I with compositional data are presented in Section 3. In Section 4, the performance of the proposed approaches is evaluated through simulation studies.
A real example is given in Section 5.In the last section we conclude the paper.

Dirichlet Regression
Compositional data are used to indicate how parts contribute to the whole.In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100%.Compositional data occupy a restricted space where variables can vary only from 0 to 100, or any other given constant.Such a restricted space is known formally as a simplex, see Pawlowsky-Glahn & Egozcue (2006).
Let c be a positive number.The p-dimensional closed simplex in R n and (p − 1)-dimensional open simplex in R p−1 are defined by respectively, where the superscript t means the function transpose.Furthermore, let T p = T p (1) and where a = (a 1 , . . ., a p ) t and a j > 0, j = 1 . . .p.We will write Y ∼ Dirichlet p (a 1 , . . ., a p ), see Ng, Tian & Tang (2011).When all a j → 0, the distribution becomes noninformative.When p = 2, the Dirichlet distribution Dirichlet 2 (a 1 , a 2 ) reduces to the Beta(a 1 , a 2 ) distribution.The marginal distributions of the components of Y, Y j , j = 1, 2, . . ., p, are distributed as Beta(a j , φ − a j ), where φ = p j=1 a j .In this sense, the Dirichlet distribution can be seen as a multivariate extension of the Beta distribution.Therefore, we have V ar The Dirichlet distribution is widely used to model data in the form of proportions, where each observation is a vector of positive numbers summing to one.It allows great flexibility of modeling, provided by the appropriate choice of its parameters.See Ng et al. (2011) and Melo et al. (2009).Gueorguieva, Rosenheck & Zelterman (2008) described a Dirichlet multivariate regression method which is useful for modeling data representing components as a percentage of a total.They described each log(a j ) as a separate linear function of covariates and regression coefficients.That is, for each component j = 1, . . ., p they used a log-link with log a ij = β t j X i (5) for covariates X i recorded on the ith individual (i = 1, . . ., n) and regression coefficients β j to be estimated using maximum likelihood.These estimates are denoted β j .The estimates a j = { a ij } of a j = {a ij } are defined by Gueorguieva et al. (2008) refer to the {a j } as meta − parameters because they combine the effects of the covariates X i using regression parameters {β j }.Melo et al. (2009) proposed a generalization of this model.The proposed model is defined by establishing relationships between the parameters that index the Dirichlet distribution and linear predictors on the explanatory variables.They assume a set of independent vector observations Y 1 , . . ., Y n , where where each function g j : R → (0, ∞) is three times differentiable, injective and known, x i1 , . . ., x ik are the values corresponding to the ith observation for k explanatory variables and β 1j , . . ., β kj are k unknown parameters corresponding to the jth component.The model, therefore, has kp unknown parameters, which can be estimated through maximum likelihood (See Melo et al. 2009).
The covariates of this regression model affect the vector mean, the variance covariance structure of the distribution of the observations and the higher-order moments.The functions g j play a similar role to the link functions of generalized linear models, in the sense that they specifically define how the parameters of the distribution of interest are linked to linear combinations of the covariates.The coefficients of this linear combination are unknown.The regression parameters are identifiable if the link functions are injective and the covariates are linearly independent (See Melo et al. 2009).
The regression coefficients can be estimated using maximum likelihood.Let B the k × p matrix with the β hj 's, h = 1, 2 . . ., k and j = 1, 2, . . ., p.The loglikelihood function is given by where If B is the maximum likelihood estimator for B, under some regularity conditions, ), when n is large, with a ∼ denoting asymptotically distributed, N kp representing a kp-variate normal distribution and K(B) representing the kp × kp information matrix for the vector version of B (See Melo et al. 2009).The matrix K(B) can be obtained as where represents the Kronecker product, L is an np × np matrix defined in partitioned form as where each L cd , with c = 1, 2, . . ., p and d = 1, 2, . . ., p, is a diagonal matrix having ith element in the diagonal given by where . ., n, j = 1, 2, . . ., p, g is the first-order derivative of g with respect to its argument and ψ (digamma function) is the first derivative of the log of the gamma function.

Profile Monitoring Control Charts
In this section, we propose and study five Hotelling T 2 control charts for monitoring linear profiles with compositional data using the Dirichlet regression model described in the previous section.The study is limited for Phase I. We consider the same charts analyzed by Yeh et al. (2009) in their study for profile monitoring in binary responses: T 2 based on the sample mean vector and covariance matrix (T 2 U sual ), T 2 based on the sample average and successive differences estimator (T 2 SD ) proposed by Sullivan & Woodall (1996), T 2 based on the sample average and intra−profile pooling (T 2 Int ) Williams et al. (2007), T 2 based on the Minimum Volume Ellipsoid (T 2 M V E ) and T 2 based on the Minimum Covariance Determinant (T 2 M CD ) studied by Vargas (2003) and Jensen, Birch & Woodall (2007).
We assume that when the process is in control, the matrix of model parameters is B 0 .In Phase I control, m independent samples are taken.In each sample r, r = 1, . . ., m, there are a set of n r independent vector observations Y 1r , . . ., Y nr r , where We suppose that Y ir ∼ Dirichlet p (a i1 , . . ., a ip ).We assume that the relationship between the parameters that index the Dirichlet distribution and k explanatory variables (X 1 , . . ., X k ) given in equation ( 6) is g j = exp(•).
For any given sample r, r = 1, 2, . . ., m, B r is the maximum likelihood estimator of B. Let β r = vec( B r ) = ( β 11r , β 21r , . . ., β k1r , β 12r , β 22r , . . ., β k2r , . . ., β 1p r , β 2p r , . . ., β kp r ) β r is a multivariate random vector, where each β sj r represents the estimator of the parameter corresponding to the explanatory variable X s , s = 1, . . ., k, applied on the j components of Y ir .The Hotelling's T 2 statistic measures the Mahalanobis distance of the corresponding vector from the sample mean vector.The general form of the statistic is where β 0 is the expected value of β r when the process is in control, and Σ 0 is the in-control covariance matrix of β r .
In Phase I control, β 0 and Σ 0 both need to be estimated and the performance of the control chart depends on the types of estimates being used.The T 2 statistics for the proposed control charts are calculated by: where where where S Int = 1 m m r=1 var( β r ), which is calculated using the observed information matrix, where β M V E and S M V E are estimates of β 0 and Σ 0 , respectively, based on the MVE method (See Rousseeuw & Van Zomeren 1990), and where β M CD and S M CD are estimates of β 0 and Σ 0 , respectively, based on the MCD method (See Rousseeuw & Van Zomeren 1990).
Although β r is distributed asymptotically normal we do not know its sampling distribution.Therefore, we used simulations to approximate the upper control limit (UCL).For simplicity we consider that the number of components is p = 2, 3, 4, 5 and 8.For a chart given we generated m independent samples.For each sample we generated a set of n independent vector observations Y 1 , . . ., Y n , where Y i ∼ Dirichlet p (a i1 , a i2 , . . ., a ip ), i = 1 . . ., n.The parameters of the Dirichlet distribution, a ij with j = 1, 2, . . ., p, are described by The values of the regressor variable (X) can be random but we have assumed that X takes fixed values, X = 0.1, 0.2, . . ., 0.9.For the m samples generated we calculate the maximum T 2 , denoted by T 2 max .This process was then repeated 10,000 times which resulted in 10,000 T 2 max values.The 95th quantile of these T 2 max values was then taken as an estimate of the UCL for that chart.
For each of the proposed charts, we ran the simulations for m = 30, 60 and 90 samples with a prespecified type I error probability α = 0.05.The UCLs obtained are shown in Table 1.We used the R language to run the simulations, in particular we used the DirichletReg-package written by Maier (2011) to calculate the estimates of β r .We also used the functions cov.mve and cov.mcd from the MASS-package to calculate β M V E and S M V E in equation ( 14), and β M CD and S M CD in equation ( 15).If a process is modeled using the multivariate normal regression, the response variables can take any real value, (Noorossana, Eyvazian, Amiri & Mahmoud 2010).However, this assumption is not met here, because the response variables for compositional data are always positive and range only from 0 to 100, or any other constant.Therefore, the use of the multivariate normal regression in this kind of processes can produce invalid results.For more details see Aitchison (2003) and Pawlowsky-Glahn & Egozcue (2006).

The Performance Evaluation
In this section we compare the performance of the proposed methods for Phase I, monitoring of compositional data, through linear regression profiles in terms of the overall probability of a signal under step and drift shift and outliers.The signal probability is defined as the probability that at least one sample, of a total of m samples, is considered to be out of control.When the process is out of control, a large signal probability indicates better ability of a control chart to detect the out-of-control process.However, when the process is in control, a large signal probability actually works against a control chart since it gives a higher false alarm rate (See Yeh et al. 2009).
The out-of-control signal probabilities were calculated based on 5,000 simulations, as the percentage of times the T 2 max exceeds the corresponding UCL.For step shift or sustained shift, we generate a shift in the vector of parameters of the regressions, β, from β 0 to β 1 .The shift starts from the sample l, for l = [m * k] + 1, where [x] denotes the largest integer which is less or equal than x and k = 1 4 , 1 2 and 3 4 .So, for k = 1 2 the first half of the samples is in-control, while the second half is in the out-of-control state.The signal probabilities for each control chart, when m = 30, were calculated through simulation.
Table 2: Signal probabilities when the intercept and the slope of the profile corresponding to the second component have not changed.Table 2 shows the signal probabilities of the five control charts considered for a step shift occuring in l = [m/2] + 1 when the intercept and the slope of the profile corresponding to the second component have not changed.When ncp = 0 the signal probabilities for the T 2 U sual , T 2 SD , T 2 Int , T 2 M V E and T 2 M CD control charts are close to 0.05.For other values of ncp, the signal probabilities of the T 2 Int control chart are 1 o very near 1, showing an excellent performance to detect step shifts.

Figures 1 to 5 describe the signal probabilities of the
and T 2 M CD control charts for a step shift occurring in three scenarios: the last three quarters, the second half, and the last quarter of the 30 samples considered.

With exception of the T 2
Int control chart, the location of the step shift affects the performance of the T 2 control charts considered.For example, the signal probabilities decrease considerably when the shift stars in the half of the samples.The effect is greater in the T 2 M V E and T 2 M CD control charts.These charts are more powerful when the shifts occur at k = 1 4 and k = 3 4 .For drift shifts, the first sample generated was in control, β 1 = β 0 , and the process parameter vector started to change from the second sample to β 1 , where , with r = 2 . . ., m and m = 30.Figure 6 shows the signal probabilities found by simulation for the 256 possible values of ncp.We observe that the T 2 SD , T 2 Int control charts have a good performance for detecting shifts with trend.
In the scenario considering the presence of outliers, 5 of them were inserted randomly in the m samples, with m = 30.They were generated from β 1 , where β 1 = β 0 +(∆).Figure 7 presents the signal probabilities calculated by simulations.T 2 Int , T 2 M V E and T 2 M CD control charts have the best performance for detecting outliers.

Example of Application
The concrete is a composite material that essentially consists of a mixture of cement, water and aggregates, which is regularly used in infrastructure and buildings construction (Li 2011).The aggregates are rock fragments named coarse aggregate and sand particles called fine aggregate, which be derived from land-or sea-based deposits, from gravel pits or hard-rock quarries, from sand dunes or river courses.The aggregate occupies between 70% and 75% of the concrete volume and affect its strength, durability, workability and cohesiveness.One aspect of interest in the quality of the aggregate is the particle size distribution known as gradation (Alexander & Mindess 2005).
In order to obtain the gradation of the aggregate, a series of standard sieves are nested or stacked, one on top of another, with increasing aperture size from bottom to top, and through which a aggregate sample is passed from top, usually  aided by shaking or vibrating the sieves (Alexander & Mindess 2005, Lyons 2008).
Figure 8 shows a kind of machine used in the gradation process.The sieves labeled as 200, 100, 50, 30, 16, 8, 4, and P3, have the hole sizes of 0.075 mm, 0.149 mm, 0.297 mm, 0.595 mm, 1.19 mm, 2.38 mm, 4.75 mm and 9.5 mm, respectively.The gradation results are the percent of aggregate retained on each sieve and the fineness modulus, which measures the average particle size.This dimensionless parameter is equal to sum of the percent of aggregate retained on each of sieve divided by 100.A smaller fineness modulus indicates a finer aggregate and a higher value represents a courser aggregate.In this work, 217 aggregate samples from a concrete manufacturing plant were studied.The aggregate samples were daily tested during 31 weeks.The set of daily observations obtained in a week is named weekly sample, therefore, 31 weekly samples were considered.The proportion passing through of each sieve and the fineness modulus were measured in each aggregate sample.
The components j = 1, 2, . . ., 8 are defined by the aggregate size retained by each sieve.The proportion passing through the sieve j, j = 1, . . ., 8, is the variable Y j .Each component corresponds to an aggregate with constant size and the proportion of aggregate passing trough them is identified by the vector Y = (Y 1 , Y 2 , . . ., Y 8 ). Figure 9 shows plots of the marginal frequencies of each component for the aggregate samples.We observe that Y 4 , Y 5 and Y 7 are skewed, which implies that Y does not have a multivariate normal distribution.

Conclusions
In this paper the control charting mechanisms discussed by Williams et al. (2007) and Yeh et al. (2009) have been extended for monitoring compositional data profiles in Phase I processes, whose response variable follows a Dirichlet distribution.This methodology allows us monitoring the linear relationship between the parameters of a Dirichlet distribution and a set of explanatory variables, and assess the stability of the parameters that characterize the studied regression model.
We used five Hotelling's type T 2 control charts and compared their performance for detecting step and drift shifts in the process parameters and outliers in the studied profiles.Simulation procedures suggest that the T 2 control chart called intra-profile pooling is an excellent tool in order to detect outliers in the profiles and step and drift shifts in the parameters of the compositional data profile.The intraprofile pooling T 2 control chart is based on the average of the sample covariance matrices of the estimates of the parameters β characterizing the profile.The T 2 control chart based on the vector of successive differences of parameter estimates is a good alternative for detecting step and drift shifts; while the T 2 control charts based on robust estimates for the mean and covariance matrix, minimum volume ellipsoid (MVE) and minimum covariance determinant (MCD) methods, are a good option for detecting outliers.
We presented an example of application with real data of the proposed methodology, in order to control the quality of the aggregate gradation in the concrete.The T 2 control chart based on successive differences suggests that the process is incontrol and does not present step and drift shifts, the control chart based on MVE detects the presence of some outliers, and the intra-profile and usual T 2 control charts show that the process is out-control.This methodology can be extended to other processes with compositional data.This paper constitutes an initial solution for monitoring compositional data profiles.It would be worthwhile to study and compare the performance of other control charts like the change point approach.Since the performance of the T 2 control charts deteriorates when number of parameters increases, it is needed more research when the number of components in the response variable increase and/or when the number of covariates increases.Reduction methods for multivariate data or high dimensional methods need also future research.
When the process is out-of-control is important to identify the causes of the anomaly in order to apply appropriate remedial measures.A future work can implement diagnostic aids such as determining the parameters responsible for outof-control signal.
Finally, some methods for monitoring Dirichlet regression profiles in Phase II can be developed.

Figure 1 :
Figure 1: Signal probabilities of the T 2 U sual control chart for a step shift occurring in l = [m * k] + 1, with k = 1 4 (on the left), k = 2 4 (on the middle), and k = 3 4

Figure 7 :
Figure 7: Signal probabilities of outliers for five control charts: usual, successive differences, intra-profile, MVE and MCD.

Figure 8 :
Figure 8: Series of sieves placed one above the other in order of size with the largest sieve at the top.

Figure 10 :
Figure 10: Linear relationship between fineness modulus and the proportion of sand passing through the sieves No 200, 100, 50 and 30.

Figure 12 :
Figure 12: T 2 Control chart based on successive differences for the process of grading of sand in a mine of a concrete manufacturing plant.

Figure 13 :Figure 14
Figure 13: T 2 Control chart based on Minimum Volume Ellipsoid (MVE) for the process of grading of sand in a mine of a concrete manufacturing plant.

Figure 15 :Figure 16 :Figure 17 :
Figure 15: T 2 Control chart based on intra-profile pooling for the process of grading of sand in a mine of a concrete manufacturing plant.

Table 1 :
Values of simulated UCL for the proposed control charts with α = 0.05 Observed marginal frequencies Y1, Y2, . . ., Y8 of the individual components of the aggregate gradings.