Abstract

The classical multivariate CUSUM and EWMA charts are commonly used to detect small shifts in the mean vectors. It is now evident that those charts are easily affected by outliers which may be due to small or moderate changes in the mean vector. In this paper, we propose a robust multivariate CUSUM and Robust multivariate EWMA charts to remedy the problem of small changed in scatter outliers. Both the empirical and simulation results indicate that the proposed robust multivariate CUSUM and EWMA charts offer substantial improvement over other multivariate CUSUM and EWMA charts. This article also discussed the robustness of the proposed charts, when there is a small or moderate sustained shift in the data set.

1. Introduction

The overall quality of a process is often measured by the joint level of several correlated characteristics.

Although, in practice, separate univariate control chart for each characteristic is often used to detect changes in the inherent variability of a process, it can be very misleading.

One of the major drawback of using separate univariate control charts is that the overall probability of false alarm (or overall type I error) becomes large, particularly when the number of the quality characteristics increases. Moreover, using separate univariate control charts for each characteristic would not have detected out of control conditions, because it ignores the correlation between the quality characteristics. As an alternative, multivariate control charts are recommended to simultaneously monitor more than one quality characteristics [1, 2]. Generally, two phases, namely Phase I and Phase II, are considered in any multivariate statistical process control application.

Phase I of monitoring scheme of a process consists of collecting a sufficient number of data to ascertain whether or not this Historical Data Set (HDS) indicates a stable (or in-control) process whereas, in Phase II, future observations are monitored based on the control limits calculated from Phase I to determine if the process continues to be stable or not. If the process in phase I is out of control, the deviation sources must be detected and after removing out of control points, the control charts should be re-established. In other words, phase I is to identify multivariate outliers, recognize the root cause, and finally remedy the sources. Consequently, the estimated control limits of phase II would become more accurate [1, 3].

In statistical quality control, a process changes into an out-control situation when multiple outliers exist in the data set. An outlier is defined as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism [4]. Multiple outliers can appear either as random distributed observations within the HDS set (scatter outliers) or as sequential observations in a certain period of the HDS (sustained shift).

Therefore, it is imperative to identify outliers in phase I that may otherwise lead to model misspecification and incorrect results in phase II. phase I and phase II of monitoring scheme of a process are usually called retrospective and prospective analysis, respectively (see [1]).

The Hotelling’s statistic was first introduced by Hotelling [5]. It is analogous to the Shewhart control chart in the univariate cases and widely used in the multivariate control charts for both retrospective and prospective stages of the monitoring schemes [610].

However, the Hotelling’s multivariate types of the Shewhart control chart are insensitive to small and moderate shifts in the mean vector due to the fact that these charts only use the information of current sample. To rectify this problem, the MCUSUM or MEWMA charts were introduced [11, 12]. These charts use the information from both the current and past observations. Since the MCUSUM and the MEWMA charts are more sensitive to the small and moderate shifts in the mean vector, they may be effective alternative to charts when detection of small or moderate shift in a process are the main concerned [1, 2].

Typically, many researchers have suggested applying the MCUSUM and MEWMA in the prospective stage of the monitoring scheme [1]. However, due to the fact that in many real cases, the presence of observation with a small (moderate) changes or special cause in the HDS can create problem for our estimation in phase II, it would be desirable if those methods can also be used in the retrospective stage of the process monitoring scheme.

Sullivan and Woodall [13] suggested to employ the MCUSUM and MEWMA charts in phase I. Ryan [2] mentioned that it is reasonable if the MCUSUM and MEWMA procedures are used in the retrospective analysis of the monitoring scheme. Koning and Does [14] also employed a CUSUM chart in phase I and stated that it can be extendable to multivariate cases. Bersimis et al. [15] also pointed out to the application of the MCUSUM and MEWMA charts in phase I by giving an overview of the multivariate statistical quality control methods [2, 1315].

Although the use of the MCUSUM and MEWMA charts can be helpful in phase I for small or moderate shifts, little work has been focused in this area. Hence, in this article, the ideas of Sullivan and Woodall [13] and other researchers already mentioned have motivated us to investigate the performance of MCUSUM and MEWMA charts in the presence of scatter outliers. Since the MCUSUM and MEWMA charts are based on classical location and dispersion estimates, they are easily affected by multiple outliers. To rectify this problem, we propose to incorporate robust location and scale estimates of the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD) [16] in the formulation of the robust MCUSUM and MEWMA charts.

The remaining parts of this paper are organized in the following structures. In Section 2, some backgrounds of the Hotelling’s , MCUSUM and MEWMA charts, will be reviewed. A brief introduction of the location and dispersion robust estimators, which are based on the MVE and the MCD, is presented in Section 3. The proposed robust MCUSUM and MEWMA control charts are introduced in Section 4. In Section 5, Monte Carlo simulation study is carried out in order to evaluate the performance of the proposed robust MCUSUM and MEWMA charts in both outlier situations, namely, the scatter outliers and the sustained shift. Numerical examples are provided in Section 6. Finally, discussions and conclusions are presented in Section 7.

2. Background of Multivariate Control Charts

2.1. Hotelling’s Charts

The Hotelling’s is a very common approach in multivariate control charts which is based on mahalanobis distance. Consider a HDS in phase I of monitoring scheme, which consists of independent time-ordered observation vectors of dimension, where is the number of quality characteristics that are being measured (). The mahalanobis distance is defined as where is a random vector containing elements for the th time period and are the location and scale estimates, respectively. If follows the variate normal distribution with as a poulation mean vector and as a population covariance matrix, then has chi-square distribution with degrees of freedom [17]. The general form of statistic are obtained by replacing and in (2.1) with classical sample mean vector () and the classical sample covariance matrix (). The classical Hotelling’s is then defined as where As already been mentioned, the chart is used both for retrospective and prospective analysis [1]. In phase I, the parameters are estimated retrospectively based on the current HDS, so the vector is not independent of the estimators and . In this situation, according to Mason and Young [7] and Tracy et al. [8], the statistical distribution of is given as where represents a beta distribution with parameters and . The upper control limit (UCL) of the in (2.4) is , where, is the upper quantile of the beta distribution, which is the probability of false alarm for each point plotted on the control chart. The lower control limit LCL is often set to zero [1, 7].

2.2. Multivariate Cumulative Sum Charts

The Cumulative Sum Chart (CUSUM) was first introduced by Page in 1954 as a powerful method in the monitoring small or moderate shifts in the mean process. The CUSUM charts directly combine all information from a dataset in the sequence of sample values by computing the cumulative sums of deviations of a sample values from a target value, that is, the CUSUM charts are based on using a sequential probability ratio tests. An excellent overview of CUSUM methods are given by Hawkins and Olwell [18] and Montgomery [1].

An inclusive review of the generalization approaches of the univariate CUSUM charts to multivariate CUSUM can be found in Bersimis et al. [15].

Another generalization approaches was proposed by Pignatiello and Runger in 1990. Let be a random vector containing elements for the th observation which derives from the variate normal distribution with an in-control poulation mean vector and an in-control common covariance matrix, . Pignatiello and Runger [19] defined as where and is the reference value based on the out-control value of which is also generally used in the univariate CUSUM schemes. A good choice of is often chosen about halfway between and. Additionally, is obtained by And is the number of subgroups (if any). It is formally defined as In this paper, we only focus on the individual observations, that is, ().

Although most approaches of the MCUSUM procedures are designed for phase II, the approach given in (2.5) can be viewed as a logical basis of phase I. This is due to the fact that it is based on the multivariate distance of related observation from in-control mean vector, somehow similar to charts. Sullivan and Woodall [13] suggested setting equals to zero for phase I when employing the CUSUM in (2.5) [13, 19].

2.3. Multivariate Exponentially Weighted Moving Average

The Exponentially Weighted Moving Average (EWMA) is another alternative to the Shewhart chart when small or moderate shifts occur in the process mean [1]. This chart is due to Roberts [12], who introduced a geometric moving average chart.

It is worth mentioning that the CUSUM and EWMA charts differ from each other in how we use the obtained information from observations in the HDS, that is, EWMA statistic assigns less and less weight to the past observations than the current observation, whereas all observations in the CUSUM procedure are weighted equally. Lowry et al. [20] introduced a Multivariate EWMA as a natural direct extension of the univariate case [20].

Let be a random vector from the variate normal distribution with known in-control as a poulation mean vector and as a known in-control common covariance matrix. Lowry et al. [20] defined MEWMA denoted as where is a diagonal matrix with diagonal elements and I is the identity matrix. Often all set to be equal if there is no any reason to weight past observation differently. The MWEMA chart signals if exceeds a related UCL, Where is the covariance matrix of and defined as When subgroup size is large enough, the MEWMA control chart can be calculated as Choosing the values of determines the sensitivity of the charts to different magnitudes of shifts. A smaller weight makes the MEWMA chart more sensitive to small shifts and larger leads the chart to be more sensitive for larger shifts. If , then the MEWMA is equivalent to control charts.

Montgomery [1] suggested choosing from the range. In practice, a popular choice are . As a good compromise, Croux et al. [21] and Gelper et al. [22] recommended .

Similar to MCUSUM in Section 2.2, the MEWMA chart defined in (2.10) can be used in phase I with some modifications. Sullivan and Woodall [13] suggested choosing because smaller value is more sensitive to a small shift. They also recommended centering the observation by modifying (2.8) to In phase I, the in-control mean vector and covariance matrix are not known and should be estimated. Sullivan and Woodall [13] recommended to apply the MWEMA chart for both reverse time order and regular time order of data to achieve equal detection probability for the same magnitudes of shift which start form the same number of observations from either end of the HDS.

3. High Breakdown Estimators based on the MVE and MCD

It is now evident that the classical sample mean or classical sample covariance (like ) are easily affected by abnormal observations [9, 23]. One simple way to overcome this problem in univariate cases is to use the sample median and MAD (median absolute deviation) instead of the sample mean and variance, respectively [24].

Although this strategy cannot be applied to multivariate data, the idea can be used in this case, by replacing the location and the scale estimators with measures of central tendency and dispersion which are more resistant to atypical data and make the multivariate approaches more robust. Minimum volume ellipsoid (MVE) and minimum covariance determinant (MCD) are frequently employed in the detection of multivariate outliers in regression contexts due to their high breakdown point and affine equivariant properties [16, 23, 2528]. These two high breakdown estimators were initiated first by Rousseeuw [16]. The finite sample breakdown point of an estimator shows that how many outlier, need to ruin the estimation of the estimator completely.

The basic strategy for obtaining the MVE estimator is to search ellipsoid from all ellipsoids which covers at least h points of dataset, where h can be taken equals to . Subsets of size h are called halfsets. After finding that ellipsoid, the usual sample mean and sample covariance of the corresponding ellipsoid are computed and declared as the MVE estimators. Rousseeuw and Leroy [23] proved that the finite sample breakdown point of these estimators is 0.5.

Due to the difficulties of computing the MVE, there are some proposed algorithms to calculate their values in the literatures. The interested reader can refer to Jensen et al. [29] to see the complete reference list of these algorithms. One of these algorithms is a subsampling method which is based on taking a fixed number of subsets randomly from the data set each of size . These randomly chosen subsets, determine the shape of an ellipsoid. Finally, the size of this ellipsoid inflates by multiplying it by a constant until covers subsets of data set [23, 27]. In this study, the subsampling algorithm is utilized because of its wide availability in statistical software packages such as SAS, S-Plus, and R. Another popular approach for finding robust estimates for mean vector and the covariance matrix is the MCD. The main idea is the same as MVE, but, here, MCD is obtained by finding the half set that gives the minimum value of the determinant of the covariance matrix. Its breakdown point is similar to the MVE. Analogous to the MVE, due to the difficulties in the calculations of MCD, there are some proposed algorithms to calculate its values [27, 30, 31]. Rousseeuw and Van Driessen [31] proposed a Fast-MCD algorithm which is easily available in the software packages and is used in this article. It is worth mentioning here that the reweighted MVE and MCD were employed in this study to increase the efficiency of both the MVE and MCD estimators [23, 32, 33].

4. Robust Multivariate Control Charts

As already been mentioned, the Hotelling’s statistic based on classical sample mean and sample covariance matrix is badly affected by outliers [7, 9, 34, 35].

Sullivan and Woodall [9] recommended replacing in (2.2) with a covariance matrix which is estimated by using the vector difference between successive observations of the HDS. This covariance matrix is defined as where is the difference between successive observations. Henceforth, the statistic which is calculated based on (4.1) is referred, as . Another approach to make chart more robust is to use the MVE or the MCD estimators instead of location and dispersion estimators of the Hotelling’s statistic which is applied first by Vargas in 2003. The application of robust control charts for individual observations based on the MVE and MCD have also been discussed extensively by Jensen et al. [29]. According to them, after obtaining the robust multivariate location and scale estimates given by either the MVE or the MCD, the robust Hotelling statistics are defined as follows: where . It is important to note that there is no guarantee that the assumptions of the matematical distribution of (2.2) preserved when replacing the location and scale estimators with robust versions. Although Wilimas et al. [35] were successful in the determination of the mathematical distribuion of the Hotelling’s based on the , still there is no exact known mathematical distribution for the Hotelling’s based on the MVE or the MCD. Hence, to overcome this problem, emprical distribution of the robust statistic is used for calculating the emprical control limits. Hereinafter, the robust version of Hotelling’s statistic based on the MVE or the MCD is indicated by . For applying the MCUSUM and MEWMA charts in the retrospective analysis, due to the fact that usualy in phase I of the monitoring scheme, the and are unknown, the classical sample mean vector and covariance matrix are utilizied to estimate them [13]. Since the MCUSUM and MEWMA procedures in (2.5), (2.6), (2.10), and (2.11) utilize the classical covariance matrix, those procedures may suffer from outliers. Hence, we propose using the robust covariance matrix from the MVE or the MCD estimators to formulate the robust version of the MCUSUM chart for individual observations, which are defined as follows: for . Likewise, the proposed robust version of the MEWMA chart for individual observations are defined as for and .

In the next section, we investigate the performance of our proposed robust MCUSUM and MEWMA charts for small and moderate shifts in the mean vector in the presence of both scatter and sustained shift outliers.

5. Simulation Study

In this section, a Monte Carlo simulation study is carried out to assess the performance of the control charts already discussed. The simulation is designed based on three subgroups each of size with the number of characteristics , and 10. Let us assume that the in-control process is a -variate Normal distribution with the mean vector and the variance-covariance matrix . The out-control process is a -variate Normal distribution with the same variance-covariance matrix but with shifted mean vector as . The amount of the shift is measured by a scalar that is defined as This measure is called noncentrality parameter and, hereafter, is referred as ncp. It is clear from (5.1), that the severity of the shift only depends on the values of . Additionally, it can be assumed that is a zero vector and is a identity matrix.

Due to the unknown distribution of the proposed robust statistics when replacements of estimators were made, the approximated empirical control limits are obtained by simulation. The control limits are exact when applied to a single point in phase I, despite this phase is a retrospective analysis of all observations. In this situation, is the probability of false alarm for each point plotted on the control chart. Therefore, if all of the statistics were distributed independently, the probability of false alarm would be, where denotes the number of subgroups. In this case, the probability of false alarm is called the overall false alarm.

Practically, it is preferred to determine the control limits by simulation to give a specified overall false alarm even though the distributions of statistics are known [2, 13, 35]. Hereafter, we refer to as the overall false alarm. Since, the lower control limits for all control charts which are discussed earlier are often set to zero [1, 7, 13, 29], the simulated empirical upper control limits (UCLs) are obtained by generating 5000 in-control datasets for each combination of .

Due to the affine equivariant property of all preceding statistics, these limits are applicable to any values of and . In this article, the overall false alarm is given by .

The empirical simulated UCLs were determined by calculating all the related statistics for each observation in the generated datasets and record the maximum value for the corresponding dataset. In order to retain, the 95th percentile of the recorded maximum values is declared as the simulated UCLs.

The obtained simulated UCLs for different types of , MCUSUM and MEWMA charts, are exhibited in Tables 1 and 2 for different sizes of and . Following the idea of Sullivan and Woodall [13], the value of is chosen to be 0.05.

The estimated value, of in all tables refer to the MCUSUM, MEWMA and Hotelling’s charts, respectively which are computed by using the classical covariance matrix. Likewise, the denote the MCUSUM, MEWMA, and Hotelling’s charts respectively which are based on the successive covariance matrix, respectively.

Traditionally, the performance of control charts is evaluated by using the Average Run Length (ARL) index. However, in the retrospective analysis, the ARL is not meaningful because a signal is generated for all observations in the HDS. Instead, for comparing control charts in phase I, the signal probability is used to achieve a specified overall false alarm. For estimating the probability of signals in the existence of scatter outliers, four percentage of outliers, denoted as , are generated for each dataset. Then a contaminated dataset of size in the dimensions is generated for different values of ncp. Each generated dataset is compared with the corresponding empirical UCL in Tables 1 or 2.

This process is repeated for 5000 times, and eventually the proportion of datasets that have at least one point greater than the UCLs is reported as the probability of signal for each respective chart. Due to space constraint, only the results for and , and are exhibited in Tables 3 and 4. Other results are consistent.

As shown in Tables 3 and 4, both proposed Robust MCUSUM and MEWMA charts outperform the based on the MVE or the MCD estimators except for. At 5% outliers, the have higher probability of signals than MCUSUM and MEWMA charts for all ncp. However, as the value of increases, the probability of detecting shift for the Robust MCUSUM and MEWMA increases. It is interesting to note that by increasing the number of subgroup size both MCUSUM and MEWMA charts based on MCD is slightly better than that when based on MVE. The values for are even worse than those for based on the MVE or the MCD in all cases (the results are not shown here).

The results for other levels of and are consistent but are not reported here due to space limitations. However, when the dimension increases, the subgroup size should also be increased to keep the consistency due to the fact that as rises for a fixed value of , the breakdown point of the MVE and MCD decreases. For example, Figures 1 and 2 illustrate the probability of signals in the case of scatter outliers for MCUSUM and MEWMA procedures, for and . It is apparent from these figures that although m is large, for fixed values of ncp and , the performance of the respective control charts for is better than when .

The situation in which the mean vector after observations changes from in-control value to out-control value is called sustained shifts and is modeled as where and are in-control and out-control mean vector, respectively. In this paper, to assess the performance of our proposed control charts in comparison with other existing control charts, two situations are considered for the sustained shift case: first, when a small or moderate change occurs midway in the HDS and, secondly, when a small or moderate change happens near the end or at the beginning of data. It is important to note that for phase I of monitoring scheme of a process, the probability of detection shift, which appears either at the beginning or at the end of the HDS, should be the same due to the fact that the is unknown in this phase. That is, in both cases some observations may come from one statistical distribution and some other observations may come from another distribution [13].

For the shift in the middle of the HDS, the required datasets are obtained by generating 50% of observations from in-control process and the other 50%, from the shifted observations for different values of the ncp. The shifted generated observations are compared with the corresponding empirical UCL from Tables 1 or 2 and the signal is recorded for the first shifted observation that exceeds the respective UCL. This process is repeated 5000 times for each combination of and , and eventually the proportion of signals is declared as the signal probability.

For the case of existing shift at the end or at the beginning of a data, for simplicity, we assume shift occurs at the end of the HDS. Then, the data sets are simulated consequently in the form of generating 80% of observations from the in-control process and 20% from the shifted process. Subsequently, the signal probability is calculated for each control chart. Due to space limitation, only the results of , 50 and 100 for are presented in Tables 5, 6, and 7.

When the MEWMA control chart is applied, the time order of observations in the HDS should be taken into consideration. This is because of the weight effect of the previous observations. This effect causes two sustained shift situations with the similar magnitude in which one located near the first observation in the HDS and the other is at the same distance from the last observation, which have different probability of signal. Nonetheless, it is desirable to have the same probability of signal for the same shift magnitude, in which the number of observations should be the same, starting from either end of the data.

In order to achieve this property, the MEWMA procedure is used with the respective simulated UCL for both original time order and reverse time order of the generated observations in the HDS.

It is worth mentioning that this idea is not helpful for the scatter outlier situation since the distribution for an outlier is different immediately before and in the succeeding observations. Thus, including successive observations contribute no useful information, whereas with the sustained shifts, there are several observations in a sequence from the same distribution.

The results obtained from Tables 57 reveal that the performance of the proposed MCUSUM and MEWMA charts, based on the MVE and MCD, are not effective in both sustained shift situations in comparison with the MCUSUM and MEWMA procedures based on the successive covariance matrix. Under such situation, it appears that the MCUSUM and MEWMA procedures, based on the classical covariance matrix, are even better than the proposed charts in detecting small and moderate shifts. It may happen due to the fact that the design of both MCUSUM and MEWMA charts are tuned to be the best for a specific sustained shift.

It can also be seen from the Tables 57 that for both shifts, that is, in the midway and at the last 20% observations of data, the MCUSUM and MEWMA charts based on the successive covariance matrix ( outperform the classical covariance matrix .

For the midway shift, the MCUSUM charts are slightly better than the MEWMA and the MWEMA is somewhat better in the case of a shift at the last 20% observations of the data. Both the MCUSUM and MEWMA charts have less sensitivity when a shift occurs at the last 20% observations of the generated data compared with the same value of ncp at the midpoint of the data except for the maximum severity of shift (ncp = 10). However, by increasing m, their performances are fairly closed for smaller severity of shift. Although the performance of the is better than the statistic based on the MVE or MCD, it is completely apparent from these tables that the performances of all s are significantly less efficient than the MCUSUM and MEWMA charts. Moreover, it is interesting to note that by increasing m, the performance of the is also increased. These results are consistent for other levels of p and m which are not presented here due to space limitations.

6. Numerical Example

In this section, a real dataset taken from Quesenberry [36] is used to evaluate the performance of the proposed charts. The original data set consists of measurements of eleven quality characteristics of 30 products, but here we only consider the first two characteristics. The data is presented in Table 8.

The MCUSUM, MEWMA procedures, and control charts are shown in Figures 3, 4, and 5, respectively. It is worth mentioning that from the plot of Figure 5, there is a sustained shift in the data set. By referring to the control limits of Table 1, it can be seen from Figure 3 that the MCUSUM chart based on successive covariance matrix, identified observations 21 to 26 as outliers. However, the MCUSUM chart based on the MVE or the MCD, failed to detect any outlier.

The MEWMA procedures shown in Figure 4 identify observations 22–27 as outliers, based on the respective UCLs in Table 2, for the . The results based on time reverse order of data, also exhibited on the right-hand side of Figure 4. In this case, for example, observations number 1 and 2 describe observations number 30 and 29 in the original time order of the dataset. It is clear from this figure that nothing is gained by the MVE or the MCD estimators. According to the UCLs for of Table 1, Figure 5 shows that only one observation (observation number 2) is detected as outlier by statistics based on the MVE and the successive covariance estimators while the based on the MCD estimator unable to detect any point as outlier.

7. Conclusion

In the retrospective analysis of a process, detecting small or moderate shifts in the mean vector is important in order to correctly estimate the parameters of a process. Since the CUSUM or the EWMA procedures are more sensitive to small and moderate changes in the dataset than charts, this study is designed to investigate the application of these control procedures in multivariate data for the retrospective analysis of a process monitoring scheme, in two situations, namely, in the presence of scatter outliers and the sustained shift in the historical dataset.

We proposed robust MCUSUM and robust MEWMA procedures to remedy the problems of outliers by incorporating the MVE and the MCD in the development of such robust charts.

The results of this study indicate that the robust MCUSUM and MEWMA procedures, based on the MVE or the MCD estimators, improve the detection probability of scatter outliers with a small shift in the mean vector as compared to the robust charts except at 5% percentage of contaminated data. Moreover, the results indicate that the MCD-based estimators are more superior than the MVE-based estimators when the subgroup size of observation is increased.

Nonetheless, the proposed MCUSUM and MEWMA charts which are based on the MVE or the MCD are not effective in detecting sustained shift in the dataset.

When the shift occurs midway in the dataset, the performance of the MCUSUM are somewhat superior to the MEWMA. On the other hand, the MEWMA is slightly better than the MCUSUM, in the situation when the shifts happen either near the beginning or near the end of the dataset. However, by increasing the subgroup size, the performance of both charts is fairly closed.

Both the MCUSUM and MEWMA control charts are more sensitive to a small or moderate shift in the sustained shift situation, when it occurs in the middle of the dataset compared with the case that a shift occurs near the beginning or the end of data.