Gaussian mixture modelling by exploiting competitive stop EM algorithm

To improve the robustness of order selection and parameter learning for Gaussian mixture model (GMM), this paper proposes a competitive stop expectation-maximization (EM) algorithm, which is based on two stop conditions. The first condition is a Lilliefors test based multivariate (MV) normality criterion, which is used to determine whether to split a component into two different components. The EM algorithm stops splitting when all components have MV normality. The minimum description length (MDL) criterion is used in the second condition, which competes with the first condition to prevent the EM algorithm from over-splitting. Simulation experiments verify the effectiveness of the proposed algorithm.


Introduction
In univariate and multivariate statistical processing, the GMM is a very flexible and powerful probabilistic statistical model. This model is widely used in various fields of statistical modeling and analysis of observed data [1] [2], such as signal processing, image processing, machine learning, pattern recognition, computer vision, etc.
The classic GMM fitting approach is the standard EM algorithm. The algorithm can obtain a maximum likelihood (ML) estimate of the parameters of the mixture model. However, when applying the standard EM algorithm to fit a GMM, we will face two problems [3] [4]: the first problem is that the algorithm must know the order of GMM in advance, and the other problem is that the algorithm is very sensitive to the initial model parameters.
To solve the above problem, several EM variant algorithms are proposed. In [5], the proposed algorithm adds new components using a Bayesian splitting procedure. In [6], an EM variant algorithm is proposed which starts with only a component and uses a Mahalanobis distance-based normality test as split criterion. This algorithm stops when no component derivates from MV normality distribution. The algorithm proposed in [7] also starts with only a component and employs entropy based Gaussian deficiency as split criterion.
To improve the robustness of order selection and parameter learning for GMM, a novel competitive stop EM (CSEM) algorithm is proposed. This algorithm is based on two stop conditions, starts with only one component, and finds the ML solution. The first condition is a Lilliefors test based multivariate (MV) normality criterion, which is used to determine whether to split a component into two different components and stop EM algorithm when no component diverges from MV Gaussian. The MDL criterion is used in the second condition, which prevents EM algorithm from over-splitting. This paper is organized as follows. In section 2, The proposed CSEM algorithm are described. In section 3, the effectiveness of the algorithm is verified by simulation experiments. In section 4, the full paper is summarized.

CSEM algorithm
Let the observation sample set X be used to fit a GMM, the model includes K components ( ) 1, , k kK = L , and each component k L obeys a Gaussian distribution. The goal of the CSEM algorithm is to find Fig.1 shows the processing flow of the CSEM algorithm.

Search the component to be split
Let Ω be an index set, which includes the indexes of all the components that do not obeys the MV normal distribution.
For any component  , its entropy can be expressed as Therefore, the index k  of the component to be split can be obtained by the following formula

Splitting process
For cluster where p  is determined by the following expression.

Standard EM algorithm
The standard EM algorithm iteratively fits the GMM, with initial The standard EM algorithm stops when where 0   . In practical applications, the variable  is usually equal to -5 10 . This paper proposes to use the MDL criterion as a competitive stop condition. When this condition is satisfied, the standard EM algorithm does not iteratively fit the GMM. This condition prevents the EM algorithm from over-splitting. The MDL criterion can be expressed as where, ( ) p X Θ is the PDF of the GMM. If ( ) ( )   Figure 2. Evolution of CSEM algorithm. Fig. 2 shows the iterative process of the proposed CSEM algorithm. The algorithm starts by treating all samples as one component and computes the corresponding mean and covariance matrices. After many iterations, the correct model order is estimated. After 100 repeated experiments, the clustering correctness of proposed algorithm is 100% and the average EM iteration is 34.47. Fig.2(a) is the set of , competitive stop condition satisfies. Example 2: To compare with other EM variants, we select both algorithms presented in literature [8] and [7]. The first is named as Backward-MDL2-EM (BMEM) algorithm. Its initialization method in this experiment is the same as that discussed in literature [8]. The second is named as Mahalanobis distancebased EM (MDEM) algorithm. This example uses three sets of test samples given in literature [7], each of which includes 500 test samples. Each algorithm is repeated 1000 times, and the correctness of the model order, the average and standard deviation of the iterations are computed.

Experimental results
The correctness of model order selection and the average and standard deviation (STD) of EM iterations, are performed for comparison. The results for the above three data sets are presented in table I. It can be inferred that the number of EM iterations of proposed EM algorithm is very close to that of MDEM algorithm and has relatively low EM iterations when compared with BMEM algorithm. For data sets A and C, the mixture components do not overlap with each other, the correctness of model order selection is 100%. For data set B, the clustering correctness of our proposed CSEM algorithm is 84.8%. Example 3: Mixtures of factor analyzers (MFA), proposed in [9], are basically mixtures of Gaussians. To verify the effectiveness of fitting an MFA, this example will use the proposed algorithm on a noisy shrinking spiral data. Fig. 3 shows that the proposed algorithm successfully extracts a one-dimensional manifold from the noisy shrinking spiral data. The selected model order is 12 K = . We repeat the fitting algorithm 30 times, the model order selects

Conclusion
To make the algorithm insensitive to the initial parameters and avoid the algorithm from converging to the local optimal solution, this paper combines Lilliefors test, MDL criterion and standard EM algorithm, and proposes a competitive stop EM algorithm for fitting a GMM. The algorithm is initialized as a single Gaussian component, and the Lilliefors test is used to check the normality of each component in the GMM. The algorithm selects the component with the largest entropy for splitting among the components that do not have normality. The MDL criterion acts as a competitive stop condition to prevent the algorithm from over-splitting. It is worth noting that this paper only uses a noisy shrinking spiral data to verify the effectiveness of the proposed algorithm in fitting an MFA. How to apply the proposed algorithm to other practical areas is the focus of further research.