On Two Mixture-Based Clustering Approaches Used in Modeling an Insurance Portfolio

Tatjana Miljkovic 1,* and Daniel Fernández 2,3 1 Department of Statistics, Miami University, Oxford, OH 45056, USA 2 Research and Development Unit, Parc Sanitari Sant Joan de Déu, Fundació Sant Joan de Déu, CIBERSAM, Sant Boi de Llobregat, Barcelona 08830, Spain; df.martinez@pssjd.org or dfdez23@outlook.com 3 School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand * Correspondence: miljkot@miamioh.edu; Tel.: +513-529-3299


Introduction
The multivariate classification ratemaking has made significant advances in the past decade with the introduction of different types of statistical methods for pricing individual risks. These new techniques bring important benefits to insurers such as more equatable pricing of the individual risks, better competitive advantage, and protection against adverse selection and support informed decisions about the type of risk that the insurer is willing to write. Werner and Modlin (2016) introduced many of these techniques with a primary focus on generalized linear models (GLMs), which are used in pricing and reserving.
According to the Actuarial Review (Baribeau 2016) "the predictive modeling is advancing far beyond its general linear model (GLM)-based roots due to the explosion of new data sources, technological innovation and advanced analytics." Recent literature on extensions of GLMs with mixture modeling has been a part of this innovation process, and this is proposed by several researchers. A finite mixture of Poisson regression models with an application to insurance ratemaking was studied by Bermúdez (2012) for count data. The authors recognized that unobserved heterogeneity in this type of data requires more structure in the modeling techniques and that the current model fitting can be improved by using finite mixture of bivariate Poisson regressions. The unobserved heterogeneity in the zero-inflated type of data is attributed to the fact that variance is often larger than the mean. In 2015, another Poisson mixture model for count data was considered by Brown and Buckley (2015) for modeling heterogeneity in a Group Life insurance portfolio. The authors showed violation of heterogeneity across groups and suggested that putting similar groups together is necessary for further analysis. Later, a non-parametric Bayesian approach was considered in modeling this mixture distribution by incorporating a Dirichlet process prior and using reversible-jump Markov chain Monte Carlo (Green 1995) to estimate the number of components. Garrido et al. (2016) and Shi et al. (2015) consider relaxing the GLM assumption of independence between the number and the size of claims. The multi-modality of univariate insurance losses was recently modeled using mixtures by Miljkovic and Grün (2016), Verbelen et al. (2014), Lee and Lin (2010), and Klugman and Rioux (2006). The findings of these studies indicate a need to explore the modeling of unobserved heterogeneity in a regression setting where the amount of claims is linked to several other covariates and the goal is to find a finite number of sub-populations of policyholders in an insurance portfolio.
In predictive modeling with a GLM setting, unobserved heterogeneity may occur when important covariates have been omitted and their influence is not accounted for in the analysis (Grün and Leisch 2008). The unobserved heterogeneity may not be fully captured when a single component GLM is used to model the data set. While these techniques have been explored in other fields, they have not been fully adopted in the actuarial field. Actuaries may consider relaxing the assumption that the regression coefficients and dispersion parameters are the same for all observations. In this case, a goal is to find groups of policy holders with similar regression coefficients.
Multiple techniques have been developed which deal with the grouping of heterogeneous data such as hierarchical clustering (Johnson 1967;Kaufman and Rousseeuw 1990), association analysis (Manly 2005), and k-means clustering algorithm (Jobson 1992;Lewis et al. 2003;McCune and Grace 2002). There are a number of clustering methods based on mathematical techniques such as association indices (Chen et al. (2011); Wu et al. (2008)), distance metrics (Everitt et al. (2001)), and matrix decomposition (Manly (2005); Quinn and Keough (2002); Wu et al. (2007)). However, these algorithms do not have a likelihood-based formulation and therefore do not provide a reliable method of model selection or assessment. A particularly powerful likelihood-based approach to one-dimensional clustering based on finite mixtures, with the variables in the columns being utilized to group the objects in the rows, is provided by McLachlan and Basford (1988), McLachlan and Peel (2004), Everitt et al. (2001), Böhning et al. (2007), Wu et al. (2008), and Melnykov and Maitra (2010).
The objective of this article it to review two recently proposed mixture-based clustering approaches for modeling unobserved heterogeneity in an insurance portfolio, which, to the best of our knowledge, have not yet been exploited by the practitioner in the actuarial science area. Moreover, we focus on modeling severity of losses as a function of several covariates arising from different sub-populations. These sub-populations are grouped into clusters based on a similar historical experience or well-defined similarity rules and the results of each group can be considered in underwriting and ratemaking. We particularly consider two different data modeling frameworks: mixture modeling of ordinal variables (for classified data based on the level of risk) and the mixture of GLMs for a mixed type of covariates (for individual data). As an illustration, we show how these two mixture models can be used to model an insurance portfolio and have complementary properties. Similarities and differences of both approaches are discussed in the context of the data structure available for the selected insurance portfolio.
In an insurance context, ordinal variables are often defined based on a risk classification with intrinsic order. For example, driver age can be treated as an ordinal outcome with the youngest drivers being associated with the highest risk propensity. Similarly, losses can be treated on an ordinal scale based on the intensity of claims. There are a variety of approaches to the modeling of ordinal data that properly respect the ordinal nature of the data. Liu and Agresti (2005) and Agresti (2010) described various proportional odds version models using adjacent-categories logits, cumulative logits McCullagh (1980), and continuation-ratio logits McCullagh and Nelder (1989). Our article focuses on the ordered stereotype model (OSM) introduced by Anderson (1984), which is more flexible than the most common models as a result of adding additional score parameters associated with the distance among ordinal categories.
The generalized linear mixed cluster-weighted model (known as CWM) was recently proposed by Ingrassia et al. (2015) as a flexible family of mixture models for fitting the joint distribution of a random vector composed of a response variable and a set of mixed-type covariates with the assumption that continuous covariates come from Gaussian distribution. The CWM method does not consider ordinal data; thus, we are also interested in the mixture-based clustering for OSM built to handle ordinal data previously proposed by Fernández et al. (2016). The ordinal data in insurance setting can be considered if the policyholders have been previously classified based on the level of risk (e.g., 1 is the lowest level of risk and 5 is the highest). This classification is often obtained for underwriting and risk management purposes. Both methods assume that an inherent clustering effect is present in the data; therefore, each sub-population of the variable of interest, such as claims, can be modeled separately. One of the main advantages of these approaches over other non model-based clustering techniques is their likelihood-based foundation because maximum likelihood theory provides estimators and model selection. Another advantage of these two methods is that they complement each other, i.e., they allow for flexibility in terms of data collection and categorization. If the collected data is all organized based on the ordinal levels corresponding to several risk classifications, then a mixture-based clustering method for ordered stereotype model would be a suitable tool to further test the heterogeneity in the data. For insurance data sets that are currently analyzed using existing GLMs, the CWM approach would allow for detecting an unobserved heterogeneity in the data by testing if more than a single component GLM fits the data better. If the CWM model provides a better fit to the data, then this model should replace a single component GLM, where applicable.
The remainder of this paper is organized as follows. Section 2 introduces the formulation and model estimation for both mixture-based methods as well as the model selection criteria. Section 3 is devoted to the presentation of the data set used in this study and the results obtained using these two approaches. Conclusions are summarized in Section 4. The appendix provides additional results with respect to the analysis conducted as part of Section 3.

Methodology
2.1. Mixture-Based Clustering for the Ordered Stereotype Model Fernández et al. (2016) proposed an extension of the model-based approach proposed in Pledger and Arnold (2014) for ordinal responses. This approach considered finite mixture models to define a clustering structure and used the OSM introduced by Anderson (1984) with the aim of formulating the ordinal procedure. The stereotype model has the advantage that it requires a smaller number parameters to be estimated than the more commonly used baseline-category logit model or the multinomial logistic regression model (Agresti 2010, Section 4.3.1). Moreover, this model estimates from the data possibly unequal spacing among the levels of the ordinal responses in the form of a set of ordered score parameters. These scores are directly interpretable as measures of similarity between neighboring levels of the variable. This ease of interpretation is an advantage over other ordinal-based models such as the proportional odds model and the continuation-ratio model.
We illustrate here the model formulation for the row clustering version. The analysis for the column clustering version is basically the same, but exchanging the parameters related to rows by their equivalent column parameters. For the row clustering version, the probability that the ordinal response response {y ij } (i = 1, . . . , n and j = 1, . . . , m) takes the category k (k = 1, . . . , q) is represented by the following log odds: where the inclusion of the monotone increasing constraint 0 = φ 1 ≤ φ 2 ≤ · · · ≤ φ q = 1 ensures that the variable response Y = (Y 1 , . . . , Y n ) = {y ij } is ordinal (see Anderson (1984)). For simplicity, we assume that the ordinal responses all have the same number of categories q so that y ij ∈ {1, . . . , q}, and they correspond to the policy holders groups that are already classified based on some underwriting criteria. The parameters {µ 2 , . . . , µ q } are the cut points, and {φ 2 , . . . , φ q } are the parameters which can be interpreted as the "scores" for the categories of the response variable Y ij . G ≤ n is the number of row groups, and i ∈ g means row i is classified in the row cluster g. The set of parameters {β 1 , . . . , β m } quantify the main effects of the m. We restrict µ 1 = φ 1 = 0, φ q = 1, ∑ G g=1 α g = ∑ m j=1 β j = 0 to ensure identifiability. It is important to note that the actual membership of the rows among the G row-clusters is unknown; therefore, it is considered missing information. Further, we define {τ 1 , . . . , τ G } as the (unknown) proportions of rows in each row group, with ∑ G g=1 τ g = 1. We can view {τ g } as the a priori row membership probabilities.
The probability of the data response y ij being equal to the category k conditional on a given clustering is where Ω is the parameter vector {{µ k }, {φ k }{α g }, {β j }}.
Likelihood functions: The (incomplete) likelihood of the data is where θ gjk is the probability of the data response defined in Equation (2).
We define the unknown row group memberships through the following indicator latent variables, where i ∈ r indicates that row i is in row group g. It follows that ∑ G g=1 Z ig = 1 (i = 1, . . . , n), and, since their a priori row membership probabilities are {τ g }, Consequently, the complete data log-likelihood of this model using the known data {y ij } and the unknown data {z ig } is as follows: Parameter estimation: The parameter estimation for a fixed number of components G is performed using the maximum likelihood estimation approach fulfilled by means of the expectation-maximization (EM) algorithm proposed by Dempster et al. (1977) and used in most finite mixture problems discussed by McLachlan and Peel (2004).
The EM algorithm consists of two steps: expectation (E-step) and maximization (M-step). As part of the E-step, a conditional expectation of the complete data log-likelihood function is obtained given the observed data and current parameter estimates. In the finite mixture model, the latent data corresponds to the component identifiers. As part of the E-step, the expectation taken with respect to the conditional posterior distribution of the latent data, given the observed data and the current parameter estimates, is referred to as the posterior probability that response y ij comes from the gth mixture component, computed at each iteration of the EM algorithm. The remaining part of the M-step requires finding component-specific parameter estimates Ω by solving numerically the maximum likelihood estimation problem for each of the different component distributions.
The E-step and M-step alternate until the relative increase in the log-likelihood function is no bigger than a small pre-specified tolerance value, when the convergence of the EM algorithm is achieved. In order to find an optimal number of components, maximum likelihood estimation is obtained for each number of groups G, and the model is selected based on a chosen model selection criterion.
In this model, the EM algorithm performs a fuzzy assignment of rows to clusters based on the posterior probabilities. The EM algorithm is initialized with an estimate g }} of the parameters and proceeds by alternation of the E-step and M-step to estimate the missing data { Z ig } and to update the parameter estimates. In this section, we develop the E-step and M-step for row clustering. This development follows closely Fernández et al. (2016) (Section 3).

E-Step:
In the tth iteration of the EM algorithm, the E-Step evaluates the expected values Z ig of the unknown classifications Z ig conditional on the data {y ij } and the previous estimates of the The conditional expectation of the complete data log-likelihood at iteration t is given by , and, applying Bayes' rule, we obtain Finally, we complete the E-step by substituting the previous expression in the complete data log-likelihood at iteration t expressed in Equation (3), (4)

M-step:
The M-step of the EM algorithm is the global maximization of the log-likelihood (4) obtained in the E-step, now conditional on the complete data {{y ij }, { Z ig }}. For the case of finite mixture models, the updated estimations of the term containing the row-cluster proportions {τ 1 , . . . τ G } and the one containing the rest of the parameters Ω are computed independently. Thus, the M-step has two separate parts.
The maximum-likelihood estimator for the parameter τ g where the data Z ig are unobserved is To estimate the remaining parameters Ω, we must numerically maximize the conditional expectation of the complete data log-likelihood in Equation (3). In the case of row clustering, where the maximization is conditional on the parameter constraints following Equation (1).

The General Linear Cluster-Weighted Model
In this section, we summarize the CWM model proposed by Ingrassia et al. (2015) starting with the relevant background. Suppose that Y = (Y 1 , . . . , Y N ) is a vector of independent random variables with the density function of a distribution from the exponential family given by where θ i is a natural parameter and φ is a scale parameter. Consider that Y is related to a set of variables x = (x 1 , . . . , x p ) through the following linear relationship where β is p × 1 vector of parameters and X is an N × p design matrix, η is the linear predictor and g[E(y|x)] is a link function, and it is considered as a simple mathematical function in the original formulation of GLMs by Nelder and Wedderburn (1972). In this model, Y is referred to as a response variable and x is a vector of continuous and discrete type explanatory variables (e.g., covariates). Let (X , Y) be a vector defined on some space with values in X × Y. Further, assume that there exist G partitions of , defined as 1 , . . . , G . Gershenfeld (1997) has introduced CWM based on Gaussian mixtures with the joint distribution, f (x, y) of (X , Y) , expressed as follows where f (y|x, g ) and f (x, g ) are conditional and marginal distributions of (X , Y) and τ g represents the weight of the g-th component s.t. ∑ G g=1 τ g = 1, τ g > 0. Ingrassia et al. (2015) introduced a broader family of CWMs that allows for the component conditional distributions to belong to the exponential family and for the mixed ype of covariates. The joint distribution of a random vector (X , Y) is obtained by splitting the covariates into continuous and discrete as X = (V , W ) under the assumption of independence. In this setting, the model in Equation (6) can be expressed as where f (y|x, ϑ g ) is a conditional density of y|x, g with parameter ϑ g , f (v, θ g ) is the marginal distribution of v with parameter θ g , f (w, θ g ) is the marginal distribution of w with parameter θ g , and v and w are the vectors of continuous and discrete covariates respectively. Further, all model parameters are defined as Φ = (θ, τ, ϑ). The conditional distribution f (y|x, ϑ g ) is assumed to belong to the exponential family.
Modeling for f (y|x, ϑ g ) and f (x, θ g ): The CWM model is based on the assumption that f (y|x, ϑ g ) belongs to the exponential family of distributions that are strictly related to GLMs. The link function in Equation (5) relates the expected value g(µ g ) = β 0g + β 1g x 1 , . . . , +β pg x p . We are interested in estimation of the vector β g , so the distribution of y|x, g is denoted by f (y|x, β g , λ g ), where λ g denotes an additional parameter associated with a two-parameter exponential family.
The marginal distribution f (x, θ g ) has the following components: f (v, θ g ) and f (w, θ g ). The first component is modeled as p-variate Gaussian density with mean µ g and covariance matrix Σ g as The marginal density f (w, θ g ) assumes that each finite discrete covariate W is represented as a vector w r = (w r1 , . . . , w rc r ) , where w rs = 1 is w r , which has the value s, s.t. s ∈ {1, . . . , c r }, and w rs = 0 otherwise.
where γ g = (γ g1 , . . . , γ gq ) , γ gr = (γ gr1 , . . . , γ grc q ) , γ grs > 0, and ∑ c r s=1 γ grs , r = 1, . . . , q. The density f (w, γ g ) represents the product of q conditionally independent multinomial distributions with parameters γ gr , r = 1, . . . , q. Considering these assumptions, the model in Equation (7) can be stated as If the CWM models allow for the conditional distribution f (y|.) to be Binomial or Poisson, then they are referred to as the Binomial CWM or the Poisson CWM, respectively. The CWM are also built to handle Gaussian, log-normal, and gamma distributions of f (y|.). In the next subsection, we will discuss the parameter estimation of the model in Equation (9).

Parameter Estimation:
The EM algorithm discussed in the previous section is used to estimate parameters of this model. Let (x 1 , y 1 ) , . . . , (x n , y n ) be a sample of n independent pairs observations drawn from the model in Equation (9). For this sample, the complete data likelihood function, L(Φ), is given by where z ig is the latent indicator variable with z ig = 1, indicating that observation (x i , y i ) originated from the j-th mixture component, and z ig = 0 otherwise.
By taking the logarithm of Equation (10), the complete data log-likelihood function, c (Φ), is expressed as It follows that, at the t-th iteration, the conditional expectation of Equation (11) on the observed data and the estimates from the (t − 1)-th iteration results in The idea behind the EM algorithm is to generate a sequence of the estimates from the maximum likelihood estimation starting from an initial solutionΦ (1) and iterating it with the following steps until convergence: E-step: The posterior probability that (x i , y i ) comes from the g-th mixture component is calculated at the t-th iteration of the EM algorithm as (12)

M-step:
The Q-function is maximized with respect to Φ, which is done separately for each term on the right hand side in Equation (9). As a result, the parameter estimatesτ g ,μ g ,Σ g , andγ g , are obtained on the (t + 1)-th iteration of the EM algorithm: , while the estimates of β are computed by maximizing each of the G terms Maximization of Equation (13) is performed by numerical optimization in the R language (R Core Team 2016) in a similar framework as the mixture of generalized linear models are implemented. For additional details about this implementation, the reader is referred to Wedel and De Sabro (1995) and Wedel (2002).

Model Selection Criterion
In mixture-based clustering, the model selection is often made based on the Akaike information criterion (Akaike 1974) and the Bayesian information criterion (Schwarz 1978) using the following formulas where l represents the value of the log-likelihood function, k represents the number of estimated parameters in the model, and n is the sample size. Calculation of AIC and BIC is completed for each selected number of mixture components, G. The best model is then selected based on the lowest value of AIC and BIC with the corresponding value of G. We compute both AIC and BIC for the models discussed in this section. However, we make the final decision based on BIC only since the previous literature suggested that BIC should be preferred over AIC in mixture modeling (for discussion, refer to Fraley and Raftery 2002).

Data
The French motor claims data set, on 413,169 motor third-party liability policies, was accessed from the CASdatasets ((Dutang and Charpentier 2016); Miljkovic (2017)) in R, for the purpose of our analysis. Both claim number and corresponding losses are available from the same portfolio of policyholders. Charpentier (2014) used the same data set to illustrate the modeling of frequency and severity of claims using various single component GLMs. In order to illustrate how OSM and CWM models are applied, we focus our analysis on Region 24 (R24), engine power f, and car brand category "Renault, Nissan, or Citroen" with the sample size of 1269 claims. The R24 has about 39% of the total French policies written, with engine power type f and "Renault, Nissan, or Citroen" being the most popular cars.
The variables of interest for our analysis are loss amount, driver age, car age, density, and exposure. When the CWM method is employed, loss amount, density, and exposure are treated as numerical continuous variables, while driver age and car age are modeled as categorical variables. The driver age and car age, coded as 99 (unknown), are excluded from the analysis. Table 1 provides the summary of all variables used in both models.
We categorized the numerical variables into ordinal variables with the aim of applying the OSM model over the same data set. The ordinal variables correspond to level of risk, where 1 is the lowest level of risk and 5 is the highest. We assume that this classification has already been determined by the underwriting practices and our goal is to test if there is a need for further classification due to unobserved heterogeneity in the data. The OSM method is sufficiently flexible that it can be adopted to the different variables and a different number of ordinal levels, depending on the data. In the following subsections, we present the analysis and the results based on the two modeling approaches presented in this paper.

OSM Results
An array of row (claim) clustering models (1) with different numbers of clusters G = 1:5 were fitted, and the information criteria measures AIC and BIC were computed. The results are summarized in Table 2. Both the AIC and the BIC indicate that the best model is the OSM version, including row (claim) clustering with G = 3 clusters with AIC = 23,584 and BIC = 23,685. Each row is allocated to the group to which the claim belongs with the highest posterior probability, where the mixing probabilities are τ = (0.60, 0.29, 0.11). Figure 1 displays the resultant G = 3 clustering structures and their profiles. The scatter plot (left) displays the average fitted scores over the 5 variables, using a weighted average which accounts for the fitted spacings. Appendix B explains the details of the calculation of these average scores.   Figure 1. Scatter plot depicting the clustering composition for R = 3 (left) claim clusters. Different color and shape symbols represent the clusters: Cluster 1 (square), Cluster 2 (circle), and Cluster 3 (triangle). The bar plot (right) displays the profile of the claims in each cluster. The percentage represents the probability θ gjk in each category (Equation (2)). Different color and shape points and color bars represent the resultant G = 3 claim clustering settings. Three groups are clearly distinguished in the scatter plot. This graph illustrates the conclusion from AIC/BIC that, among the row clustering models, the model with three claim groups is the best for our data. The bar plot (right) displays the profiles of the different clusters according to the estimated probability of the data response y ij being equal to the category k in Equation (2). We might conclude that the claims classified in the first group correspond to those with the lowest risk regarding the five categories, the ones in the second group have a more low-to-moderate risk, and the claims in the third group are those with more risk. An attractive feature of the stereotype regression model is that it allows one to test whether two adjacent categories are distinguishable. Since each ordinal response category k (k = 1, . . . , 5) is associated with a score parameter φ k , the spacing between adjacent φ k values shows us how similar or different the categories are in terms of the effect of claims (see Agresti (2010) (Section 4.3.5) and Fernández et al. (2016) (Section 1.2.2)). Table A1 in Appendix A shows the parameter estimates and their uncertainty for the model with G = 3 clusters. For this data set, the fitted score parameters were φ 1 = 1, φ 2 = 3.636, φ 3 = 4.855, φ 4 = 4.990, and φ 5 = 5. Therefore, the distance between ordinal categories "1" and "2" (2.636) is greater than that between other adjacent categories. However, there is almost no spacing between categories "4" and "5". This implies that the relative frequencies in these two categories are independent of the clustering structure. Therefore, retaining the distinction between "4" and "5" is not informative about the clustering structure. In that case, the model still holds with the same scores if the ordinal scale is collapsed by combining those two adjacent categories into one single response category. This spacing setting is illustrated in the spaced mosaic plot (Fernández et al. 2014)  proportional to the number of claims in each claim cluster; the width is proportional to the numbers of each ordinal response within each cluster. The area represents the frequency of each combination, also shown numerically in each block. The relative spacing between ordinal categories (e.g., 2.636 between 0 and 1, shown by the yellow, red, and green bars) has been determined by the data.
Common location measures in continuous response variables such as the mean or the median are not directly applicable in ordinal responses. However, the estimates of the scores parameters { φ k }, k = 1, . . . , q provide a continuous scale in [0, 1]. Re-scaling { φ k } conveniently as v 1 = 1, v q = q, and v k = 1 + (q − 1) ×φ k gives us a continuous scale in the range [1, q], which we can use as a new numerical value in our data set, replacing the original ordinal response values {y ij } with the new adjusted spacing {v k }. For instance, if initially y ij = k, the replacement would be y ij = v k . We can then calculate average over this new data {y ij } to determine the different profiles of each cluster from our original heterogeneous data. Table 3 shows the mean in the data set. We can see that the cluster G = 1 is composed by the claims with the largest losses, which corresponds to the youngest drivers, oldest cars, and largest density. The clusters G = 2 and G = 3 have very similar values apart from the car age, which makes them different.

CWM Results
Tools for generalized linear CWM are available in the statistical R package flexCWM, developed by Mazza et al. (2017). The log-normal CWM was fitted to the following covariates: driver age, car age, density, and exposure. The model selection procedure based on the AIC and the BIC found three mixture components with their corresponding mixing probabilities as follows: 0.52, 0.43, and 0.05. Table 4 shows the summary results for log-likelihood, AIC, and BIC. The CWM function selects the best model based on the minimum value of BIC. In our analysis, the best model is detected when G = 3 and these results are shown in bold in Table 4. The number of selected components is consistent with the OSM approach. This analysis is done for Region 24, engine power f, and car brand category "Renault, Nissan, or Citroen" with the sample size of 1269 claims. Within these settings, we can observe that the effect of clustering is present in the portfolio and three subpopulation of drivers are found to have similar characteristics. As a result 3-component GLM seems to provide better fit than a single component GLM based on BIC criteria.
The left display of Figure 3 shows the distribution of losses by group and driver age while the right display shows the distribution of losses by group and car age. It is apparent that the highest losses are generated by the youngest drivers (age group 18-22), classified in Group 1. Additionally, the second highest losses are reported by drivers aged 43-74 in Group 1. Old cars, 10-15 years old, are associated with the largest losses in Group 1. A combination of young drivers (less than 23 years old) and old cars (10-15 years) are indicators for the largest losses in this portfolio. A classification of drivers in R24 can be done based on a level of risk to which they are exposed. Table A2 in Appendix A shows the results for the estimated coefficients in each component of the CWM model. Driver age and car age are coded as factors with Level 1 being omitted. Thus, the coefficients shown in the output for driver's age by category are estimated relative to the youngest drivers (less than 23 years old). We can observe that the significance of the coefficients in each cluster vary greatly across these clusters. The significance is coded as ≈ 0 (***), 0.001 (**), 0.01 (*), and 0.05 (.), corresponding to the p-value of the specific coefficient. Coefficients in Cluster 3 are all significant, while most of the coefficients associated with driver age and car age are not significant in the first two clusters. Coefficients associated with driver age in Cluster 3 are all positive and increase with age showing that losses increase with age. The 75+ age group generate on average more losses compared to the <23 age group. In Clusters 1 & 2, we observe the negative coefficients for drivers age, indicating a reverse trend in losses for these two groups. Here, we can see that, in these groups, younger drivers generate the highest losses, and, as age increases, losses tend to decrease on average, holding other variables constant. Similarly, the coefficients for car's age by category are estimated relative to the newest cars (less than 1 year old). These coefficients are all significant in Cluster 3 and their values are positive. We observe that older cars generate more losses on average, compared to newer cars in Cluster 3. In Cluster 1, the car age coefficients are mostly negative while in Cluster 2 they are positive. The sign and magnitude of the coefficients change depending on the relative center of each cluster. Density variable is highly significant in all three clusters. Exposure variable is significant in Cluster 1 only. Here, the finite mixture modeling allows us to model a natural representation of heterogeneity in three latent groups of policyholders in this insurance portfolio. Based on the BIC, the model fit is improved by using three component GLM rather than one component. The interpretation of the coefficients can be done for each cluster separately to account for the "observed" heterogeneity within each subpopulation of the corresponding cluster.

Conclusions
This article reviews two recently developed mixture-based clustering approaches applied to an insurance portfolio in order to find the optimal number of clusters and test the assumption of heterogeneity. While these approaches have been used in different fields, to the best of our knowledge, they have so far not been applied in the actuarial field for testing the unobserved heterogeneity in the insurance portfolio. These two methods allow for flexibility in terms of data collection and categorization and allow for further assessment of the underwriting risk. If all the collected data are organized based on the ordinal levels corresponding to several risk classifications, then a mixture-based clustering method for OSM would be a suitable tool to further test heterogeneity in the data. For the insurance data sets that are currently analyzed using existing GLMs, the CWM approach would allow for finding multiple GLM components when modeling frequency and severity of claims. The following family of distributions are supported by CWM with their link functions: binomial, poisson, t-distribution, log-normal, gamma, and inverse gaussian. If the multi-component CWM model provides a better fit to the data then a single component GLM, we recommend that this clustering effect is considered in practice. The OSM and CWM approaches can be used by practitioners and academics in the actuarial area. We have shown that they have complementary properties.
In a mixture-based clustering method for OSM, when the data is organized based on ordinal levels, the effect of a few extreme losses needs to be tempered down. Allocating these extreme losses in an ordinal, "high-risk" category makes them less influential in the model fitting, giving broad categories which enable us to detect major overall patterns. This enables the inclusion of all of the different levels of dispersion across ordinal categories. Additionally, assigning scores to ordinal categories provides an easy way of showing the descriptive statistics. If practitioners have knowledge about the score for each of the ordered categories, assigning scores might be the best way to analyze the data, because ordinary linear models can be applied. However, if practitioners do not have any predetermined idea about the spacing between adjacent categories, the use of an ordinal stereotype model is convenient, as the data dictates non-equally spaced scores among ordinal outcomes. The estimation of the spacing among ordinal responses is an improvement over other ordinal data models, such as the proportional odds model and continuation-ratio model, although more research in performance comparison with other equivalent methods is needed.
For illustration of these methods, we analyzed a small data set of French motor insurance claims in Region 24 using the methods discussed in this paper. We found three clusters in the data using both the CWM and OSM approaches and therefore show the presence of unobserved heterogeneity. Due to the nature of an insurance portfolio, unobserved heterogeneity is most likely to exist in almost all insurance portfolios. Thus, current modeling techniques should be extended to account for mixture-based clustering that will allow practitioners to detect additional sub-populations of the policy holders based on the same set of characteristics within an existing portfolio, and make the appropriate pricing and risk management decisions for each group. These two methods can also assist practitioners with the underwriting selection process as they have good complementary properties.
Finally, the future research may focus on comparing clustering structures resulting from different methodologies. Over the same data set, one can apply three measures in common that are used to compare clusterings: the Adjusted Rand Index (ARI), the Variation of Information (VI), and the Normalized Information Distance (NID) proposed by Hubert and Arabie (1985), Meila (2005), and Kraskov et al. (2005), respectively. Table A1 shows the parameter estimates and their uncertainty for the model with G = 3 clusters as a result of fitting the OSM (1) for the French motor claims by policy data set. φ k P[y ij = k | i ∈ r], i = 1, . . . , n, j = 1, . . . , m, g = 1, . . . , G.

Appendix A. Model Fitting
From here, we can calculate the mean response level of claim i to variable j, conditional on its (fuzzy) allocation to the row clusters: z ig φ gj , i = 1, . . . , n, j = 1, . . . , m.
This is a numerical measure of the typical response to variable j for claims of row cluster g, appropriately adjusting for the uneven spacing of the levels of the ordinal response. Finally, we determine the mean of the previous weighted averages over the m columns in order to obtain the average fitted scores of claim i across all of the variables φ (i.) = 1 m m ∑ j=1 φ (ij) , i = 1, . . . , n.
Author Contributions: Both authors equality contributed in developing this manuscript.