INSURANCE ANALYTICS WITH CLUSTERING TECHNIQUES

The k-means algorithm and its variants are popular clustering techniques. Their purpose is to uncover group structures in a dataset. In actuarial applications, these partitioning methods detect clusters of policies with similar features and allow one to draw up a map of dominant risks. The main challenge lies in de(cid:28)ning a distance between two observations exclusively characterised by categorical variables. This research paper starts with a review of the k-means algorithm and develops an extension based on Burt’s framework to manage categorical rating factors. We then focus on a mini-batch version that keeps computation time under control when analysing a large-scale dataset. We next broaden the scope of application of the fuzzy k-means to fully categorised datasets. Lastly, we conclude with a thorough introduction to spectral clustering and work around the dimensionality issue by reducing the size of the initial dataset with k-means.


Introduction
Cluster analysis is one of the popular techniques within statistical data analysis and machine learning, helping to uncover group structures in data.Objects are grouped so that the resulting clusters are as heterogeneous as possible between each other while also being as homogeneous as possible with regard to the observations classied within them.
In actuarial applications, clustering methods can detect dominant groups of policies, and the analysis of their claims allows one to draw a posteriori a map of insured risks.The resulting clusters can be further used to construct unsupervised pricing grids.* Postal address: Voie du Roman Pays 20, 1348 Louvain-la-Neuve (Belgium).E-mail to: charlotte.jamotton(at)uclouvain.be Postal address: Voie du Roman Pays 20, 1348 Louvain-la-Neuve (Belgium).E-mail to: donatien.hainaut(at)uclouvain.bePostal address: Rue Belliard 2 -B-1040 Brussels (Belgium).E-mail to: t.hames(at)detralytics.be The starting point of our work is the k-means algorithm, one of the most popular unsupervised learning algorithms (see, e.g., MacQueen (1967), Hastie et al. (2009) and Kaufman and Rousseeuw (2009)).Known as a centroid-based partitioning method, it segregates observations into a pre-specied number of clusters by optimising a measure of similarity.There exist several extensions of the k-means algorithm, including a mini-batch version to handle large-scale datasets.For a thorough overview, we refer to Jain (2010).
Despite their popularity in other elds such as image processing, clustering techniques are still underexploited in actuarial science, and the literature is scarce.Nevertheless, Williams and Huang (1997) may be referenced for their use of the k-means to identify policyholders with a high claims ratio in a portfolio of motor vehicle insurance.Hainaut We further test its eciency as a method in which policies can belong to multiple clusters.
We end this work in Section 5 with a thorough review of spectral clustering, as an unprecedented means to identify non-convex clusters in an insurance portfolio.The need for the eigendecomposition makes spectral clustering more computationally expensive than standard k-means for large-scale datasets.Reducing the size of the initial dataset with the k-means algorithm and applying spectral clustering to the resulting centroids shows itself to be an eective solution to work around this dimensionality issue.

2
The k-means algorithm and Burt's distance Let us consider a set of n numeric objects X = {x 1 , ...x n }, x i ∈ R p and an integer number K ≤ n.The k-means algorithm searches for a partition of the data set X into a prespecied number K of disjoint clusters such that the within groups sum of squared errors (WGSS) or intraclass inertia in Eq.( 2) is minimised.The k-means clustering method is based on the concept of a centroid, which may be interpreted as the centre of gravity of a cluster of objects.As it is dened as the geometric centre of the corresponding data cluster, a centroid is not necessarily one of the data points.The coordinates of the u th centroid are contained in a vector c u = (c u 1 , . . ., c u p ) for u = 1, ..., K.We dene the clusters or groups of data S u for u = 1, ..., K for a given distance d(., .)and a set of K centroids as follows: Here, the dissimilarity measure The centre of gravity of S u is a p vector g u = g u 1 , . . ., g u p such that The centre of gravity of the full dataset is denoted by g = 1 n n i=1 x i .We dene the inertia I u of a cluster S u as the sum of the distances of all the points within a cluster from the centroid of that cluster: The intraclass inertia I a is the sum of clusters' inertia, weighted by their size: whereas the interclass inertia I c is the inertia of the cloud of centres of gravity: According to the König-Huygens theorem, the global inertia is the sum of the intraclass and interclass inertia: In order to have homogeneous clusters on average (compact clusters), an usual criterion of classication consists in identifying a partition of X that minimises the intraclass inertia I a .This amounts to nding the partition that maximises the interclass inertia, I c , which ensures the heterogeneity between clusters.Optimizing such objective functions is known to be computationally dicult (NP-hard).However, there exist ecient heuristic procedures that quickly converge to a local optimum with respect to interclass inertia.The most common method uses an iterative renement technique called k-means, which is detailed in Algorithm 2.
Given an initial set of K random centroids c 1 (0),...,c K (0), we partition {S 1 (0), . . ., S K (0)} the dataset according to the rule in equation (1).As an alternative to this random centroids initialization, the k-means++ algorithm of Vassilvitskii and Arthur (2006) hinges on a heuristic to nd the centroid seeds c 1 (0),...,c K (0) and construct the initial partition of the dataset.This alternative procedure to initialise the k-means heuristic is detailed in Algorithm 1.When compared to the random centroids initialization, it improves the running time of the algorithm, and the quality of the nal solution.Keep in mind that the k-means clustering is based on a heuristic.The resulting partition is therefore not guaranteed to be a global solution, and changes in the seed value may aect the outcomes.
Regardless of the procedure chosen for the arbitrary centroids initialization, the resulting partition of the dataset is a set of convex polyhedrons delimited by median hyperplanes of centroids.Once the K centroids are initialised, we proceed by alternating between two steps.The assignment step of the e th iteration associates each observation x i with the cluster S u (e) whose centroid c u (e) is the nearest.In the update step, the K centroids (c u (e + 1)) u=1:K are replaced by the K new cluster means (g u (e)) u=1:K .At each iteration, we can prove that the intraclass inertia is reduced.We iterate until convergence, which is when the assignments no longer change.
Algorithm 1 Initialization of centroids for the k-means algorithm.
Initialization : Select an observation uniformly at random from the data set, X.The chosen observation is the rst centroid, and is denoted c1(0).
Main procedure: End loop on dataset, i 2) Select the next centroid, cj(0) at random from X with probability End loop on K Algorithm 2 Algorithm for k-means clustering. Initialization: Randomly set up initial positions of centroids c1(0),...,cK (0).

Main procedure:
For e = 0 to maximum epoch, emax End loop on data set, i.

Update step:
For u = 1 to K 2) Calculate the new centroids cu(e + 1) of Su(e) as follows xi .
End loop on centroids, u.
3) Calculation of the total distance d total between observations and closest centroids: d(xi, cu(e + 1)) .

End loop on epochs e
To illustrate this section, we apply the k-means algorithm to data from the Swedish insurance company Wasa in 1999.The data set is available on the companion website of the book by Ohlsson and Johansson (2010) and contains information about motorcycle insurance over the period 1994-1998.Each policy is described by quantitative and categorical variables.The quantitative variables are the insured's age and the age of his vehicle.The categorical variables are the policyholder's gender, the geographic zone, and the vehicle's class.The vehicle class is determined by the ratio of power in KW ×100 / vehicle weight in kg + 75, rounded to the nearest integer.The database also reports for each policy the number of claims, the total claim costs, and the contract period.After removing contracts with null duration, the database counts n = 62 436 insurance policies.Table 1  EV ratio 20-24 7 EV ratio 25- The p numerical variables are stored in a vector x j=1,...,n ∈ R p , and the q categorical variables have m k binary modalities for k = 1, ..., q.The total number of modalities is the sum of m k : m = q k=1 m k .In further developments, we enumerate modalities from 1 to m.The information about the portfolio may be summarised by a n × m super-indicator matrix D = (d i,j ) i=1...n,j=1...m .If the i th policy has the j th modality then d i,j = 1, otherwise d i,j = 0.
For example, assume policies are exclusively described by the policyholder's gender (M=male or F=Female) and by his education level (H=high school, C=college or U=university).
The number of variables and modalities are q = 2, m 1 = 2 and m 2 = 3 respectively.If the rst and second policyholders are respectively a man with an undergraduate degree and a woman with a graduate degree, the two rst lines of the matrix D have the structure of the disjunctive Table 2.
In this paper, we focus on features encoded as categorical variables.In such case, the Euclidean distance is not appropriate anymore.When dealing with one-hot encoded categorical features, Hamming's distance is usually the rst substitute that comes to mind.
It measures the dissimilarity between features as the number of dierent bit positions in two bit strings (the u th centroid and the i th policy bit strings).The disappointing results achieved with Hamming's distance lead us to another measure based on the weighted Burt matrix.

Burt's distance
Hamming's distance is a simple measure of discordance between observations.It therefore fails to discriminate between observations that are far from those that are close to each other.This issue is addressed by Burt's distance, which is based on the study of joint frequencies in modalities.In order to study the dependence between modalities, the numbers n i,j of policies sharing modalities i and j, for i, j = 1, ..., m is contained in a contingency table, known as the m × m Burt matrix B = (n i,j ) i,j=1,...,m .The Burt matrix is directly related to the disjunctive table D through the following mathematical relationship: This symmetric matrix is composed of q × q blocks B k,k ′ for k, k ′ = 1, ..., q.A block B k,k ′ is the contingency table that crosses the variables k and k ′ .Table 3 displays the Burt matrix for matrix D presented in Table 2.By construction, the sum of elements of a block B k,k ′ is equal to the total number of policies, n.The sum of n i,j for row i is equal to: Because the Burt matrix is symmetric, we can infer that n .,j= i=1,...,m n i,j = q n j,j .
Furthermore, blocks B k,k for k = 1, ..., q are diagonal matrices, whose diagonal entries are the numbers of policies that respectively present the modalities 1, ..., m k , for the k th variable.In our example, we have n 1,1 + n 2,2 = n and n 3,3 + n 4,4 + n 5,5 = n.Here, n 1,1 and n 2,2 count the total number of men and women in the portfolio, whereas n 3,3 , n 4,4 and n 5,5 count the number of policyholders with, respectively, a high school, college, or university degree.

Gender Degree
Degree Table 3: Burt matrix for the disjunctive Table 2.
In the same manner as Hainaut (2019), we dene the chi-square distance between rows i and i ′ of the Burt matrix as follows: Intuitively, the distance between two modalities is measured by the sum of weighted gaps between joint frequencies with respect to all modalities.Similarly, the chi-square distance between columns j and j ′ of the Burt matrix is dened by n i,j n .,j− n i,j ′ n .,j′ 2 j, j ′ ∈ {1, ..., m}.
We recall that k-means is a centroid-based clustering algorithm.Bregman divergences are the only distortion functions for which such centroid-based clustering schemes are possible.
The simplest divergence is known to be the squared Euclidean distance.As we prefer to evaluate distances with the Euclidean distance, the elements of the Burt matrix n i,j are replaced by weighted values n W i,j : Given that n i,.= q n i,i and n .,j= q n j,j , we have that mm , then the weighted Burt matrix is denoted by B W : The distances between rows (i, i ′ ) and columns (j, j ′ ) of the Burt matrix become: The j th modality then corresponds to the j th line of B W , a vector in R m .The centre of gravity D i,.B W /q of points with coordinates stored in the corresponding lines of the weighted Burt matrix can then be used to identify the i th policy with multiple modalities.Since a policy i may fall into a single one of the m k categories of the k th variable, a policy characterised by two features is also dened by a subset of q = 2 modalities.We represent each policy in R 3 as the midpoint between the corresponding lines of B W .The Burt's distance between the i th and j th policies is then given by: This is illustrated in Figure 1 in the case of three modalities.
Policy with features 1 and 2

𝑊
Policy with features 2 and 3 Figure 1: Illustration of two policies in the Burt's space with three modalities.
Table 4 reports the claim frequency as well as the dominant modalities associated with each cluster.The dominant modality j dominant k is identied as either the dominant representative in the u th cluster or as the one that minimises the distance measured by the ℓ 2 −norm between the centroid (c u ) u=1:K and the line of B W corresponding to the j th modality of the k th variable: A quick analysis reveals the most and least risky driver's proles.The riskiest category is young male drivers of vehicles of less than 4 years.The lowest claim frequency is characterised by older vehicles (12+).The under-representation of female drivers in the portfolio is also reected in the dominant gender (mostly men) associated to each cluster.
The Deviance, AIC, and BIC are reported in Table 5.Both the AIC and BIC are computed with a number of degrees of freedom equal to 18×27 (# of clusters × # of modalities).In comparison, the deviance of a GLM model tted to the same dataset is equal to 5732.   3

Mini-batch k-means
The k-means algorithm stores the entire dataset in main memory.When dealing with large amounts of data, this characteristic can turn into a major computation time constraint.
The mini-batch k-means algorithm is a popular alternative to reduce the temporal and spatial costs.The underlying idea hinges on saving memory space by using small random batches of data with a xed size.At each iteration, a new random sample from the dataset is used to update the clusters, taking care of deprecating previous coordinates according to a learning speed.This step is repeated until convergence.The details of this approach for categorical variables combined with the Burt's distance are set forth in Algorithm 3.
Empirical results in the literature suggest a substantial saving in computational time at the expense of partition quality.Figure 4 compares the standard k-means algorithm with the mini-batch's relative savings in computational time with respect to the number of clusters.End loop on batch dataset, i.  End loop on centroids, u.End loop on epochs e 3) Calculation of the total distance d total between observations and closest centroids.
We run the mini-batch algorithm on the Wasa dataset with Burt's distance (5) and batches of 5000 policies.Table 7 provides the results for 18 centroids.The mini-batch k-means is faster and gives approximately the same results.We retrieve all the categories discovered in Sub-section 2.1.The cluster with the second highest frequency is also characterised by one of the highest ratio powers (class 6).The deviance, reported in Table 6 is slightly higher than that of a segmentation based on the k-means algorithm.Keep in mind that the small loss in cluster quality may be the product of a greater presence of hazards in the mini-batch algorithm.
In our case study, the gain in computation time with respect to k-means is limited (both algorithms run in a few seconds).To observe signicant gains in computation time, the mini-batch k-means should be tested on a larger dataset than the Wasa one.

Fuzzy k-means
The k-means algorithm, also known as `hard clustering', attempts to partition the data set X into K distinct clusters with respect to a cost function.Each data point is assigned to the cluster whose centre is the nearest and may only belong to one cluster.Fuzzy kmeans, also referred to as `soft clustering', distinguishes itself by allowing data points to potentially belong to more than one cluster S u for u = 1, ..., K. Given a nite set of data, the algorithm returns a list of K cluster centres {c 1 , ..., c K } and a partition matrix W of membership w i,j for i = 1, ..., n and j = 1, ..., K.The w i,j gives the degree to which the i th policy belongs to cluster S j .Much like the k-means algorithm, the fuzzy k-means algorithm aims to minimise the intra-cluster variance: where The hyper-parameter m ∈ R + with m ≥ 1 is called the fuzzier.It determines the level of cluster fuzziness.A large m results in smaller membership grades w i,j , and hence fuzzier clusters.In the limit m = 1, the memberships, w i,j , converge to 0 or 1, which amounts to crisp partitioning.
The arbitrary centroid initialization and the pre-specied number of clusters both retain their signicant inuence over the nal partition, which is not guaranteed to be a global optimum.The fuzzy k-means is detailed in Algorithm 4 for categorical variables combined with Burt's distance.

Main procedure:
For e = 0 to maximum epoch, emax

Assignment step:
For i = 1 to n 1) Calculate the probability that the i th policy is in cluster Sj(e), j ∈ {1, ..., K} wi,j = 1 End loop on data set, i.

End loop on epochs e
We apply fuzzy k-means to the Wasa insurance dataset fully converted into categorical variables.We convert the variables driver's age and vehicle age into categorical variables as in the sub-section 2.1.We run the algorithm with a fuzziness parameter of m = 1.10.
The policies are assigned to the most likely cluster (highest w i,j for j = 1, ..., K).Table 8 reports the most frequent features and the claim frequency in each cluster.We nd similar proles for the most and least risky drivers to those found in sub-section 2.1, but with a much higher claim frequency for the riskiest group.The deviance (Table 9) is better than that of the (mini-batch) k-means.Notice that if we set m = 2, which is a standard level of fuzziness in the literature, several clusters are not assigned any policies, but some policies have a non-null probability of belonging to them.b.The (mutual) k-nearest neighbour graph: the vertex v i is connected with the vertex v j if v j is among the k-nearest neighbours of v i .Since the neighbourhood relationship is not symmetric and the graph must be symmetric, we need to enforce the symmetry.

Goodness of t
To do so, we may simply choose to ignore the directions of the edges; we connect v i and v j if v i is among the k-nearest neighbours of v j or if v j is among the k-nearest neighbours of v i .The resulting graph is usually called the k-nearest neighbour graph.Another option would consist in connecting v i and v j if v i is among the k-nearest neighbours of v j and v j is among the k-nearest neighbours of v i .The resulting graph is usually called the mutual k-nearest neighbour graph.
c.The fully connected graph: we connect all the points available in the dataset.
Keep in mind that the arbitrary chosen rule will set the edge matrix's entries, and may inuence the resulting clusters.
In order to work with the so obtained graph representation, we use a graph Fourier transform based on its Laplacian representation.The Laplacian representation of a graph G is derived from the dierence between the degree matrix D and the adjacency matrix A : We can dene a function on a graph's vertices, f : Let us consider a discrete periodic function that takes N values, at times 1, 2, ..., N .The loop on periods may be represented by a ring graph, as shown in Figure 8.In this particular case, the matrix of edges and weights is: Its entries indicate the absence or presence of a (weighted) edge between the vertices.
Because none of the vertices have edges to themselves, its diagonal elements are 0. Because our graph is undirected, E ij = E ji .
In order to nd the Laplacian representation, we subtract the adjacency matrix A set out above from the diagonal degree matrix D. The matrix D contains on its diagonal the degree of each vertex D ii = j w i,j .The degree of a vertex v i represents the weighted number of edges connected to it, and is equal to the sum of the row i in the adjacency matrix.
If we denote by f = (f (v j )) j=1,...,N the vector of values of f (.) at vertices, the product Lf corresponds precisely to the second nite dierence derivative of the function f (.).
Both the adjacency and degree matrices are encased in the resulting Laplacian matrix; the diagonal entries are the degrees of the vertices, and the o-diagonal elements are the negative edge weights.
The partition of the graph, including the number of clusters, is then inferred from the eigenvalues and eigenvectors of the graph Laplacian matrix (spectral analysis).Since L is symmetric, we can rewrite Laplacian L as U ΣU ⊤ where U is a matrix containing the eigenvectors and Σ a diagonal matrix containing the eigenvalues.As L is a positive semi-denite matrix, it can be easily shown that the smallest eigenvalue of L is always 0.
The analysis of the eigenvalues gives useful insights into the graph's structure.The eigenspace of 0 gives a way to partition the graph.For instance, if all the vertices of a graph are completely disconnected, all the eigenvalues are zero.As we add edges, some of the eigenvalues become non-null.The number of null eigenvalues corresponds to the number of groups of connected vertices in our graph.As an illustration, let us consider the graph plotted in Figure 9 that is made of K dierent groups of vertices that are not connected to each other.In such a case, the null eigenvalue has multiplicity K, i.e., the number of null eigenvalues is equal to the number of connected components K. Looking at the eigenvalues of the graph Laplacian provides information, not just about whether a graph is connected but also how well it is connected.The smallest non-zero eigenvalue, called the spectral gap, informs us about the density of the graph.If a graph is densely connected (all pairs of vertices have an edge), then the spectral gap is equal to the number of vertices.Conversely, the smaller the spectral gap, the closer the graph is to being disconnected.The rst positive eigenvalue thence gives a sort of continuous measure of how well the graph is connected.
The second smallest non-zero eigenvalue, called the Fiedler value (or algebraic connectivity of the graph), approximates the minimum graph cut needed to separate the graph into two connected components.Let's assume K = 2 and the two groups of vertices V 1 and V 2 are linked by one additional edge, as illustrated in Figure 10.In such a case, solely looking at each value in the Fiedler vector gives an indication about which side of the cut (V 1 or V 2 ) the vertices v 1 and v 2 belong to.Finally, if a graph is made of K sub-graphs, we can prove that elements of the K eigenvectors with null eigenvalues are roughly constant over each cluster.In other words, the eigenvectors' coordinates of points belonging to the same cluster are identical.Intuitively, since the eigenvectors of zero tell us how to partition the graph, the rst K columns of U (the eigenvectors' coordinates) consist of the cluster indicator vectors.Let's assume two well-separated sub-graphs.The eigenvectors' coordinates (U i,1 , U i,2 ) i=1,...,n with a null eigenvalue are constant for all vertices v i belonging to the same sub-graph V k .Formally, for K = 2 we have The proof is detailed in Appendix 6.If the K clusters are not identied, we can then run the k-means algorithm with rows of the rst K eigenvectors as inputs representative of vertices.The full procedure is detailed in Algorithm 5.
Algorithm 5 Spectral clustering. Initialization: Represent the dataset X as a graph G = (V, E, W ) 5) The i th data point is associated to the cluster of U (k) i,.
As explained by Von Luxburg (2007), spectral clustering outperforms other popular clustering algorithms due to its ability to handle non-convex clusters.To illustrate this, we apply the spectral analysis to the circular dataset used in the introduction of this section (1200 points, 800 in the outer ring, and 400 in the inner circle).The graph is built with the mutual k-nearest neighbours for k = 20.The similarity parameter is α = 1.
The left plot of Figure 11 conrms that the inner and outer rings are well identied by spectral clustering.The right plot displays all the pairs of eigenvectors' coordinates (U i,1 , U i,2 ) i=1,...,1200 .Pursuant to the above property, we observe in the right plot of Figure 11 that the eigenvectors' coordinates of points belonging to the same cluster are identical.To conclude this section, we apply the spectral clustering algorithm to the Wasa insurance dataset.In order to work with a homogeneous distance, we once again use the conversion of the variables driver's age and vehicle age into categorical variables from sub-section 2.1.
The disjunctive table D and the weighted Burt matrix B W are then computed using Burt's distance.As in Section 2.1, the i th contract with multiple modalities is identied by the centre of gravity: x i = D i,.B W /l .There are 62 436 contracts in the dataset.In order to graphically represent the dataset, we reduce its dimension by applying the k-means algorithm with 1000 centroids.The graph is then built using the method of mutual k-nearest neighbours (with k=50) and a similarity parameter (α = 1) on this reduced dataset made up of the 1000 cluster centres.We run the spectral clustering algorithm with K = 18 clusters.Table 11 reports the average owner and vehicle ages, as well as the dominant features and the observed claim frequency for each cluster.We observe that the spectral clustering applied to a preliminary reduced dataset with the k-means algorithm is able to discriminate between drivers with dierent risk proles.Each block L i is a graph Laplacian of its own, namely the Laplacian corresponding to the sub-graph of the i th connected component.We deduce from above that every L i has eigenvalue 0 with multiplicity 1, and the corresponding eigenvector is the constant one vector on the i th connected component.
Since the eigenspace of 0 gives a way to partition the graph, the blocks of positive values roughly correspond to the clusters.Given that such a matrix is constituted by a weighted sum of the outer products of eigenvectors, these eigenvectors should exhibit the property of being roughly piecewise constant (Paccanaro, Casbon and Saqi 2006).
For problems with well-separated clusters, components corresponding to the elements in the same cluster should therefore have approximately the same value.By `well-separated clusters', we mean to say the vertices in each cluster/sub-graph should be connected with high anity (high similarity), while dierent clusters/sub-graphs are either not connected or are connected only by a few edges with low anity.

( 2019 )
further compared k-means to self-organizing maps (SOM) to discriminate motorcycle insurance policies based on the analysis of joint frequencies of categorical variables.We can also mention Hsu et al. (1999), who present SOM in a framework through which they perform change of representation in knowledge discovery problems using insurance policy data.Clustering techniques are not, however, limited to the aforementioned centroid-based methods.In 1999, Weiss had already stressed the attractiveness of methods based on the eigenvectors of the anity matrix in the context of segmentation.The stability of the simple eigendecomposition algorithms is often put forward in favour of spectral decomposition.Interested readers may refer for details to Shi and Malik (2000), Ng, Jordan and Weiss (2001) and Belkin and Niyogi (2001).Von Luxburg (2007) also sets up a concise step-by-step tutorial on the graph theory used by spectral clustering.Despite the overwhelming amount of literature on spectral clustering, no application to an insurance portfolio has hitherto been carried out.The main reason behind the under-exploration of clustering techniques in actuarial science is the exclusive management of quantitative variables in their standard form by partitioning methods.In such cases, algorithms rely on the Euclidean distance between data points to measure their similarity.Some extensions, aiming to take qualitative variables into account already exist.For instance, Zhexue Huang (1998) extends the k-means algorithm to handle datasets with observations characterised both by quantitative and categorical variables.To this end, Zhexue Huang (1998) measures the distance between two observations by relying on a mixture of the Euclidean distance between quantitative features and Hamming's distance between (binary coded) categorical features.Similarly, Mbuga and Tortora (2021) recently extended spectral clustering to heterogeneous datasets, with both continuous and categorical features, by dening a global dissimilarity measure computed using a weighted sum.In their paper on feature transformation, Wei, Chow and Chan (2015) outline the challenges of developing a dissimilarity measure as a linear combination of the Euclidean and Hamming distances to handle mixed-type datasets.Xiong et al. (2012) further proposed a novel divisive hierarchical clustering algorithm for categorical objects.Starting with an initialization based on multiple correspondence analysis, they exploit the Chi-square distance between a single object and a set of objects to dene the objective function of their clustering algorithm.The contributions of this research paper are multiple.The main premise of this work is the data processing through Burt's space.It allows an investigation into dierent methods of data partitioning for actuarial applications.In Section 1, we introduce the centroidbased k-means algorithm and review its k-means++ heuristic.We next extend it to fully categorised datasets, based on Burt's distance.Section 2 reviews a mini-batch version to perform clustering on large-scale datasets with a limited loss of accuracy.In Section 3, we broaden the scope of application of the fuzzy k-means by means of our Burt framework.

Figure 2 Figure 2 :
Figure 2 displays the 18 clusters in a two-dimensional space with the owner's and vehicle's ages on the axes.

Figure 3 :
Figure 3: Left plot: evolution of the total intra-class inertia.Right plot: evolution of the Deviance.

Figure 4 : 1 =For i = 1 to b 1 )
Figure 4: Partition quality (measured by the deviance in solid lines) and running time (dotted lines) with respect to the number of clusters.

For u = 1 to K 2 ):
Calculate the centroids of the batch assigned to S new u B W /q .

Figure 5
Figure 5 depicts the 18 clusters in the space delimited by tge owner and vehicle ages.

Figure 5 :
Figure 5: Illustration of the partitioning of a dataset with the fuzzy k-means algorithm based on Burt's distance.

Figure 6 :
Figure 6: Illustration of the partition of non-convex data with the (mini-batch) k-means, and spectral algorithms.Each cluster is identied by a colour.The centroids are represented with a diamond shape.

Figure 8 :
Figure 8: Ring representation of a period with N steps.

Main procedure: 1 ) 2 ) 3 )
Calculation of the n × n Laplacian matrix L = D − A Extract the eigenvectors matrix U and diagonal matrix of eigenvalues Σ from L = U ΣU ⊤ Fix k and build the n × k matrix U (k) of eigenvectors with the k eigenvalues closest to zero 4) Run the k-means algorithm 2 with the dataset of U (k) i,. for i = 1, ..., n.
does not aect the structure of the algorithm.Each sub-graph then corresponds to a block L i and the Laplacian matrix L takes the form of a block diagonal matrix: summarises the information provided by categorical variables.

Table 4 :
K-means algorithm on Burt's distance.Dominant features and average claim frequencies per cluster.If we use the claim frequency as a predictor, λ i , we can estimate the goodness of t of the partition with the Poisson deviance.The deviance is the dierence between the log-likelihood of the saturated model and that of the partitioned model.If N i and ν i are respectively the number of claims and the duration (exposure) of the i th contract, the deviance is dened as:

Table 5 :
Statistics of goodness of t obtained by partitioning the dataset into 18 clusters with the kmeans algorithm based on Burt's distance.

Table 6 :
Statistics of goodness of t obtained by partitioning the dataset into 18 clusters with the minibatch k-means.

Table 7 :
Coordinates of centroids and average claim frequencies for 18 clusters, mini-batch k-means.

Table 8 :
Fuzzy clustering, dominant features, and average claim frequencies per cluster.

Table 9 :
Fuzzy clustering.Statistics of goodness of t obtained by partitioning the full categorical dataset into 18 clusters.

Table 10 :
Table 10 conrms that the method achieves a reasonable goodness of t in terms of deviance.Statistic of goodness of t obtained by partitioning the dataset into 18 clusters with the spectral algorithm.