Bayesian CART models for insurance claims frequency

Accuracy and interpretability of a (non-life) insurance pricing model are essential qualities to ensure fair and transparent premiums for policy-holders, that reflect their risk. In recent years, the classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature, since they offer good prediction performance and are relatively easily interpretable. In this paper, we introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling. Additionally to the common Poisson and negative binomial (NB) distributions used for claims frequency, we implement Bayesian CART for the zero-inflated Poisson (ZIP) distribution to address the difficulty arising from the imbalanced insurance claims data. To this end, we introduce a general MCMC algorithm using data augmentation methods for posterior tree exploration. We also introduce the deviance information criterion (DIC) for the tree model selection. The proposed models are able to identify trees which can better classify the policy-holders into risk groups. Some simulations and real insurance data will be discussed to illustrate the applicability of these models.


Introduction
An insurance policy refers to an agreement between an insurance company (the insurer) and a policy-holder (the insured), in which the insurer promises to charge the insured a certain fee for some unpredictable losses of the customer within a period of time, usually one year.The charged fee is called a premium which includes a pure premium and other loadings such as operational costs.
For each policy, the pure premium is determined by the multiple explanatory variables (such as properties of the policy-holders, properties of the insured objects, properties of the geographical region, etc.), also called risk factors [1].The premium charged reflects the customer's degree of risk; a higher premium suggests a potential higher risk, and vice versa.Therefore, it is necessary to use risk factors to classify policy-holders with similar risk profiles into the same tariff class.The insureds in the same group, all having similar risk characteristics, will pay the same reasonable premium.
The process of constructing these tariff classes is also known as risk classification; see, e.g., [2,3].
In the basic formula of non-life insurance pricing, the pure premium is obtained by multiplying the expected claims frequency with the conditional expectation of severity, assuming the independence between the frequency and severity; see, e.g., [4].Hence, the claims frequency modelling represents an essential first step in non-life insurance pricing.In this paper, we aim to propose efficacious approaches (namely, Bayesian CARTs or BCART models) to analyze imbalanced insurance claims frequency data.
Due to its flexibility in modelling a large number of distributions in the exponential family, generalized linear models (GLMs), developed in [5], have been the industry standard predictive models for insurance pricing [2,6].Explanatory variables enter a GLM through a linear predictor, leading to interpretable effects of the risk factors on the response.Extensions of GLMs to generalized additive models (GAMs) to capture the nonlinear effects of risk factors sometimes offer more flexible models.However, both GLMs and GAMs often fail to identify the complex interactions among risk factors.Another popular classical method based on Bayesian statistics, the credibility method, was introduced to deal with multi-level factors and lack of data issues; see, e.g., [1,7].Due to the limitations of these classical statistical methods and equipped with continually developing technologies, further research has turned to machine learning techniques in recent years.Several machine learning methods such as neural networks, regression trees, bagging techniques, random forests and boosting machines have been introduced in the context of insurance by adopting the actuarial loss distributions in these models to capture the characteristics of insurance claims.We refer to [8] for a recent literature review on this topic and [9][10][11] for more detailed discussions.
Insurance pricing models are heavily regulated and they must meet specific requirements before being deployed in practice, which posts some challenges for machine learning methods; see [4].
Therein, it is stressed that pricing models must be transparent and easy to communicate to all the stakeholders and that the insurer has the social role of creating solidarity among the policy-holders so that the use of machine learning for pricing should in no way lead to an extreme penalization of risk or discrimination.The latter has also been noted recently in, e.g., [12,13] where it is claimed that prediction accuracy on an individual level should not be the ultimate goal in insurance pricing; one also needs to ensure the balance property.Bearing these points in mind, researchers have concluded that tree-based models are good candidates for insurance pricing [4,[14][15][16][17].More precisely, the use of CART, first introduced in [18], partitions a portfolio of policy-holders into smaller groups of homogeneous risk profiles based on some risk factors in which a constant prediction is used for each sub-group.This results in a highly transparent model and automatically induces solidarity among the policy-holders in a sub-group.Although a large number of scholars have done empirical and theoretical studies on the effectiveness of the CART, limitations of the forwardsearch recursive partitioning method used in the CART have been identified.In particular, the predictive performance tends to be low, and it is known to be unstable: small variations in the training set can result in different trees and different predictions for the same test examples.Due to these limitations, more complex tree-based models that combine multiple trees in an ensemble have been popular in insurance prediction and pricing, but these ensemble techniques usually introduce additional difficulties in model transparency.In this paper and the sequel, we propose BCART models for insurance claims prediction.Instead of making an ensemble of trees, we aim to seek for one good tree, which can improve the prediction ability whilst ensuring model transparency, by adopting a Bayesian approach applied to CART.
BCART models were first introduced by Chipman et al. [19] and Denison et al. [20], independently.The method has two basic components, prior specification (for the tree and its terminal node parameters) and stochastic search, and the idea is to obtain a posterior distribution given the prior, thus leading the stochastic search towards more promising tree models.Compared with the tree that CART generates by greedy forward-search recursive partitioning method, BCART model generates a much better tree by an effective Bayesian-motivated stochastic search algorithm.This will not be able to find the correct partition of the data due to its greedy forward-search nature.
In contrast, the proposed Poisson BCART can retrieve the optimal tree structure since it has the ability to explore the tree space in a global way (for example, it can correct the incorrectly chosen splits in previous steps).
Figure 1: Covariate partition for a Poisson distributed simulation.The response variable, which is simulated for the points, has Poisson intensity equal to 1 (circles) and 7 (triangles).
Since BCART models and their ensemble version -the Bayesian Additive Regression Trees (BART) models -generally outperform other machine learning models, they have been extensively studied in the literature; see, e.g., [21][22][23][24][25] and references therein.In particular, their excellent empirical performance has also motivated works on their theoretical foundations; see [26,27].However, in most of these studies, the focus has been on Gaussian distributed data, with some exceptions such as [24,28].It turns out that a data augmentation approach is needed when dealing with general non-Gaussian data.The existing algorithms do not seem to be directly applicable to the insurance data for prediction and pricing.To cover this gap, as a first step we propose BCART models for claims frequency taking account of the particularities of insurance data such as the high number of zeros and involvement of exposures.We refer to [29,30] for a review of claims frequency modelling which also includes some nice analyses on exposures.
The main contributions of the paper are as follows: • We give a general MCMC algorithm for the BCART models applied to any distributed data, where a data augmentation may be needed.In doing so, we follow some ideas in [31,32].
• We introduce a novel model selection method for the BCART models based on the deviance information criterion (DIC).Note that DIC was introduced in [33] which appeared a few years after the introduction of BCART [19].The effectiveness of this approach is illustrated by several designed simulation examples and real insurance data.
• We implement the BCART for Poisson, NB and ZIP distributions which are not currently available in any existing R packages.In particular, we introduce two different ways of incorporating exposure in the NB and ZIP models, following the lines of study in [29,30].The simulation examples and real insurance data analysis show the applicability of these proposed BCART models.
• To date, Bayesian tree-based models have not attracted enough attention compared to other machine learning methods in the actuarial community.This first step of applying BCART for claims frequency modelling will open the door for more sophisticated tree-based models to meet the needs of the insurers.
Outline of the rest of the paper: In Section 2, we review the BCART framework followed by an extension with data augmentation and a model selection method using DIC.Section 3 introduces the notation for insurance claims frequency data and five BCART models including a Poisson model, two NB models and two ZIP models.In Section 4, we discuss the applicability of the proposed BCART models using three simulation examples and a real insurance claims dataset.Section 5 concludes the paper.

Bayesian CART
We shall briefly review the BCART framework of the seminal paper [19].We begin with a general structure of a CART model.Consider a data set (X, y) = (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ) with n observations.For the i-th observation, x i = (x i1 , x i2 , . . ., x ip ) is a vector of p explanatory variables (or covariates) sampled from a space X , while y i is a response variable sampled from a space Y.For our purpose of claims frequency modelling, Y will be a set of non-negative integers.
A CART has two main components: a binary tree T with b terminal nodes which induces a partition of the covariate space X , denoted by {A 1 , . . ., A b }, and a parameter θ = (θ 1 , θ 2 , . . ., θ b ) which associates the parameter value θ t with the t-th terminal node.Note that here we do not specify the dimension and range of the parameter θ t which should be clear in the considered context below.If x i is located in the t-th terminal node (i.e., x i ∈ A t ), then y i has a distribution where f represents a parametric family indexed by θ t .
By associating observations with the b terminal nodes in the tree T , we can represent the data set as where y t = (y t1 , . . .y tnt ) with n t denoting the number of observations and y tj denoting the jth observation in the t-th terminal node, and X t is an analogously defined n t × p design matrix.
For a CART model, it is typically assumed that conditionally on (θ, T ), response variables within a terminal node are independent and identically distributed (IID), and they are also independent across terminal nodes.In this case, the CART model likelihood will be of the form Since a CART model is identified by (θ, T ), a Bayesian analysis of the problem proceeds by specifying a prior distribution p(θ, T ).Because θ indexes the parametric model whose dimension depends on the number of terminal nodes of the tree, it will usually be convenient to use the relationship and specify the tree prior distribution p(T ) and the terminal node parameter prior distribution p(θ | T ), respectively.This strategy offers the advantage that the choice of prior for T does not depend on the form of the parametric family indexed by θ, and thus the same specification of T can be used for any type of response variable.

Specification of tree prior p(T )
The prior for T has two components: tree topology, and a decision rule for each of the internal nodes.Chipman et al. [19] propose a branching process prior for the topology of T .A draw from this prior is obtained by generating, for each node at depth d (with d = 0 for the root node), two child nodes with probability where γ > 0, ρ ≥ 0 are parameters that control the structure of the tree.This process iterates for d = 0, 1, . . ., until we reach a depth at which all the nodes stop to grow.Note that p(d) is not a probability mass function, but instead is the probability of a given node at depth d being converted to a branch node.A sufficient condition for the termination of this branching process is that ρ > 0, and the case ρ = 0 corresponds to the Galton-Watson process, see, e.g., [34].We refer to [19] for some simulation discussions about expected number of terminal nodes associated with the values of the pair (γ, ρ).After the tree topology is generated, each internal/branch node is associated with a decision rule of the form x l < c l or x l ∈ C l according to whether x l is a continuous or a categorical explanatory variable, where x l is selected independently and uniformly among the available explanatory variables for each node, and the split value c l or split category subset C l are selected uniformly among those available for the selected variable x l .In practice, we only consider the overall set of possible split values to be finite; if the l-th variable is continuous, the grid for the variable is either uniformly spaced or given by a collection of observed quantiles of {x il , i = 1, 2, . . ., n}.Although the uniform choice of variable is natural, simple and most popular, non-uniform choice of variable may be beneficial for certain purposes, for example, variable selection (see [35]).

Specification of the terminal node parameter prior p(θ | T )
When choosing p(θ | T ), it is important to realize that using priors that allow for analytical simplification can greatly reduce the computational burden of posterior calculation and exploration.This is especially true for prior form p(θ | T ) for which it is possible to analytically margin out θ to obtain the integrated likelihood where in the second equality we assume that conditional on the tree T with b terminal nodes as above, the parameters θ t , t = 1, 2, . . ., b, have IID priors p(θ t ), which is a common assumption.Examples where this integration has a closed-form expression can be found in, e.g., [19,21], particularly for Gaussian distributed data y.When no such priors can be found, we have to resort to the technique of data augmentation (see, e.g., [24,28,36]) which will be discussed later.Combining the integrated likelihood P (y | X, T ) with tree prior p(T ), allows us to calculate the posterior of T When using MCMC to conduct Bayesian inference, T can be updated using a Metropolis-Hastings (MH) algorithm with the right-hand side of (5) used to compute the acceptance ratio.These MH simulations can be used to stochastically search the posterior space over trees to determine the high posterior probability trees from which we can choose a best one.An additional Gibbs sampler is then used to obtain a posterior sequence for θ.It is worth noting that by integrating out θ in (4) we avoid the possible complexities associated with reversible jumps between continuous spaces of varying dimensions [22,37].
Other proposals are also suggested to improve the mixing of simulated trees, but these are usually not easy to implement; see, e.g., [38,39].One of appealing features of these four proposals is that grow and prune steps are reversible counterparts of one another and change and swap steps are reversible counterparts on their own.As noticed in [19], this is very attractive for the calculation of α T (m) , T * in Algorithm 1, since there are substantial cancellations in the ratios (see also [40] for detailed calculations).In our implementation, we consider these four proposals detailed as follows: • Grow: Randomly select a terminal node among those satisfying a minimum observations requirement.Split it into two new child nodes and randomly assign it a decision rule.
• Prune: A terminal node is randomly selected.The chosen node and its sibling node are pruned into the direct parent node which then becomes a new terminal node.
• Change: Randomly select an internal node.Randomly assign it a new decision rule for this node.
• Swap: Randomly pick a parent-child pair which are both internal nodes.Swap their decision rules.
Remark 1 Note that in step 4 of Algorithm 1, sampling of θ (m+1) is needed only for those nodes that were involved in the proposed move from T (m) to T * and only when this move was accepted.

MCMC algorithm with data augmentation
In this section, we discuss the case where there is no obvious prior distribution p(θ t ) such that the integration in ( 4) is of closed-form, particularly, for non-Gaussian data y.In this case, we shall use a data augmentation method in implementing the MCMC algorithm.Some special cases have been discussed in [22,24,28,36].
The term data augmentation originated from Tanner and Wong's data augmentation algorithm [41].It is introduced purely for computational purposes and a latent variable is required so that the original distribution is the marginal distribution of the augmented one.We refer to [32] for an overview of data augmentation and relevant theory.For our purpose, we augment the data y by introducing a latent variable z = (z 1 , z 2 , . . ., z n ) so that the integration in (7) below is computable with the augmented data (y, z).To this end, we shall follow the idea of marginal augmentation introduced in [31] (see also [32]).In their framework, our parameter θ can be interpreted as a working parameter, and thus the integrated likelihood is given as where with z t = (z t1 , z t2 , . . ., z tnt ) defined according to the partition of X and with obvious independence assumed.Following the Scheme 3 of [31] (see also Section 3 of [32]), we propose the following Algorithm 2 to simulate a Markov chain sequence of pairs θ (1) , T (1) , θ (2) , T (2) , . . ., starting from the root node.
Algorithm 2 One step of the MCMC algorithm for updating the BCART parameterized by (θ, T ) using data augmentation Input: Data (X, y) and current values θ (m) , T (m) , z (m)   1: Generate a candidate value T * with probability distribution q T (m) , T * 2: Sample z (m+1) ∼ p(z | X, y, θ (m) , T (m) ) 3: Set the acceptance ratio α T (m) , T * = min Output: New values θ (m+1) , T (m+1) , z (m+1)   Note that in some cases introducing one latent variable z is not enough to obtain a closed-form for the integration in (7); we might need more latent variables.In that case, we can easily extend Algorithm 2 to include multivariate latent variables and use the Gibbs sampler in step 2. Clearly, the more latent variables used, the slower the convergence of the Markov chain sequence.As discussed in [32], it is an "art" to search for efficient data augmentation schemes.We discus this point later for the claims frequency models.
Remark 2 Similarly to Algorithm 1, in step 2 and step 5 of Algorithm 2 the sampling is needed only for those nodes that were involved in the proposed move from T (m) to T * , and step 5 is needed only when this move was accepted.

Posterior tree selection and prediction
The MCMC algorithms described in the previous section can be used to search for desirable trees.
However, as discussed in [19] and illustrated below in our analysis, the algorithms quickly converge and then move locally in that region for a long time, which occurs because proposals make local moves over a sharply peaked multimodal posterior.Instead of making long runs of search to move from one mode to another better one, we follow the idea of [19] to repeatedly restart the algorithm.
As many trees are visited by each run of the algorithm, we need a method to identify those trees which are of most interest.Moreover, the structure of trees in the convergence regions is mostly determined by the hyper-parameters γ, ρ which also need to be chosen appropriately.In [19], the integrated likelihood p(y | X, T ) is used as a measure to choose good trees from one run of the algorithm, though other measures, like residual sum of squares, could also be introduced.However, there is no discussion on how the tree prior hyper-parameters γ, ρ should be determined optimally.
A natural way to deal with this is to use cross-validation which, however, requires repeated model fits and is very computationally expensive.In this paper, we propose to use the DIC for choosing appropriate γ, ρ, and thus introduce a three-step approach for selecting an "optimal" tree among those visited.To this end, we first give a definition of DIC for a Bayesian CART.We refer to [33,[42][43][44] for detailed discussions of DIC and its extensions.
Consider the tree T with b terminal nodes and parameters θ t , t = 1, 2, . . ., b, previously defined.
We first introduce DIC for each node using the standard definition, the DIC for the tree is then defined as the sum of the DIC of all terminal nodes in the tree due to the independence assumption.
For node t, we call the deviance.Analogously to the Akaike information criterion (AIC), Spiegelhalter et al. [33] proposed the DIC based on the principle DIC="goodness of fit"+"complexity", which is defined as where θ t = E post (θ t ) is the posterior mean (with E post denoting expectation over the posterior distribution of θ given data y), and p Dt is the effective number of parameters given by The DIC of the tree T with b terminal nodes is then defined as where D(θ) = Next, we introduce the DIC for tree models with data augmentation.Depending on whether the latent variable z is treated as a parameter or not, there are three types of likelihoods leading to eight versions of DIC as discussed in [42].Due to the complexity in implementing any of those eight and motivated by the idea that DIC="goodness of fit"+"complexity", we introduce a new DIC for node t in the tree as follows where D(θ t ) is the deviance defined through the data y t (as in ( 8)) which represents the goodness of fit, and q Dt is the effective number of parameters defined through the augmented data (y t , z t ) as follows where θ t = E post (θ t ), and in this case E post denotes expectation over the posterior distribution of θ given augmented data (y, z).As we will see below, for the frequency models, q Dt is approximately the dimension of θ t as the sample size n t in node t tends to infinity.Similarly, the DIC of the tree T with b terminal nodes is thus defined as where 12)).More recently, a new criterion called WAIC was introduced by Watababe [45] (see also [43,44]), where in its definition the plug-in prediction density is replaced by the full prediction density E post (f (y ti | θ t )).When the explicit expression is not available, this posterior expectation is usually computed by Monte Carlo algorithm as S −1 S k=1 f (y ti | θ k ), where θ k is simulated from the posterior distribution of θ t .In the following section, we will see that this posterior expectation can be obtained explicitly for the Poisson model, but not for other models.It turns out that using WAIC gives the same selected model as DIC in our simulation examples.Additionally, since it involves Monte Carlo algorithm and as such could be considerably more computationally expensive, we suggest using DIC.Now, we are ready to introduce the three-step approach for selecting an "optimal" tree from the Table 1: Three-step approach for "optimal" tree selection Step 1: Set a sequence of hyper-parameters (γ j , ρ j ), j = m s , • • • , m e , such that for (γ j , ρ j ), the MCMC algorithm converges to a region of trees with j terminal nodes.
Step 2: For each j in Step 1, select the tree with maximum likelihood p(y | X, θ, T ) from the convergence region.
Step 3: From the trees obtained in Step 2, select the optimal one using DIC.
MCMC algorithms.Let m s < m e be two user input integers which represent the belief of where the optimal number of terminal nodes of the tree might fall into.In practice, these can be estimated first by using some other methods, e.g., a CART model.The three-step approach is given in Table 1.In what follows, the tree selected by using the three-step approach will be called an "optimal" tree.
Remark 4 (a).The relation between hyper-parameters (γ j , ρ j ) and the distribution of the number of terminal nodes of tree has been illustrated in [19].It seems not hard to set values for (γ j , ρ j ) so that the MCMC algorithms will converge to a region of trees with required j terminal nodes.Whereas, it is worth noting that this distribution of number of terminal nodes is also affected by the data in hand, which can be seen from the calculation of the acceptance ratio in the MCMC algorithms.In our simulations and real data analysis below, we have to select a relatively larger ρ in order to achieve our goals.
(b).In Step 2, the so-called data likelihood p(y | X, θ, T ), rather than the integrated likelihood p(y | X, T ), is used, which is due to our interest in the fit of the parametric model to data.The simulations and real data in Section 4 indicate that these two types of likelihood show a consistency in the ordering of their values, and thus we suspect there is no big difference using either of them.
Suppose T with b terminal nodes and parameter θ is the optimal tree obtained from the above three-step approach.For a given new x the predicted ŷ using this tree model is defined as where I (•) denotes the indicator function and {A t } b t=1 is the partition of X from T .
Remark 5 An alternative prediction given x can be defined using the full predictive density as However, for the frequency models the explicit expression can be found only for Poisson case, and for other models the Monte Carlo method is needed to estimate the posterior expectation.Thus, we shall use (14) for simplicity.

Bayesian CART claims frequency models
In this section, we introduce the BCART for insurance claims frequency by specifying the response distribution in the general framework introduced in Section 2. We shall discuss three commonly used distributions in the literature to model the claim numbers, namely, Poisson, NB and ZIP distributions; see, e.g., [10,29,30].To this end, we first introduce the claims data.A claims data set with n policy-holders can be described by (X, v, N ) = (x 1 , v 1 , N 1 ), . . ., (x n , v n , N n ) , where x i = (x i1 , . . ., x ip ) ∈ X consists of rating variables (e.g., area, driver age, car brand in car insurance); N i is the number of claims reported, and v i ∈ (0, 1] is the exposure in yearly units which is used to quantify how long the policy-holder is exposed to risk.The goal is to explain and predict the claims information N i based on the rating variables x i and the exposure v i for each individual policy i, which leads to the claims frequency, i.e., the number of claims filed per unit year of exposure to risk.
We will discuss below how this can be done with BCART models.

Poisson model
Consider a tree T with b terminal nodes as discussed in Section 2. In a Poisson model, we assume for the i-th observation where A t is a partition of X .Here we use the standard notation λ t for claims frequency rather than the generic notation θ t for the parameter in terminal node t.Essentially, we have specified the distribution f (y i | θ t ) for terminal node t (see Section 2) as for the i-th observation such that x i ∈ A t .Note that, for simplicity, here and hereafter, the exposure v i and x i will be compressed in some notation.Based on the discussions in Section 2.2, we choose the Gamma prior for λ t with hyper-parameters α, β > 0, that is, with Γ(•) denoting the Gamma function.As in Section 2, for terminal node t we define the associated With the above Gamma prior, the integrated likelihood for terminal node t can be obtained as Clearly, from (18), we see that the posterior distribution of λ t , conditional on N t , is given by The integrated likelihood for the tree T is thus given by Next, we discuss the DIC for this tree, focusing on the DIC t for terminal node t.First, we have and by (19) we get the posterior mean for λ t as Furthermore, we derive that where we use the fact that with ψ(x) = Γ (x)/Γ(x) being the digamma function.Using ( 21)-( 23), we obtain the effective number of parameters for terminal node t as and Then the DIC of the tree T is obtained by using (10).
Remark 6 Since ψ(x) = log(x) − 1 2x (1 + o(x)), as x → ∞, we immediately see that p Dt → 1 as n t → ∞.This explains the name of effective number of parameters in the Bayesian framework, as 1 is the number of parameters in the terminal node t for Poisson model if a flat prior is assumed for λ t .
With the above ( 19)- (20) and DIC obtained, we can use the three-step approach proposed in Section 2.5 to search for an optimal tree, where (19) and ( 20) should be used in Step 4 and Step 2, respectively, in Algorithm 1.Given an optimal tree, the estimated claims frequency λ t in terminal node t can be given by the posterior mean in (22), using (14).It is worth noting that we can obtain the same estimate by using (15) instead.

Negative binomial models
The NB distribution, a member of mixed Poisson family, offers an effective way to handle overdispersed insurance claims frequency data where excessive zeros are common.
Consider the tree T with b terminal nodes as before.In the NB model, we assume that N ti | X ti , v ti follows a NB distribution for all terminal nodes, t = 1, . . ., b.There are different ways to parameterize the NB distribution, particularly with the exposure, see, e.g., [10,29].We shall discuss two models below.

Negative binomial model 1 (NB1)
We first adopt the most common parameterization of the NB distribution, see, e.g., [24].That is, for terminal node t, where κ t , λ t > 0. It is easy to show that the mean and variance of N ti are given by The degree of over-dispersion in relation to the Poisson is controlled by the additional parameter κ t in the NB model, which converges to the Poisson model as κ t → ∞.
In NB regression, the lack of simple and efficient algorithms for posterior computation has seriously limited routine applications of Bayesian approaches.Recent studies make Bayesian approaches appealing by introducing data augmentation techniques; see, e.g., [24,46].In order to save on total computational time of the algorithm and avoid the difficulty of finding an appropriate prior for κ t with corresponding data augmentation, we shall treat the parameter κ t as known in the Bayesian framework which can be estimated upfront by using, e.g., the moment matching method.However, in line with the Poisson model, we shall treat λ t as uncertain and use a Gamma prior with corresponding data augmentation.Based on the formulas given in (25), we can estimate the parameter κ t , using the moment matching method, see, e.g., Chapter 2 of [6] as follows where Next, introducing a latent variable ξ t = (ξ t1 , ξ t2 , . . ., ξ tnt ) ∈ (0, ∞) nt , we can define a data augmented likelihood for the i-th data instance in terminal node t as It is easily checked that integrating over ξ ti ∈ (0, ∞) in (28) yields the marginal distribution (24).
Further, we see that ξ ti , given data N ti and parameters, is Gamma distributed, i.e., Given the data augmented likelihood in (28), the estimated parameter κ t using (26), and a conjugate Gamma prior for λ t with hyper-parameters α, β > 0 (cf.( 17)), we can derive the integrated augmented likelihood for the terminal node t as follows Moreover, from the above we see that the posterior distribution of λ t given the augmented data (N t , ξ t ), is given by The integrated augmented likelihood for the tree T is thus given by Now, we discuss the DIC for this tree.Since we only consider uncertainty for λ but not for κ, the DIC defined in (13) cannot be adopted directly.Thus, using the idea that DIC="goodness of fit"+"complexity", we can introduce a new DIC t for terminal node t as follows Here, the goodness of fit is given by and the effective number of parameters r Dt is given by where 1 represents the number for κ t and the second part is for λ t , Therefore, a direct calculation shows that the effective number of parameters for terminal node t is given by and thus

Negative binomial model 2 (NB2)
We now consider another parameterization of the NB distribution, see, e.g., [10,29].That is, for It is easy to show that the mean of N ti is the same as in (25), but the variance becomes This formulation yields a fixed over-dispersion of size λ t /κ t which does not depend on the exposure v ti , and thus it is sometimes preferred (see [10]) and has been judged as more effective for real insurance data analysis (see [29]).
We use the same way to deal with κ t and λ t as in the previous subsection.Using the same approach as Chapter 2 of [6], we can estimate the parameter κ t as follows where V 2 t and λ t are given in (27).Note that this parameterization offers a simpler estimation for κ t , and that λ t is a minimal variance estimator; see [6].
Similarly as before, we can define a data augmented likelihood for the i-th data instance in terminal node t as Further, we see that ξ ti , given data N ti and parameters, is Gamma distributed, i.e., Given the data augmented likelihood in (37), the estimated parameter κ t using (36), and a conjugate Gamma prior for λ t with hyper-parameters α, β > 0, we can derive the integrated augmented likelihood for terminal node t as follows From the above we see that the posterior distribution of λ t , given the augmented data (N t , ξ t ), is given by The integrated augmented likelihood for the tree T is thus given by Now, we discuss the DIC t for terminal node t of this tree.Similarly, as in the previous subsection, we can easily check that For the above two NB models, the DIC of the tree T is obtained by using (10).
With the above formulas derived in the two subsections for NB models, we can use the three-step approach proposed in Section 2.5, together with Algorithm 3, to search for an optimal tree and then obtain the prediction for new data.
(b).It is worth noting that our way of dealing with parameter κ is different from that in [24] where a single κ is sampled from a distribution and used for all terminal nodes.It turns out that that way of dealing with κ cannot give us good estimates in our simulation examples, whereas our way of first estimating κ using moment matching method for each node can give good estimates.
(c).There are other ways to parameterize the NB distribution, see, e.g., [46].However, it looks that these ways are normally discussed when there is no exposure involved, so we will not cover them here.

Zero-Inflated Poisson models
Insurance claims data normally involves a large volume of zeros.Many policy-holders incur no claims, which does not necessarily mean that they were involved in no accidents, but they are probably less risky.In this section, depending on how the exposure is embedded in the model we discuss two ZIP models to better reflect the excessive zeros, see, e.g., [30].

Zero-Inflated Poisson model 1 (ZIP1)
For terminal node t, we use the following ZIP distribution by embedding the exposure into the Poisson part (see [24]) where f P (m | λ t ) is given as in ( 16), and 1 1+µt ∈ (0, 1) is the probability that a zero is due to the point mass component.Note that for computational simplicity we consider a model with two parameters rather than three as in [24].
Similarly to the NB model, a data augmentation scheme is needed for the ZIP model.To this end, we introduce two latent variables φ t = (φ t1 , φ t2 , . . ., φ tnt ) ∈ (0, ∞) nt and δ t = (δ t1 , δ t2 , . . ., δ tnt ) ∈ {0, 1} nt , and define the data augmented likelihood for the i-th data instance in terminal node t as where the support of the function f ZIP1 is {0} × {0, 1} × (0, ∞) ∪ N × {1} × (0, ∞) .This means that we impose δ ti = 1 when N ti ∈ N (i.e., N ti = 0).It can be shown that ( 41) is the marginal distribution of the above augmented distribution; see [24] for more details.By conditional arguments, we can also check that δ ti , given data φ ti , N ti = 0 and parameters, is Bernoulli distributed, i.e., and δ ti = 1, given N ti > 0. Furthermore, φ ti , given data δ ti , N ti and parameters, is exponential distributed, i.e., We assume independent conjugate Gamma priors for µ t and λ t with hyper-parameters α i , β i > 0, i = 1, 2 (cf.( 17)).With the data augmented likelihood in (42) and the above Gamma priors, we can derive the integrated augmented likelihood for terminal node t as follows Moreover, from the above we see that the posterior distributions of µ t , λ t given the augmented data (N t , δ t , φ t ) are given by The integrated augmented likelihood for the tree T is thus given by Now, we discuss the DIC for this tree which can be derived as a special case of (11) with θ t = (µ t , λ t ).
To this end, we first focus on DIC t of terminal node t.It follows that where we can derive that Therefore, DIC t can be obtained from ( 47) and ( 48) as DIC t = D µ t , λ t + 2q Dt .

Zero-Inflated Poisson model 2 (ZIP2)
For terminal node t, we use the following ZIP distribution by embedding the exposure into the zero mass part (see [30]) where 1 1+µtvti ∈ (0, 1) is the probability that a zero is due to the point mass component.This formulation stems from an intuitive inverse relationship between the exposure and the probability of zero mass.This way of embedding exposure has been justified to be more effective in [30].
Similarly to before, we introduce two latent variables φ t = (φ t1 , φ t2 , . . ., φ tnt ) ∈ (0, ∞) nt and δ t = (δ t1 , δ t2 , . . ., δ tnt ) ∈ {0, 1} nt , and define the data augmented likelihood for the i-th data instance in terminal node t as where the support of the function tional arguments, we can also check that δ ti , given data φ ti , N ti = 0 and parameters, is Bernoulli distributed, i.e., and δ ti = 1, given N ti > 0. Furthermore, φ ti , given data δ ti , N ti and parameters, is exponential distributed, i.e., As previously, we assume independent conjugate Gamma priors for µ t and λ t with hyper-parameters Given the data augmented likelihood in (50) and the above Gamma priors, we can obtain the integrated augmented likelihood for terminal node t as follows Moreover, from the above we see that the posterior distributions of µ t , λ t given the augmented data (N t , δ t , φ t ) are given by The integrated augmented likelihood for the tree T is thus given by Now, we discuss the DIC t of terminal node t.It follows that where we can derive the same expression for q Dt as in (48).Therefore, we obtain from ( 55) and ( 48) that For the above two ZIP models, the DIC of the tree T is obtained by using (10).
With the formulas derived in the above two subsections for ZIP models, we can use the three-step approach proposed in Section 2.5, together with Algorithm 2, to search for an optimal tree and then obtain the prediction for new data.
Remark 8 There are other ways to deal with the data augmentation for ZIP models; see, e.g., [41,47,48] where only one latent variable is introduced.The models discussed therein with one latent variable should work more efficiently, but in their constructions no exposure is considered.Since involvement of exposure is one of the key features of insurance claims frequency analysis, we had to introduce two latent variables for data augmentation to facilitate calculations.

Simulation and real data analysis
In this section, we illustrate the efficiency of the BCART models introduced in Section 3 by applying them to some simulated data and a real insurance claims dataset.In the sequel, we use the abbreviation P-CART to denote CART for the Poisson model, and the other abbreviations can be similarly understood (e.g., NB1-BCART denotes the BCART for NB model 1).
We first introduce some performance measures that will be used for prediction comparisons.
Suppose we have obtained a tree with b terminal nodes and the corresponding parameter estimates for θ which we will use to obtain the prediction N i given x i , v i for a test data set (X, v, N ) with m observations.The number of test data in the terminal node t is denoted by m t , t = 1, . . ., b.The performance measures used are as follows: M1: The residual sum of squares (RSS) is given by This measure is commonly used for Gaussian distributed data, but here we also use it for non-Gaussian data for comparison.
M2: RSS based on a sub-portfolio (i.e., those instances in the same terminal node) level is given by where y t is the estimated frequency for the terminal node t, which is estimated by ( 14) assuming unit exposure.More specifically, y t = λ t for Poisson and NB models, and y t = µ t λt 1+µ t for ZIP models.This measure is preferred here as it takes into account of accuracy on a (sub-)portfolio level (i.e., balance property) other than an individual level.We refer to [12,13,49] for more details and discussions of the balance property that is required for insurance pricing.
M3: Negative log-likelihood (NLL): This is calculated by using the assumed response distribution in the terminal node with the estimated parameters.It represents the ex-ante belief of the underlying distribution of the data, and thus a good measure for model comparison, e.g., [30].
M4: Discrepancy statistic (DS) (cf.[50]), is defined as a weighted version of RSS(N/v), given by where y t is the same as in M2, and σ 2 t is the estimated variance of frequency for terminal node t.More specifically, σ 2 t = λ t for the Poisson model, σ 2 t = λ t (1 + λ t / κ t ) for NB models, and for ZIP models.
M5: Lift: Model lift has the ability to differentiate between low and high claims frequency policyholders.Sometimes it is called the "economic value" of the model.A higher lift illustrates that the model is more capable of separating the extreme values from the average.We refer to [4,29,30] and references therein for further discussions on lift.We propose a way to calculate lift for the tree model in the following steps.
Step 1: Retrieve the predicted frequencies for terminal nodes, y t , t = 1, . . ., b, for the optimal tree obtained from the training procedure.
Step 2: Set y min = min b t=1 y t and y max = max b t=1 y t , which identify the least and most risky groups of policy-holders, respectively.
Step 3: Use test data in the least and most risky groups/nodes to obtain their total sum of exposures, say v min and v max .
Step 4: If v min ≤ v max , then sort the data using exposures in descending order in the most risky group.Calculate the cumulative sums of the sorted exposures until the one equal or greater than v min is achieved and then calculate the corresponding empirical frequency (i.e., ratio of sum of claim numbers and sum of exposures) of these first data involved, say λ [Similarly, If v min > v max , then sort the data using exposures in ascending order in the least risky group.Calculate the cumulative sums of the sorted exposures until the one equal or greater than v max is achieved and then calculate the corresponding empirical frequency of these first data involved, say λ We remark that more performance measures and diagnostic approaches can be introduced following ideas in, e.g., [4,29,30].However, this is not the main focus of the present paper, so these are explored elsewhere.

Simulation examples
We will discuss three simulation examples, namely, Scenarios 1-3 below.Scenario 1 aims to illustrate that BCART can do really well for the chessboard data similar to Figure 1 for which CART cannot reasonably do anything.In addition, from this simulation study we also see that BCART can do well with variable selection.In Scenario 2, we shall examine how different BCART models can capture the data over-dispersion.In Scenario 3, we illustrate the effectiveness of ZIP-BCART models for data with exposures.

Scenario 1: Poisson data with noise variables
We simulate a data set {(x i , v i , N i )} n i=1 with n = 5, 000 independent observations.Here v i ∼ U(0, 1), Obviously, the variables x ik , k = 3, . . ., 8 are independent of the response N .We include both categorical and continuous variables as noise variables and as significant variables, which is a bit more general than the data shown in Figure 1.We apply the P-BCART and P-CART to the above simulated data, where x ik , k = 1, 7, 8 are treated as categorical.Note that the same conclusion can be drawn for numeric x ik , k = 1, 7, 8, but to better illustrate the effectiveness of the P-BCART we choose to make them as characters (to increase the splitting possibilities of these variables).
We first apply the P-CART in R package rpart [51] to the data.It is not surprising that P-CART is not able to give us any reasonable tree that can characterize the data, due to its greedy search nature.The smallest tree (except the one with only a root node) that P-CART generated has 25 terminal nodes and the tree found by using cross-validation has 31 terminal nodes.Obviously, both of them are much more complicated than the real model.Furthermore, in these two trees all the noise variables are used, which indicates that P-CART is sensitive to noise.
Now we discuss the P-BCART applied to the data.We use P(Grow)= 0.2, P(Prune)= 0.2, P(Change1)= 0.2, P(Change2)= 0.2, P(Swap)= 0.2 for the tree proposals, where Change1 means that in the selected internal node we only change the splitting value/category and Change2 means that in the selected internal node we change both the splitting variable and the corresponding splitting value/category.As discussed in [20]   different pairs of (α, β) while keeping their ratio.We also observe the same in other simulation examples, so in the following we will not dwell on their selection.
In Table 2 we list the tuned hyper-parameters γ, ρ in the first two columns for which the MCMC algorithms will converge to a region of trees with a certain number of terminal nodes listed (see Step 1 of Table 1).For each fixed hyper-parameter γ and ρ, we run 10000 iterations in the MCMC algorithm and take results after an initial burn-in period of 2000 iterations, after which the posterior probabilities of the tree structures have been settled for some time.This procedure is done with 3 restarts.The fourth column gives the total number of accepted trees after the burn-in period in the MCMC algorithms.The last columns of Table 2 include the total number of times each variable is used in the accepted trees.An interesting finding from these columns is that the noise variables x 3 and x 4 have a much lower selection rate than the other noise variables, which might be just because x 3 and x 4 are simulated using a distribution completely different from those for significant variables.
Besides, we see that other noise variables also have a very low selection rate, and as expected, the variables x 1 , x 2 are dominating.
In Figure 2, we illustrate this procedure with plots of the number of terminal nodes, the integrated likelihood p P (N |X, v, T ) and the data likelihood p P (N |X, v, λ, T ) of the accepted trees for j = 4.
The observations are in line with those in [19]; we see from the likelihood plots that the convergence of MCMC can be obtained relatively quickly.Interestingly, the optimal tree is not found in the first round of MCMC which got stuck in a local mode, but the restarts helped where in the second and the third rounds optimal trees can be found.Moreover, we see that there is no big difference shown in the plots of the integrated likelihood and the data likelihood.
Following Step 2 of Table 1, for each j = 2, . . ., 8, we select the optimal tree with maximum data likelihood p P (N |X, v, λ, T ) from the convergence region.The variables used in these optimal trees are listed in Table 3, where we can see that none of these trees involves the noise variables.The values for the effective number of parameters p D reflect the number of parameters in the tree if a flat prior for λ t is used.Furthermore, we list the DIC for these trees in the last column of Table 3.
Following Step 3 of Table 1 we conclude that the selected optimal tree is the one with 4 terminal nodes which is illustrated in Figure 3.We see that this tree is close to a true optimal one with the almost correct topology and accurate parameter estimates.
We also run several other similar but more complex simulation examples to check the performance   of P-BCART, NB-BCART and ZIP-BCART models.Our conclusions from these simulations are: 1), BCART models can retrieve the tree structure (including both topology and parameters) that used to simulate the data and 2), BCART models are able to avoid choosing noise variables.

Scenario 2: ZIP data with varying probability of zero mass component
We simulate a data set {(x i , v i , N i )} n i=1 with n = 5, 000 independent observations.Here x i = (x i1 , x i2 ), with x ik ∼ N (0, 1) for k = 1, 2. We assume v i ≡ 1 for simplicity, since it is not a key feature in this Scenario.Moreover, N i ∼ ZIP(p 0 , λ (x i1 , x i2 )), where and p 0 ∈ (0, 1) is the probability of a zero due to the point mass component, for which the value is to be specified.The data is split into two subsets: a training set with n − m = 4, 000 observations and a test set with m = 1, 000 observations.
For this Scenario, we aim to examine how the P-BCART, NB-BCART and ZIP-BCART will perform when p 0 is varied.Note that since v i ≡ 1, NB1 and NB2 (ZIP1 and ZIP2) will be essentially the same.Intuition tells us that when p 0 is small NB-BCART should be good enough to capture the over-dispersion introduced by a small proportion of zeros, but when p 0 becomes large ZIP-BCART should perform better for the highly over-dispersed data.This intuition will be confirmed by this study.For simplicity, we shall present two results, one with p 0 = 0.05 and the other with p 0 = 0.95.
We first discuss the simulation with a small probability of zero mass (i.e., p 0 = 0.05).In Table 4 we present the hyper-parameters γ, ρ used to obtain MCMC convergence to the region of trees with a certain number of terminal nodes (indicated after the abbreviation of models, e.g., the 2 in ZIP-BCART (2)).The last two columns give the effective number of parameters and DIC of the optimal trees for each model, respectively.We can conclude from the DIC that by using Step 3 in Table 1 we can select the optimal tree with the true 4 terminal nodes for either ZIP-BCART, P-BCART or NB-BCART, and among those, the NB-BCART (with DIC=11191.7) is the best one.
This looks a bit surprising at first glance because our data are simulated from a ZIP model.We suspect that the reason for this may be two-fold: First, the NB is enough to capture the small overdispersion.Second, we have used data-augmentation in the algorithms and thus it is understandable that the NB-BCART with 1 latent variable could achieve better performance than the "real" ZIP-BCART with 2 latent variables.Moreover, we see that even the P-BCART performs better than the ZIP-BCART, for similar reasons.Now, let us look at the performance of these models on test data in Table 5.First, we see that for each type of models, ZIP, Poisson and NB, the optimal tree with 4 terminal nodes achieves best RSS(N/v) (0.001618, 0.001082 and 0.000701, respectively) and DS(N/v) (0.000116, 0.000072 and 0.000056, respectively) on test data, which is not surprising as those models retrieve the almost true tree structures.Second, we see from the RSS(N ) that for each type of models, the performance becomes better as the number of terminal nodes that we want increases, however, the amount of decrement becomes smaller after the optimal trees with 4 terminal nodes have been obtained.We observe the same for negative log-likelihood and lift.It is worth noting that when calculating and comparing lift for different trees, instead of simply following the four steps in M5, in Step 4 we first choose the minimum total sum of exposures among the least and most risky groups in all the trees to be compared and then calculate other values accordingly using this minimum total sum of exposures as the basis.Third, we see that among these three trees with 4 terminal nodes, the one obtained from NB-BCART gives the best performance on test data based on all those performance measures, which is consistent with the conclusion from training data.
Next, we consider the simulation with a large probability of zero mass (i.e., p 0 = 0.95).The results are displayed in Table 6 and Table 7. Similar discussions can be done for this case.In particular, we find that the performance order based on DIC is ZIP-BCART>NB-BCART>P-BCART, which is also consistent with their performances on test data.
We also run several other similar simulation examples to check the performance of P-BCART, NB-BCART and ZIP-BCART with different values for p 0 .Our conclusion from these simulations is that when the proportion of zeros in the data is small (reflected by small p 0 ) then the NB-BCART or P-BCART would be preferred than the ZIP-BCART, whereas when the proportion of zeros in the data is large then the ZIP-BCART would be preferred than the NB-BCART and P-BCART.This is also confirmed by the real insurance data discussed below.

Scenario 3: Different ways to incorporate exposure in ZIP models
The purpose of Scenario 3 is to compare two different ways of dealing with exposure, namely, ZIP1-BCART and ZIP2-BCART.To this end, we simulate a data set {(x i , v i , N i )} n i=1 with n = 5, 000 independent observations.Here v i ∼ U(0, 1), , with µ(x i1 , x i2 ) ≡ 0.5, and some τ ≥ 0 to be specified below.The data is split into two subsets, namely a training set with n − m = 4, 000 observations and a test set with m = 1, 000 observations.
In the above simulation setup, we include exposure in both the Poisson component and the zero mass component.In this way, it is not clear which of ZIP1-BCART and ZIP2-BCART will outperform the other.That being said, we could vary the value of τ to control the effect of exposure to the zero mass component.We shall consider two extreme cases, one with a very small τ and the other with a very large τ .More precisely, for a large τ we choose τ = 100.In this case, since many v τ i will be small, we have that p (τ ) i will be close to one, which implies that Poisson component should play a minor role in exposure modelling and thus we would expect that ZIP2-BCART has better ability to capture this.On the other hand, for a small value τ = 0.0001, since many v τ i will be close to 1 we have that p (τ ) i will be almost independent of v i , which implies that zero mass component should play a minor role in exposure modelling and thus we would expect that ZIP1-BCART has better ability to capture this.We report DIC for these two cases in Table 8.The model performances on test data are listed in Table 9 for τ = 100 and Table 10 for τ = 0.0001.From these tables, we can confirm the above intuition that ZIP1-BCART should perform better for small τ and worse for large τ (compared to ZIP2-BCART).We conclude from this simulation study that the ZIP2-BCART works better in capturing the potential stronger effect of the exposure to the zero mass component, which is also illustrated in the real insurance data discussed below.

Real data analysis
We illustrate our methodology with a real insurance dataset, named dataCar, available from in-suranceData package in R; see [52] for details.This dataset is based on one-year vehicle insurance We first apply P-CART to training data, where we use cross-validation to choose a tree which can best fit the data.The resulting tree, which has 5 terminal nodes, is shown in Figure 4.Then, we apply P-BCART, NB1-BCART, NB2-BCART, ZIP1-BCART and ZIP2-BCART to the same data.
Based on the knowledge learnt from P-CART, we can tune the hyper-parameters γ, ρ, so that the algorithm will converge to a region of trees with number of terminal nodes around 5. Some of these, together with the effective number of parameters and DIC, are shown in Table 12.We see from this table that all the effective numbers of parameters are reasonable for the model used to fit the data.
We conclude from the DIC that all of these BCART models select an optimal tree with 5 terminal nodes using the three-step approach, and among these the one from ZIP2-BCART, with the smallest DIC(=25632.5), should be chosen as the global optimal tree to characterize the data.
It is interesting to check whether there are similarities in the trees obtained from different models,  including the P-CART, particularly as they all have 5 terminal nodes.For the tree from P-CART illustrated in Figure 4, the variable"agecat" is first used and then "veh value", followed by "agecat" again.The tree from P-BCART (not shown here) also uses "agecat" first, but in the following steps, it uses "veh value" and "veh body".The trees from NB1-BCART and NB2-BCART look very similar, and both of them use "gender " first and then use "agecat", "veh value" and "veh body".
Further, the trees from ZIP1-BCART and ZIP2-BCART have the same tree structure and select the same splitting variables as the tree from P-BCART, while the splitting values/categories are slightly different.The optimal tree from ZIP2-BCART is displayed in Figure 5, where the estimated frequency (i.e., the first figure in each node) is calculated through (14) for the ZIP2 model with unit exposure.Comparing the two trees in Figure 4 and Figure 5 we see that ZIP2-BCART model can identify a more risky group (i.e., the one with estimated frequency equal to 0.2674).Moreover, for comparison we also use GLM to fit the data.We find that only the variables "agecat" and "veh body" are significant, in which we also use the interactions between these two variables.In conclusion, though the variables used for different models can differ slightly, there seems to be a consensus that "agecat", "veh value" and "veh body" are relatively significant variables and "gender ", "veh age" and "area" are less significant.Now, we apply the trees obtained from training to test data.The performances are given in Table 13.We also include the commonly used GLM, for which the performance looks not as good as the tree models.From the table, we can conclude that for each of the BCART models the tree with 5 terminal nodes that is selected by DIC performs better, in terms of RSS(N/v) and DS(N/v), than the trees with either smaller or larger number of terminal nodes.This confirms that the proposed three-step approach for the tree model selection in each type of models based on DIC works well in real data.Moreover, all the performance measures give the same ranking of models as follows: ZIP2-BCART > ZIP1-BCART > NB2-BCART > NB1-BCART > P-BCART > P-CART > GLM.• In the MCMC algorithms we have only used four common proposals, namely, Grow, Prune, Change, and Swap, which have made the algorithm to quickly converge to a local optimal region.In order to make it better explore the tree space, other proposals such as those in [38,39] can be suggested to improve the mixing of simulated trees.However, we suspect this will significantly increase the computational time, particularly, for high-dimensional large data set and for models requiring data augmentation.To mitigate this effect, we might need to use non-uniform choice of splitting variables in the tree prior so as to achieve a better variable selection (see [35]).
• We have proposed to use a single (optimal) tree induced from the BCART models for claims frequency prediction.The main reason for this choice is, as we discussed in the introduction, for ease of interpretation.Since stakeholders and regulators may not be statisticians who are able to understand very complex statistical models, a single tree offers intuitive and visual results to them.Although we have proposed an approach to find one single optimal tree, some sub-optimal trees (in the convergence region of the MCMC) which possess similar/different tree structures, may also be as informative as the single optimal tree and should not be simply ignored.Further research can be done in this direction to make better use of the posterior trees by clustering or merging them; see, e.g., [54,55].
• To further improve the accuracy of these Bayesian tree-based models we could explore the BART for claims frequency modelling.The BART models are tree ensembles; each tree in BART only accounts for a small part of the overall fit, potentially improving the performance, but model interpretability needs to be explored before it can be used for insurance pricing.To this end, we believe some insights from the contribution [4] would be helpful.
In this paper, we have focused on insurance claims frequency.A natural next step is to construct a full insurance pricing BCART model, including both claims frequency and severity.
has been justified by simulation examples (with Gaussian distributed data) in the aforementioned papers.Here, we show another simulation example with Poisson distributed data to illustrate the effectiveness of the BCART.Specifically, we simulate 5,000 Poisson-distributed observations where the Poisson intensity depends on two explanatory variables (or covariates) x 1 and x 2 as illustrated in Figure 1.We refer to Subsection 4.1.1 for a slightly more general simulation example.It is clear from the figure that the optimal partition of the covariate space consists of four regions where the data in each region should follow a homogeneous Poisson distribution.Clearly, the Poisson CART

b
t=1 D(θ t ) and p D = b t=1 p Dt are the deviance and effective number of parameters of the tree.
| .The lift is defined as L = λ is the empirical frequency of the least risky group.
(e) min | .The lift is defined as L = λ (e) max /λ (e) min | , where λ (e) max is the empirical frequency of the most risky group.] the introduction of these two types of change is helpful since it improves the mixing of posterior trees.For the Gamma prior of the Poisson intensities λ t we use α = 3.2096 and β = 0.8 which are selected by keeping the relationship α/β = n i=1 N i / n i=1 v i .It is worth mentioning that the performance of the algorithm does not change much when choosing

Figure 3 :
Figure 3: Optimal P-BCART.Numbers at each node give the estimated value for the frequency parameter λ t and the percentage of observations.

Figure 4 :
Figure 4: Tree from P-CART.Numbers at each node give the estimated frequency and the percentage of observations.

Figure 5 :
Figure 5: Optimal tree from ZIP2-BCART.Numbers at each node give the estimated frequency and the percentage of observations.

Table 2 :
Total times of variables used in all accepted trees from the P-BCART MCMC algorithms (after burn-in period)

Table 3 :
Number of times of variables used in each chosen optimal tree and the corresponding p D and DIC (after burn-in period).Bold font indicates DIC selected model.# terminal nodes x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8

Table 4 :
Hyper-parameters, p D (or q D ,r D ) and DIC on training data (p 0 = 0.05).Bold font indicates DIC selected model.

Table 5 :
Model performance on test data (p 0 = 0.05) with bold entries determined by DIC (see Table4).

Table 6 :
Hyper-parameters, p D (or q D ,r D ) and DIC on training data (p 0 = 0.95).Bold font indicates DIC selected model.

Table 7 :
Model performance on test data (p 0 = 0.95) with bold entries determined by DIC (see Table6).

Table 8 :
DIC for ZIP-BCART with different values of τ on training data.Bold font indicates DIC selected model.

Table 12 :
Hyper-parameters, p D (or q D ,r D ) and DIC on training data (dataCar ).Bold font indicates DIC selected model.

Table 13 :
Model performance on test data (dataCar ) with bold entries determined by DIC (see Table12).