Robust claim frequency modeling through phase-type mixture-of-experts regression

This paper addresses the problem of modeling loss frequency using regression when the counts have a non-standard distribution. We propose a novel approach based on mixture-of-experts speciﬁcations on discrete-phase type distributions. Compared to continuous phase-type counterparts, our approach offers fast estimation via expectation-maximization, making it more feasible for use in real-life scenarios. Our model is both robust and interpretable in terms of risk classes, and can be naturally extended to the multivariate case through two different constructions. This avoids the need for ad-hoc multivariate claim count modeling. Overall, our approach provides a more effective solution for modeling loss frequency in non-standard situations. © 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).


Introduction
Modeling loss frequency using regression is a common task in insurance and actuarial science.Accurately modeling the frequency and severity of losses is essential for effective risk management and pricing of insurance policies.However, this can be a difficult task when the counts have a non-standard distribution, with zero-inflated-like models (such as in Yip and Yau (2005); Lee (2021); Chen et al. (2019); Zhang et al. (2022)) and machine-learning methods (see Gabrielli (2020); Gao et al. (2022) and Wüthrich and Merz (2022) for a broader overview) being predominant in the recent literature.This is because many commonly used models, such as Poisson or Negative Binomial regression, are not particularly well-suited for non-standard distributions, and ad-hoc methods are often adopted, though with no formal justification.In this paper, we propose a novel approach for addressing this problem by using mixture-of-experts specifications on discrete-phase type distributions.This approach has gained popularity in the last decade, as it is inspired by machine learning and allows for fast and effective modeling of complex data.
Mixture-of-experts models are widely used in machine learning and statistics (Yuksel et al. ( 2012)), and they are known for their ability to accurately model complex data distributions.These models are particularly useful in situations where the data contains multiple modes or clusters, as they allow each mode to be modeled separately by a different expert component.This flexibility allows mixture-ofexperts models to capture the underlying structure of the data more accurately than other types of probabilistic models.The applications of mixture-of-experts models are numerous and diverse and include classification and density estimation.Recently, they have been introduced to actuarial science in Fung et al. (2019).
In this paper, we introduce a series of models, which in the simplest univariate case, falls into the classical mixture-of-experts definition, inheriting their favorable properties.However, a key difference is that our proposal has an additional viewpoint from the point of view of absorption times of Markov processes, which statistically allows for a much more concise (in terms of model components) and effective tailor-made estimation procedure.Furthermore, our model is both robust and interpretable in terms of risk classes, which is important for effective risk management.In addition, the model can be easily extended to the multivariate case through two different constructions, which avoids the need for ad-hoc multivariate claim count modeling.This is important because ad-hoc methods can be difficult to implement and may not accurately capture the underlying data.It is worth mentioning that Fung et al. (2019) considered the multivariate case as well, though for other much simpler component distributions.
Similarly to continuous phase-type counterparts (which for the multivariate case has been estimated in general in Albrecher et al. (2022) and for a subclass in Bladt (2023)), our approach offers estimation via a (generalized) expectation-maximization (EM), though a key difference is that we deal with matrix powers instead of matrix exponentiation, dramatically reducing computation complexity and thus making it applicable for use in real-life scenarios with large datasets.The EM algorithm is a powerful general-purpose tool for estimating the parameters of probabilistic models that contain latent variables (Dempster et al. (1977)).In the univariate setting, this algorithm works by iteratively updating the estimated values of the latent variables and the model parameters based on the observed data (though some adaptations are necessary for the mixture-of-experts models, which are particularly tricky to fit, see also Chen et al. (1999)).This approach can be extended to the multivariate setting, where it estimates the parameters of our mixture-of-experts specification of discrete phase-type distributions.This allows us to use existing methods for estimating the parameters of multinomial distributions (zerolayer neural networks) to quickly and effectively estimate the parameters of our models.It is also important to note that gradient-based methods can be problematic when applied to these models because of the great deal of variables and non-linearity in the likelihood.Therefore, the EM algorithm provides a more robust and reliable approach for estimating the parameters in this setting.In the absence of regressors, the general class of multivariate discrete phase-type distributions was introduced in Navarro (2018), while in actuarial science, multivariate counts were studied using a subclass of discrete phase-type distributions in He and Ren (2016a) in terms of risk measures (see also Ren and Zitikis (2017) for a different construction) and in He and Ren (2016b) in terms of estimation through an EM algorithm.
We apply the proposed methodology to a real-life data set, the Wisconsin Local Government Property Insurance Fund (LGPIF) data set (previously analyzed in Frees et al. (2016) to compare several standard regression models) to demonstrate its effectiveness in practice.We will compare our approach to existing methods and show that it provides more accurate estimates, with the code to estimate our models being open-source and freely available online.1 This is mainly because our approach is able to capture the subtleties in the distribution of the data, whereas existing methods are not well-suited for this type of data.Such subtleties are worth modeling when datasets are large.
In the following sections, we will first revisit some of the properties of classical univariate discrete phase-type distributions (Section 2), which are the building blocks of our proposed model.We will then present the multivariate extensions of these distributions and derive estimation procedures for two of the most tractable sub-classes (Section 3).These procedures allow us to estimate the parameters of our model using expectation-maximization.We then introduce estimation of regression models via the mixture-of-experts specification for all models in the paper (Section 4).Finally, we show its practical applicability using real-life data (Section 5) and conclude (Section 6).

Discrete phase-type distributions
Discrete distributions are an important tool in statistics and actuarial science, as they provide a means of modeling the count data that is often encountered in these fields.Claim counts, in particular, are an important type of data that can be modeled using discrete distributions.These distributions provide a flexible and powerful means of modeling the complex relationships between the factors that influence claim counts, and can be used to make accurate modeling about present and future claim frequency.
Discrete phase-type (DPH) distributions are a particularly good model for flexibly modeling a wide range of data within the same context.These distributions are a generalization of the traditional discrete distributions, and allow for the modeling of complex behavior of discrete variables.This makes them well-suited for use in actuarial science, where there is high monetary value associated with accurately capturing the behavior of claim frequency.
In this section, we revisit some of the key properties of the classical univariate discrete phase-type distributions, which are the building blocks of every model presented in the remainder of this paper.
Hence, consider a time-homogeneous Markov chain Z = (Z n ) n∈N 0 on a state space E = {1, . . ., p, p + 1}, where states 1, . . ., p are transient and p + 1 is absorbing.In essence, this guarantees that the hitting time of the process to state p + 1 is almost surely finite.The transition matrix P of Z is then of the form , where T = (t kl ) k,l=1,...,p is a p × p matrix, called a sub-transition matrix, containing the transition probabilities among the transient states, and t t t = (t 1 , . . ., t p ) is the vector of transition probabilities from the transient states to the absorbing state, also known as the exit probability vector.Note that since the rows of P sum to one, we have the relationship t t t = e e e − T e e e, where e e e denotes the p-dimensional vector of ones.Naturally, the rows' sums of T are possibly below one.Let π k = P (Z 0 = k), k = 1, . . ., p, p + 1 and (π π π , π p+1 ) be the initial probabilities of the Markov chain, where π π π = (π 1 , . . ., π p ).In what follows, we assume that we cannot start in the absorbing state, so that π p+1 = 0, and then π π πe e e = 1, unless stated otherwise.

Definition 2.1 (DPH). The time until absorption
These formulas are reminiscent of the geometric distribution if we replace the matrix with a scalar.Indeed, if p = 1 we recover the geometric distribution.Perhaps less obvious is the fact that the Negative Binomial distribution can also be obtained as a special case with identical phases in sequence.The class is closed under mixtures.
Concerning expected values, we have that the first moment of Y is given by More generally, the κ-th factorial moment of Y , κ ∈ N, is explicit and given by Additionally, the probability-generating function G Y of Y possesses a closed-form expression given by and it exists (at least) for |z| ≤ 1.
These properties make DPH distributions easy to implement, given a representation, that is, given (π π π, T ).Most of the applications of these distributions thus concentrate on finding representations efficiently.As we will see below, using the underlying jump-process is very useful in this regard, and thus expectation-maximization (EM) algorithms are of statistical importance.It is worth noting, however, that for small dimensions, direct gradient-based methods are also effective.
Remark 2.1.Note that in the definition of Y ∼ DPH(π π π , T ) above, we do not allow for Y = 0.However, if we are interested in having a discrete distribution that can take the value of 0, we can simply consider π p+1 > 0 so that P (Y = 0) = π p+1 .In such a case, we can think of the law of Y as a mixture distribution with density function where f (y) is the density function of a DPH(π π π , T ) distribution as before.
Equally effective, in terms of interpretation and statistical accuracy, is simply modeling Y + 1 as a DPH distributed random variable.
The following result from Neuts (1975) shows the robust modeling capabilities of DPH distributions.While a proof is provided here for future reference, the representation is not particularly concise.This result demonstrates the flexibility of DPH distributions, as they can be used to approximate a wide range of other distributions with arbitrary precision.
Theorem 2.2.Any probability mass function g with finite support on N is a discrete phase-type.
Proof.Consider a probability mass function g with support on {1, . . ., n} for some n ∈ N. We denote the corresponding probabilities by g(i).Then, a DPH representation of dimension n for g is the following: .
In particular, we also have that the set of DPH distribution is dense in the set of discrete distributions with support on the entire N.
Example 2.3.Consider the function x → √ x(1 + sin(x/2)) sampled at the points 1, 2, . . ., 35, and then normalized by the total sum.By specifying a DPH construction as in the above theorem, we obtain a 35-dimensional representation exactly matching the constructed density.Fig. 2.1 shows the construction.

Remark 2.2 (On the dimension of an approximation).
Generally speaking, when no regression is present, the dimension p of the required DPH distribution should be taken to be smaller than the maximum value of the data.
In the worst-case scenario, DPH distributions work as mixtures, which will target one specific histogram value at a time.However, by using matrix parameters, we allow for jumping back and forth between the states of the underlying Markov chain, whereby increased flexibility is attained for fixed p.In other words, increasing p usually results in a better fit not only for one value in the histogram but entire sections of the distribution.Furthermore, since the number of parameters (we prefer to think of them as weak learners in this context) grows with p 2 (which can be a drawback but also a strength, as is the case when talking about denseness), then it is virtually never the case that a very large dimension is required to obtain a good fit, and often a good model will have a single-digit p.
Also, note that the theoretical denseness construction is not very parsimonious and should not be used for model building.Such construction does not take advantage of the full discrete phase-type potential (namely, the parsimony induced by state interactions).

Multivariate discrete phase-type distributions
The goal of this section is twofold.First, it provides an overview of existing notation and results related to multivariate extensions of DPH distributions, which will be useful in the following sections.Second, it presents the first known estimation procedures for two of the most tractable multivariate discrete phase-type distributions.These procedures will serve as the building blocks for the multivariate mixture-of-experts regression models that will be discussed later on.

The multivariate discrete phase-type class
In this section, we will begin by reviewing the MDPH * class of multivariate DPH distributions introduced in Navarro (2018).This class of distributions is generally considered to be of theoretical importance due to the fact that only certain subclasses are tractable, meaning that they may not have explicit probability mass functions and distribution functions.Consequently, we focus on these tractable subclasses in subsequent sections.
Let τ ∼ DPH (π π π, T ) be a discrete phase-type distributed random variable of dimension p with underlying Markov chain (Z n ) n∈N 0 .Let r r r j = r j (1), . . ., r j (p) be p-dimensional column vectors taking values in N p 0 , j = 1, . . ., d, and let R = (r r r 1 ,r r r 2 , . . .,r r r d ) be a p × d matrix, called a reward matrix.
for all j = 1, . . ., d.We then say that the random vector Y Y Y = Y (1) , . . ., Y (d) has a multivariate discrete phase-type distribution of the MDPH * type, and we write Y Y Y ∼ MDPH * (π π π, T , R).
In the above definition, we can interpret r j (k) as the reward obtained while Z is in state k.Thus, Y ( j) can be seen as the total reward for component j obtained prior to absorption.In particular, we have that Y ( j) is DPH distributed for all j = 1, . . ., d. Navarro (2018) showed that elements of the MPH * class have explicit expressions for the joint probability-generating function, joint moment-generating function, and joint moments.Moreover, some closure properties of the class were derived, and denseness in the class of distributions with support in N d was shown.However, as previously mentioned, and as is the case for the continuous counterparts (the MPH * class), there are no general explicit expressions for the density and distribution functions.In addition, and in contrast to the continuous case,2 the estimation for this general class of distributions is still an open statistical challenge, which limits their applicability to describe real-life data.
Next, we present two sub-classes of the DMPH * class which have explicit expressions for the joint density and distribution functions, and that preserve the denseness property of the DMPH * class.As an important consequence, these sub-classes allow for estimation via EM algorithms, making them attractive tools for modeling real-life data.We derive the main parts of the algorithms below, delegating the full details to Appendix A. In Section 4, we show how these models can be adapted to work with the regression setting presented there.

The feed-forward class of multivariate discrete phase-type distributions
In this section, we will present our results for the bivariate case in order to improve the clarity of the presentation.It is possible to extend these results to higher dimensions, although doing so can be challenging from an implementation perspective.
We denote by T 11 and T 22 the sub-transition matrices describing the transition among the states in E 1 and E 2 , respectively.Additionally, we denote by T 12 the transition matrix describing movement from E 1 to E 2 .Finally, we make the following assumption: the process cannot reach the absorbing state from a state in E 1 and cannot return to a state in E 1 once it has entered E 2 .
Thus, the sub-transition matrix of Z is of the form Note that under this setup, T 11 e e e + T 12 e e e = e e e, and the vector of exit probabilities is t t t = (0 0 0 , t t t 2 ) , where t t t 2 = e e e − T 22 e e e.
Next, we assume that the Markov chain can only start in a state of E 1 but not E 2 , which imposes the following structure for its initial distribution π π π = (η η η, 0 0 0), where η η η = (η 1 , . . ., η p 1 ) with η η ηe e e = 1.Now let Y = inf{n ≥ 1 : Remark 3.1 (Interpretation of the fMDPH class).Suppose that two claim frequencies follow the fMDPH distribution.Then we assume that there is a latent underlying risk process that generates the claim numbers for the first component while in a certain environment (the state space E 1 ).When there is a change of environment (entering state space E 2 ), the generation of the first component terminates, and the one for the second component starts until the process gets absorbed.The dependence is generated by the fact that the manner (or state) in which the first component terminates directly affects how the second one is started.

Standard probabilistic considerations yield the following expressions for the joint density f Y Y Y and joint survival function
11 T 12 T y (2)   22 e e e .
We can then use the equations above to derive other functionals of Y Y Y .First, we have that the joint probability-generating function of Next, the joint factorial moments of Y Y Y are given by where κ 1 , κ 2 ∈ N. In particular, we can use this last expression to obtain which is required to compute the correlation matrix of Y Y Y .
Finally, we have that the marginals are DPH distributed with ) is given by η η η = (π π π , 0), T 11 = T , T 22 = S , and T 12 = t t tγ γ γ .Perfect positive dependence.Intuitively, positive dependence in an fMDPH distribution is obtained when highly-visited states in E 1 are connected with states in E 2 with the same property via the matrix T 12 .For example, the following fMDPH distribution yields an extreme case.
η η η = (0, 1) , T 11 = 0 0 0.5 0 . Note that in the example above, the perfect positive dependence is obtained because there is a unique way to move through the Markov chain due to the sparse structure of the matrices with some probabilities equal to 1.This construction easily generalizes to higher dimensions.
Perfect negative correlation.On the other hand, negative dependence is obtained when highly-visited states in E 1 are connected with states in E 2 with low visit probabilities.The following example exhibits perfect negative dependence.
The construction principle of fMDPH distributions in terms of a single Markov chain allows us to obtain the following result analogous to Theorem 2.2.Theorem 3.4.Any joint probability mass function g with finite support on N 2 is a bivariate discrete phase-type of the feed-forward type.
Proof.Notice that we can always construct an absorbing Markov chain with paths absorbing precisely at the time points given by the support of g and with probabilities matching those of g.The details are omitted.We present a simple example illustrating the above statement.
We now show that the definition of fMDPH distributions falls into the MDPH * specification of Navarro (2018).

Estimation in two dimensions
Consider a dataset consisting of N two-dimensional i.i.d.observations In what follows, we denote this sample as y y y = {y y y 1 , . . ., y y y N }.
We first note that we only observe the times to entry to E 2 and to absorption of the underlying Markov chain.Hence, we are in an incomplete-data setup, and the expectation-maximization (EM) algorithm shall be employed.To apply such a method, we first require the complete likelihood, that is, the likelihood assuming that the entire path of the underlying Markov chain is observed.
Let B k be the number of Markov chains starting in state k, N kl the number of transitions from state k to state l, and N k the number of Markov chains that exit from state k to the absorbing state.Then, the complete likelihood function for this sample is given by This yields the following expression for the complete loglikelihood Then, we need to alternate between an expectation (E) step consisting of calculating the expected value of the complete loglikelihood given the observed sample and a maximization (M) step in which we maximize the complete likelihood with the conditional expectations in place of the actual sufficient statistics.
We first proceed to compute the conditional expectations of the sufficient statistics given the observed data.For such a purpose, we consider a generic data point y y y = (y (1) , y (2) ) to perform the calculations.We start with B k : Here, e e e k denotes the k-th canonical basis vector in R p 1 .In what follows, we use the same notation for canonical basis vectors in real vector spaces of different dimensions.
For N kl , given the structure of the underlying Markov chain, we need to split the computations into three different cases.
Case 1: k, l ∈ E 1 .Note that transition between two states in E 1 is not possible if the transition of the chain to E 2 was before time 2.
Thus, we have that Then, 11 e e e k t kl e e e l T Similarly to the previous case, we have that 22 e e e k−p 1 t kl e e e l−p 1 T We note that the corresponding conditional expectation can be interpreted as the probability of going from E 1 to E 2 , at time y (1) , due to a transition from k to l.Thus, For the statistics N k , we simply have Finally, for a sample of size N, we add the formulas above evaluated at all data points y y y i , i = 1, . . ., N.
The M-step follows straightforwardly by applying a Lagrange multiplier argument and is thus omitted for brevity.The complete EM algorithm is summarized in Algorithm 1.

Extension to higher dimensions
We can extend the construction principle of fMDPH distributions to higher dimensions by simply splitting the set of transit states As in the bivariate case, several functionals of Y Y Y are explicit in terms of matrix functions.For instance, the joint density is given by with c c c d = e e e − C d e e e.However, as we can see from the derivations of Algorithm 1, the complexity of an estimation procedure for higher dimensions increases non-linearly, and each dimension needs to be treated individually, which makes it challenging from an implementation point of view.Hence, we now introduce another subclass of the MDPH * , which turns out to also be a subclass of the fMDPH class, with similar desirable properties, but with an estimation procedure that allows for simple scaling into higher dimensions.

The mDPH class
on a common state space E = {1, . . ., p, p + 1}, where states 1, . . ., p are transient and p + 1 is absorbing.We assume that the dependence structure of these processes is as follows: and In other words, all the Markov chains start at the same state and evolve independently after that.In what follows, we denote by Z 0 = Z (1) 0 , and π k = P (Z 0 = k), k = 1, . . ., p, and π π π = (π 1 , . . ., π p ).Moreover, the sub-transition matrix and vector of exit probabilities associated with Z ( j) are denoted by kl ) k,l=1,...,p and t t t j = e e e − T j e e e = (t respectively, for all j = 1, . . ., d. Then, we say that the random vector Y Y Y = (Y (1) , . . ., Y (d) ) is mDPH distributed, and we use the notation Note in particular that Y ( j) ∼ DPH(π π π, T j ) for all j = 1, . . ., d.

Remark 3.3 (Interpretation of the mDPH class).
This class is peculiar in the sense that dependence between claim count variables following an mDPH law is completely determined by the initial state, which is shared by otherwise fully independently evolving sample paths.Though we will see below that this class belongs to the fMDPH class, the way of thinking of the mDPH class is closer to pure mixture models, where the mixture classes partition the distribution into natural risk classes.Within each risk class, the claim counts are independent.
As previously mentioned, an advantage of this class is that it possesses closed-form formulas for several functionals of interest, and they can easily be obtained by conditioning on the starting value of the Markov chains.For instance, the joint distribution function 1 − e e e k T y ( j)   j e e e , y y y ∈ N d .
Similarly, we have that the joint density is given by e e e k T y ( j) −1 j t t t j , y y y ∈ N d , the joint probability-generating function is and the joint factorial moments can be computed as In particular, we also have that for π k e e e k I − T j −1 e e e e e e k I − T q −1 e e e , which is needed to compute the correlation matrix of Y Y Y .
The following results show that any other multivariate discrete distribution can be approximated by an mDPH model, which makes them a robust class from a statistical perspective.
Theorem 3.10 (Densensess).The class of mDPH distributions is dense in the set of distributions with support on N d .
Proof.Consider DPH distributions of the form given in the proof of Theorem 2.2.Observe that these distributions satisfy Condition 3 of Proposition 3.1.in Fung et al. (2019).Finally, note that multivariate finite mixture models with DPH components of this form are particular specifications of mDPH distributions.Then, the result follows by Condition 1 of Proposition 3.1 in Fung et al. (2019).
Finally, we establish the relationship of this class with respect to the other two multivariate models previously introduced.Proof.The proof is analogous to the one of Proposition 2.2 in Bladt (2023).

Estimation
Maximum likelihood estimation of the mDPH class can be carried out via an explicit EM algorithm for arbitrary dimensions, which we derive next.Consider i.i.d.realizations y y y = {y y y 1 , . . ., y y y N }, y y y i = (y In order to write down the complete likelihood associated with this sample, we require the following definitions.As before, we let B k be the total number of Markov chains starting in state k.For each one of the Markov chains Z ( j) , j = 1, . . ., d, we let N ( j) kl be the number of transitions from state k to state l, and N ( j) k the number of exits from state k to the absorbing state.With these definitions at hand, we have that the complete likelihood is given by We proceed to compute the required conditional expectations.We consider a generic data point y y y = (y (1) kl .We first note that transition between two states is only possible if y ( j) ≥ 2. Thus, Then, e e e s T y (q) −1 q t t t q e e e s T m j e e e k t kl e e e l T kl p s=1 π s q = j e e e s T y (q) −1 q t t t q e e e l K (y ( j) ;e e e s , T j )e e e k P (Y Y Y = y y y) , where K (y; π π π, T ) is defined as for any vector of initial probabilities π π π and sub-transition matrix T .
Finally, for N = p s=1 π s q = j e e e s T y (q) −1 q t t t q e e e s T y ( j) −1 j e e e k t For a sample of size N, we simply sum over the different sample values y y y i , i = 1, . . ., N, on the formulas above.
The M-step follows easily by applying a Lagrange multiplier argument.We provide the detailed routine in Algorithm 2.
Example 3.12 (Synthetic data).We generated an i.i.d.sample of size 5000 of a bivariate discrete random vector with Negative Binomial margins and Gaussian copula.More specifically, the first margin has a dispersion parameter of 3 and a success probability of 0.3, the second margin has a dispersion parameter of 4 with a success probability of 0.7, and the copula has a correlation parameter of 0.5.Then we fitted an fMDPH and an mDPH model to this data set.The aim is to show that the two classes of multivariate DPH distributions introduced before can successfully recover this type of dependence.
For the fMDPH case, we selected a model with p 1 = p 2 = 6, and for the mDPH one, we chose a distribution with p = 6.We ran both algorithms with 2000 iterations, which was chosen such that changes on the log-likelihoods of both models became negligible, leading to computational times3 of 8.60 mins for the mDPH model and 19.10 mins fMDPH one.Fig. 3.1 shows that both fitted joint densities successfully recover the shape of the histogram of the simulated sample.This is further supported by Fig. 3.2, where we observe that the marginal behavior is described adequately.Moreover, we have that the sample correlation of the generated sample is 0.4789, which is well approximated by the fMDPH and mDPH models with values of 0.4353 and 0.4379, respectively.Finally, we note that the loglikelihoods of both models (-24,630.56for the fMDPH model and -24,625.87 for the mDPH model) are above the loglikelihood computed employing the original model (-24,708.38).
Interestingly, the simpler model (recall that mDPH ⊂ fMDPH) is performing slightly better in loglikelihood, and with fewer overall parameters (114 for the fMDPH model and 78 for the mDPH one).This is because their theoretical equivalences are shown through dimension-augmentation arguments, effectively making the fitted mDPH have a much larger fMDPH equivalent.Moreover, note that the AIC (Akaike's Information Criterion) of the mDPH model is 49,407.74,which outperforms the original model with a corresponding value of 49,426.76.Nevertheless, this is not the case for the fMDPH model with an AIC of 49,489.12.This is an example of simpler models behaving parsimoniously if encoded correctly.

Discrete phase-type mixture-of-experts models
The mixture-of-experts approach is a popular method for creating more flexible and powerful regression models.This approach involves combining multiple individual regression models, or "experts" in a way that allows the model to make more accurate predictions.In this section, we propose a new variant of the mixture-of-experts approach that incorporates discrete univariate and multivariate phase-type distributions as defined in the previous sections.This allows the model to more accurately capture the complex, non-linear relationships between the input and output variables, leading to a more flexible regression estimation.

Univariate model
In the discrete case, we may follow the construction approach from Bladt and Yslas (2022b) to define the PH-MoE class of regression models.More specifically, assume that we have a vector X X X ∈ R h corresponding to covariate information of the random variable Y , then we endow the underlying Markov chain Z with the initial probabilities depending on the covariate information.Here, we consider the softmax parametrization, which neural networks have popularized.Definition 4.1.We say that the DPH-MoE model with initial probabilities π π π (X X X; α α α) = (π k (X X X; α α α)) k=1,...,p given by  (p×h) .
It is worth remarking that, in many cases, univariate loss modeling can be more effective than multivariate loss modeling because it allows for a more focused and detailed analysis of the variable in question.This can be particularly useful when the goal is to understand the underlying factors contributing to the loss of a particular variable and when the relationships between different variables are not well understood.
Univariate loss modeling may also be more computationally efficient than multivariate loss modeling since it involves dealing with a smaller amount of data.This can be important when the data is very large or when a real-time analysis is required.

Multivariate model
While univariate loss modeling can be useful for understanding the loss of a single variable, multivariate modeling offers several advantages.For example, multivariate modeling allows for a more comprehensive analysis of the relationships between different variables, which can provide valuable insights into the underlying factors that drive the system.This can be particularly useful in situations where the relationships between variables are not well understood, or where the effects of changes in one variable on the others are unclear.
In addition, multivariate modeling can be more effective at capturing the complex, non-linear relationships that often exist between input variables and output variables by "learning from the other dimensions", see for instance Frees et al. (2010) (and also Frees (2003) for a credibility approach).This can lead to more accurate estimation and better decision-making, which can be critical in various applications.
The definitions of the fMDPH-MoE and mDPH-MoE models are similar to the univariate case, and they consist of considering vectors of initial probabilities depending on the covariate information.In the present case, we again only consider the softmax parametrization.

Remark 4.1 (Interpretation of the multivariate phase-type MoE models).
The natural interpretation that follows directly from combining the mixture-of-experts and the phase-type concepts is as follows.Risk classes are chosen according to a mixture specification governed by the initial vector, and thereafter, the number of claims follows a multi-state model.Each dimension of the claim count vector collects rewards on specific states of the process until absorption.Thus, the claim evolution is highly dependent on the multinomial parameters, which for high matrix dimensions can become somewhat ambiguous, creating risk classes that are, in a sense, arbitrarily constructed.This problem, although still less pronounced than for Neural Networks, is present in all mixture-of-experts specifications, and a statistical workaround (which sadly does not make the model any more interpretable) would be to introduce regularization.
However, we want to stress that by adding matrix parameters, we aim at making our models much more flexible and competitive for small p, compared to what pure mixture-of-experts approaches provide.See also Tables 3 and 4 in Bladt and Yslas (2022b), for empirical evidence of such dimensionality reduction, in the case of continuous responses.

Denseness
The denseness of probability distributions is an important property that translates into flexible and powerful statistical modeling.A class of distributions is, loosely speaking, said to be mathematically dense if it can approximate arbitrarily closely other distributions.This property allows a wide range of modeling techniques, including the use of mixture models, convolution, and other advanced techniques.
Mathematically, dense distributions are particularly useful in situations where the data is complex and non-linear, as they allow for the creation of more flexible and accurate models.They can also be used when the data is noisy or incomplete, as there is less focus on the particular structure of any given distribution in the class.
We now proceed to state some definitions and regularity conditions that allow our models to be framed into a correct mathematical concept of denseness, which is not only applicable in the response domain, but also uniformly over regression models, that is, taking the feature space into account.Definition 4.2.Let A be the set of possible values of the covariates X X X .A class of regression models C(A) in A is the set of conditional distributions given the covariates, that is, each element of C(A) is a set of probability distributions {F ( • | X X X = x x x) ; x x x ∈ A} .Definition 4.3.Given a class of regression models C 1 (A), we say that the class of regression models C 2 (A) is dense (uniformly dense) in C 1 (A) if, for each element in C 1 (A), there exists a sequence of regression models in C 2 (A) such that all the associated conditional distributions converge weakly (uniformly weakly) for each x x x ∈ A.
Definition 4.4.We say that a feature space A is regular if it is of the form A = {1} × [a, b] h−1 , a, b ∈ R. In other words, the covariates contain an intercept and are otherwise contained in a hypercube.

Condition 4.5. A class of regression models C(A) is said to satisfy the tightness and Lipschitz conditions on
is a tight family of distributions, and for each y y y, the function

Proposition 4.6 (Denseness). Let C(A) be a class of multivariate frequency regression models satisfying the tightness and Lipschitz conditions on a regular A. Then, the class of mDPH-MoE regression models is uniformly dense in C(A).
Proof.The result is a direct consequence of Theorem 3.3 in Fung et al. (2019) by noticing that multivariate LRMoE models with DPH experts of the form given in the proof of Theorem 2.2 are particular instances of mDPH-MoE models.
Remark 4.2.Given that mDPH ⊂ fMDPH we have that the fMDPH-MoE class also satisfies the denseness property above.Moreover, observe that for d = 1 the mDPH class retrieves the DPH class.Thus, the set of DPH-MoE regression models satisfies the denseness property for univariate frequency regression models.

Estimation
Deriving an estimation procedure for the DPH-MoE model via a modified EM algorithm is an important step in developing this powerful and flexible regression technique.The EM algorithm is a widely-used iterative method for estimating the parameters of a statistical model, and has been successfully applied to a variety of different types of models.
By developing a modified EM algorithm specifically for the DPH-MoE model, we may take advantage of the existing statistical methods for the estimation of multinomial models (effectively, a zero-layer neural network), and combine them with the phase-type-specific part of the algorithm.The result is a fast and robust procedure that can be easily implemented in high-level programming languages (though all our examples in this paper were coded in c++).
The derivation of the multivariate classes' algorithms is omitted in this paper for brevity, as they are similar to the estimation procedures previously derived without covariate information.However, the full details of these algorithms can be found in the appendix of this paper.Specifically, refer to Algorithms 4 and 5 for algorithms for the fMDPH-MoE and mDPH-MoE classes, respectively.Consider an i.i.d.sample of size N from a DPH(π π π (X X X), T ) model with absorption times y y y = (y 1 , . . ., y N ) and corresponding paired covariate information x x x = (x x x 1 , . . ., x x x N ).Now, let B k (x x x) be the number of Markov chains with initial distribution π π π (x x x) starting in state k, N kl (x x x) the number of transitions from state k to state l conditional on π π π (x x x), and N k (x x x) the number of chains that exit from state k to the absorbing state given π π π (x x x).Then, the complete likelihood function for this sample is given by where However, making use of the complete likelihood above, we can employ an expectation-maximization (EM) algorithm to compute the MLE iteratively.The derivation of the E-and M-steps is similar to the case without covariate information, and the exact formulas are the following: 1) E-step, conditional expectations: T K (y i ;π π π(x x x i ), T )e e e k π π π(x . 2) M-step, explicit maximum likelihood estimators: 3) R-step, weighted multinomial regression estimation: where p−1 is the standard (p − 1)-simplex.
The detailed routine can be found in Algorithm 3.

Remark 4.3 (Data with zeroes).
When dealing with data that contains zeroes, there are at least three possible approaches to using the proposed regression model.The first approach is to simply exclude any data points that contain zeroes, as these points may not provide useful information for the model.This approach can be particularly useful if the proportion of zero-valued data points is relatively small, and if the remaining data is sufficient to accurately model the relationships of interest.In this approach, we may alternatively keep the zeroes and estimate π p+1 separately without regression by simply considering the number of zero observations divided by the total number of observations and fit a DPH-MoE model to the remaining positive observation.A similar approach is sometimes used for regression models based on zero-inflated distributions fitted in two steps.
The second approach involves translating the data by one unit, so that all data starts at 1 and then fitting a DPH-MoE model.The interpretability of this approach is not diminished because, by design, multivariate DPH distributions represent the progression of states through time, so the translation simply means starting the clock at a later time.
A third approach is to use a specialized variant of the regression model that is specifically designed to handle data with zeroes.For example, the model could incorporate a zero-inflated component, which allows for the modeling of excess zeros in the data.For instance, assume that π p+1 (x x x) if of the form π p+1 (x x x) = (1 + exp(x x x γ )) −1 .Then, the procedure splits into fitting a logistic regression to model the probability of obtaining zero or not and fitting a DPH-MoE to the positive data.
Remark 4.4 (Data with exposures).The classic approach to dealing with exposure for claim counts is through a log-offset in the regression formula.This results in a multiplicative effect of exposure on the claim frequency whenever a log-link function is used on a Poisson distribution.The multiplicative structure is appealing for practitioners, and it holds quite generally, that is, irrespective of the form of the predictive model: linear, trees, Neural Networks, among others.
In our setting, however, we have moved away from the Poisson specification so that we deal with exposure in a different manner: as a covariate.This approach can improve the accuracy of the predictions, as it allows the model to take into account the effects of the exposure on the outcome of interest.However, this approach does have a small downside, in that it can make the model less interpretable.This can make it more difficult to communicate the results to decision-makers.
Despite this downside, incorporating exposure as a covariate is still a valuable approach, as it does not compromise the statistical performance of the model.The improved accuracy of the predictions can be valuable in a wide range of applications, not to claim counts, and can help to make better-informed decisions.In many cases, the benefits of improved accuracy will outweigh the slight loss of interpretability, making this an effective approach for improving the performance of the regression model.

Case study: LGPIF data set
This section considers the Wisconsin Local Government Property Insurance Fund (LGPIF) data set,4 which was previously described in detail and analyzed in Frees et al. (2016), and also recently considered in Yang and Shi (2019); Jeong et al. (2023).The fund was established to provide property insurance for local government entities and is administered by a government office.Properties covered under this fund include government buildings, vehicles, and equipment.These data provide a good example of a typical multi-line insurance company encountered in practice.The data span the period 2006-2010, as a training set (which is the part used in the in-sample analysis below), and the year 2011 is also available, as a testing set (we use this part for prediction).
Our analysis focuses on the univariate and multivariate settings, allowing us to investigate the data from different perspectives.We showcase the two multivariate regression models that we have introduced in the paper, each of which offers insights into the data and the modeling methodology.Through the analysis, we use the LGPIF data set to illustrate the favorable performance of discrete phase-type regression modeling.Since the data is large and significantly zero-inflated, we decide to use the second method (translation) of Remark 4.3 to deal with zeros.
Claims in the dataset have six different codes, depending on the applicable coverage type.The corresponding codes are BC, IM, C, P, N, and O, and they are all standard codes used in the insurance industry to represent different types of coverage.BC stands for buildings and contents coverage, which provides insurance for buildings and the properties within.IM is an abbreviation for "inland marine" and is used as the coverage code for equipment coverage, which originally belonged to contractors.C represents coverage for the impact of a vehicle with an object, the impact of a vehicle with an attached vehicle, or overturn of a vehicle.P stands for coverage for direct and accidental loss or damage to a motor vehicle, including breakage of glass, loss caused by missiles, falling objects, fire, theft, explosion, earthquake, windstorm, hail, water, flood, malicious mischief or vandalism, riot or civil commotion, or colliding with a bird or animal.N is used as an indication that the coverage is for vehicles of the current model year or 1-2 years prior to the current model year.O is used to indicate that the coverage is for vehicles three or more years prior to the current model year.
We focus our analysis on BC and IM, since they are the ones that lead to the majority of claims in the dataset, roughly 65%.The number of observations within these two coverages is then 10, 282.Regarding explanatory variables, for each coverage, we choose the following: coverage amount (in log-scale, respective means 2.12, −1.34 and variances 4, 3.51), an indicator for no claims in the previous year (respectively, 3796 and 2239 zero values and 1864, 2383 unit values), entity type (City, County, Misc, School, Town), and deductible level (in log-scale, respective means 7.15, 6.54 and variances 1.38, 0.42).See Table 5.1 for the number of claims per entity type.In Frees et al. (2016), BC was also specially highlighted above the rest of the coverages, and the same explanatory variables were used.

Building and content and contractor's equipment frequency modeling
We initially focus on the univariate task of modeling the frequency of building and content (BC) and compare it with the reference models presented in Frees et al. (2016).Table 5.2 shows the results for the expected versus empirical counts for varying claim numbers.We observe that the DPH-MoE models behave favorably across the entire distribution, which is also supported in Table 5.3 using a Chisquare goodness of fit statistic and Table 5.45 in terms of log-likelihood and AIC.In particular, we see that despite the high number of model parameters, the additional information which is extracted from the covariates is still worthwhile, with a matrix dimension of 7 being preferred.Moreover, to assess the predictive performance of the fitted models, we used the testing set and compared the mean predictions with the observed frequency via the root-mean-square error (RMSE) measure, which can be found in Table 5.4.We observe that the DPH-MoE specifications outperform the other models here as well.
An analogous analysis for IM is provided in Tables 5. 5, 5.6 and 5.7, 6 where the AIC prefers a dimension of 3. Note that the AIC is a measure of the relative quality of statistical models for a given set of data.It balances the fit of the model (how well the model explains the data) with the complexity of the model (the number of parameters the model uses).It penalizes models that use a large number of parameters, as these models may overfit the data and may not generalize well to new data.The AIC is presently being used to compare different types of models despite the fact that it is known to penalize matrix methods more harshly than other models, since it is a standard criterion for comparing the other models and because the data include covariate information.

Joint modeling of BC and IM frequencies
In Frees et al. (2016), joint modeling for claim counts is done using a Zero-one-inflated NB for BC and an NB for IM, bonded together with a Gaussian copula.This procedure is standard and widespread in the industry for both discrete (counts) and continuous (severity) response variables.Here, we apply an fMDPH-MoE with dimensions 7 and 3 (which are motivated by the univariate modeling framework) and an mDPH-MoE of dimension 7 (performing the best between similar dimensions).The results of our models are presented in Table 5.8 in terms of expected versus observed counts, and in Table 5.97 in terms of likelihood and AIC.Our models show favorable performance compared to the standard method.Interestingly, the smaller dimensional mDPH-MoE performs better than the fMDPH-MoE, indicating that parsimonious solutions may be found with simpler models.Therefore, we recommend trying both methodologies before deciding on one of them.

Conclusion
In this paper, we proposed a novel approach for modeling loss frequency using mixture-of-experts specifications on discrete-phase type distributions.Our approach is inspired by machine learning and allows for fast and effective modeling of complex data.We applied the proposed methodology to a real-life data set, the Wisconsin Local Government Property Insurance Fund (LGPIF) data set, and compared it to existing methods, showing that it provides more accurate estimates.We demonstrated that our approach is a more effective solution (both in-sample and out-of-sample) for modeling loss frequency in non-standard situations and is a versatile and practical tool for analyzing a wide range of data.

Declaration of competing interest
There is no competing interest.Input: Positive integer data points y y y = (y 1 , . . ., y N ), covariates x x x = {x x x 1 , . . ., x x x N }, and initial parameters (α α α, T ). 2) E-step: Compute the statistics E[B k (x x x i ) | Y = y i , X X X = x x x i ] = π k (x x x i )e e e k T T y i −1 t t t π π π(x x x i )T y i −1 t t t E[B k (x x x i ) | Y = y i , X X X = x x x i ] log(π k (x x x i ;α α α)) , Input: Data points y y y = {y y y 1 , . . ., y y y N } in N 2 , with y y y i = (y (1) i , y (2) i ), i = 1, . . ., N, covariates x x x = {x x x 1 , . . ., x x x N }, and initial parameters (η η η, T 11 , T 12 , T 22 ).Yuksel, S.E., Wilson, J.N., Gader, P.D., 2012.Twenty years of mixture of experts.IEEE Transactions on Neural Networks and Learning Systems 23 (8), 1177-1193. Zhang, P., Pitt, D., Wu, X., 2022.A new multivariate zero-inflated hurdle model with applications in automobile insurance.ASTIN Bulletin: The Journal of the IAA 52 (2), 393-416.
follows a discrete phase-type distribution with initial distribution π π π and sub-transition matrix T , and we write Y ∼ DPH(π π π , T ).The density f Y and cumulative distribution function F Y of Y ∼ DPH(π π π , T ) are explicit and given in terms of matrix functions as
Proposition 3.6 (fMDPH ⊂ MDPH * ).The fDMPH class is contained in the MDPH * class Proof.It suffices to see that an MDPH * representation of an element in fMDPH is given by

Fig. 3 . 2 .
Fig. 3.2.Histograms of margins of the simulated sample versus marginal densities of the fitted fMDPH and mDPH models.

,
i = 1, . . ., N , E[N kl | Y = y y y, x x x] = N i=1 1{ y i ≥ 2}t kl e e e l T K (y i ;π π π(x x x i ), T )e e e k π π π (x x x i )T y i −1 t t t , E[N k | Y = y y y, x x x] (x x x i )T y i −1 e e e k π π π (x x x i )T y i −1 t t t .3) M-step: Let tkl = E N kl | Y Y Y = y y y, x x x p s=1 E N ks | Y Y Y = y y y, x x x + E N k | Y Y Y = y y y, x x x , k, l = 1, . . ., p , tk = E N k | Y Y Y = y y y, x x x p s=1 E N ks | Y Y Y = y y y, x x x + E N k | Y Y Y = y y y, x x x , k = 1, . . ., p .4) R-step: Maximize the weighted multinomial logistic regression α and set πk (xx x i ) = π k (x x x i ; α α α) = exp(x x x T i α α α k ) p s=1 exp(x x x T i α α α s ) , i = 1, . . ., N , k = 1, . . ., p .5)Update the current parameters to (α α α, T ) = ( α α α, T ).Return to step 1 unless a stopping rule is satisfied.Output: Fitted representation (α α α, T ).Algorithm 4 EM algorithm for fMDPH-MoE (Softmax parametrization).
{1, . . ., p} into more subsets, say E 1 , . . ., E d for some d > 1.We denote by C 1 , . . ., C d the sub-transition matrices describing the movement of the Markov chain within states in E 1 , . . ., E d , respectively.Furthermore, we assume that the Markov chain starts in a state in E 1 and can only move to a state in E j+1 if it is currently in a state in E j .Here, we understand E d+1 as {p + 1}.These movements are described by the non-negative matrice D 1 , . . ., D d−1 satisfying C j e e e + D j e e e = e e e, j = 1, . . ., d − 1.Thus, the initial probabilities and sub-transition matrix of the underlying Markov chain are of the form , ..., y(d)) with corresponding joint probability mass function P (Y Y Y = y y y) = j .Then, for B k we have that

Table 5 .1
Summary of LGPIF data per entity type.

Table 5 .2
Comparison between empirical values and expected values for building and contents frequency.