Generalized Poisson Difference Autoregressive Processes

This paper introduces a new stochastic process with values in the set Z of integers with sign. The increments of process are Poisson differences and the dynamics has an autoregressive structure. We study the properties of the process and exploit the thinning representation to derive stationarity conditions and the stationary distribution of the process. We provide a Bayesian inference method and an efficient posterior approximation procedure based on Monte Carlo. Numerical illustrations on both simulated and real data show the effectiveness of the proposed inference.


Introduction
In many real-world applications, time series of counts are commonly observed given the discrete nature of the variables of interest. Integer-valued variables appear very frequently in many fields, such as medicine (see Cardinal et al. (1999)), epidemiology (see Zeger (1988) and Davis et al. (1999)), finance (see Liesenfeld et al. (2006) and Rydberg and Shephard (2003)), economics (see Freeland (1998) and Freeland and McCabe (2004)), in social sciences (see Pedeli and Karlis (2011)), sports (see Shahtahmassebi and Moyeed (2016)) and oceanography (see Cunha et al. (2018)). In this paper, we build on Poisson models, which is one of the most used model for counts data and propose a new model for integer-valued data with sign based on the generalized Poisson difference (GPD) distribution. An advantage in using this distribution relies on the possibility to account for overdispersed data with more flexibility, with respect to the standard Poisson difference distribution, a.k.a. Skellam distribution. Despite of its flexibility, GPD models have not been investigated and applied to many fields, yet. Shahtahmassebi and Moyeed (2014) proposed a GPD distribution obtained as the difference of two underling generalized Poisson (GP) distributions with different intensity parameters. They showed that this distribution is a special case of the GPD by Consul (1986) and studied its properties. They provided a Bayesian framework for inference on GPD and a zeroinflated version of the distribution to deal with the excess of zeros in the data. Shahtahmassebi and Moyeed (2016) showed empirically that GPD can perform better than the Skellam model.
As regards to the construction method, two main classes of models can be identified in the literature: parameter driven and observation driven. In parameterdriven models the parameters are functions of an unobserved stochastic process, and the observations are independent conditionally on the latent variable. In the observation-driven models the parameter dynamics is a function of the past observations. Since this paper focuses on the observation-driven approach, we refer the reader to MacDonald and Zucchini (1997) for a review of parameter-driven models.
Thinning operators are a key ingridient for the analysis of observation-driven models. The mostly used thinning operator is the binomial thinning, introduced by Steutel and van Harn (1979) for the definition of self-decomposable distribution for positive integer-valued random variables. In mathematical biology, the binomial thinning can be interpreted as natural selection or reproduction, and in probability it is widely applied to study integer-valued processes. The binomial thinning has been generalized along different directions. Latour (1998) proposed a generalized binomial thinning where individuals can reproduce more than once. Kim and Park (2008) introduced the signed binomial thinning, in order to allows for negative values. Joe (1996) and Zheng et al. (2007) introduced the random coefficient thinning to account for external factors that may affect the coefficient of the thinning operation, such as unobservable environmental factors or states of the economy. When the coefficient follows a beta distribution one obtain the beta-binomial thinning (McKenzie (1985), McKenzie (1986) and Joe (1996)). Al-Osh and Aly (1992), proposed the iterated thinning, which can be used when the process has negative-binomial marginals. Alzaid and Al-Osh (1993) introduced the quasi-binomial thinning, that is more suitable for generalized Poisson processes. Zhang et al. (2010) introduced the signed generalized power series thinning operator, as a generalization of Kim and Park and Chen et al. (2016) a Bayesian Autoregressive Conditional Negative Binomial model. In this paper, we develop a Bayesian inference procedure for the proposed GPD-INGARCH process and a Markov chain Monte Carlo (MCMC) procedure for posterior approximation. One of the advantages of the Bayesian approach is that extra-sample information on the parameters value can be included in the estimation process through the prior distributions. Moreover, it can be easily combined with a data augmentation strategy to make the likelihood function more tractable. We apply our model to a cyber-threat dataset and contribute to cyber-risk literature providing evidence of temporal patterns in the mean and variance of the threats, which can be used to predict threat arrivals. Cyber threats are increasingly considered as a top global risk for the financial and insurance sectors and for the economy as a whole (e.g. EIOPA, 2019). As pointed out in Hassanien et al. (2016), the frequency of cyber events substantially increased in the past few years and cyberattacks occur on a daily basis. Understanding cyber-threats dynamics and their impact is critical to ensure effective controls and risk mitigation tools. Despite these evidences and the relevance of the topic, the research on the analysis of cyber threats is scarce and scattered in different research areas such as cyber security (Agrafiotis et al., 2018), criminology Brenner (2004), economics Anderson and Moore (2006) and sociology. In statistics there are a few works on modelling and forecasting cyber-attacks. Xu et al. (2017) introduced a copula model to predict effectiveness of cyber-security. Werner et al. (2017) used an autoregressive integrated moving average model to forecast the number of daily cyber-attacks. Edwards et al. (2015) apply Bayesian Poisson and negative binomial models to analyse data breaches and find evidence of over-dispersion and absence of time trends in the number of breaches. See Husák et al. (2018) for a review on modelling cyber threats.
The paper is organized as follows. In Section 2 we introduce the parametrization used for the GPD and define the GPD-INGARCH process. Section 3 aims at studying the properties of the process. Section 4 presents a Bayesian inference procedure. Section 5 and 6 provide some illustration on simulated and real data, respectively. Section 7 concludes.

Generalized Poisson Difference INGARCH
A random variable X follows a Generalized Poisson (GP) distribution if and only if its probability mass function (pmf) is with parameters θ > 0 and 0 ≤ λ < 1 (see Consul, 1986). We denote this distribution with GP (θ, λ). Let X ∼ GP (θ 1 , λ) and Y ∼ GP (θ 2 , λ) be two independent GP random variables, Consul (1986) showed that the probability distribution of Z = (X − Y ) follows a Generalized Poisson Difference distribution (GDP) with pmf: (2) where z takes integer values in the interval (−∞, +∞) and 0 < λ < 1 and θ 1 , θ 2 > 0 are the parameters of the distribution. See Appendix A.3 for a more general definition of the GPD with possibly negative λ.
In the following Lemma we state the convolution property of the GPD distribution since which will be used in this paper. Appendix B.1 provides an original proof of this result.
We use an equivalent pmf and a re-parametrization of the GPD, which are better suited for the definition of a INGARCH model. A random variable Z follows a GPD if and only if its probability distribution is We denote this distribution with GP D(µ, σ 2 , λ).
Remark 1. The probability distribution in Eq. 3 is equivalent to the one in Eq. 2 up to the reparametrization µ = θ 1 − θ 2 and σ 2 = θ 1 + θ 2 . See Appendix B for a proof.
The mean, variance, skewness and kurtosis of a GDP random variable can be obtained in close form by exploiting the representation of the GDP as difference between independent GP random variables.
Remark 2. Let Z ∼ GP D(µ, σ 2 , λ), then mean and variance are: and the Pearson skewness and kurtosis are: (a) GPD(µ, σ 2 , λ) distribution for σ 2 = 10 (b) GPD(µ, σ 2 , λ) distribution for µ = 2 See Appendix B for a proof. Figure 1 shows the sensitivity of the probability distribution with respect to the location parameter µ (panel a), the scale parameter σ 2 (panel b) and the skewness parameter λ (different lines in each plot). For given values of λ and µ, when σ 2 decreases the dispersion of the GPD increases (dotted and dashed lines in the right plot). For given values of λ and σ 2 , the distribution is right-skewed for µ = 8, which corresponds to S(Z) = 0.7155, and left-skewed for µ = −4, which corresponds to S(Z) = −0.3578, (dotted and dashed lines in the left plot). See Appendix A.3 for further numerical illustrations.
Differently from the usual GARCH(p, q) process (e.g., see Francq and Zakoian (2019)), the INGARCH(p, q) is defined as an integer-valued process {Z t } t∈Z , where Z t is a series of counts. Let F t−1 be the σ-field generated by {Z t−j } j≥1 , then the GPD-INGARCH(p, q) is defined as . . , p, p ≥ 1, j = 1, . . . , q, q ≥ 0. For q = 0 the model reduces to a GPD-INARCH(p) and for λ = 0 one obtains a Skellam INGARCH(p, q) which extends to Poisson differences the Poisson INGARCH(p, q) of Ferland et al. (2006). From the properties of the GPD, the conditional mean µ t = E(Z t |F t−1 ) and variance σ 2 t = V (Z t |F t−1 ) of the process are: respectively.
Following the parametrization defined in Remark 1, we need to impose the constrain φ > (1−λ) −2 , in order to have a well-defined GPD distribution. In Fig. 2, we provide some simulated examples of the GPD-INGARCH(1, 1) process for different values of the parameters α 0 , α 1 and β 1 .
Simulations from a GPD-INGARCH are obtained as differences of GP sequences Each random sequence is generated by the branching method in Famoye (1997), which performs faster than the inversion method for large values of θ 1t and θ 2t . We considered two parameter settings: low persistence, that is α 1 + β 1 much less than 1 (first column in Fig. 2) and high persistence, that is α 1 + β 1 close to 1 (second column in Fig. 2). The first and second line show paths for positive and negative value of the intercept α 0 , respectively. The last line illustrates the effect of λ on the trajectories with respect to the baselines in Panels (a) and (b). Comparing (I.a) and (I.b) in Fig. 3 one can see that increasing β 1 increases serial correlation and the kurtosis level (compare (II.a) and (II.b)).
We provide a necessary condition on the parameters α i and β j that will ensure that a second-order stationary process has an INGARCH representation. First define the two following polynomials: D(B) = 1 − β 1 B − . . . − β q B q and G(B) = α 1 B + . . . + α p B p , where B is the backshift operator. Assume the roots of D(z) lie outside the unit circle. For non-negative β j this is equivalent to assume D(1) = q j=1 β j < 1. Then, the operator D(B) has inverse D −1 (B) and it is possible to write where H(B) = G(B)D −1 (B) = ∞ j=1 ψ j B j and ψ j are given by the power expansion of the rational function G(z)/D(z) in the neighbourhood of zero. If we denote K(B) = D(B) − G(B) we can write the necessary condition as in the following proposition.

Properties of the GPD-INGARCH
We study the properties of the process by exploiting a suitable thinning representation following the strategy in Ferland et al. (2006) and Zhu (2012) for Poisson and Generalized Poisson INGARCH, respectively. We use the quasibinomial thinning as defined in Weiß (2008) and the thinning difference (Cunha et al. (2018)) operators.

Thinning representation
We show that the INGARCH process can be obtained as a limit of successive approximations. Let us define: and where {U 1t } t∈Z and {U 2t } t∈Z are sequences of independent GP random variables and for each t ∈ Z and i ∈ N, {V 1t,i,j } j∈N and {V 2t,i,j } j∈N represent two sequences of independent integer random variables. Moreover, assume that all the variables U s , V t,i,j , with s ∈ Z, t ∈ Z, i ∈ N and j ∈ N, are mutually independent. It is possible to show that X have a thinning representation. We define a suitable thinning operation, first used by Alzaid and Al-Osh (1993) and follow the notation in Weiß (2008), let ρ θ,λ • be the quasi-binomial thinning operator, such that it follows a QB(ρ,θ/λ,x).
in Eq. 10 and 11 admit the representation and where ϕ • X is the quasi-binomial thinning operation. See Appendix A for a definition.
In the following we introduce the thinning difference operator and show that Z has a thinning representation.
See Cunha et al. (2018) for an application of the thinning operation to GPD-INAR processes and Appendix A.4 for further details. Using the new operator as defined in Eq. 14, we can represent Z (n) t as follows.
has the representation: indicates the sequence of random variables with mean ψ i /(1−λ), involved in the thinning operator at time τ and {U t } t∈Z is a sequence of independent GPD random variables with mean ψ 0 /(1 − λ) with ψ 0 = α 0 /D(1).

Proof. See Appendix B
The proposition above shows that Z (n) t is obtained through a cascade of thinning operations along the sequence {U t } t∈Z . For example: is a finite weighted sum of independent GPD random variables, the expected value and the variance of Z (n) t are well defined. Moreover, it can be seen that E[Z (n) t ] does not depend on t but only on n, hence it can be denoted as µ n . Using Proposition 3 and µ k = 0 if k < 0, it is possible to write µ n as follows . From the last equation it can be seen that the sequence {µ n } satisfies a finite difference equation with constant coefficients. The characteristic polynomial is K(z) and all its roots lie outside the unit circle if K(1) > 0. Under the assumption K(1) > 0, the following holds true.
t } n∈N has an almost sure limit.
Proof. See Appendix B.
t } has a mean-square limit.
Proof. See Appendix B.

Stationarity
Given Proposition 4, if we can show that {Z (n) t } is a strictly stationary process, for any given n, then also its almost sure limit {Z t } t∈Z will be a strictly stationary process. In order to show stationarity for {Z (n) t }, we follow a procedure similar to the one in Ferland et al. (2006). Let us define the probability generating function where p(W) = P r(W = (W 1 , . . . , W k ) ) and t = (t 1 , . . . , t k ) ∈ C k . The probability generating function has the following properties.
Proof. See Appendix B Using the probability generating function, in the following we know the stationarity of the process.
t } t∈Z is a strictly stationary process, for any fixed value of n.
Proof. Let k and h be two positive integers. As pointed out by Ferland et al. (2006), Brockwell et al. (1991) show that to prove strictly stationarity we only need to show that Z have the same joint distribution, where we can rewrite both vectors in Eq. 19 as and To show that the two vectors have the same probability generating function, we first write the pgfs of X, Y and Z as shown above.
given U t−n...t+k and U t−n+h...t+k+h , are the same. Accordingly, it is possible to write and claim that Z (n) 1+h...k+h and Z (n) 1...k have the same joint distribution. Proposition 8. The process {Z t } t∈Z is a strictly stationary process.
Proposition 9. The first two moments of {Z t } t∈Z are finite.
Proof. See Appendix B.

Conditional law of {Z
To verify that the distributional properties of the sequence are satisfied, we will follow the same arguments in Ferland et al. (2006) adjusted for our sequence. Given Moreover, recalling that Z t = X t − Y t , for a fixed t, we can consider three sequences, {r and r As claimed by Ferland et al. (2006), there is a subsequence {n k } such that r (n k ) t converges almost surely to Z t . We know that and we know that the first term in both Eq. 29 and 30 goes to zero. Therefore, we can write and, as before, (Z t −Z (n) t ) goes to zero since we have proven almost sure convergence.
We have now to show that the second term in the last line of Eq. 31 goes to zero, for this purpose we need to find a sequence that converges almost surely to zero. For this reason it is more suitable to rewrite the previous sequence as follows Ferland et al. (2006) show that therefore, we can conclude that also Equation 34 implies that W (n) t converges to zero in L 1 , therefore there exist a subsequence W (n k ) t converging almost surely to the same limit. From this it follows directly that the distributional properties of X t are satisfied.
Since r and conclude that

Moments of the GPD-INGARCH
The conditional mean and variance of the process Z t are where φ = 1 1−λ . The unconditional mean and variance of the process are From Th. 1 in Weiß (2009) we know a set of equations from which the variance and autocorrelation function of the process can be obtained. Suppose Z t follows the In order to have an explicit expression for the variance of µ t and Z t and for the autocorrelations, we consider two special cases as in Zhu (2012) and Weiß (2009). For a proof of the results in these examples, see Section B.3. (1)). Consider the INARCH(1) model

Example 1 (INARCH
then the linear equations in Eq. 39, becomes Where the second equation comes from Example 2 in Weiß (2009). We derive the following autocovariances Example 2 (INGARCH(1,1)). Consider the INGARCH(1,1) model From Eq. 39, We can now determine V (µ t ). First note that we have where the second equation in Eq. 49 is equal to V (µ t ). From this latter equation, we can derive the expression for V (µ t ) Combining Eq. 38 and 50, we can derive a close expression for the variance of Z t : where φ = 1 1−λ . The autocorrelations are given by

Bayesian Inference
We propose a Bayesian approach to inference for GPD-INGARCH, which allows the researcher to include extra-sample information through the prior choice and allows us to exploit the stochastic representation of the GPD and the use of latent variables to make more tractable the likelihood function.

Data augmentation
Denote the probability distribution of Z t with is not analytically tractable we apply Markov Chain Monte Carlo (MCMC) for posterior approximation in combination with a data-augmentation approach ( Tanner and Wong, 1987). See Robert and Casella (2013) for an introduction to MCMC. As in Karlis and Ntzoufras (2006), we exploit the stochastic representation in Eq. 8 and introduce two GPD latent variables X t and Y t with pmfs Let Z 1:T = (Z 1 , . . . , Z T ), X 1:T = (X 1 , . . . , X T ) and Y 1:T = (Y 1 , . . . , Y T ). The complete-data likelihood becomes where δ(z − c) is the Dirac function which takes value 1 if z = c and 0 otherwise. The joint posterior distribution of the parameters θ and the two collections of latent variables X 1:T and Y 1:T is π(X 1:T , Y 1:T , θ|Z 1:T ) ∝ f (Z 1:T , X 1:T Y 1:T |θ)π(θ)
The full conditional for the latent variables is We draw from the full conditional distribution by MH. Differently from Karlis and Ntzoufras (2006), we use a mixture proposal distribution which allows for a better mixing of the MCMC chain. At the j-th iteration, we generate a candidate X * t from GP (θ 1t , λ) with probability ν and (X * t − Z t ) from GP (θ 2t , λ) with probability 1 − ν, and accept with probability where q(X t ) = νf (X t |θ 1t , λ) + (1 − ν)f (X t − Z t |θ 2t , λ) and X (j−1) t is the (j − 1)th iteration value of the latent variable X t . The method extends to the GPD the technique proposed in Karlis and Ntzoufras (2006) for the Poisson differences.

Simulation study
The purpose of our simulation exercises is to study the efficiency of the MCMC algorithm presented in Section 4. We evaluated the Geweke (1992) convergence diagnostic measure (CD), the inefficiency factor (INEFF) 1 and the Effective Sample Size (ESS). We simulated 50 independent data-series of 400 observations each. We run the Gibbs sampler for 1,010,000 iterations on each dataset, discard the first 10,000 draws 1 The inefficiency factor is defined as where ρ(k) is the sample autocorrelation at lag k for the parameter of interest and are computed to measure how well the MCMC chain mixes. An INEFF equal to n tells us that we need to draw MCMC samples n times as many as uncorrelated samples.   to remove dependence on initial conditions, and finally apply a thinning procedure with a factor of 250, to reduce the dependence between consecutive draws. As commonly used in the GARCH and stochastic volatility literature (e.g., see Chib et al., 2002;Casarin et al., 2009;Billio et al., 2016;Bormetti et al., 2019, and references therein), we test the efficiency of the algorithm in two different settings: low persistence and high persistence. The true values of the parameters are: α = 0.25, β = 0.23, λ = 0.4 in the low persistence setting and α = 0.53, β = 0.25, λ = 0.6 in the high persistence setting. Table 1 shows, for the parameters α, β and λ, the INEFF, ESS and ACF averaged over the 50 replications before (BT subscript) and after thinning (AT subscript).
The thinning procedure is effective in reducing the autocorrelation levels and in increasing the ESS, especially in the high persistence setting. The p-values of the CD statistics indicate that the null hypothesis that two sub-samples of the MCMC draws have the same distribution is accepted. The efficiency of the MCMC after   (Fig. 5). They have been previously considered in Brijs et al. (2008) and Andersson and Karlis (2014). The time series of accident counts is non-stationary and should be differentiated (Kim and Park, 2008). We applied our Bayesian estimation procedure, as described in Section 4. In Fig. 6 are presented the histograms for the Gibbs draws for each parameters. Table 2  data, we found evidence of high persistence in the expected accident arrivals, i.e. α+β = 0.8673 and heteroskedastic effects, i.e.β = 0.4753. Also, there is evidence in favour of overdispersion,λ = 0.5892 and overdispersion persistsenceφ = 179.7905. We study the contribution of the heteroskedasticy and persistence by testing some restrictions of the INGARCH(1,1) (models from M 2 to M 4 in Tab. 2). Bayesian inference compares models via the so-called Bayes factor, which is the ratio of normalizing constants of the posterior distributions of two different models (see Cameron et al. (2014) for a review). MCMC methods allows for generating samples from the posterior distributions which can be used to estimate the ratio of normalizing constants.
In this paper we use the method proposed by Geyer (1994). The method consists in deriving the normalizing constants by reverse logistic regression. The idea behind this method is to consider the different estimates as if they were sampled from a mixture of two distributions with probability to be generated from the j-th distribution of the mixture. Geyer (1994) proposed to estimate the log-Bayes factor κ = η 2 − η 1 by maximizing the quasi-likelihood function where n is the number of MCMC draws for each model and X ij = log f (Z 1:T , X

Cyber threats data
According to the Financial Stability Board (FSB, 2018, pp. 8-9), a cyber incident is any observable occurrence in an information system that jeopardizes the cyber security of the system, or violates the security policies and procedures or the use policies. Over the past years there have been several discussions on the taxonomy of incidents classification (see, e.g. ENISA, 2018), in this paper we use the classification   provided in the Hackmageddon dataset. Hackmageddon is a well-known cyberincident website that collects public reports and provides the number of cyber incidents for different categories of threats: crimes, espionage and warfare. Figure 7 shows the total and category-specific number of cyber attacks at a daily frequency from January 2017 to December 2018. Albeit limited in the variety of cyber attacks the dataset covers some relevant cyber events and is one of the few publicly available datasets (Agrafiotis et al., 2018). The daily threats frequencies are between 0 and 12 which motivates the use of a discrete distribution. We remove the upward trend by considering the first difference and fit the GPD-INGARCH model proposed in Section 2.
We applied our estimation procedure, as described in Section 4. As in the previous application, we fix α 0 = 1.05 that is coherent with the conditional mean of the time series. We ran the Gibbs sampler for 110000 iterations, where we discarded the first 10000 iterations as burn-in sample. In Fig. 8 are presented the histograms for the Gibbs draws for each parameters. Figure 8 shows that, as before, it is reasonable to fit a GPD-INGARCH process to the difference of cyber attacks since both the autoregressive parameter α and β, that represent the heteroskedastic feature of the data, are different from zero. Additionally, the value of λ suggest the presence of over-dispersion in the data.
Given the importance of forecasting cyber-attacks, in this section we present the results of one-step-ahead forecasting exercise over a period of 120. We follow an approach based on predictive distributions which quantifies all uncertainty associated with the future number of attacks and is used in a wide range of applications (see, e.g. McCabe and Martin, 2005;McCabe et al., 2011, and references therein). We account for parameter uncertainty and approximate the predictive distribution by MCMC. At tht j-th MCMC iteration we draw Z where j = 1, . . . , J, denotes the MCMC draw, h = 1, . . . , H the forecasting horizon and

Conclusions
We introduce a new family of stochastic processes with values in the set of integers with sign. The increments of the process follow a generalized Poisson difference distribution with time-varying parameters. We assume a GARCH-type dynamics, provide a thinning representation and study the properties of the process. We provide a Bayesian inference procedure and an efficient Monte Carlo Markov Chain sampler for posterior approximation. Inference and forecasting exercises on accidents and cyber-threats data show that the proposed GPD-INGARCH model is well suited for capturing persistence in the conditional moments and in the overdispersion feature of the data.

A.1 Poisson Difference distribution
The Poisson difference distribution, a.k.a. as Skellam distribution, is a discrete distribution defined as the difference of two independent Poisson random variables N 1 − N 2 , with parameters λ 1 and λ 2 . It has been introduced by Irwin (1937) and Skellam (1946). The probability mass function of the Skellam distribution for the difference where Z = . . . , −1, 0, 1, . . . is the set of positive and negative integer numbers, and I k (z) is the modified Bessel function of the first kind, defined as (Abramowitz and Stegun, 1965) It can be used, for example, to model the difference in number of events, like accidents, between two different cities or years. Moreover, can be used to model the point spread between different teams in sports, where all scored points are independent and equal, meaning they are single units. Another applications can be found in graphics since it can be used for describing the statistics of the difference between two images with a simple Shot noise, usually modelled as a Poisson process.

A.2 Generalized Poisson distribution
The Generalized Poisson distribution (GP) has been introduced by Consul and Jain (1973) in order to overcome the equality of mean and variance that characterizes the Poisson distribution. In some cases the occurrence of an event, in a population that should be Poissonian, changes with time or dependently on previous occurrences. Therefore, mean and variances are unequal in the data. In different fields a vastness of mixture and compound distribution have been considered, Consul and Jain introduced the GP distribution in order to obtain a unique distribution to be used in the cases said above, by allowing the introduction of an additional parameter. See Consul and Famoye (2006) for some applications of the Generalized Poisson distribution. Application of the GP distribution can be find as well in economics and finance. Consul (1989) showed that the number of unit of different commodities purchased by consumers in a fixed period of time follows a Generalized Poisson distribution. He gave interpretation of both parameters of the distribution: θ denote the basic sales potential for the commodity, while λ the average rates of liking generated by the product between consumers. Tripathi et al. (1986) provide an application of the GP distribution in textile manufacturing industry. In particular, given the established use of the Poisson distribution in the field, they compare the Poisson and the GP distributions when firms want to increase their profit. They found that the Generalized Poisson, considering different values of the parameters, always yield larger profits. Moreover, the Generalized Poisson distribution, as studied by Consul (1989), can be used to describe accidents of various kinds, such as: shunting accidents, home injuries and strikes in industries. Another application to accidents has been carried out by Famoye and Consul (1995), where they introduced a bivariate extension to the GP distribution and studied two different estimation methods, i.e. method of moments and MLE, and the goodness of fit of the distribution in accidents statistics . Hubert Jr et al. (2009) test for the value of the GP distribution extra parameter by means of a Bayesian hypotheses test procedure, namely the Full Bayesian Significance Test. Famoye (1997) and Demirtas (2017) provided different methods of sampling from the Generalized Poisson distribution and algorithms for sampling. As regard processes, the GP distribution has been used in different models. For example, Consul and Famoye (1992) introduced the GP regression model, while Famoye (1993) studied the restricted generalized Poisson regression. Wang and Famoye (1997) applied the GP regression model to households' fertility decisions and Famoye et al. (2004) carried out an application of the GP regression model to accident data. Zamani and Ismail (2012) develop a functional form of the GP regression model, Zamani et al. (2016) introduced a few forms of bivariate GP regression model and different applications using dataset on healthcare, in particular the Australian health survey and the US National Medical Expenditure survey. Famoye (2015) provide a multivariate GP regression model, based on a multivariate version of the GP distribution, and two applications: to the healthcare utilizations and to the number of sexual partners.
The Generalized Poisson distribution of a random variable X with parameters θ and λ is given by The GP is part of the class of general Lagrangian distributions. The GP has Generating functions and moments • Parameters: The pgf of the GP is derived by Consul and Jain (1973) by means of the Lagrange expansion, namely: where D x−1 = d x−1 dz x−1 . In particular, for the GP distribution we have (Consul and Famoye, 2006) : f (z) = e θ(z−1) and g(z) = e λ(z−1) (A.6) Now, by setting G(u) = f (z) we have the expression above for the pgf. (see proof in Properties Consul and Jain (1973), Consul (1989), Consul and Famoye (1986) and Consul and Famoye (2006) derived some interesting properties of the Generalized Poisson distribution.
For a proof of Th. 1 see Consul and Jain (1973).
Theorem 2 (Unimodality). The GP distribution models are unimodal for all values of θ and λ and the mode is at x = 0 if θe −λ < 1 and at the dual points x = 0 and x = 1 when θe −λ = 1 and for θe −λ > 1 the mode is at some point x = M such that: where a is the smallest value of M satisfying the inequality For a proof of Th. 2 see Consul and Famoye (1986). Consul and Shenton (1975) and Consul (1989) derived some recurrence relations between noncentral moments µ k and the cumulants K k : Moreover, a recurrence relation between the central moments of the GP distribution has been derived: where µ k (t) is the central moment µ k with θt and λt in place of θ and λ.

A.3 Generalized Poisson Difference distribution
The random variable X follows a Generalized Poisson distribution (GP) P x (θ, λ) if and only if with θ > 0 and 0 ≤ λ < 1. Let X ∼ P x (θ 1 , λ) and Y ∼ P y (θ 2 , λ) be two independent random variables, Consul (1986) showed that the probability distribution of D, D = X − Y , is: where d takes all integer values in the interval (−∞, +∞). As for the GP distribution, we need to set lower limits for λ in order to ensure that there are at least five classes with non-zero probability when λ is negative. Hence, we set where, m 1 , m 2 ≥ 4 are the largest positive integers such that θ 1 + m 1 λ > 0 and θ 2 + m 2 λ > 0.
Proposition 10. The probability distribution in (A.13) can be written as (θ 2 +yλ) y−1 (θ 1 +(y+d)λ) y+d−1 ·e −2yλ = C d (θ 1 , θ 2 , λ). (A.14) Therefore, equation (A.14) is the pgf of the difference of two GP variates, from which is possible to obtain the following particular cases: where d is any integer (positive, 0 or negative) and I d (z) is the modified Bessel function of the first kind, of order d and argument z. The last result in equation (A.15) is the Skellam distribution (Skellam, 1946). Therefore, the Skellam distribution is a particular case of the difference of two GP variates.
Therefore, given that G(u) = G 1 (u)G 2 (u), the probability generating function (pgf) of the random variable D = X − Y is From the cumulant generating function where T 1 = β + λ(e T 1 − 1) and T 2 = −β + λ(e T 2 − 1), it is possible to define the mean,variance, skewness and kurtosis of the distribution.
is the kurtosis.   (c) and (d) we can see that for fixed values µ, when λ decreases, the GPD is more skewed respectively to the right for µ > 0 (θ 1 > θ 2 ) and to the left for µ < 0 (θ 1 < θ 2 ). Therefore, the sign of µ determines the skewness of the GPD.
From figure A.2 we can see again, that for positive values of µ the distribution becomes more right-skewed, panels (a) and (b), and more left-skewed for negative values of µ in panels (c) and (d). Moreover, here can be seen better that has θ 1 increases the distribution has longer tails.

A.4 Quasi-Binomial distribution
A first version of the Quasi-Binomial distribution, defined as QB-I by Consul and Famoye (2006), was investigated by Consul (1990) as an urn model. In their definition of the QB thinning, however, Alzaid and Al-Osh (1993) used the QB distribution introduced in the literature by Consul and Mittal (1975) and defined byConsul and Famoye (2006) as QB-II.
for x = 0, 1, 2, . . . , n and zero otherwise. where a > 0, b > 0 and θ > −a/n. However, Alzaid and Al-Osh (1993), when defining the QB thinning operator, used a particular case of the QB-II distribution: where a = p, q = b, 0 < q = 1 − p < 1 and θ is such that nθ < min(p, q). We denote the QB-II with QB(p,θ,n). For large n, such that np → λ, the QB distribution tends to the Generalized Poisson distribution.
The quasi-binomial (QB) thinning has been introduced by Alzaid and Al-Osh (1993) as a generalization of the binomial thinning to model processes with GP distribution marginals. Unlike the binomial thinning, the QB thinning is able to obtain the distribution of the corresponding innovation. In particular, the authors argued in many counting process is more suitable to consider that the probability of retaining an element may depend on the time and/or the number of already retained elements. They assumed that, at time t, the number of retained elements follows a QB distribution. Using the notation in Weiß (2008), the QB thinning is defined as follows: Proposition 11 (Quasi-Binomial Thinning). Let ρ θ,λ • be the quasi-binomial thinning operator such that ρ θ,λ • follows a QB(ρ,θ/λ,x). If X follows a GP(λ,θ) distribution and the quasi-binomial thinning is performed independently on X, then ρ θ,λ • X has a GP(ρλ,θ) distribution.
The thinned variable, ρ θ,λ • X, can be interpreted as the number of survivors from a population described by X.

B.2 Proofs of the results in Section 2
Proof of Remark 1. Let X ∼ GP (θ 1 , λ) and Y ∼ GP (θ 2 , λ), then Z = (X − Y ) follows a GP D(θ 1 , θ 2 , λ). We know that P (Z = z) = P (X = x)·P (Y = y), therefore we can name S = Y and we substitute S in Z = X − Y , obtaining X = S + Z. Now we can write: (B.6) which is the probability of a GP D(θ 1 , θ 2 , λ). We can now introduce the new parametrization of the probability density function of the GPD. Define thus, we can rewrite both parameters θ i , i = 1, 2, with respect to µ and σ 2 : By substituting B.8 into equation B.6 we have Carrying out the operations in Eq. B.9 we obtain Eq. B.7.
Proof of Remark 2. If Z ∼ GP D(θ 1 , θ 2 , λ) in the parametrization of Consul (1986), the moments are given in equations A.17-A.20. By using our reparametrization of the GDP in equations A.17-A.20, we obtain where κ i , i = 2, 3, 4 are respectively the second, third and fourth cumulants.
Proof of Proposition 1. Let ψ j be the coefficient of z j in the Taylor expansion of G(z)D(z) −1 . We have Where we go from line two to line three of eq B.15 as follows (B.17) Following Ferland et al. (2006), to go from line three to line four of B.15: (B.18) From B.16, a necessary condition for the second-order stationarity of the integervalued process {Z t } is: ∀a, b ∈ R. Hence, for a = 1 and b = −1, this is true for the difference Z Proof of Proposition 5. We use again the fact that Z and the following lemma.
Lemma 2. If X n (ω) and Y n (ω) have a mean-square limit also their sum will have a mean-square limit.
Hence, by setting a = 1 and b = −1 we will obtain that Lemma 2 will be valid also for the difference of two sequences and we can say that Z (n) t converges to Z t in L 2 (Ω, F, P).
Proof of Proposition 6.
Proof of Proposition 9. As said before, Z are finite sums of independent Generalized Poisson variables and it follows that Z (n) t is a finite sum of Generelized Poisson difference variables. As shown by Zhu (2012), the first two moments of X t and Y t are finite: is finite and is also finite, where Cov(X t , Y t ) = 0 given that X t and Y t are independent and where C i and C i , with i = 1, 2 are constants.

C Numerical Illustration
We consider 400 samples from a two GPD-INGARCH(1,1) and two simulation settings: one with low persistence and the other with high persistence. The first setting has parameters λ = 0.4, α 1 = 0.25, β 1 = 0.23, α 0 = −0.2 and φ = 22.78, while the second setting with parameters λ = 0.6,α 1 = 0.53, β 1 = 0.25, α 0 = −0.2 and φ = 26.25. We run the Gibbs sampler for 1,010,000 iterations, discard the first 10,000 draws to avoid dependence from initial conditions, and finally apply a thinning procedure with a factor of 100 to reduce the dependence between consecutive draws.
The following figures show the posterior approximation of α 1 ,β 1 and λ. For illustrative purposes we report in