1 Introduction

Temporal point pattern statistical models aim at modeling the occurrence of a given event of interest in a given interval on the real line, which is commonly interpreted as time. The most widely used point process model is the Poisson process (PP) in which the number of events in any interval has Poisson distribution and is independent for disjoint intervals. The Poisson process dynamics is mainly determined by its intensity function (IF) and is called a homogeneous Poisson process when this function is constant. Cox processes are a statistically appealing generalization of the Poisson process that allow the intensity function to vary stochastically across the interval under consideration.

A variety of classes of Cox process models can be defined in terms of the stochastic dynamics of the intensity function. Several of those models have appeared in the literature, including non-parametric models in which such dynamics is described by a Gaussian process. The most popular of those is the log Gaussian Cox process (Møller et al. 1998) that defines the IF as the exponential of a Gaussian process. Other (than the exponential) link functions are also considered in the literature, for example, functions proportional to: the square function (Walder and Bishop 2017); the logistic function (Adams et al. 2009); the standard normal c.d.f. (Gonçalves and Gamerman 2018). Gonçalves et al. (2020) considers the case in which the IF is a diffusion process, allowing for non-Gaussian dynamics.

Non-parametric Cox process models as the ones above are appealing due to their flexibility to model the IF dynamics. However, they impose some complex statistical challenges. The intractability of their likelihood function often requires from the existing methodologies the use of approximations that have to deal with the trade-off between precision and computational cost. Although approximation-free methodologies have already been introduced (Adams et al. 2009; Gonçalves and Gamerman 2018), their associated computational cost prevents them from being applied to moderately large datasets. That motivates the development of novel methodologies that offer modeling flexibility at a reasonable computational cost and, preferably, do not consider approximations.

A simple yet appealing model is the Markov modulated Poisson process (MMPP), in which the IF follows a continuous time Markov chain (CTMC). This means that the IF is piece-wise constant with jumps having a Markovian dynamics. This class of models has been explored under a statistical perspective by different authors. In particular, Fearnhead and Sherlock (2006) and Rao and Teh (2013) propose exact Markov chain Monte Carlo (MCMC) methodologies to perform Bayesian inference. Both methodologies consider MCMC algorithms that iterate between two main sampling steps: i) sampling from a collection of CTMC bridges (former) or a collection of homogeneous PPs (latter); ii) performing a forward-backward algorithm over a collection of time points. On one hand, the algorithm from Fearnhead and Sherlock (2006) does not introduce extra latent variables, as it is done in the algorithm of Rao and Teh (2013), that slows down the mixing of the MCMC chain. On the other hand, the two main sampling steps described above are performed considering: the observation times in Fearnhead and Sherlock (2006); and the transition times of an (augmented) CTMC in Rao and Teh (2013). This and the fact that the former algorithm requires the computation of the exponential of matrices whereas the latter does not, imply a higher cost for the two steps of the former algorithm. In a typical example, the number of events is much larger than the number of changes in the underlying CTMC, therefore favoring the algorithm from Rao and Teh (2013).

This paper introduces a generalization of Markov modulated Poisson processes—called the generalized MMPP (GMMPP), that allows the intensity function to jump among different and pre-specified functional forms. The jumps are determined by a continuous-time Markov chain but in a way that each state of the chain is associated with one functional form. One special case of the GMMPP, for example, has a piecewise continuous, linear intensity function, which starts at a fresh random value when the underlying Markov chain enters state 1, and increases with slope \(\beta \) while the CTMC remains in state 1, then jumps to a new random value and decreases with a slope \(\gamma \) when the CTMC enters state 2. Parameters indexing the functional forms, like \(\beta \) and \(\gamma \) in the aforementioned example, may be assumed unknown and estimated in the inference process. This construction offers a quite flexible yet parametric solution to model the IF of unidimensional Cox processes.

The proposed class of Cox processes is expected to fill a gap between Markov modulated Poisson processes and fully non-parametric Cox processes. Compared to the former, GMMPPs provide much more flexibility to model the IF dynamics. An MMPP model would typically require the use of CTMCs with large state spaces with many state changes (short visits) which would seriously compromise model parsimony and, consequently, the implied computational cost. On the other hand, when compared to fully non-parametric approaches, for example, when the IF dynamics depends on a Gaussian process, GMMPPs are expected to provide a similarly good fit in many cases but with huge gains in terms of computational cost.

An MCMC algorithm is developed to perform exact Bayesian inference for GMMPPs. It is exact in the sense that the devised Markov chain converges to the exact posterior distribution of all the unknown quantities of the model, including the IF. The algorithm builds upon the ideas introduced in Rao and Teh (2013) so that it scales with the number of Markov jumps and does not suffer massively when increasing the number of observations. Further non-trivial developments are presented to circumvent the fact that a forward-filtering-backward-sampling (FFBS) cannot be used to sample one of the blocks of the Gibbs sampling as it is done in Rao and Teh (2013). In fact, one of the ideas developed here can be used to further improve the algorithm of Rao and Teh (2013). The proposed MCMC is computationally efficient in terms of computing time and, therefore, feasible to be used with very large data sets. This way, the main contributions of this paper are twofold: first, a novel class of parametric unidimensional Cox process which is flexible yet parsimonious is introduced, and second, a computationally efficient MCMC algorithm is devised to perform exact Bayesian inference. The methodology offers an appealing (and much cheaper) alternative to non-parametric Cox processes in a variety of problems in which the latter ought to be a suitable choice.

To motivate the use of GMMPPs we consider two real data sets regarding coal mining disasters and the exchange rate between Brazilian Real (BRL) and US Dollar (USD). For the former, each event represents an explosion that killed ten or more men in Britain. For the latter, each event represents a day in which the variation w.r.t. the previous day was greater than 1%. A kernel method (see Diggle 1985) is used to estimate the IF as it is shown in Fig. 1 and suggests that the IF ought to be well described, in both examples, by a GMMPP with an increasing and a decreasing line with varying starting values. Both examples are revisited in Sect. 5.

Fig. 1
figure 1

Two real data sets and the respective empirical IF. Left: Coal mining disasters—time unit is year. Right: daily variation over 1% in the BRLxUSD exchange rate—-time unit is month

Finally, we propose a specific model to fit epidemic curves, allowing for asymmetry between the growth and decay behaviors. The model allows for simplifications in the proposed MCMC algorithm, which lead to reasonable computational times even for very large data sets—with more than 300 thousand observations. The model is applied to data sets regarding Dengue Fever epidemics in Brazil and the COVID-19 pandemic. This is presented in “Appendix E”.

Other examples of point process models that do not assume the stochastic independence property of Poisson processes, are the self-exciting Hawkes process (Rasmussen 2013, see, for example) and the Matérn repulsive process (Rao et al. 2017, see, for example).

This paper is organized as follows. Sections 2 and 3 present the GMMPP and the MCMC algorithm, respectively. Section 4 explores simulated examples to discuss the efficiency of the methodology and compare it to a non-parametric Cox process approach, in terms of inference and computational cost. Finally, Sect. 5 applies the methodology to real datasets. Two of them are the ones presented in Fig. 1 regarding coal mining disasters and the BRLxUSD exchange rate.

2 Generalized Markov modulated Poisson processes

Let \(Y := \{Y(s)\}_{s \in \mathbb {R}^+}\) be a non-homogeneous Poisson process (NHPP) with intensity function \(\lambda := \{\lambda (s)\}_{s \in \mathbb {R}^+}\) and consider K functional forms \(g_k\), \(k=1,\ldots ,K\). These may be, for example, constants, increasing or decreasing lines, exponential functions, etc. The IF \(\lambda \) switches among the different functional forms according to the transitions of a continuous-time Markov chain \(X := \{X(s)\}_{s \in \mathbb {R}^+}\), with Q-matrix \(Q_{\theta }\), initial distribution \(\pi _0\) and state space \(\{1:E\}:=\{1, 2, \ldots , E\}\), for \(E\ge K\), where \(\theta \) is the vector of parameters indexing \(Q_{\theta }\). Furthermore, in its most general form, we allow the IF to switch from the same functional form to itself and have different starting values every time a functional form is revisited. We call the resulting process Y a generalized Markov modulated Poisson process with mathematical representation given as follows.

Fig. 2
figure 2

Three examples of GMMPPs. Left: CTMC trajectory, right: IF trajectory and generated PP events (circles). Each example considers, respectively, one of the three possible priors on R presented in Sect. 3. Self-jumps are allowed on the third example

Define \(T_0=0\), \(T = (T_1, T_2, \ldots )\) as the jump times of X and \(Z = (Z_1,Z_2, \ldots )\) as the corresponding sequence of visited states, i.e. \(Z_i = X(T_i)\), \(i \in \mathbb {N}\), and \(Z_0 = X(0)\). For each s, the surjective function \(h_s:=h(X(s)): \{1:E\} \rightarrow \{1:K\}\) assigns a functional form \(g_k\) to each of the states of X and \(\zeta _s:=\zeta _s(X) = \max \limits _{i \ge 0}\{ T_i: T_i \le s\}\). Additionally, \(R=(R_0,R_1,\ldots )\), with \(R_i := R_i(X)=\lambda (T_i)\), is the value of the IF at \(T_i\) and \(\psi \) is a vector of parameters indexing the \(g_k\)’s. The generalized Markov modulated Poisson process is formally defined as follows.

$$\begin{aligned}&Y\sim \text{ NHPP }(\lambda ), \end{aligned}$$
(1)
$$\begin{aligned}&\lambda (s)= g_{h_s}(s,\zeta _s, R,\psi ), \end{aligned}$$
(2)
$$\begin{aligned}&X\sim \text{ CTMC } (\pi _0, Q_{\theta }, E), \end{aligned}$$
(3)
$$\begin{aligned}&\pi (R|T,Z_0,Z,\psi )=\pi (R_0|Z_0,\psi ) \nonumber \\&\quad \times \prod _{i\in \mathbb {N}}\pi (R_i|R_{i-1},T_{i-1},T_i,Z_{i-1},Z_i,\psi ). \end{aligned}$$
(4)

We can set any subset of the functional forms (from none to all of them) to allow for self-jumps—the ith jump is a self-jump if \(h(X(T_{i-1}))=h(X(T_{i}))\). This way, the size of the state space E can be anything from K to 2K. Some choices for the prior on R are presented in Sect. 3. For example, suppose that \(g_1\) is a straight line with slope \(\beta \), then, for a given s such that the IF assumes the functional form \(g_1\) at \(s\in [T_{i},T_{i+1})\), we have that \(\lambda (s)=R_i+\beta (s-\zeta _s)\).

The IF can be interpreted in two ways: i) a discrete time Markov chain and a sequence of exponentially distributed random variables \(T_{i+1}-T_i\) as well as initial values R, from which a piecewise deterministic function is constructed; ii) a piecewise deterministic Markov process conditioned on the values of a discrete time Markov chain. Naturally, the IF is required to be non-negative. Formally, we deal with this issue by assigning zero to the density of Y conditional on any trajectory of the IF that assumes negative values.

In the simpler case in which the IF is not allowed to switch from each \(g_k\) to itself, we set \(h(k)=k\), \(k=1,\ldots ,K\). On the other hand, if that feature is allowed, we set \(h(k)=h(K+k)=k\), \(k=1,\ldots ,K\). In order to favor model identifiability in a statistical context, some entries of the \(Q_{\theta }\) matrix are set to be zero so that a jump to the kth functional form that is not a self-jump can only happen through the kth state of X and never through the \((K+k)\)th one. Figure 2 illustrates the proposed class of models by presenting a realization for three examples of GMMPPs.

3 Bayesian inference

We aim at performing inference for GMMPPs based on the observation of the process over a finite time interval [0, S]. The inference process consists in obtaining the posterior distribution of the intensity function and unknown parameters. Given the structure of the proposed class of models, this is equivalent to the distribution of \((Z_0,Z,T,R,\theta ,\psi |y)\), where y represents a realization of the process Y in [0, S]. In order to fully specify the model under the Bayesian approach we need to assign a prior distribution to parameters \(\theta \) and \(\psi \). We define \(\psi =(\psi _1,\ldots ,\psi _K)\), where \(\psi _k\) is the set of parameters indexing the kth functional form, \(\theta _{\cdot } = \{ \theta _k: k = 1, \ldots , K\}\) as the rates of the waiting times of X and \(\theta _{k \cdot } = \{ \theta _{kj}: j = 1, \ldots , K \text{ and } j \ne k \} \) as the transition probabilities from the states corresponding to the kth functional form. The parameter vectors \(\theta \) and \(\psi \) are assumed to be independent a priori. Independence among all the \(\psi _k\)’s and among the components of \(\theta _{\cdot }\) and vectors \(\theta _{k \cdot }\) is also assumed. In the case that no self-jumps are allowed, the full prior specification of \(\theta \) is completed by setting, for \(k=1,\dots ,K\),

$$\begin{aligned} \theta _k\sim & {} Gamma(\alpha _k,\beta _k), \nonumber \\ \theta _{k \cdot }\sim & {} Dirichlet(\gamma _{k1},\ldots ,\gamma _{kk-1},\gamma _{kk+1},\ldots ,\gamma _{kK}). \end{aligned}$$
(5)

If self-jumps are allowed, each row of \(Q_{\theta }\) has K non-zero probabilities due to the restrictions imposed to have model identifiability. Moreover, the transition probabilities between the two states corresponding to the same functional form are the same and the transition probabilities between the \(k_1\)th and \(k_2\)th functional forms are the same when moving from either of the two states corresponding to the \(k_1\)th form to state \(k_2\). This means that there are K transition probabilities associated to each functional form and the vector of these probabilities are assumed to follow a Dirichlet distribution, i.e.,

$$\begin{aligned} \theta _{k \cdot }\sim Dirichlet(\gamma _{k1},\ldots ,\gamma _{kK}), \end{aligned}$$
(6)

for \(k=1,\dots ,K\).

The rate parameters of the waiting times are set to be the same for the two states corresponding to the same functional form for which self-jumps are allowed. In order to illustrate all the restrictions imposed to the \(Q_{\theta }\) matrix, consider an example with three functional forms, all allowed to self-jump. The resulting \(Q_{\theta }\) matrix is then given by

$$\begin{aligned} \left( \begin{array}{ccccccccccc} -\theta _1 &{} &{}\theta _1\theta _{12} &{} &{}\theta _1\theta _{13} &{} &{} \theta _1\theta _{11} &{} &{} 0 &{} &{}0 \\ \theta _2\theta _{21} &{} &{}-\theta _2 &{} &{}\theta _2\theta _{23} &{} &{}0 &{} &{}\theta _2\theta _{22} &{} &{}0 \\ \theta _3\theta _{31} &{} &{}\theta _3\theta _{32} &{} &{}-\theta _3 &{} &{}0 &{} &{} 0 &{} &{}\theta _3\theta _{33} \\ \theta _1\theta _{11} &{} &{}\theta _1\theta _{12} &{} &{}\theta _1\theta _{13} &{} &{}-\theta _1 &{} &{} 0 &{} &{} 0 \\ \theta _2\theta _{21} &{} &{}\theta _2\theta _{22} &{} &{}\theta _2\theta _{23} &{} &{} 0 &{} &{}-\theta _2 &{} &{} 0 \\ \theta _3\theta _{31} &{} &{}\theta _3\theta _{32} &{} &{}\theta _3\theta _{33} &{} &{} 0 &{} &{} 0 &{} &{} -\theta _3 \\ \end{array} \right) . \end{aligned}$$

We consider a few options for the prior of the set R of starting values. Note that the (random) dimension of R is \((|T|+1)\), where |T| is the size of T in [0, S]. One possible choice for the prior on R is to fix the starting value of each functional form to be the same on every visit to that form. This implies that the initial value of each functional form \(g_k\) is a parameter indexing that form and, if this is unknown, it is added to the parameter vector \(\psi _k\). We call this the prior 1 for R.

Another option for the prior on \(R_i\) (prior 2) is a point-mass distribution depending on \((R_{i-1},T_{i-1},T_i,Z_{i-1},\psi )\) that leads to a continuous IF trajectory, being therefore closer in form to the traditional log Gaussian Cox process. This prior is defined by

$$\begin{aligned} \displaystyle P(R_i=g_{h_{T_{i-1}}}(T_i,T_{i-1},R_{i-1},\psi ))=1. \end{aligned}$$
(7)

A third option (prior 3) assumes

$$\begin{aligned} \displaystyle \pi (R_i|R_{i-1},T,Z_0,Z,\psi )= & {} \pi (R_i|Z_i), \nonumber \\ \displaystyle \pi (R_i|Z_i=k)= & {} \pi _k, \end{aligned}$$
(8)

for some distribution \(\pi _k\), \(k=1,\ldots ,K\). This means that all the starting values referring to the same functional form are independent and identically distributed. Furthermore, in order to have a feasible MCMC algorithm, the constant functional form is the only one for which a continuous prior can be adopted, in particular, a gamma distribution. For all the other forms, a discrete prior must be adopted and, unless useful information is available, we shall assume uniform discrete priors on supports chosen according to the scale of the IF. A more flexible approach is to fix the size of the support of this discrete distribution but set the actual values to be unknown and assuming a joint continuous prior.

3.1 Model elicitation and identifiability

The proposed class of GMMPPs offers a considerably flexible structure to model a variety of point process phenomena. This flexibility, however, gives rise to complex important issues that have a great influence on the quality of the statistical analysis. More specifically, model and prior elicitation should be carefully performed to avoid identifiability problems and favor a reasonable model fit. Reliable prior information about the phenomenon under study should always be used to elicit informative priors. One may also consider a preliminary analysis of the data and the use of informative priors. A non-parametric kernel estimate of the density function can be used to guide the choice of the functional forms and other features of the model, like self-jumps and prior structure of the initial values.

The use of informative priors should be considered based on the (prior) information acquired. For example, if few transitions are expected, the data would provide little information about the parameters indexing the \(Q_{\theta }\) matrix. In this case, the information that few transitions are expected could be used to elicit informative priors for the mean waiting time parameters \(\theta _i\) in terms of the scale of the model (magnitude of the waiting times). Moreover, scenarios in which the IF is estimated to have few transitions will be weakly informative about parameters indexing the Q-matrix, yielding a low predictive power. On the other hand, if the IF is estimated to have a considerable number of transitions, reasonable information about the Q-matrix will be available and, in turn, reasonably precise predictions can be made. Finally, prior information may also be used to elicit informative priors for the parameters \(\psi \).

Generally speaking, the proposed class of models does not aim at emulating non-parametric structures, which would imply the need for many functional forms with short visits to each one. This would compromise model identifiability and computational cost. The actual aim is to provide good model fitting and prediction with high gains in terms of computational cost in situations in which a non-parametric structure for the IF is expendable.

All of the issues discussed above are explored in the simulated and real examples presented in Sects. 4 and 5.

3.2 Model augmentation and the MCMC algorithm

The target posterior distribution of \((Z_0,Z,T,\theta ,\psi |y)\) is highly complex, which suggests the use of MCMC methods as the most reasonable choice to perform inference. Developing an efficient algorithm, however, is not straightforward and should consider non-trivial techniques to achieve that goal. We consider a model augmentation approach similar to the one used in Rao and Teh (2013) but with some adaptations to gain computational efficiency. The model augmentation is based on the augmented representation of a CTMC introduced in Hobolth and Stone (2009) and referred to as uniformization. The CTMC is represented as a discrete time Markov chain (DTMC) subordinated to a Poisson process. This means that the times of the DTMC, which has the same state space of the CTMC X, are defined by a Poisson process. The augmented component comes from the fact that the DTMC may have transitions between the same state. We shall refer to those transitions as virtual jumps.

The difference between our approach and the one in Rao and Teh (2013) is that we consider a non-homogeneous Poisson process instead of a homogeneous one. The gain in efficiency due to the use of a non-homogeneous PP will be made clear further ahead in the text. The representation using non-homogeneous PP is also used in Rao and Teh (2012) in the context of inference for semi-Markov jump processes. Nevertheless, the authors do not provide a proof for the result.

Let us start by defining K constants \(\Omega _1,\ldots ,\Omega _K\) such that \(\Omega _k>|Q_k|\), where \(Q_k\) is the kth diagonal element of \(Q_{\theta }\). Now let \(V_0,V_1,\ldots \) be a sequence of discrete random variables on \(\{1:E\}\) and \(W=(W_1,W_2,\ldots )\) a sequence of random times on \(\mathbb {R}^+\). For \(W_0=0\), we define the following stochastic process:

$$\begin{aligned}&V_0 \sim \pi _0, \end{aligned}$$
(9)
$$\begin{aligned}&(W_{\ell }- W_{\ell -1}|V_{\ell -1} = j, W_{\ell -1})\sim \text{ Exp }(\Omega _{h(j)}),\end{aligned}$$
(10)
$$\begin{aligned}&(V_{\ell }|V_{\ell -1} = j) \sim P(E,B_{j \cdot }), \end{aligned}$$
(11)
$$\begin{aligned}&B_{j \cdot } = \mathbf{1} _j + \frac{1}{\Omega _{h(j)}}Q_{j \cdot }, \end{aligned}$$
(12)

for \(\ell =1,\dots ,|W|\), where \(P(E,B_{j \cdot })\) is a discrete distribution on E with probability vector \(B_{j \cdot }\), \(Q_{j \cdot }\) is the jth row of \(Q_{\theta }\) and \(\mathbf{1} _j\) is a row vector of size E with the jth element being 1 and all the others being 0.

As it is stated in Proposition 1 below, the process \((V_0,V)\), where \(V=(V_1,V_2,\ldots )\), subordinated to times W is an alternative representation for the CTMC X. We shall refer to this process as the augmented CTMC. Note that the virtual times are an extra component that is not defined in the original definition of a CTMC. Finally, note that the result is valid for \(\Omega _k\ge |Q_k|\) but, in order to use this representation in our MCMC context, the strict inequality is required. This is because the equality implies the almost surely non-existence of virtual jumps, which, in turn, precludes the validity of the MCMC algorithm. This argument will be made clear further ahead in the text.

Proposition 1

For any \(\Omega _k \ge |Q_k|\), the process \((V_0, V, W)\) defines the same continuous time Markov chain X with initial distribution \(\pi _0\) and Q-matrix \(Q_{\theta }\).

Proof

See “Appendix A”. \(\square \)

Consider now the augmented model that replaces the CTMC X in the original model in (1)–(4) by the augmented CTMC defined in (9)–(12)—its graphical representation is presented in Fig. 3 and Table 5 in “Appendix B” presents an overview of the model. We define U and T as the virtual and non-virtual jump times of the augmented CTMC, respectively. The vector of all the unknown quantities in the augmented model is \(\varphi =(W,U,T,V_0,V,V^{(T)},R,\theta ,\psi )\), where \(V^{(T)}\) is V at times T. This means that the aim of the inference procedure is to obtain the posterior distribution of \((\varphi |y)\). Note that there are redundancies in the definition of \(\varphi \) since \(W=U\cup T\) and \(V^{(T)}\subset V\), nevertheless, that is required due to the particular sampling scheme to be adopted in the MCMC.

Fig. 3
figure 3

Graphical representation of the proposed augmented GMMPP model

We design a Gibbs sampling algorithm, with a Metropolis-Hastings (MH) step for one of the blocks. Each block is iteratively sampled conditioned on all the other components appearing in \(\varphi \). The blocking scheme and sampling algorithms to be adopted aim at simultaneously optimizing the convergence properties and computational cost of the algorithm. We consider the following blocks: (UWV); \((V_0,V,V^{(T)},U,T,R)\); \(\theta \); \(\psi \).

We present the joint density of \((Y,\varphi )\) which is useful to derive the algorithms to sample from each block. We have that

$$\begin{aligned} \pi (Y,\varphi )= & {} \pi (Y|V_0, V , W, R, \psi )\pi (V_0, V, W | \theta ) \nonumber \\&\times \pi (R|V_0,V,W,\psi ) \pi (\theta ) \pi (\psi ), \end{aligned}$$
(13)

where each density above is obtained w.r.t. some suitable dominating measure. The likelihood \(\pi (Y|\cdot ):=\pi (Y|V_0, V , W, R, \psi )\) is written w.r.t. the probability measure of a Poisson process with constant positive rate such that

$$\begin{aligned} \pi (Y|\cdot ){\mathop {\propto }\limits ^{\varphi }}\exp \left\{ - \int _{0}^S \lambda (s) ds \right\} \prod _{n=1}^{N_S}\lambda (t_n), \end{aligned}$$
(14)

where \(N_S\) is the number of events from y in [0, S] and \(t_n\) is the time of the nth event. The densities \(\pi (V_0, V, T, U, W | \theta )\), \(\pi (R|V_0,V,T)\) and \(\pi (\theta )\) can be obtained from (9)–(12), (7)–(8) and (5)–(6), respectively. Finally, \(\pi (\psi )\) is some suitable continuous density.

3.3 Sampling (UWV)

First note that, conditional on \((T,V^{(T)},\theta )\), (UWV) is independent of the data and consists of the virtual jumps. These are sampled directly from their full conditional distribution, which is given by the following proposition.

Proposition 2

Defining \(U^{(i)}\) as the virtual jumps in \((T_i,T_{i+1})\), with \(T_0=0\) and \(T_{|T|+1}=S\), the full conditional distribution of the virtual jumps is such that:

  1. i.

    the \(U^{(i)}\)’s are mutually independent;

  2. ii.

    for \(i=0,\ldots ,|T|\), \(U^{(i)}\) is a homogeneous Poisson process with rate \(\Omega _{h(V_{(i)})}+Q_{h(V_{(i)})}\), where \(V_{(i)}\) is the state of V at \(T_i\).

Proof

See “Appendix A”. \(\square \)

Note that, if \(\Omega _k = |Q_k|\), the number of virtual jumps is almost surely zero and, as a consequence, the MCMC chain is reducible since the non-virtual jumps would be restricted to the set defined by its initial value. In fact, the values of the \(\Omega _k\)’s have a great impact on the efficiency of the algorithm. If these are increased, the mean number of virtual jumps also does, which, in turn, improves the mixing of the chain. On the other hand, an increase in the number of virtual jumps leads to an increase in the computational cost of each iteration of the algorithm, in particular, on the step where \((V_0,V,V^{(T)},U,T,R)\) is sampled. Rao and Teh (2013) suggests the use of \(\Omega = 2\max \limits _k|Q_k|\) in the context of inference for MMPP, based on empirical results. Note however that the authors consider a unique \(\Omega \) for all k, as presented in the original augmented CTMC representation of Hobolth and Stone (2009). This leads to different local mixing properties of the MCMC with respect to different states (in our case, different functional forms) of the CTMC. Moreover, an optimal local choice w.r.t. the state (functional form) with the larger \(|Q_k|\) ought to penalize the local computational cost associated to the other states. That issue is the main motivation for us to propose the alternative augmented CTMC with distinct \(\Omega _k\)’s. It allows for a finer optimization of the chain’s properties in the sense of globally optimizing the mixing without penalizing the computational cost. Finally, based on the results of Rao and Teh (2013), we set \(\Omega _k = 2|Q_k|\) for all k.

The original algorithm from Rao and Teh (2013) may also benefit from the use of distinct dominating rates \(\Omega _k\), for the same reasons discussed in the previous paragraph. The step to sample the virtual jumps would be performed according to the result in Proposition 2 and, therefore, be identical to the same step in our algorithm. The other step, in which the states of the subordinated Markov chain are sampled, would be performed by the FFBS algorithm proposed in the original paper but considering the jth row of the transition matrix of the subordinated Markov chain to be \(\left( \mathbf{1} _j + \frac{Q_{j \cdot }}{\Omega _{h(j)}}\right) \).

3.4 Sampling \((V_0,V,V^{(T)},U,T,R)\)

The full conditional density of \((V_0,V,V^{(T)},U,T,R)\) is proportional to (13), hence

$$\begin{aligned} \pi (V_0,V,&U,T,R|y,\theta ,\psi ,W) \propto L_{0}(V_0,R_0)\pi _0(V_0) \\&\times \prod _{l=1}^{|W|}L_{l}(V_{0:l},R_{0:(l)})\pi (V_l|V_{l-1}, \theta )\\&\times \prod _{i=0}^{(l)}\pi (R_{(l)}|V_{(l)}), \end{aligned}$$

where \(V_{0:l}=(V_0,\ldots ,V_l)\), (l) is the number of non-virtual jumps up to \(W_l\) and \(R_{0:(l)}\) are all the starting values up to \(W_l\) (which are not necessarily l values). Also,

$$\begin{aligned} L_l(V_{0:l},R_{0:(l)})= & {} \exp \left\{ - \int \limits _{W_l}^{W_{l+1}} \lambda (s)ds \right\} \\&\times \prod \limits _{t_n \in [W_l, W_{l+1})} \lambda (t_n). \end{aligned}$$

Direct sampling from the full conditional distribution of \((V_0,V,V^{(T)},U,T,R)\) requires the computation of its probability mass function, which is a (at least) \(E^{|W|}\)-dimensional vector. Therefore, in the majority of cases, the computational cost associated to this algorithm is impractical. Furthermore, note that, for GMMPP’s, each likelihood term \(L_l(V_{0:l},R_{0:(l)})\) depends on V and R up to time \(W_l\), because of the dependence on the last non-virtual jump up to \(W_l\). For that reason, unlike in the case of inference for MMPP’s (see Rao and Teh 2013), an FFBS scheme cannot be devised to sample from the full conditional distribution of \((V_0,V,V^{(T)},U,T,R)\). Instead, we propose an independent Metropolis Hastings step with a proposal distribution \(q(V_0,V,R)\) that aims at approximating the target full conditional by adding suitable normalizing constant terms for each of the |W| terms \(L_{l}(V_{0:l},R_{0:(l)})\pi (V_l|V_{l-1}, \theta )\) or \(L_{l}(V_{0:l},R_{0:(l)})\pi (V_l|V_{l-1}, \theta )\pi (R_{(l)}|V_{(l)})\), accordingly. More specifically,

$$\begin{aligned} q(V_0,V,R)&= c_0 \frac{\pi _0(V_0)}{c_0(V_0)} c_0(V_0) L_0(V_0,R_0)\nonumber \\&\quad \times \pi (R_0 | V_0, \tau ) \prod _{l=1}^{|W|} \left\{ c_l \frac{\pi (V_l | V_{l-1}, \theta )}{c_l(V_l)} c_l(V_l)\right. \nonumber \\&\quad \times L_l(V_{0:l},R_{0:(l)})[\pi (R_{(l)}|V_l)\mathbb {I}(V_{l-1} \ne V_l) \nonumber \\&\quad \left. + \mathbb {I}(V_{l-1} = V_l)]\right\} , \end{aligned}$$
(15)

where \(c_l(V_l)\) is the normalizing constant of \(L_l(V_{0:l},R_{0:(l)})\pi (R_{(l)}|V_l)\) and \(c_l\) is the normalizing constant of \(\displaystyle \frac{\pi (V_l|V_{l-1},\theta )}{c_l(V_l)}\). Note that when \(V_l\) corresponds to a virtual jump, \(c_l\) is the constant that normalizes \(\pi (V_l|V_{l-1},\theta )L_l(V_{0:l},R_{0:(l)})\).

figure a

The acceptance probability of the MH step is given by

$$\begin{aligned} \alpha = 1 \wedge \prod _{l=1}^{|W|} \frac{c_l}{c_{l}^*}, \end{aligned}$$
(16)

where the \(c_l\)’s and \(c_{l}^*\)’s refer to the current and proposal values, respectively. Note that any trajectory that leads to negative values for the IF is rejected with probability 1.

As described before, the constant functional form is the only one for which we assume a continuous prior for its starting values—a gamma prior. For all the other forms, the required normalizing constants above would typically be intractable for continuous priors. The detailed algorithm to perform the MH step described above is presented in Algorithm 1 of “Appendix B”. The algorithm for the simpler case with no varying starting value is obtained by applying straightforward simplifications.

We use the general result from Mengersen and Tweedie (1996) to establish the uniform ergodicity of the proposed MH sub-chain.

Proposition 3

The Metropolis-Hastings sub-chain defined by (15) and (16) is uniformly ergodic.

Proof

See “Appendix A”. \(\square \)

Since this is an independent MH algorithm, its efficiency relies heavily on its acceptance rate—the higher the better. Note that this rate should be reduced as the number of non-virtual jumps increases. For that reason, we propose an adaptation of the algorithm by partitioning the interval [0, S] and separately sampling \((V_0,V,V^{(T)},U,T,R)\) in each of these time intervals. In order to have a robustly efficient algorithm, we propose an adapting strategy that starts by updating \((V_0,V,V^{(T)},U,T,R)\) in one block and then partitions this into more blocks if required. The adaptation is considered up to a certain iteration of the Markov chain so to guarantee its convergence. Finally, the algorithm to sample \((V_0,V,V^{(T)},U,T,R)\) in each sub-interval of time is a direct and straightforward adaptation of Algorithm 1 that is presented in “Appendix B”.

The partitioning strategy ought to be executed with care in order to guarantee that the respective full conditional distribution depends on the likelihood only inside the respective time interval. This means that the limits of the intervals have to be times \(T_i\)’s of non-virtual jumps and the proposal distribution is truncated so that the upper limit of the respective time interval remains as a non-virtual jump time and does not change the initial functional form of the next time interval. For a partition \((0=s_0,s_1,\ldots ,s_B=S)\), the blocks are defined by the intervals \([0,s_1]\), \([s_1,s_2]\), \(\ldots \), \([s_{B-2},s_{B-1}]\), \([s_{B-1},S]\).

The adapting partition strategy goes as follows. Set a number of iterations M large enough to obtain a reliable estimate of the acceptance rate and a reasonable threshold r for the rate. The algorithm starts with one block \((B=1)\). Then, after every M iterations, the acceptance rate in those last M iterations is evaluated. If this rate is smaller than r, we make \(B=B+1\). The adaptation carries on until the computed rate is larger than r. Following the results from Lee and Neal (2018), we suggest \(r\approx 0.25\). A partition with B blocks is defined by setting the intervals’ limits to be the \(T_i\)’s which are the closest to the times \(|W|/B,\;2|W|/B,\ldots ,\;(B-1)|W|/B\).

Finally, the continuous time Markov chain trajectory may be highly correlated with some parameters in \(\psi \), which may compromise the mixing of those parameters. A simple and efficient way to mitigate this problem is to perform multiple updates of (UWV) and \((V_0,V,V^{(T)},U,T,R)\) on each iteration of the Gibbs sampling. This issue is illustrated in the simulated examples.

3.5 Sampling \(\theta \) and \(\psi \)

It is straightforward to simulate from the full conditional distribution of \(\theta \) given the conditional independence structure and conjugation of its prior. The full conditional distribution of \(\theta _k\) is a \(Gamma(\alpha _k+m_k(V_0,V),\beta _k+\tau _k(V_0,V,T))\), where \(m_k(V_0,V)\) and \(\tau _k(V_0,V,T)\) are the total number of visits to and the total time spent at the kth functional form in [0, S], respectively. Moreover \(\theta _{k \cdot }\) has a Dirichlet full conditional distribution with parameter vector \((\gamma _{k1}+m_{k1}(V_0,V),\ldots ,\gamma _{kK}+m_{kK}(V_0,V))\), if self-jumps are allowed, and \((\ldots ,\gamma _{kk-1}+m_{kk-1}(V_0,V),\gamma _{kk+1}+m_{kk+1}(V_0,V),\ldots )\), if no self-jumps are allowed, where \(m_{k_1k_2}(V_0,V)\) is the total number of transitions from functional form \(k_1\) to functional form \(k_2\) in [0, S],

Concerning parameters \(\psi \), we have from (13) that its full conditional density is

$$\begin{aligned} \pi (\psi |\cdot )\propto \exp \left\{ -\int _{0}^S\lambda (s)ds\right\} \prod _{n=1}^{N_S}\lambda (t_n)\pi (\psi ). \end{aligned}$$

The prior independence among the \(\psi _k\)’s implies the conditional independence of the respective K full conditionals. Moreover, for a constant \(g_k\) with fixed starting value, a \(Gamma(\eta _k,\nu _k)\) leads to a full conditional

$$\begin{aligned} Gamma(\eta _k+n_k(y),\nu _k+\tau _k(V_0,V,T)), \end{aligned}$$

where \(n_k(y)\) is the number of events from y occurring during the time that the IF assumes the functional form \(g_k\). For all the other functional forms we perform MH steps with an adapted random walk proposal (see Roberts and Rosenthal 2009, Section 2) for each vector \(\psi _k\). The acceptance probability for each \(\psi _k\) is given by

$$\begin{aligned} 1\wedge \frac{\pi (\psi _{k}^*|\cdot )}{\pi (\psi _k|\cdot )}, \end{aligned}$$

where \(\psi _k\) and \(\psi _{k}^*\) are the current and proposal values, respectively, and \(\pi (\psi _k|\cdot )\) is proportional to the product of the likelihood in (14) and the prior density of \(\psi _k\).

3.6 Prediction

Prediction is a common procedure associated with the statistical analysis of stochastic processes. In the context of unidimensional Poisson processes, prediction consists in estimating the future behavior of the process, in particular, its intensity function and/or events. The Bayesian approach allows for the prediction to be made under a probabilistic approach through the predictive distribution. Consider the full Bayesian model of a GMMPP Y in \([0,\infty ]\) and let y be a realization of the process in [0, S]. Now define \(g(Y,V,T,\psi )\) to be some measurable function, in the probability space of the full Bayesian model, that depends on (YVT) only in \((S,\infty )\). Then, prediction about \(g(Y,V,T,\psi )\) is made through the predictive distribution of \(g(Y,V,T,\psi )|y\).

In an MCMC context, it is straightforward to obtain a Monte Carlo (MC) sample from the predictive distribution as long as it is feasible to simulate from the full model. An MC sample is obtained by simulating \(g(Y,V,T,\psi )\) conditional on each value simulated along the MCMC (after a burn-in period) due to the fact that

$$\begin{aligned} \pi (g(Y,V,T,\psi )|y)\!=\!\int \pi (g(Y,V,T,\psi )|\varphi ,y)\pi (\varphi |y)d\varphi . \end{aligned}$$

Appealing examples of \(g(Y,V,T,\psi )\) include:

  1. i.

    \(\displaystyle \dot{\lambda } := \{\lambda (s)\}_{s \in (S,S+\dot{S})}\), for \(\dot{S}>0\);

  2. ii.

    \(\displaystyle \Lambda _{\dot{S}}=\int _{S}^{S+\dot{S}}\lambda (s)ds\);

  3. iii.

    \(N_{S+\dot{S}}-N_S\).

For examples i. and ii., it is enough to simulate the CTMC X conditional on each sample of \((X(S),\theta )\) and compute g for the respective sampled value of \(\psi \). For example iii., an extra step is required to simulate from a \(Poisson(\Lambda _{\dot{S}})\) distribution, conditional on each simulated valued of \(\Lambda _{\dot{S}}\).

Table 1 The cost as a function of the number of observations

4 Simulated examples

This section presents a collection of simulated examples to explore important issues related to the methodology proposed in this paper. In particular, we explore: 1. the impact of the number of observations and the number of jumps in the IF on the computational cost of the MCMC algorithm; 2. a sensitivity analysis for the priors of \(\psi \) and \(\theta \); 3. the efficiency in estimation and prediction (with replications).

Convergence diagnostics are obtained based on the MCMC chain for the parameters, for some functions of the CTMC and for the log-posterior density (logarithm of the non-normalized posterior density). Computational cost is evaluated in terms of the average time (in seconds) to obtain 100 effective samples of the log-posterior density. We consider the running time including the calibration of the MH proposal for \(\psi \) and the adapting partition of [0, S] to sample \((V_0,V,V^{(T)},U,T,R)\). The effective sample size of an MCMC sample of size n is defined as \(n_{ess}=\frac{n}{1+2\sum _{j=1}^{\infty }\rho _j}\), where \(\rho _j\) is the autocorrelation of order j of the chain. It is such that the variance of the ergodic average of the n values from the chain is the same as the variance of the ergodic average of an independent sample (from the target distribution) of size \(n_{ess}\). The sum in the denominator is truncated at the last lag for which the autocorrelation is significantly different from zero (based on an asymptotic confidence interval). For the four examples in which the GMMPP is compared to a non-parametric IF model, we consider the effective samples of the log-likelihood instead of the log-posterior density.

All the examples are implemented in Ox (Doornik 2007) and run on an i7 3.4GHz processor with 16MB RAM. Codes are available at https://github.com/liviadutra/bayesian-inference-for-gmmpp.

4.1 Investigating the computational cost

In this section, we investigate the computational cost associated to the proposed methodology. In particular, we investigate the impact of the number of observations and the number of changes in the IF trajectory. As emphasized before, the low computational cost is at the core of the main contributions of this paper.

The cost as a function of the number of observations

We simulate five scenarios with the same behavior for the IF (functional forms and changes) in the same time interval but with different levels of magnitude. We consider three functional forms—increasing and decreasing lines and a constant, with fixed starting values and no self-jumps allowed. Table 1 presents the specific functional forms, length of stay, mean number and actual number of observations. We fix the Q-matrix so that all the states have a mean staying time of 20 units and uniformly distributed transition probabilities. Multiple updates of blocks (UWV) and \((V_0,V,V^{(T)},U,T,R)\) are performed to control the high autocorrelation of the parameters of the increasing line functional form—5 updates for scenarios A1 and A2, 15 for A3 and A4 and 25 for A5. Moderately informative priors are adopted for \(\psi \) in scenario A1, namely \(\psi _{11}\sim N(1,2^2)\) (intercept of the increasing line), \(\psi _{12}\sim N(0.5,0.6^2)\) (slope of the increasing line), \(\psi _{21}\sim N(5,2^2)\) (intercept of the decreasing line), \(\psi _{22}\sim N(-0.5,0.6^2)\) (slope of the decreasing line), \(\psi _{31}\sim Gamma(1,1)\) (constant). For all the other scenarios, independent uniform improper priors are used for all parameters but \(\psi _{31}\), for which a Gamma(1, 1) is also used.

Results regarding the estimation of the IF and \(\psi \) are presented in Fig. 13 and Tables 6 and 7 in “Appendix C”. They already show a reasonably good recovery of the IF and parameters for the dataset with only 103 observations, with the estimation improving substantially with the size of the dataset. MCMC diagnostics are presented in Figs. 16 and 17 in “Appendix D”.

The relation between the computational cost and the number of observations is shown in Fig. 4. We highlight the computational efficiency of the proposed MCMC algorithm shown by the running times. The methodology has shown to be quite efficient to be applied for very large datasets. For example, the total running time to obtain an effective sample size of 100 for the log-posterior density is around 2.3 min for the dataset with 10 thousand observations and 18 min for the dataset with 30 thousand observations.

Fig. 4
figure 4

Computational cost, measured in terms of the time per 100 effective samples of the log-posterior density, versus the number of observations

Table 2 The cost as a function of the number of changes in the IF

The cost as a function of the number of changes in the IF

We simulate three scenarios with the same average number of observations and three functional forms—increasing and decreasing lines and a constant, with fixed starting values and self-jumps allowed. The IF is simulated from the same CTMC prior but considering different total observed time in order to have considerably different numbers of changes in the IF. The number of observations is approximately 2 thousand for all the scenarios. Table 2 presents the specific functional forms, average length of stay per visit and number of changes in the IF. The priors on the Q-matrix diagonal parameters are \(\theta _1\sim Gamma(1,10)\), \(\theta _2\sim Gamma(1,10)\) and \(\theta _3\sim Gamma(1,5)\) and, for the transition probability vectors, we adopt a uniform prior on the respective simplex. Finally, uniform improper priors are adopted for all the \(\psi \) parameters. Blocks (UWV) and \((V_0,V,V^{(T)},U,T,R)\) are updated 5 times in each iteration of the Gibbs sampling.

Results regarding the estimation of the IF and of the \(\psi \) and \(\theta \) parameters are presented in Fig. 14 and Tables 6 and 7 in “Appendix C”. They show a very good recovery of the IF and parameters. MCMC diagnostics are presented in Figs. 18 and 19 in “Appendix D”.

The relation between the computational cost and the real number of changes in the IF is shown in Fig. 5. Again, we highlight the computational efficiency of the proposed MCMC algorithm shown by the running times. The total running time to obtain an effective sample size of 100 for the log-posterior density is around 105 s for the dataset with 40 changes in the IF and approximately 2 thousand observations.

Fig. 5
figure 5

Computational cost, measured in terms of the time per 100 effective samples of the log-posterior density, versus the number of changes in the IF

Table 3 Informative priors for the sensitivity analysis of \(\psi \)

4.2 Prior sensitivity analysis

We perform a prior sensitivity analysis for parameters \(\psi \) for scenarios A1, A3, A5 and B2. Those examples are run with non-informative and moderately informative priors. The latter ones are set based on the scale of each example. Also, a prior sensitive analysis for parameters \(\theta \) in the diagonal of the Q-matrix is performed for scenarios B1 and B3. Again, non-informative and moderately informative priors are used.

In the first analysis, the Q-matrix is fixed for all the \(\hbox {A}^*\) scenarios at the same values as in Sect. 4.1. For scenario B2, the same non-informative priors from Sect. 4.1 are adopted. The prior on the constant IF parameter \(\psi _{31}\) is set to be Gamma(1, 1) in all the cases. For the parameters indexing the other two functional forms, we compare the results for improper uniform priors and the moderately informative priors shown in Table 3. Results for the parameter estimation are presented in Table 8 in “Appendix C” and show that greater differences are observed only for the parameters of the increasing line. As it should be expected, the variances of those parameters are greater for the non-informative priors, for which the posterior density is also more asymmetric. Results for the IF go in the same direction, with significant differences observed only for scenario A1 in the time period associated with the increasing line. It can be noticed that the posterior distribution of the IF is more influenced by the data for the non-informative prior, as expected—see Fig. 15 in “Appendix C”.

The second sensitivity analysis concerns the prior distribution on the parameters \(\theta \) in the diagonal of the Q-matrix. Improper uniform priors are adopted for all the \(\psi \) parameters except for the constant value \(\psi _{31}\) which has a Gamma(1, 1) non-informative prior. Uniform priors on the simplex are adopted for all the transition probability vectors in the Q-matrix. The non-informative priors for the Q-matrix diagonal parameters are improper uniforms distributions and the informative ones are \(\theta _1\sim Gamma(1,10)\), \(\theta _2\sim Gamma(1,10)\) and \(\theta _3\sim Gamma(1,5)\). Results (omitted here) are virtually the same for the two prior specifications w.r.t. the estimated IF, \(\psi \) parameters and transition probability vectors from the Q-matrix. As for the parameters in the diagonal of the Q-matrix, small yet non-negligible differences are observed, with slightly larger variances for the case with non-informative priors.

4.3 Efficiency in estimation and prediction

4.3.1 Examples with replications

We now investigate the efficiency of the proposed methodology in terms of estimation and prediction by considering replications of the same model. We consider the IF from scenarios A1, A3 and B2 and generate 50 independent datasets for each one. Prediction for the integrated IF in the interval [400, 800] is performed for scenario B2 by sampling from its predictive distribution.

In order to summarize the performance of the methodology we consider the posterior distribution of the following measure of fit:

$$\begin{aligned} \frac{1}{S} \int _0^S|\lambda (s) -\lambda _R(s)|ds, \end{aligned}$$
(17)

where \(\lambda _R\) is the real intensity function.

Results are shown in Figs. 6, 7 and 8 and reveal a very good performance to estimate and predict the IF.

Fig. 6
figure 6

Results for 50 replications of scenario A1. Top left: true (black line) and posterior mean of the IF for each replication. Top right: posterior density of the measure of fit in (17) for each replication. Bottom: mean and 95% CI for the integrated IF for each replication. Horizontal line represents the real value

Fig. 7
figure 7

Results for 50 replications of scenario A3. Top left: true (black line) and posterior mean of the IF for each replication. Top right: posterior density of the measure of fit in (17) for each replication. Bottom: mean and 95% CI for the integrated IF for each replication. Horizontal line represents the real value

Fig. 8
figure 8

Results for 50 replications of scenario B2. Top left: true (black line) and posterior mean of the IF for each replication. Top right: posterior density of the measure of fit in (17) for each replication. Bottom left: mean and 95% CI for the integrated IF for each replication. Horizontal line represents the real value. Bottom right: real (black line) and posterior predictive density of the integrated IF in [400, 800]

Fig. 9
figure 9

Comparison between the estimated IF (posterior mean and 95% pointwise CI) for the GMMPP (red) and the non-parametric IF (blue) models for scenarios C1, C2 and C3. Black dashed line is the true IF and circles are the observations. (Color figure online)

4.3.2 Comparison to non-parametric Cox process model

We compare our methodology to that from Gonçalves and Gamerman (2018), in which the IF is assumed to be a continuous positive function of a latent Gaussian process. As it is demonstrated in that paper, this is the most computationally efficient exact methodology for Gaussian process-driven IF Cox processes. The computational cost associated to the MCMC algorithm from Gonçalves and Gamerman (2018) is \(\mathcal {O}((\lambda _{sup}S)^3)\), where \(\lambda _{sup}\) is the supremum of the IF in [0, S]. It is defined by the cost to generate multivariate normal distributions which are required due to the use of a latent Gaussian process. This implies that not only the cost is larger and grows much faster than the cost from our methodology but also that it is not feasible to apply the methodology to very large datasets.

We consider three scenarios—C1, C2 and C3, with deterministic IFs. Scenario C1 considers a dataset of size 302 generated from the IF \(\lambda (s)=20\exp \{-x/5\}+1.5\exp \{-(x-25)^2/50\}\) in [0, 50] and excluding the 8 observations generated in [9, 14] (so, there is no “true” IF). The GMMPP is fitted with decreasing and increasing lines and a constant, all with non-varying intercept (prior 1 for R). Uniform improper priors are adopted for all the parameters indexing the increasing and decreasing lines and a Gamma(1, 1) for the constant one. A prior Gamma(1, 20) is adopted for all the three diagonal parameters of the Q-matrix. The prior on the probability vectors of the Q-matrix are uniform Dirichlet. The estimated IF for both models are shown in Fig. 9.

Scenarios C2 and C3 consider the IFs shown in Fig. 9, along with the estimated IF for both models. C2 has 452 observations and C3 has 519. The GMMPPs fitted in the C2 considers increasing and decreasing lines with varying starting values (prior 3 for R) and a non-varying constant line (prior 1 for R). C3 considers increasing and decreasing lines with varying starting value only for the former. We set priors N(1.3, 1) and \(N(-1.3,1)\) for the slope of the increasing and decreasing functions in C2 and uniform discrete priors on \((3,3.2,\ldots ,11)\) and \((8,8.2,\ldots ,16)\) for the intercepts of the increasing and decreasing lines, respectively. A Gamma(1, 10) prior is adopted for all the three diagonal parameters of the Q-matrix. All of those priors basically set the scale of the IF to favor model identifiability. For C3, we set N(15, 4) for the intercept of the decreasing line, uniform improper priors for the slope parameters and a uniform discrete prior on \((3,3.2,\ldots ,13)\) for the intercept of the increasing line. Priors Gamma(1, 5) and Gamma(1, 10) are adopted for the two diagonal parameters of the Q-matrix, respectively. The priors on the probability vectors of the Q-matrix are uniform Dirichlet for both scenarios.

Results for the three scenarios show that similar results are obtained for the GMMPP and the non-parametric models. We also compare those models in terms of computational efficiency—see Table 4, to reinforce the computational advantages of the proposed methodology.

Table 4 Time (in min) per 100 effective samples of the log-likelihood function

5 Applications

Here, we present two applications to illustrate our methodology. Furthermore, in “Appendix E”, we propose a specific model to fit epidemic curves and analyze data from Dengue Fever in Brazil and COVID-19 in Switzerland and Romania. This is a data-driven statistical approach that allows for uncertainty quantification in estimation and prediction and is computationally feasible to be applied to large datasets.

Fig. 10
figure 10

Top: posterior probability of each functional form—constant (black), increasing line (green) and decreasing line (orange). Bottom: estimated IF for the coal mining data for the GMMPP and non-parametric IF models. Red line and pale-red shadow are the posterior mean and 95% CI for the GMPP and blue line and pale-blue shadow are the same statistics for the non-parametric model. Circles are the observations. (Color figure online)

5.1 Coal mining disasters

We apply the proposed methodology to the classic coal mining disasters data of Jarrett (1979), consisting of the dates of 191 explosions in coal mines that killed ten or more men in Britain between 15th March 1851 and 22nd March 1962 (re-scaled to [0, 111], year unit). We also analyze these data with the non-parametric IF model of Gonçalves and Gamerman (2018).

Based on an empirical analysis of the data (see Fig. 1), we set three functional forms—decreasing and increasing lines and a constant, all with varying starting values (prior 3 for R). We adopt the following priors: \(\text{ Uniform }(0.1,0.3,\ldots ,4.7)\) for the varying starting values of the increasing and decreasing lines and a Gamma(1, 1) for the constant; \(N(0.1,0.2^2)\) and \(N(-0.1,0.2^2)\) for the slopes; the three diagonal parameters of the Q-matrix are fixed at 1/60 and the transition distributions are fixed to be uniform. The former considers the time scale of the data to penalize scenarios with expendable changes in the intensity function but without compromising its estimation.

The computational time per 100 effective samples of the log-likelihood is 3.8 min for the GMMPP and 8.6 min for the non-parametric IF model (with no approximations to simulate from the Gaussian process). The estimated IF for both models is shown in Fig. 10, along with the posterior probability of each functional form. The posterior mean of the integrated IF is 201.7 (standard deviation = 15.5) for the GMPP and 193.4 (14.5) for the non-parametric IF model. The posterior mean of the slopes are -0.0811 (0.0966) for the negative one and 0.0904 (0.1009) for the positive one.

Fig. 11
figure 11

Top: posterior probability of each functional form—increasing line (green) and decreasing line (orange). Bottom: Estimated IF (posterior mean and 95% CI) for the exchange rate data. Circles are the observations

5.2 BRLxUSD exchange rate

We consider the exchange rate between US Dollar to Brazilian Real. The dataset consists of the 1163 days, between Jan 2000 and Dec 2017 (re-scaled to [0, 216], month unit), in which the daily variation was higher than 1%. Prediction is performed for the period of Jan 2018 to Apr 2020.

Based on an empirical analysis of the data (see Fig. 1), we set two functional forms—decreasing and increasing lines, both with varying starting values. We adopt the following priors: \(\text{ Uniform }(5,5.2,\ldots ,12)\) and \(\text{ Uniform }(0,0.2,\ldots ,6)\) for the varying starting values of the decreasing and increasing lines, respectively, and uniform improper priors for both the slopes. A Gamma(1, 40) is assumed for the diagonal parameters of the Q-matrix and uniform Dirichlet priors are used for the transition probabilities. The observed time interval is divided into 6 blocks to update the CTMC component.

Fig. 12
figure 12

Posterior predictive density of the integrated IF for the exchange rate example. The vertical line represents the real observed number of events in the predicted period

The MCMC algorithm takes around 4.6 min to draw 100 effective samples. The estimated IF and the posterior probability of each functional form are shown in Fig. 11. The posterior mean and standard deviation of the integrated IF is 1143.1 and 34.3. The posterior mean of the slopes are –0.3102 (standard deviation = 0.0651) for the negative one and 0.1176 (0.0355) for the positive one. The same statistics for the mean waiting times are 11.70 (4.18) for the decreasing line and 9.36 (3.16) for the increasing line.

The predictive distribution of the integrated intensity between Jan 2018 and Apr 2020 is shown in Fig. 12 and has mean 151.4, standard deviation 35.80 and 95% CI (78.44, 218.29). The real observed number of events is 151.

6 Conclusions

This paper proposed a novel class of unidimensional Cox processes in which the intensity function assumes predefined functional forms and alternates among these according to the jumps of a continuous time Markov chain. This novel class aims at providing an efficient way to perform useful statistical analysis of unidimensional point processes at a very reasonable computational cost, especially when compared to non-parametric approaches based on latent Gaussian processes.

Important issues regarding model elicitation and identifiability and some aspects of the MCMC algorithm are discussed and explored in simulated studies. Model elicitation should be based on prior knowledge and/or empirical analysis of the data. Whilst non-informative priors work well for the parameters indexing the functional forms, prior elicitation for the parameters in the Q-matrix requires special attention. If not many changes are expected, parameters in the diagonal should be fixed at values coherent with the scale in a way to avoid very short visits. For the transition probabilities, uniform priors are suitable in any case.

The proposed MCMC algorithm performs exact Bayesian inference for GMMPPs so that only Monte Carlo error is involved. The algorithm is carefully devised to efficiently sample from the posterior distribution of all the unknown quantities in the model. In particular, the blocking scheme to sample from the CTMC trajectory has been shown to be crucial to obtain a computationally efficient algorithm. Simulated studies illustrated the computational and statistical efficiency of the proposed methodology under different circumstances. In particular, efficient solutions for large datasets are obtained at a reasonable cost.

Two real datasets are also analyzed and prediction is performed for one of them, providing good results. Finally, results indicate that, typically, models with only straight lines (increasing, decreasing and constant) are enough to provide a good fit.