PtProcess : An R Package for Modelling Marked Point Processes Indexed by Time

This paper describes the package PtProcess which uses the R statistical language. The package provides a uniﬁed approach to ﬁtting and simulating a wide variety of temporal point process or temporal marked point process models. The models are speciﬁed by an intensity function which is conditional on the history of the process. The user needs to provide routines for calculating the conditional intensity function. Then the package enables one to carry out maximum likelihood ﬁtting, goodness of ﬁt testing, simulation and comparison of models. The package includes the routines for the conditional intensity functions for a variety of standard point process models. The package is intended to simplify the ﬁtting of point process models indexed by time in much the same way as generalized linear model programs have simpliﬁed the ﬁtting of various linear models. The primary examples used in this paper are earthquake sequences but the package is intended to have a much wider applicability.


Introduction
This paper describes a unified approach to fitting and simulating a wide variety of temporal point process or temporal marked point process models using the R statistical language (R Development Core Team 2010).These models are defined in terms of a conditional intensity function, i.e., conditional on the history of the process over time.
Point processes are used for modelling a series of events occurring at points in time.A typical and important example is the series formed by the times that earthquakes occur in a given region.Point processes often show a lot of structure.For example, a large earthquake is often followed by a sequence of aftershocks.Some of the aftershocks may, themselves, have aftershocks.These characteristics can be observed in the upper panel of Figure 1.It shows the Phuket boxing day event on 2004-12-26, and its aftershock sequence, that caused the devastating boxing day tsunami in the Indian Ocean.The aftershock sequence of events is also very characteristic of those associated with large earthquakes.Initially, the aftershocks occur very frequently, and some have large magnitudes.Both the frequency and magnitude size diminish over time.One can also observe that the larger aftershocks also have their own associated aftershock sequence.An accessible review of these models is provided by Vere-Jones (2009), who also discusses these models in the context of forest-fires.
An important concept in temporal point process theory is the conditional intensity function (conditional on the history of the process over time).This describes the probability of a new event in the next instant of time given knowledge of the past up until the present.The conditional intensity function of a marked point process can be considered as having two parts: the ground intensity function together with the distribution of the marks, both conditional on the past, where the ground intensity function simply describes the non-homogeneous Poisson rate of the process over time.Examples of possible marks in the earthquake context are: the earthquake location, magnitude, and rupture direction.Rigorous definitions are given in Section 2 of this paper.
In the development of this software, the ground intensity function provides the unifying principle.Given the ability to calculate values of the ground intensity function, some integrals of the ground intensity function, and the conditional distribution of the marks, one can write generic routines to calculate likelihoods, fit parameters, carry out goodness of fit tests and do simulations.The R statistical language has the required structure for doing this and the package PtProcess (Harte 2010), described here, is an implementation of this approach within the R framework.
The concept of using a general framework for fitting a wide variety of models is well known, particularly for analysing both designed experiments (Nelder 1977) and generalized linear models (Nelder and Wedderburn 1972).This unified approach resulted in an enormous simplification in the analyses of linear models carried out by applied statisticians.It is the author's aim that the package PtProcess (Harte 2010) provides an appropriate unified approach for people analysing a range of temporal point process models.
The PtProcess package was originally written (Harte 1998) in S-PLUS to model processes indexed by time and conditional on their history.It has recently been revised to include marked point processes, and has been added to the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=PtProcess.The R open source software (R Development Core Team 2010) provides a sophisticated programming environment for the development of such models and is also available for people world-wide to freely use.While functions for some specific models have been provided in the PtProcess package, the purpose is mainly to provide a structure and environment for users to define and analyse their own marked point process models.
The development of the PtProcess package has followed a direction of including the models that we are currently applying in our earthquake modelling.However, we have endeavored to structure the package with sufficient generality to allow for the inclusion of a wider class of temporal point process model: both those of non marked point process type, and those with a non seismological application.
A package available in CRAN for modelling spatial point processes is spatstat (Baddeley andTurner 2005, 2010).The models provided within spatstat have no natural definition of past history and are based on a Papangelou conditional intensity (i.e., conditional on the spatial locations of other points, with no time ordering), and hence are quite different to those models discussed in this paper.The ptproc package by Peng (2003) is based on an earlier version of PtProcess, extending it into multiple dimensions, but deals more directly with the conditional intensity function.
In Section 2 we provide a very brief theoretical background to marked point processes, with an emphasis on those parts that influence the structure of the package.In Section 3 we describe the main structural features of the package and how the underpinning theory has influenced that structure.In Section 4 we provide some example code for various analyses.

Mathematical background
In this section a very brief summary of marked point processes is provided.The two most important components of such a model are the ground intensity function and the mark distribution.The ground intensity function describes the rate at which events occur over time, and is not only influenced by the current time, but also the events that have occurred before the current time (i.e., process history).This is described in Section 2.1.The mark distribution describes other variables associated with the event, referred to as marks, and can also be dependent on the history.The mark distribution is described in Section 2.3.Daley and Vere-Jones (2003, Chapter 7.3) give an extensive theoretical background to this class of models and also many other point process models.

Ground intensity function
The conditional intensity function describes the instantaneous Poisson rate.For example, consider events that are ordered by time occurring in 2D space.The history up to but not including time t is denoted by H t , and in this case is where t i is the time of the ith event and (x i , y i ) is its spatial location.The conditional intensity function is where The ground intensity function describes the instantaneous Poisson rate as a function of time only.Let N δ (t) be the number of events in [t, t + δ).Then the ground intensity function is defined as where (θ 1 , • • • , θ m ) ∈ Θ m are the parameters.Often these are excluded, and so λ g (t|H t ) is interpreted as An example of a ground intensity function is that of the simple (non-spatial) ETAS model, i.e., epidemic type aftershock sequence model used in modelling earthquake counts, see Ogata (1988Ogata ( , 1998Ogata ( , 1999)).It assumes that certain earthquake aftershock sequences can be modelled like an epidemic, i.e., large earthquakes inducing more aftershocks (higher infection rate) in a given interval of time and also, the aftershock sequence extends for a longer time after the mainshock event.The ground intensity function is The parameters (µ, A, α, c, p) are all positive, t i is the time of the ith event with magnitude M i , and M 0 is a threshold magnitude, generally a value below which the dataset is incomplete.
The history of the process up to time t includes all event times and magnitudes up to but not including t, i.e., The ETAS model says that events are generated as a Poisson process with rate µ.The first term under the summation (i.e., exp{α(M i − M 0 )}) says that larger events raise the intensity more, and the last term determines the length (time) of the aftershock sequence.
Obviously, the parameters require certain constraints, else the aftershock sequence (epidemic) could explode and never die out.See Ogata (1988Ogata ( , 1998Ogata ( , 1999) ) for more details of this model.
Figure 1 contains a plot of the ETAS ground intensity function that has been fitted to the aftershock sequence of the Phuket boxing day earthquake.The mainshock occurred on 2004-12-25 at 00:58:53.45GMT.The aftershock sequence extends for a considerable time after the mainshock and also includes additional mainshock-aftershock sequences.The data can be found in Harte (2010) and are discussed further in Section 4.

Likelihood function of simple model
Here we derive the likelihood function for the simple model where the conditional intensity function is only a function of time, i.e., takes the form of the ground intensity function in Equation 2.
Let τ be the time of the last event before time t.Also let ∅ (τ,t) be the null outcome, i.e., no events in the interval (τ, t).Denote the conditional distribution of the time of the next event as and h(t|H τ ∩ ∅ (τ,t) ) as the corresponding conditional density function.Then .
Solving the differential equation gives where τ is the time of the last event occurring before t.Rearranging gives the density function as  (0.054, 3.15, 1.34, 0.021, 1.12).The spatial boundaries of the analysed region are 89 • E, 105 • E, 5 • S, and 16 • N. Events with magnitude ≥ 5 between midnight on 2004-01-01 and 2009-01-01 have been selected, giving 1248 events.An epicentral plot of the same events can be seen in Figure 6.
are event times, and [T 1 , T 2 ] represents the interval within which events are explicitly included into the likelihood.Those events prior to T 1 , if available, are included in the history of the process.In the model fitting context this distinction allows one to use the events prior to T 1 to enable the process to "reach equilibrium" or a steady state.Then

Marked point process
Consider again the situation where the conditional intensity function is given by Equation 1.By a similar argument as in Section 2.2, its log-likelihood is where T ⊆ R + is a time interval, and X and Y are the domains of x and y, respectively.Obviously this can be extended to include many additional variables.For brevity, we rewrite it as log L = i : where y now represents a multivariate variable, referred to below as the marks.
A marked point process is a temporal point process {(t i , y i )} on R + × Y with the additional property that the conditional intensity function can be written as the following product The general form of the log-likelihood of a marked point process then follows directly from Equation 5as Note that the ground intensity and mark distribution could share some common parameters, hence Θ m (Equation 2) and Ψ p are not necessarily disjoint (i.e., Θ m ∪ Ψ p ⊆ R q where q ≤ m+p).While a trivial concept here, it is an important consideration in the software structure in Section 3.
Consider again the ETAS model applied to the Phuket earthquake sequence.Often the ETAS model is fitted under the assumption that the event magnitudes (i.e., M i − M 0 ) have an exponential distribution that is independent of the history of the process.An alternative assumption could be that during the initial part of the aftershock sequence when λ g (t|H t ) is relatively higher, we might expect the magnitude distribution to have fewer small events, hence a greater mean, and this mean would return to a lower level as the ground intensity returned to the relatively low background level.In this situation, the mark distribution (i.e., of the magnitude) is conditional on the history of the process, which will determine the current value of λ g (t|H t ).

Residual process
Let t i , for i = 1, • • • , n, be the times of the observed events.Then consider the transformed times where λ g (t|H t ) is the fitted ground intensity function.If the data are sampled from a process with ground intensity λ g (t|H t ), then the transformed time points will form a stationary Poisson process with rate parameter one.This sequence of transformed times is referred to as the residual process.Further details can be found in Ogata (1988) and Aalen and Hoem (1978).
A simple diagnostic graphical test of the model goodness of fit is to plot the event number i (i.e., i = 1, • • • , n, horizontal axis) versus the transformed time τ i (vertical axis).Note that the horizontal scale is not linear in time.The points should roughly follow the straight line y = x.Significant departures from this line indicate a weakness in the model.Further, if in a given sub-interval the line has a slope less than one then the transformed times τ i are too small, which indicates that the fitted ground intensity function λ g (t|H t ) is too small in this sub-interval.Conversely, if the slope is greater than one, the fitted ground intensity function λ g (t|H t ) is too large in this sub-interval.
An alternative is to plot τ i − i on the vertical scale.Such a plot is referred to as a cusum in the quality control literature and was devised by Page (1954), see also Lucas (1985).In the present context, the "cumulative sum" is that of the inter-event residual times, and the subtraction of i removes the mean.Hence, when the process is "under control", the cusum will have a zero slope.A positive or negative slope indicates that λ g (t|H t ) is either too large or too small, respectively.A disadvantage of the first graphical representation is that the vertical scale of the plot increases linearly with the length of the series, and hence deviations from the line y = x will become visually smaller and hence not as noticeable with increasing n.The cusum technique effectively reduces the vertical scale.In the same way as Brownian motion will deviate further from the origin with increasing time, so will the cusum, but it will be at a much slower rate than in the first plot, i.e., O( Zhuang (2006) and Schoenberg (2003) discuss the residual process in more general settings.

Model simulation
The PtProcess package uses the thinning method (Lewis and Shedler 1979;Ogata 1981) to simulate event times.Essentially one takes a small sub-interval at the beginning of the required simulation period.Then one finds the maximum of λ g (t|H t ) on the sub-interval, and simulates an inter-event time according to this maximum rate λ max .Given this has been done using λ max , the inter-event time will, on average, be too small.Hence, events will be generated too frequently; thus we generate a potential event (too many on average), and then "thin".One then moves along the simulation period and considers the next sub-interval.

Thinning algorithm
1. Let τ be the start of a small simulation interval.
3. Calculate the maximum of λ g (t|H τ ) in the interval as 4. Simulate an exponential random number ξ with rate λ max .

If
then go to step 6.

If
then a new "event" occurs at t i = τ + ξ.Simulate the associated marks for this new event.
8. Increment τ for the next event simulation: 9. Return to step 2.
If λ g (t|H t ) is monotonically decreasing (except at event times), as in the ETAS model, then the selection of δ has no effect because λ max = λ g (τ |H τ ).When λ g (t|H t ) is monotonically increasing (except at event times), there are two extreme situations that could cause the simulation method to be inefficient.If δ is too small, λ max will be relatively small, hence ξ quite large, possibly greater than τ + δ.Here many small intervals will be considered, but each with a very low likelihood of including an event.If δ is too large, λ max will be relatively large, hence ξ will be quite small.This could be inefficient as many potential "events" will be "thinned".Hence, for best efficiency, one requires δ to be not too small, but also not too large.Our algorithm sets δ as the 70th percentile of an exponential distribution with rate equal to λ g (τ |H τ ).
The thinning algorithm is not necessarily the most efficient or intuitive method of simulation.However, it does provide a method that is sufficiently general for a reasonably wide class of models.An alternative method for simulating the ETAS model, which also gives more intuition into the underlying model, is to initially simulate the ancestor (mainshock).Then one would simulate the children of this ancestor (1st generation), then their children (2nd generation), and continue until the given family line dies out.At this point the aftershock sequence (family tree) associated with the given mainshock event (ancestor) has expired and the process returns to the underlying rate of seismicity.In fact, the ETAS model assumes that all events are potential ancestors, though many are of sufficiently small magnitude that they have no offspring.

Overview of structure
The essence of the package structure hinges around the concept of a model object.In the present paper, we are particularly interested in marked point process models.Hence, we want to define the structure for an R object which embodies all of the information that is required to define such a model.We want to do this in such a way that we can perform generic like operations (functions) on this object, for example, a model summary, calculate the residuals, calculate the log-likelihood, possibly an appropriate plot, simulate data, and so on.The R programming language has a number of generic functions (e.g., plot, summary, residuals, logLik, simulate, etc).We want to provide methods for some of the R generic functions for when such a generic function is applied to a marked point process model object.
The model object needs to contain the dataset and an appropriate specification of the mathematical structure of the model.We will need to calculate the log-likelihood, particularly if we want to maximize the likelihood with respect to the model parameters.Generally, optimisation functions require all estimable parameters to be represented as a single vector.We also want the users to be able to define their own conditional intensity functions and build an appropriate model object in such a way that the provided methods for marked point process objects will work on their user defined objects.
In the preceding section (Section 2), it was shown that the marked point process contains two important "building blocks", the ground intensity function and the mark density function.In fact, it can be seen that if the R representations of the ground intensity function and mark density function can both be evaluated at arbitrary values of time, and the integral of the ground intensity function can be calculated over arbitrary intervals, then we can calculate the log-likelihood, simulate a process, calculate the residual process, and so on.In Section 3.2 and Section 3.3 we define a general structure for the R representations of the ground intensity function and the mark density function, respectively.
As already noted, we require the model parameters to be represented to external functions as a single vector.This enables the log-likelihood as given by Equation 7to be maximized over the complete model parameter space.The model parameter space is Θ m ∪Ψ p where Θ m is the parameter space of λ g (t|H t ), and Ψ p is the parameter space of f (y|H t ).Further, Θ m ∩ Ψ p is not necessarily ∅.Therefore, we need two mappings that map the complete parameter space into Θ m , and another mapping it into Ψ p .The structure for these mappings is described in Section 3.4.
In Section 3.5 we define the structure of the R marked point process model object, and in Section 3.6 we briefly describe generic functions in R that currently have methods defined for the marked point process model object.The functions that provide these methods are relatively simple because they can be written in a very general manner; everything that is specific to the model of interest is contained within the model object, namely the ground intensity function and the mark density function.This is why these two functions must conform to a general specification (Sections 3.2 and 3.3, respectively); they will be executed directly by the methods functions.
The last subsection (Section 3.7) describes how parameters can be estimated by maximising the likelihood function.

Ground intensity function
Ground intensity functions, defined in Section 2.1, are programmed as R functions with the following structure.In example code below, we generically refer to individual functions as gif.Those provided in the PtProcess package have a naming convention name_gif.Currently they include jump processes that are conditional on their history (e.g., etas_gif, srm_gif, linksrm_gif), and non-homogeneous Poisson processes that are not conditional on their history (e.g., simple_gif, expfourier_gif, exppoly_gif, fourier_gif, poly_gif).
The R gif functions not only evaluate values of λ g (t|H t ), but also its integral, which is required to calculate the log-likelihood (Equation 7) and the residual process (Equation 8).
In this sense, the R gif function does more than that specified by the mathematical definition given in Section 2.1.

Forms of usage
An R gif function has two usages, one to evaluate λ g (t|H t ) at a specified vector of time points, and another to evaluate the integral of λ g (t|H t ) on a specified interval.The usage is determined by the mix of supplied arguments.The two usages are, respectively, gif(data, evalpts, params, tplus = FALSE) and gif(data, NULL, params, TT) where the arguments are defined as follows.
data is a data frame containing the history of the process (H t ).It should contain all variables that are required to evaluate the gif function, though can contain others too.No history is represented as NULL.
evalpts is a vector of times at which the gif function is to be evaluated.
params is vector containing values of the parameters required by the gif function.
TT is vector of length 2, being the time interval over which the integral of the ground intensity function is to be evaluated.
All gif functions must have the same arguments even if they are redundant for the given point process under study (e.g., a homogeneous Poisson process is not dependent on its history).

Function attributes
Functions have been given an attribute "rate", taking the values of "bounded", "decreasing" or "increasing".This is used within the simulate method for 'mpp' objects which uses the thinning method (see Section 2.5).This method requires a knowledge of the maximum of λ g (t|H t ) in a given interval.The argument tplus is also used by the simulation routine, where it is necessary to determine the value of the intensity immediately after a simulated event.

Example function
Consider the following simple "conditional" intensity function where α, β and t are all assumed to be positive.In fact the function is not dependent on the history of the process at all, but the essential structure of the R function below is still the same; the history argument (data) is simply not used within the function.It can be seen that the function has two parts, depending on whether the argument TT is missing.If an interval TT = (T 1 , T 2 ) has been specified, then the integral ) is evaluated at each value of t specified within evalpts.Checking for positive values could also be included within the function, but this would be at the expense of execution time.
The other unused function argument in this example is tplus.In many processes that are dependent on the history, a jump occurs in λ g (t|H t ) at the time of an event, more specifically, the process is assumed to be left continuous.If tplus == TRUE, then λ g (t + |H t ) is returned, i.e., the value of the function immediately after the jump.This value is required by the thinning simulation algorithm.Note also the attribute that has been attached at the bottom of the function code.This is also required by the thinning simulation algorithm to determine λ max (see Section 2.5).
Code for other functions that are conditional on the process history can be scrutinized by printing out the function.The essential structure of the R function is the same as in the example above.

Mark distributions
In order to fit a marked point process, we need to define the appropriate R function to calculate the mark densities.If we want to simulate a marked point process, then we need to define the appropriate R function to simulate the mark variates.Consider a mark distribution, which could be multivariate, called "xyz ".We will refer to the corresponding R mark density and mark simulation functions as dxyz _mark and rxyz _mark, respectively.The prefixes "d" and "r" are chosen to be consistent with the nomenclature used in the standard R probability functions.R functions representing mark density and random number generators have the follow generic argument structure: dxyz _mark(x, data, params) and rxyz _mark(ti, data, params) respectively, where: ti is the time of an event (scalar), x is a data.frame of mark values and times of specific events, generally a subset of the history, data is a data.framecontaining the history of the process, denoted below as H t , and params is a numeric vector of parameters.
Mark density functions must return a vector with length being equal to the number of rows in x.Each element contains the logarithm of the joint density of the marks corresponding to each event time (row) in x.
The random number generator simulates each mark for a single value of ti.It must return a list of simulated marks corresponding to the specified time ti, and so each component in the list will be of length one.A list is used (rather than a numeric vector) because it allows marks to have a more complex structure.
Consider again the example of the Phuket earthquake sequence.Assume that after very large events, there is a deficit of smaller magnitude events and more larger magnitude events (i.e., relative to the exponential distribution)1 ; in particular a gamma distribution with shape parameter that is greater than one.When the shape parameter is equal to one, we have the exponential distribution.We want the distribution to return to roughly exponential after the ground intensity has returned to a low level that is characteristic of the background level of seismicity.Hence, define an R mark density function with seven parameters.Parameters p 1 , • • • , p 5 will be required to evaluate the R ground intensity function etas_gif.Assume that the magnitude distribution is gamma with scale parameter p 6 , and with a shape parameter given by shape where p 7 (p 7 > 0) is a free estimable parameter.When λ g (t|H t ) is small, the magnitude distribution roughly returns to the exponential distribution with an approximate rate of p 6 .The square root transformation was chosen pragmatically, to have a similar effect as a log transform, but ensuring that the values all remain positive.The density function could be written as follows: In the above example, we have modified the exponential distribution to restrict the number of very small events just above the threshold magnitude M 0 .However, another problem with the exponential distribution is that its tail is too long, and hence if very large earthquake sequences are simulated, earthquake events with unrealistically large magnitudes2 will occasionally be generated.The gamma distribution above actually compounds this problem.Ideally, one also wants to truncate the tail of the exponential distribution, see Harte (2010, topic dpareto).

Model parameter space
The parameter spaces of the ground intensity function and mark distribution (i.e., Θ m and Ψ p ) are not necessarily disjoint.In order to provide sufficient flexibility, mappings need to be defined from the full model parameter space to both Θ m and Ψ p .These are specified by means of R expressions.
The mapping expressions can contain any legitimate R arithmetic expression.The complete set of model parameters are assumed to be contained in a vector called params.Here is an example of a five parameter model, where the gif has four parameters (mapping gmap), and the mark distribution has three parameters (mapping mmap), with mappings specified as follows: Note that params[4] is exclusive to mmap, params[3] is exclusive to gmap, and all other parameters are shared.Note the inclusion of the R combine (c) function, because the expression must create a vector of parameters.These expressions are embedded directly into the code of various functions.

Marked point process model object
A marked point process is defined within a list object of class 'mpp'.By giving the object a class, we can use the object orientated nature of the R programming language and provide appropriate methods for some generic functions that are already defined within R.
The 'mpp' object must contain all of the information required to define the marked point process model.It can be constructed using the mpp function which has the following usage mpp(data, gif, marks, params, gmap, mmap, TT) where the arguments are as follows.
data is a data.framecontaining the history of the process, denoted below as H t .It should contain all variables that are required to evaluate the gif function and the mark distribution, though can contain others too.No history is represented as NULL.
gif is the ground intensity function as described in Section 3.2.
marks is a list containing the mark distribution as described in Section 3.3.The first component of the list contains the mark density function and the second the random number generating function, e.g., list(dxyz _mark, rxyz _mark).Undefined components are represented as NULL, e.g., list(dxyz_mark, NULL).
params is a numeric vector of all model parameters.
gmap is an expression that maps the model parameters (params) into the parameter sub-space Θ m of the ground intensity function; see Section 3.4.
mmap is an expression that maps the model parameters (params) into the parameter sub-space Ψ p of the mark distribution; see Section 3.4.
TT is a vector of length 2, being a time interval.Its use depends on whether the log-likelihood is being calculated or events are being simulated; see Section 3.6 for further details.

Generic functions
The R generic functions logLik, plot, residuals, simulate and summary currently have methods for the 'mpp' model object.The algorithms to calculate the log-likelihood, residual process, and simulate data were given in Section 2.3, Section 2.4 and Section 2.5, respectively.The R code can be scrutinized by printing the logLik, residuals, or simulate methods, respectively.Similarly, the methods for plot and summary can be scrutinized by printing plot or summary methods, respectively.Let x be an object with class 'mpp'.We now briefly describe these functions.

logLik(x) returns the log-likelihood. If x$marks[[1]
] is NULL, then log L as in Equation 4 is returned, else log L as in Equation 7is returned.Events included in the interval x$TT explicitly enter the likelihood, however, all events (even before x$TT[1]) are included in the history.This allows one to initially wait until x$TT[1] for the process to reach a steady state.
plot(x) plots the ground intensity function as in the lower panel of Figure 1.
residuals(x) returns a vector ("ts" object) of event times of the residual process, see Equation 8.
simulate(x) returns an object of class 'mpp'.By default, the function simulates events between x$TT[1] and x$TT[2].This can be overridden by a more sophisticated stopping condition.The new object will contain all events in the object x prior to x$TT[1] and then the newly simulated events.
summary(x) returns a list object containing a summary of the model.

Model fitting
Maximising the likelihood function is a non-trivial computational exercise.There are a number of possible problems: some optimisation routines are very sensitive to poor initial starting values of the parameters, hard boundaries on the parameter space, and different parameters may have very different scales or ranges of appropriate values.We will discuss these problems and how they are related at the end of this subsection.
Initially we want to set up a fairly general framework that gives the user flexibility to use different optimizers, utilize provided parameter scaling within the optimizer or implement their own, and also implementation of parameter constraints.This is achieved with a genericlike function called neglogLik.It returns the negative log-likelihood of the model.The function code is essentially as follows: R> neglogLik <-function(params, object, pmap) { + object <-pmap(object, params) + return(-logLik(object)) + } where: params is a vector of parameter values that could belong to a subspace of the full parameter space, further details below, object is a model object, and pmap is a user provided function mapping the revised parameter values params into the appropriate locations in object.The default is that the params vector is mapped one to one (untransformed) onto the model vector.
Initially new parameter values contained in params are mapped into the appropriate locations in the model object.The log-likelihood function is then calculated, the method being determined by the class of the object.It is essentially a "wrapper" function for the generic logLik function.We now show that it serves a number of useful purposes: as well as outputting the negative value of the log-likelihood, it also allows one to restrict the parameter space to a subspace of the full model, and also allows one to transform the parameters to eliminate boundary problems.

Parameter transformation: Case 1
Consider a model defined in an object x.Assume that the full model has five parameters, and we want to estimate the parameters by maximising the likelihood.We can achieve this by using the R function nlm which is in fact a minimizer (can also use optim), hence the calculation of the negative log-likelihood.
We initially provide starting values for nlm.Say the following values are reasonable: 10,20,30,40,50) We want nlm to start with these initial values, and iteratively calculate new versions until convergence is achieved.The logLik function call within the function neglogLik will use those parameter values that are contained within the current model object, so we need a function to update these values with the new estimates at each iteration: R> allmap <-function(y, p) { + y$params <-p + return(y) + } Now the neglogLik function can be minimized using nlm, where the behaviour of the function neglogLik is determined by the class of the object x: Parameter transformation: Case 2 The iterative process can get into trouble if the domain of some parameters have boundaries and new estimates fall outside of this domain.Consider the above five parameter example again, but where we only want to maximize the likelihood over parameters 1, 3, 4, and 5 while fixing parameter 2 at 20. Further, assume that parameter 1 can only take values between 9 and 11, parameter 3 must be positive, and parameters 4 and 5 can take any real values.
This would be achieved with the following code.The modified logit transformation allows nlm to select values for the first parameter from the entire real line (eliminating boundary problems), which then get mapped into the required interval.Similarly, the logarithmic transformation ensures that parameter 3 will always be positive.

Hessian, standard errors, AIC and parameter scaling
Routines that find the maximum by essentially taking the "steepest ascent" will generally be quite efficient.However, to take this route, numerical derivatives (possibly 2nd too if a quadratic approximation is used) will need to be calculated (we have not included formulae for derivatives in the model object).If the provided initial values are too poor, then this strategy also performs poorly or may not even work.Ideally one would like a robust initial method, particularly if one is uncertain of the parameter values.The R function optim provides a number of possibilities.Its default method uses the Nelder-Mead simplex algorithm (Nelder and Mead 1965) which appears to be more robust than methods based on derivatives.
There are a few possible ways to calculate the standard errors of parameters.One way is to calculate the Hessian matrix.Again, depending on the method that the optimizer uses, the Hessian may well have been calculated, and an argument simply needs to be set to request the optimisation function to output this matrix.The covariance matrix of the parameter estimates can then be calculated by inverting this matrix, and the standard errors are the square roots of the diagonal elements of this inverted matrix.One needs to take some care here though too.For example, the estimated Hessian matrix will be dependent on how many iterations the optimizer has performed prior to the final iteration.At each iteration, the optimizer will generally revise the parameter scaling factors (see below), which in turn will affect the calculated numerical derivatives.The estimated matrix will be rather poor if few iterations have been done because the scaling factors will not have had sufficient time to "settle down".If the model variables have been transformed, then the Hessian provided by the optimisation routine will be of the transformed variables.One may want to revert to untransformed variables close to the solution if standard errors really are required.Further, use of such standard errors appeals to asymptotic theory of maximum likelihood parameter estimates (normality) which may provide a poor approximation if the sample dataset is too small.
An alternative to Hessian-based standard errors is to simulate data of the same size as the observed dataset using the estimated parameter values for the observed dataset.One would then estimate the parameters for the simulated dataset.By repeating this simulation and model fitting many times, one can derive empirical distributions for each of the parameters.Given that some of these models can take a considerable amount of computing time to fit, this would only be a viable option in the case of relatively small datasets, where the asymptotic theory is less likely to hold.At least in the case of the larger datasets, the asymptotic theory is more likely to hold.
In some respects, the problems associated with calculating the standard errors are not really such a problem.Often one wants to compare a sequence of different fitted models.However, the difference between this sequence of fitted models to a sequence of fitted regression models in the linear model context is that the point process models are generally not nested.Hence, it is not simply a matter of testing whether a particular parameter is equal to zero or not.In the point process application areas, greater use seems to be made of measures like the AIC to compare competing models.
Scaling of the parameters may also be required if the different parameters are on very different scales.Both of the R functions nlm and optim have an argument where the parameter scales can be specified.Alternatively, scaling can be jointly achieved in many situations by using the log-based transformations already discussed, in much the same way where one uses a log transformation in classical regression models.

Example
In this section we fit two models to the Phuket aftershock sequence (Figure 1).The analyses here are very preliminary, and to provide a satisfactory model description of the process is a very complex task.For example, one complication is the temporal-spatial-magnitude boundaries imposed on the dataset.The data sequence is part of a geophysical process that has and will continue ad infinitum, and the spatial region selected is part of a larger tectonic environment.Further, all earthquake catalogues are incomplete for small magnitude events, and the catalogue completeness magnitude is not necessarily constant in time and space3 .While the models fitted below describe certain features of the given dataset, they probably have a very poor forecasting ability of future events.
The dataset is contained in the PtProcess package (Harte 2010).The data were originally extracted from the PDE (preliminary determination of epicentres) catalogue provided by the US Geological Survey (ftp://hazards.cr.usgs.gov/pde/).The catalogue is incomplete for events with magnitude below about 4.5.The magnitudes are rounded to one decimal place.To further restrict the size of the dataset, we have only included events with magnitude ≥ 5, so that the effective magnitude cut-off is 4.95.The selected spatial region has boundaries 89 • E, 105 • E, 5 • S and 16 • N. It is arbitrary what unit of time is used in the ground intensity function, though some of the parameter values will be peculiar to the chosen unit.For studies over hundreds of years, one would probably use years.In this example the time variable is the number of days since the beginning of the dataset, i.e., midnight on 2004-01-01 is time zero.All 1248 recorded events up until midnight on 2009-01-01 have been included, i.e., 1827 days later.
In Section 4.1 an ETAS ground intensity function with an exponential magnitude distribution (independent of the history) is fitted to the data.In Section 4.2 a model with a gamma density for the event magnitudes (Equation 9) is fitted to the data.This mark distribution is dependent on the history of the process.The residual process for each model is calculated and plotted in Section 4.3, and a forecast based on simulation is performed in Section 4.4.

Null model
In this subsection we fit a model to the Phuket earthquake data (Figure 1) with an ETAS ground intensity function (Equation 3) and an exponential distribution that is independent of the process history for the event magnitudes.
We initially set up the model object in a way that it can be used more generally in the following subsections.This is done by using the gamma density for the event magnitudes (mark distribution) as in Equation 9.In this subsection, we then impose restrictions on the parameters to ensure that the exponential distribution is actually fitted.The gamma distribution reduces to the exponential distribution when the shape parameter is one (i.e., params[7]=0, the default value within the model object).In this situation the magnitudes are independent of the history and hence the rate is easily estimated as the inverse of the mean.This has also been set as the default value within the model object.Since the rate parameter of the exponential distribution has already been estimated and is contained within the model object, we only need to estimate the five parameters associated with the ground intensity function.The function expmap below restricts the estimation to only these five parameters.Further, the model parameters are all positive.By allowing the minimizers to work with the logarithms of the estimated parameters, values on the entire real line can be selected (hence no boundary problems), and these get mapped back onto the positive real line within the function expmap.

R> library("PtProcess
Initially the function optim is used to estimate the parameters.Estimates from this are then used as starting values for nlm.The function optim uses, by default, the Nelder-Mead simplex algorithm (Nelder and Mead 1965) which is generally more robust to poor starting values.
The function nlm uses a steepest ascent method, based on numerical derivatives.It is much more sensitive to poor initial values, but convergence is relatively fast with good starting values.

Full model
In this subsection we fit a model to the Phuket earthquake data (Figure 1) with an ETAS ground intensity function (Equation 3) and a gamma density for the event magnitudes (mark distribution) as in Equation 9. We use the model object x that has already been defined in Section 4.1.

Goodness of fit
The log-likelihood of both fitted models be calculated as follows.3) fitted to the Phuket earthquake of 2004-12-26 (00:58:53.45GMT) and aftershock sequence.The "full model" refers to that model using a gamma distribution which is dependent on the history for the event magnitudes, and the "null model" refers to that model where the magnitudes have an exponential distribution which is independent of the history.
R> print(logLik(x0)) R> print(logLik( x1)) The log-likelihood and parameter estimates of the (null) model with exponential magnitudes and the (full) model with gamma magnitudes are listed in Table 1.Using only one extra parameter, the difference between the log-likelihoods is quite significant.The difference between the parameters associated with the ETAS ground intensity function are quite small.The ground intensity function can be plotted as plot(x1, log = TRUE) (lower panel of Figure 1) and a summary of the model object as summary(x1).
A graphical representation of the model goodness of fit is to plot the residual process (Equation 8) as described in Section 2.4.The code is written below and the resultant graph is shown in Figure 2.
Plots of the residual process are essentially checking the goodness of fit of the ground intensity function.Figures 2 and 3 indicate that there is very little difference between these functions.This is consistent with the very similar parameter estimates for the two models as listed in Table 1.
A useful diagnostic to evaluate the goodness of fit of the mark distribution is to plot their cusums over time.Under the null model, the magnitudes have an exponential distribution with mean 1/p 6 .Similarly, in the full model the expected value of the magnitude of the event  3) with gamma magnitudes is initially fitted to the Phuket earthquake of 2004-12-26 (00:58:53.45GMT) and aftershock sequence.The example dataset runs until midnight on 2009-01-01.2000 simulations were commenced from this date until the first magnitude 6.5 event in each occurred.The histogram represents the empirical distribution of the times to this event.The dash lines represent the 0.5, 0.8, 0.9, 0.95 and 0.99 quantiles.

Current development
There are two problems where the current development of the package is being focussed.The first term in the log-likelihood function generally has a double sum.When upwards of 10,000 events are in the dataset, evaluation of the likelihood function is very slow.The second problem involves the product structure of the conditional intensity function given by Equation 6.In some models this condition can be too restrictive.In this section we briefly discuss these two problems, and the solutions being implemented.

Computational complexity
The main computational work in calculating the likelihood function, and hence fitting a model, is the calculation of the term in Equation 7.This is because it is generally conditional on the history of the process.Hence, given a value of t, we must effectively look through the historical events to see how they affect the ground intensity at t, which will occur as a summation like operation.In some cases, some sufficient statistics can be calculated (e.g., see function linksrm1_gif) and stored for subsequent function calls, and in others all calculations are repeated every time the function is called (e.g., see function etas_gif).
One way to speed up the calculation of this term is to use parallel processing.We have recently implemented this using the R package called snow.One question with parallel processing is at what level in the program structure should it be implemented, for example, at a relatively fine level like each time we multiply matrices, or at a much coarser level?If one is using a cluster of nodes (machines) each with multiple processors, the efficiency is very dependent on the speed of communication between processors.In clusters this communication is very slow compared to a super computer.Hence, it is best to allocate large amounts of work to the nodes, and minimize communication.We have implemented the parallel processing at a relatively coarse level, being done within the logLik method.This ensures that some benefits can be had when using any ground intensity function.Further, the most computationally intensive aspects of these models is the model fitting.Implementation within the likelihood function also ensures benefits during estimation too.
If there are n events, then the double sum in the first term of Equation 7 will contain approximately n(n + 1)/2 terms.There may be more if there are further historical events that occur before those that occur within the interval T .These terms are then split equally between the available processors.When we use two processors on the same node (machine) with a dataset of about 8,300 events, we get an increase in speed of about 1.8 over a single processor.When we use 6 processors (all roughly of comparable speed), 2 on each of 3 nodes, we get an increase in speed of about about 4 over a single processor.This reflects the slow connection between nodes compared to the much faster connection between processors on the same node.The speed factors reduce for datasets of fewer events.The test model was that discussed in the following section (Section 5.2).No parallel processing was applied to other terms in the likelihood function.

Spatial distribution as a mark
In Section 2, it was shown that the main aspects of modelling a marked point process involve the ability to calculate values of the ground intensity function λ g (t|H t ), calculate the integral of the ground intensity function, and calculate values of the mark density f (y|H t ).It also requires the condition that the conditional intensity function can be expressed as a product as in Equation 6.
In the case of earthquake modelling, we could, therefore, include the spatial location of the event, the rupture length, rupture direction, and other measured information about a given event.The catch is that the conditional intensity function must satisfy Equation 6.This can put a considerable constraint on the form of the model.To see the problem, consider again the ETAS model.Let then the ground intensity function of the non-spatial ETAS model given by Equation 3 can be written as λ g (t|H t ) = µ + A i:t i <t ξ i (t).( 10) If a spatial version of this model was to be a marked point process, then we would require a density function to describe the spatial distribution of the events, say f (x, y), such that the product with λ g (t|H t ), as in Equation 6, gives the conditional intensity function λ(t, x, y | H t ).
Unfortunately this model has the wrong structural form.
Consider the following argument.Given Equation 10, it is natural to think of the non-spatial ETAS model as a composite model, i.e., a composition of many processes.It contains a Poisson part with rate parameter µ, and the term Aξ i (t) can be thought of as the ground intensity function of the aftershock sequence associated with the ith event.We require ξ i (t) to be integrable so that the ith family line dies out, i.e., the aftershocks associated with the ith event eventually stop.Now assume that the ith event has spatial location (x i , y i ) (in 2D, disregarding depth) and magnitude M i .Then we would expect the aftershock sequence to be located relatively close to (x i , y i ), and the spatial spread would be expected to be larger if the magnitude M i was greater.This suggests that a possible very simplistic mark density to describe the spatial distribution of the aftershock sequence associated with the ith event could be something like where σ 2 = de β(M i −M 0 ) , and d and β are estimable parameters (see Ogata and Zhuang (2006) for other possibilities).Then a possible form of the conditional intensity function of a spatial ETAS model is λ(t, x, y|H t ) = ζ(x, y) + A i:t i <t ξ i (t)f i (x, y) , where, given that X × Y is the observation region, The function B(x i , y i ) is effectively a boundary correction accounting for the finite spatial observation region (i.e., X × Y).It can be seen that the summation in Equation 11 is being taken over many marked point process models.This composite marked point process model still has one important structural characteristic of the more simple marked point process model.If we integrate over the mark space, the marginal intensity is similar to the overall ground intensity function (Equation 10), except for the inclusion of the boundary correction, i.e., Y X λ(t, x, y|H t ) dx dy = µ + A i:t i <t B(x i , y i ) ξ i (t) , denoted by λ g (t|H t ), say.Because of this, a number of the generic methods written for the simple marked point process work also in this situation with only minor modifications.For example, we could simulate a spatial ETAS point process by generating the time to the next event based on the current value of λ g (t|H t ) as in the simple case4 .Once a new event time has been simulated (say t j ), the event would be allocated to one of the existing clusters or the background process, depending on the relative values of µ and B(x i , y i )ξ i (t j ) for all t i < t j .
Other marks, including magnitude, would be allocated to the new event at this stage as well.
Estimation of the process is reasonably straightforward too.The PtProcess package currently has a prototype function for this situation.
The isotropic mark density function defined above is obviously unsatisfactory to model the Phuket example dataset.In Figure 6 we see that the spatial distributions of the aftershocks associated with the largest mainshocks are not isotropic, and further, they are not even centred about the main shocks.The aftershocks of the 2004 Boxing Day event tend to occur to the north.This often occurs, i.e., the mainshock can be on the periphery of what will develop as the aftershock region.There is some current argument (McGuire, Zhao, and Jordan 2002) that various second moment properties of the mainshock provide information of whether the aftershock region will be offset in this manner.If this is true, these measured properties should be built into the spatial density function.Further, note the alignment of the events within a long thin approximately elliptical region whose major axis is roughly parallel to the island of Sumatra.One may also want to consider functions with a powerlaw decay rather than the exponential decay of the bivariate normal density function, for example, see Ogata and Zhuang (2006).

Figure 2 :Figure 3 :
Figure 2: Residual process times for the ETAS model (Equation3) fitted to the Phuket earthquake of 2004-12-26 (00:58:53.45GMT) and aftershock sequence.The solid black line represents the model using a gamma distribution that is dependent on the history for the event magnitudes, and the dashed red line represents the model where the magnitudes have an exponential distribution that is independent of the history.The dashed line cannot be distinguished from the solid line in this plot.The dates of the larger events are marked on the top axis.

Figure 4 :Figure 5 :
Figure 4: Cusum of event magnitudes for the ETAS model (Equation3) fitted to the Phuket earthquake of2004-12-26 (00:58:53.45GMT) and aftershock sequence.The solid black line represents the model using a gamma distribution that is dependent on the history for the event magnitudes, and the dashed red line represents the model where the magnitudes have an exponential distribution that is independent of the history.

Table 1 :
Parameter estimates of the ETAS model (Equation