Formulating State Space Models in R with Focus on Longitudinal Regression Models

We provide a language for formulating a range of state space models with response densities within the exponential family. The described methodology is implemented in the R-package sspir. A state space model is specified similarly to a generalized linear model in R, and then the time-varying terms are marked in the formula. Special functions for specifying polynomial time trends, harmonic seasonal patterns, unstructured seasonal patterns and time-varying covariates can be used in the formula. The model is fitted to data using iterated extended Kalman filtering, but the formulation of models does not depend on the implemented method of inference. The package is demonstrated on three datasets.


Introduction
Generalized linear models, see McCullagh and Nelder (1989), are used when analyzing data where response-densities are assumed to belong to the exponential family.Time series of counts may adequately be described by such models.However, if serial correlation is present or if the observations are overdispersed, these models may not be adequate, and several approaches can be taken.The book by Diggle, Heagerty, Liang, and Zeger (2002) gives an excellent review of many approaches incorporating serial correlation and overdispersion in generalized linear models.Dynamic generalized linear models (DGLM), often called state space models, also address those problems and are treated in a paper by West, Harrison, and Migon (1985) in a conjugate Bayesian setting.They have been subject to further research by e.g.Zeger (1988) using generalized estimating equations (GEE), Gamerman (1998) using Markov chain Monte Carlo (MCMC) methods and Durbin and Koopman (1997) using iterated extended Kalman filtering and importance sampling.
Standard statistical software does not not include procedures for DGLMs and only sparse support for Gaussian state space models.There is a need for a simple, yet flexible way of specifying complicated non-Gaussian state space models.Often, one need to tailor make software for each specific application in mind.A function, StructTS, has been developed for analysis of a subclass of Gaussian state space models, see Ripley (2002).The binary library SsfPack for Ox may be used freely for academic research and provides a tool set for analysis of Gaussian state space models with some support for non-Gaussian models, see Koopman, Shephard, and Doornik (1999).The interface is very flexible, but not as easy to use as a glm call in R.
Section 2 describes Gaussian state space models and shows how generalized linear models can naturally be extended to allow the parameters to evolve over time.We define components (e.g.trend and seasonal components) that separate the time series into parts that may be inspected individually after analysis.In Section 3 the syntax for defining objects describing the proposed state space models are described as a simple, yet powerful, extension to the glmcall in R (R Development Core Team 2006).The techniques are illustrated on three examples in Section 4.

State space models
The Gaussian state space model for univariate observations involves two processes, namely the state process (or latent process), {θ k }, and the observation process, {y k }.The random variation in the state space model is specified through descriptions of the sampling distribution, the evolution of the state vector, and the initialization of the state vector.
Let {y k } be measured at timepoints t k for k = 1, . . ., n.The state space model is defined by We assume that the disturbances {ν k } and {ω k } are both serially independent and also independent of each other.The possible time-dependent quantities F k , G k , V k and W k may depend on a parameter vector, but this is suppressed in the notation.
We now consider the case where the state process is Gaussian and the sampling distribution belongs to the exponential family, The density (4) contains the Gaussian, Poisson, gamma and the binomial distributions as special cases.The natural parameter η k is related to the linear predictor λ k by the equation The linear predictor in a generalized linear model is of the form λ k = Z k β, where Z k is a row vector of explanatory variables and β is the vector of regression parameters.The link function, g, relates the mean, E(y k ) = µ k , and the linear predictor, λ k , as g(µ k ) = λ k .The inverse link function, h, is defined as where τ is the mean value mapping.The following relations hold , where u is the inverse of v.The link function is said to be canonical if

Dynamic extension
The static generalized linear model is extended by adding a dynamic term, X k β k , to the linear predictor, where β k is varying randomly over time according to a first order Markov process.Hence, where β is the coefficient of the static component and {β k } are the time-varying coefficients of the dynamic component.
For notational convenience, we will use the notation The evolution through time of the state vector, θ k , is modelled by the relation for an evolution matrix G k , determined by the model.The error terms, {ω k }, are assumed to be independent Gaussian variables with zero mean and variance VAR(ω k ), with non-zero entries corresponding to the entries of the time-varying coefficients, β k , and zero elsewhere.
The model is fully specified by the initializing parameters m 0 and C 0 , the matrices F k , G k , and the variance parameters V k and VAR(ω k ).The variances may be parametrized as e.g.

Inferential procedures
For a Gaussian state space model, we write For exponential family sampling distributions, the iterated extended Kalman filter yields an approximation to the conditional distribution of the state vector given D n , see e.g.Durbin and Koopman (2000).By Taylor expansion, the sample distribution (4) is approximated with a Gaussian density, giving an approximating Gaussian state space model.The conditional distribution of the state vector given D n in the exact model and in the Gaussian approximation have the same mode.The iterated extended Kalman filter is used as filter and smoother method in sspir.

Decomposition
The variation in the linear predictor, random or not, may be decomposed into four components: a time trend (T k ), harmonic seasonal patterns (H k ), unstructured seasonal patterns (S k ), and a regression with possibly time-varying covariates (R k ).
Each component may contain static and/or dynamic components, which is specified by zero and non-zero diagonal elements in VAR(ω k ), respectively, as described in the following.

Formulating State Space Models in R
The block-diagonal evolution matrix takes the form k is defined in (9), and G (3) k in (12).The components are only present if the model includes the corresponding terms.
The linear predictor, will be detailed in the following.

Time trend
The long term trend is usually modelled by a sufficiently smooth function.In static regression models, this can be done by e.g. a high degree polynomial, a spline, or a generalized additive model.In the dynamic setting, however, a low degree polynomial with time-varying coefficient may suffice.
By stacking a polynomial, and the first p derivatives, the transition from t k−1 to t k obeys the relation where ∆t k = t k − t k−1 , and the upper triangular transition matrix is given by Using θ (1) k for the left hand side of ( 8), a polynomial growth model with time-varying coefficients can be written as θ k .The error term has variance VAR(ω 1) , where W (1) is diagonal in the case with independent random perturbations in each of the derivatives.
The trend component is the first element in θ Alternatively, the time trend may be modelled as a random function, q(t), for which the increments over time are described by a random walk, resulting in a cubic spline, see Kitagawa and Gersch (1984).The transition is the same as in ( 8) with p = 2, but only one variance parameter is necessary as,

Harmonic seasonal pattern
Seasonal patterns with a given period, m, can be described by the following dth degree trigonometric polynomial This component can be used to describe seasonal effects showing cyclic patterns.Further seasonal components may be added for each period of interest.

The random fluctuations in θ
(2) k is modelled by a random walk, θ

Unstructured seasonal component
For equidistant observations, a commonly used parameterization for the seasonal component is to let the effects, γ k , for each period sum to zero in the static case, or to a white noise error sequence in the time-varying case, see Kitagawa and Gersch (1984).For an integer period, m, the sum-to-zero constraint can be expressed as m−1 i=0 γ k−i = 0 in the static case, and in the dynamic case, . This is expressed in matrix form by letting θ Then, θ w , 0, . . ., 0) defines the evolution of the seasonal component.The corresponding term in the linear predictor is extracted by

Regression component
Observed time-varying covariates, R k , enter the model through the usual regression term k and VAR(ω 4) .The structure of W (4) is specified by the modeller and depends on the context.

Specification of state space objects
The package sspir can be downloaded and installed from http://CRAN.R-project.org/ and is then activated in R by library("sspir").Assuming that the data are available either in a dataframe or in the current environment, then a state space model is setup using glm-style formula and family arguments.Terms are considered static unless embraced by the special function tvar(), described further in Section 3.2.

State space model objects
In sspir, a state space model is defined as an object from the class ssm.The object defines the model and contains the slots that are needed for the subsequent statistical analysis.The definition of a state space model object has the following syntax ssm(formula, family=gaussian, data, subset, fit=TRUE, phi, m0, C0, Fmat, Gmat, Vmat, Wmat) The call is designed to be similar to the glm call.The elements in the call are formula a specification of the linear predictor (5) of the model.The syntax is defined in Section 3.2.
family a specification of the observation error distribution and link function to be used in the model, as in a glm-call.This can be a character string naming a family function, a family function or the result of a call to a family function.Currently, only Poisson with loglink, binomial with logit-link, and Gaussian with identity-link have been implemented.It is possible to expand with further combinations within the exponential family.
data an optional data frame containing the variables in the model.By default the variables are taken from 'environment(formula)', typically the environment from which 'ssm' is called.The response has to be of class ts.
subset an optional vector specifying a subset of observations to be used in the fitting process.
fit a logical defaulting to TRUE which means that the iterated extended Kalman smoother is used to fit the model.If FALSE, the model is only defined and no inferential calculations are made.
phi a vector of hyper parameters that are passed directly to Fmat, Gmat, Vmat, and Wmat.If phi is not provided, it is default set to a vector of ones with the length determined by the number of hyper parameters needed on the basis of the formula provided.
m0 a vector with the initial state vector.Defaults to a vector of zeros.
C0 a matrix with the variance matrix of the initial state.Defaults to a diagonal matrix with diagonal entries set to 10 6 .
Fmat a function giving the regression matrix at a given timepoint.If not supplied, this is constructed from the formula.
Gmat a function giving the evolution matrix at a given timepoint.If not supplied, this is constructed from the formula.
Wmat a function giving the evolution variance matrix at a given timepoint.If not supplied, this is constructed from the formula.
Vmat a function giving the observation variance matrix at a given timepoint.If not supplied, this is constructed from the formula.
The call creates an object defining the system matrices F t , G t , W t , and V t in terms of functions, returning the matrix in question at a given time point.For example, the Wmat function could be defined as Here, Wmat has value phi[3] at all time-points except time-point tt==10, where the value is phi [2].This provides a mechanism of incorporating interventions and change-points at any given time.Note, that the call to ssm creates the functions which can be re-used.In the following example, the Wmat function is first created in the call to ssm, and then manually changed so that the second diagonal entry is larger at timepoint tt==10.Finally, the model is fitted using the function kfs.

Model formulas
A model formula is built up as in a glm call in R. The response appears on the left hand side of a tilde (~) and on the right hand side the explanatory variables, factors and continuous variables, appear.However, to specify time-varying regression coefficients, we have defined a special notation, tvar(), in which these are enclosed.
For example, the formula y ~z + tvar(x) will correspond to covariates, z and x, of which z has a static parameter and x has a dynamic parameter.An implicit intercept is also included in the model, unless the term -1 appears in the formula.When tvar enters a formula and -1 is not included, the intercept will always be time-varying, i.e. a random walk is added to the linear predictor.Thus, this model corresponds to the state space model with the linear predictor specified as λ k = Z k β + X k β k , Z k being the kth row in the n × 1 matrix Z = [z] and X k the kth row of the n × 2 matrix X = [1 x].The R command model.matrixapplied to the formula y ~z + x yields the n × 3 matrix [1 z x], in which the rows are F k .The polynomial time trend, (9), is specified using the function, polytime(time,degree=p) Note that polytime is different than the built-in R-function poly since the latter produces a design matrix with orthonormal columns.
The harmonic seasonal pattern, (11), is specified using the function, polytrig(time,period=m,degree=d) whereas the unstructured seasonal pattern, (12), is specified using the function, sumseason(time,period=m) Regression components are specified using the usual Wilkinson-Rogers formula notation in R.
The model matrix does not contain information about which variables are time-varying.This distinction is implemented by specifying the variance matrix, VAR(ω k ), with zeros in entries corresponding to static parameters and non-zero entries otherwise.

Inference
When a model has been defined using ssm and the option fit has been set to TRUE, the iterated extended Kalman smoother has been applied.The output is stored with the ssm object and can be extracted by the function getFit.This is a list that contains the estimated mean (as the component m), and variance matrices (as the component C) of the state vector, θ k , as well as the approximate log-likelihood (as the component loglik) based on the Gaussian approximation to the state space model.
If ssm has been called with the option fit set to FALSE, the function kfs returns the output object.
The following example defines a state space model, runs the iterated extended Kalman smoother, and finally extracts the fitted information into the variable vd.fit.

Examples
In this section, three examples of specification and application of state space models will be presented.The examples include Gaussian and Poisson observation densities.The time series are decomposed into components of trend and seasonality and also inclusion of external covariates is illustrated.The main focus will be on formulation of the state space object, how a relevant data analysis can be performed, and how to present the output from the analysis, based on this object.

Example 4.1 (Gas consumption)
A dataset provided with R is the quarterly UK gas consumption from 1960 to 1986, in millions of therms (Durbin and Koopman 2001, p. 233).As response, we use the (base 10) logarithm of the UK gas consumption (displayed in Figure 1), which we assume is normal distributed.We fit a model with a first order polynomial trend with time-varying coefficients and an unstructured seasonal component, also varying over time.
with the linear trend component being and the seasonal component being where ω (1) k and ω (1) k are independent noise components.The corresponding vector notation is by blocking the evolution matrices of ( 9) and ( 12) we get The model is specified, fitted, and plotted in sspir by phistart <-c(3.7e-4,0,1.7e-5,7.1e-4)gasmodel <-ssm(log10(UKgas) ~-1 + tvar(polytime(time,1)) + tvar(sumseason(time,4)), phi=phistart) fit <-getFit(gasmodel) plot(fit$m[,1:3]) Here, the estimated variances are taken from an external maximum likelihood algorithm provided by the function StructTS, Ripley (2002), which is standard in R. The decomposition in trend, slope and season components is displayed in Figure 2. In 1971, the slope increases from approximately 0.005 to approximately 0.014 and returns to this level in 1979.At the end of the observation period, the slope increases again.Similarly, it is seen, that the amplitude of the seasonal component is fairly constant from 1960-1971, after which it increases in the period 1971-1979 and then it stabilizes.The analysis can be reproduced in sspir by demo("gas").
Example 4.2 (Vandrivers) Let y t be the monthly numbers of light goods van drivers killed in road accidents, from January 1969 to December 1984 (192 observations).On January 31st, 1983, a seat belt law was introduced.The interest is to quantify the effect of the seat belt legislation law.For further information about the data set consult Harvey and Durbin (1986).
Here we use a state space model for Poisson data with a 13-dimensional latent process, consisting of an intervention parameter, seatbelt, changing value from zero to one in February 1983, a constant monthly seasonal, and a trend modelled as a random walk.
This corresponds to a model for Poisson counts with the linear predictor given by a trend, an intervension and a seasonal component and Hence the linear predictor is with evolution matrix and noise term given by .
The model can be defined and plotted in sspir by The plot displayed in Figure 3 shows the observations together with the estimated trend and intervention.The superimposed confidence limits are ± 2 standard deviations, based on the conditional variance of the latent process, C t , see Section 2.2.The trend over time is generally decreasing and the intervention effect corresponds approximately to a 25% reduction of casualties.The analysis can be reproduced in sspir by demo("vandrivers"), which show how to add confidence limits to the plot.

Example 4.3 (Mumps)
Monthly registered cases of mumps in New York City, January 1928 through June 1972 has been studied by Hipel and McLeod (1994).The incidence of mumps are known to show seasonal behavior.In the study period the incidence also show variation in trend.The monthly sample variance grow with the monthly average, although substantial overdispersion is clearly present.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1970 1975 1980  , 1927 -1972.The upper frame shows the observed number of cases with the de-seasonalized trend superimposed.The middle frame shows the location of the peak of the seasonal pattern.The lower frame depicts the variation in the peak-to-trough ratio over the period.
Fitting a Poisson generalized linear model with a quadratic trend and an monthly seasonal pattern, yields an overdispersion of 89.7, a significant trend and a significant seasonal variation.Changing the seasonal pattern to a harmonic pattern is in accordance with the data but does not substantially change the overdispersion.
We model the mumps incidence with a first order polynomial trend with time-varying coefficients and a time-varying harmonic seasonal component.
with a linear trend and a harmonic seasonal, yielding the linear predictor where the linear trend component is defined as and harmonic seasonal pattern as k , ω k , ω k and ω (s) k ] we get the following matrix notation . This is formulated in sspir by the call data("mumps") index <-1:length(mumps) mumps.m<-ssm( mumps ~-1 + tvar(polytime(index,1)) + tvar(polytrig(index,12,1)), family=poisson(link=log), phi=c(0,0,0.0005,0.0001),C0 = diag(4)) mumps.fit<-getFit(mumps.m)plot(mumps) lines( exp(mumps.fit$m[,1]),lwd=2) The choice of a first order sinusoid gives the possibility to express the seasonal variation via the peak-to-trough ratio (yearly max/min) and the location of the peak (code not shown, but available in demo("mumps")).The output in Figure 4 shows a graduately changing seasonal pattern with a decreasing peak-to-trough ratio and a peak location slowly changing.The location of the peak is changing from late April in 1928April in -1936, where after the location of the peak stabilizes around May 1st until 1964, when the peak wanders off to late May, see Figure 4.It is also seen that the peak-to-trough ratio is varying between 6 to 9 until around 1948, when the ratio graduately decreases to about 4 in 1971.Epidemic episodes are seen irregularly each 3 to 5 years.The analysis can be reproduced in sspir by demo("mumps").

Discussion
The main contribution of the sspir package is to give a formula language for specifying dynamic generalized linear models.That is, an extension of glm formulae by marking terms with tvar to specify that the corresponding coefficients are time-varying.The package also provides (extended) Kalman filter and Kalman smoother for models within the Gaussian, Poisson and binomial families.The output from the Kalman smoother leaves many possibilities for designing a suitable presentation of features of the latent process.
The Kalman filter is initialized by the values of m 0 and C 0 , see (3).The modeller can set entries in C 0 to accommodate prior knowledge.In cases where the prior information about θ 0 is sparse, a diffuse initialization may be adequate, see Durbin and Koopman (2001).This feature has not yet been implemented.The present framework does not allow the modeller to estimate the unknown variance parameters automatically.The modeller can, though, combine numerical maximization algorithms with the output of the iterated extended Kalman smoother.Hence, the formulation in sspir does not rely on any specific implementation of an inferential procedure.

Figure 2 :
Figure 2: Time-varying trend, slope, and seasonal components in UK gas consumption.

Figure 3 :Figure 4 :
Figure 3: Estimated trend and intervention (solid line) for the vandrivers data.The dashed lines are ± 2 standard deviations.
where D k is all information available at time t k .The Kalman filter recursively yields m k and C k with the recursion starting in θ 0 ∼ N p (m 0 , C o ).Assessment of the state vector, θ k , using all available information, D n , is called Kalman smoothing and we write θ n |D n ∼ N p ( m k , C k ).Starting with m n = m n and C n = C n , the Kalman smoother is a backwards recursion in time, k = n − 1, . . ., 1.
This variable is a list where, vd.fit$m[t,] contains the conditional mean E[θ t |D n ] and vd.fit$C[[t]] contains the corresponding variance matrix, VAR[θ t |D n ].