Hidden Semi Markov Models for Multiple Observation Sequences: The mhsmm Package for R

This paper describes the R package mhsmm which implements estimation and prediction methods for hidden Markov and semi-Markov models for multiple observation sequences. Such techniques are of interest when observed data is thought to be dependent on some unobserved (or hidden) state. Hidden Markov models only allow a geometrically distributed sojourn time in a given state, while hidden semi-Markov models extend this by allowing an arbitrary sojourn distribution. We demonstrate the software with simulation examples and an application involving the modelling of the ovarian cycle of dairy cows.


Introduction
The package mhsmm in the R system for statistical computing (R Development Core Team 2010) performs inference in multiple hidden Markov models and hidden semi-Markov models. A good overview of these models is given by Rabiner (1989). Efficient algorithms for parameter estimation are described by Guédon (2003). The models (and the mhsmm package) have been applied to oestrus detection in dairy cows (O'Connell, Tøgersen, Friggens, Løvendahl, and Højsgaard 2011).
The main features of the mhsmm package are as follows: Observations are allowed to be multivariate. Missing values are allowed. Observations must be recorded at equidistant times. The package is designed to allow the specification of custom emission distributions. It is possible to have multiple sequences of data. Parameter estimation is made using EM algorithms. Crucial parts of the code is written in C which makes estimation fast. The package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=mhsmm.
To our knowledge, there are two other software packages available for hidden semi-Markov models: The first is the AMAPmod software (Godin and Guédon 2007), which is specifically for the exploration of plant architecture. Another R package for for hidden semi-Markov models is hsmm package, Bulla, Bulla, and Nenadic (2010). The mhsmm package is distinguished from hsmm in mainly two aspects: (1) mhsmm has the ability to estimate parameters for multiple observation sequences.
(2) mhsmm is extensible because the user can create custom emission distributions.
The paper is organized as follows: Section 2 presents an example of a hidden Markov model based on simulated data. Section 3 goes more into the theory of the models and Section 4 contains various simulation examples each illustrating different aspects of the package. In Section 5 a real application on modelling the reproductive status of dairy cows is presented. Section 6 illustrates how to make user defined extensions. Finally Section 7 contains a discussion.

An introductory example
This example is based on simulation and illustrates hidden Markov models. A hidden Markov model can be described as follows (see Section 3 for more details): We consider a process evolving over discrete time points. Let S = (S t , t = 0, . . . , T ) denote a sequence of unobserved random variables, each with a finite state space {1, . . . , J}, and let X = (X t , t = 1, . . . , T ) denote a corresponding set of observed random vectors. A hidden Markov model has the functional form (1) From (1) it follows that (1) the observables X are all conditionally independent given the latent variables S and (2) X t depends on the latent variables S only through S t in a hidden Markov model, The term P (S 0 ) is called the initial distribution, P (S t |S t−1 ) is the transition distribution and P (X t |S t ) is the emission distribution. In practice P (S 0 ) is given as a vector π, P (S t |S t−1 ) is a transition matrix P (so the model is homogeneous because it is the same transition matrix for all t) while the emission distribution P (X t |S t ) (generically denoted by b) can be given in various different forms; see the examples below. Hence, a triple θ = (π, P, b) specifies a hidden Markov model. The function dnorm.hsmm provides the density function for the emission distribution. The argument rnorm.hsmm is essentially a wrapper for the rnorm() function and takes the necessary specifcations from the model object. The specification of the emission distribution states that X t |S t = s ∼ N (µ s , σ 2 s ). Notice that the elements of sigma are variances, not standard deviations. Section 6.2 shows an example of specifying a multivariate normal emission distribution.
The predict() returns a list in which the component named s contains the jointly most likely configuration of the states, which is found using a Viterbi algorithm, (Forney Jr 1973).
In some practical applications data consists of multiple sequences of observation. For example, in Section 5 we have multivariate data measured over time from several individual cows. The mhsmm package provides estimation and simulation routines for such data. For illustration, we generate three sequences of data, and fit the model with: R> train = simulate (model, c(100, 20, 30), rand.emis = rnorm.hsmm) R> h2 = hmmfit(train, startval, mstep = mstep.norm)

Theory of hidden Markov and semi-Markov models
This section contains a brief summary of Markov chains, hidden Markov and hidden semi-Markov models, or HMMs and HSMMs respectively. For a comprehensive introduction we refer to Rabiner (1989).

Discrete Markov chains
A discrete Markov chain is a random process (in discrete time) taking discrete values (states) from the state space S, that is, S t ∈ S = {1, . . . , J} for t = 1, 2, . . . , T . The process S t is a Markov chain if it has the Markov property for any s 0 , s 1 , . . . , s t+1 ∈ {1, . . . , J}. Hence the state at any given time t + 1 depends on the previous states only through the state at time t.
Let p ij = P (S t+1 = j|S t = i) with the properties J j=1 p ij = 1 and p ij ≥ 0 denote the probability of jumping from state i at time t to state j at time t + 1. The matrix P = (p ij ) is then the transition matrix of the Markov chain. To fully specify the model we require the distribution of the initial state π i = P (S 0 = i).
The number of time steps spent in a given state is called the sojourn time. The probability of spending u consecutive time steps in state i under this model is We call d i (u) the sojourn density. Hence the sojourn time is geometrically distributed for any Markov chain.

Hidden Markov models
Suppose we can only observe a variable X t which is related to the state S t but not the state itself. This situation is visualized in Figure 3. The conditional distribution of the observed variable X t given the unobserved (or hidden) state S t is referred to as the emission distribution. We refer to the parameters defining such a process as a hidden Markov model, henceforth referred to as an HMM. These models have been used for a variety of different applications, such as speech recognition (Rabiner 1989), weather modeling (Hughes, Guttorp, and Charles 1999) and DNA sequence analysis (Krogh, Mian, and Haussler 1994).
In addition to the parameters π and P which defines a Markov chain, a HMM also requires an emission distribution to be defined, that is Figure 3: Visual representation of a hidden Markov process. X t are some observed variables and S t is the unobserved, hidden state.
For example, b i (x) may be a multivariate Gaussian distribution. As stated in Section 2, a HMM is hence specified by a triple θ = (π, P, b).
The Baum-Welch algorithm is the original procedure for estimating the parameters of a HMM (Baum, Petrie, Soules, and Weiss 1970). This technique was later grouped with a more general class of algorithms for incomplete data, named the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). We again point to Rabiner (1989) for a very clear overview.

Hidden semi-Markov models
In standard HMMs, the sojourn time is geometrically distributed (as shown by Equation (2)). In some real-world problems (see for example Section 5) this is an unrealistic and severe limitation because the probability of a state change depends on the time spent in the current state.
A possible solution to this issue is to explicitly estimate the duration density d(u), producing what is referred as a hidden semi-Markov model, henceforth called an HSMM. Thus rather than having d(u) defined by P as in (2) we model the d(u) explicitely. Therefore, a HSMM is specified by a quadruple θ = (π, P, b, d). Ferguson (1980) was the first to propose such models along with an algorithm to fit them, as Rabiner (1989) summarizes. Guédon (2003) developed a more efficient algorithm and a method to deal with right censoring which we have implemented.
The complete data likelihood of a HSMM is where s * r is the rth visited state and u r is the time spent in that state. Guédon proposed using the survivor function for the sojourn time in the last state so we do not have to assume the process is leaving a state immediately after time T . Using this survivor function has two advantages: It improves parameter estimation and, perhaps more importantly, it provides a more accurate prediction of the last state visited which is important for online applications where we wish to estimate the most recent state when monitoring a process.
As we have not observed the state sequence, maximising this likelihood constitutes an incomplete data problem. A local maximum can be found using the EM algorithm. We briefly outline the procedure below. The EM algorithm involves iterating over two steps until convergence. In the E-step, we calculate the expected complete data likelihood given the value of the parameters at iteration k and the observed data, This term is typically broken down into a sum of terms involving subsets of the parameters. The M-step then involves choosing θ (k+1) as the values that maximize Q(θ|θ (k) ). These steps are repeated until convergence.

The EM algorithm for hidden Markov models
A local maximum of the HMM likelihood (1) can be found via the EM algorithm through the following steps: E-step: The E-step involves estimating two terms: (1) The probability of being in state i at time t given the observed sequence, and (2) the probability that the process left state i at time t and entered state j at t + 1 given the observed sequence, These values can be calculated via a dynamic programming method known as the forward-backward algorithm which has complexity O(J 2 T ) as Rabiner (1989) discusses.
M-step: Based on (4) and (5) the initial transition probabilities are estimated aŝ Estimates for the parameters of the emission distribution are, of course, dependent on the choice of distribution. If we assume X t are normally distributed given , then the parameters µ i and σ 2 i can be estimated aŝ Equations (6) and (7) are implemented in the mstep.norm() function in the mhsmm package.
The mhsmm package is extensible in that users can specify custom distributions. See Section 6 for examples.

The EM algorithm for hidden semi-Markov models
Parameter estimation for HSMMs is more complicated than for HMMs, both in terms of the mathematical description and in terms of the computational effort required. The EM algorithm for HSMMs is as follows: E-step: Calculate the E-steps for HMMs as given in (4) and (5). Furthermore, we also need the expected number of times a process spends u time steps in state j, Guédon (2003) provides a version of the forward-backward algorithm for estimating (8) which is implemented in the mhsmm package. The algorithm has worst-case complexity O(JT (J +T )). However if we restrict the maximum possible sojourn time to a moderate value M this is reduced to O(JT (J + M )). For example, in one of the simulation examples of Section 4, we know sojourns of length greater than 500 are impossible for all practical purposes, so we set M = 500.
M-step: Calculate the M-steps for HMMs as given in (6) and (7). In addition we also need to estimate the state duration density. Guédon provides derivations for d i (u) as a non-parametric probability mass function using (8) as v η iv but then proposes an ad-hoc solution for using parametric distributions with η iu which we have followed in mhsmm. One possibility is to use common discrete distributions with an additional shift parameter d that sets the minimum sojourn time (d ≥ 1). For example, we may use the Poisson distribution with density, We estimateλ i = T v=1 (v − d)η iv for all possible shift parameters, d = 1, . . . , min(u : η iu > 0), choosing the d which gives the maximum likelihood. Guédon states that this ad-hoc procedure works well in practice and we have found this to be the case in simulations. Such an approach is also possible for other common distributions.
Another possiblity is to assume that the sojourn times are Gamma distributed, that is, U r |S r = i ∼ Γ(a i , b i ). For this case, we estimated the parameters as follows: The likelihood for the Gamma distribution can be maximized with respect to its parameters by solving, where ψ() is the digamma function. We usē and log u i = u η iu log(u) u η iu and then solve the equation using Newton's method (Choi and Wette 1969). This methodology is implemented in the gammafit() function. The scale parameter is estimated asb i =ū i /â i .

Further simulation examples
This section contains several simulation examples, each illustrating features of the package.

Nonparametric sojourn distribution
Cases may arise where we do not know the sojourn distribution. We can estimate a nonparameteric sojourn distribution, perhaps as an initial step before deciding on a parametric distribution. We can estimate a non-parametric sojourn distribution as in (3.5). The parameters for such a distribution are a M × J matrix, with entries (u, j) corresponding to d j (u). A good starting value to use is a uniform distribution covering the range of reasonable values for the sojourns. We can view the estimated non-parametric sojourn densities in Figure 5, right. The ovarian cycle in cattle takes approximately 21 days (but with a large variation) and can be divided into four stages, which can be further grouped into two longer stages (see Table 1). The use of days is an insufficiently granular timescale for the oestrus stage, as oestrus lasts between 6 and 30 hours (Ball and Peters 2004). Ovulation occurs on the day following oestrus, so identifying oestrus allows a farm manager to know when to artificially inseminate a cow.
Following ovulation a structure called the corpus luteum forms in the ovary. The corpus luteum produces the hormone progesterone and remains in the ovary until a few days prior to the next ovulation (the luteal phase), at which time it degenerates rapidly. That is, progesterone is high during the luteal phase and low during the follicular phase. Progesterone directly reflects the biological processes occurring in the ovary and is a very useful indicator of reproductive status. Progesterone can be automatically measured from milk samples, with milking occurring every 6 -20 hours (in robotic milking systems). A typical progesterone profile after calving can be viewed in Figure 6 (bottom).
In the period leading up to ovulation, the cow will try to attract the attention of a bull by standing to be mounted, mounting other cows and being mounted by other cows. This is sometimes referred to as standing heat and is traditionally how a stockman would identify a cow that is about to ovulate, allowing them to proceed with artificial insemination or bring the cow to a bull. This behaviour leads to an increase in the number of counts on a pedometer the cow is wearing, and so can be exploited for automated detection of oestrus. Having a cow 1 days.from.calving

Analysis
We provide a simplified version of an analysis performed in O'Connell et al. (2011) using HSMMs to model reproductive data from dairy cows. Since the period of the ovarian cycle is irregular, a more conventional time series approach such as ARIMA is not suitable. We may hypothesize that the stages given in Table 1 are suitable states for a hidden Markov model but we know the states will not have geometrically distributed sojourn times. This is because the reproductive states of cows are not a memoryless process, since a follicular stage is likely to occur after 18 days in a luteal stage. Hence, a HSMM may be a suitable model for this data.
The dataset reprocows contains time-series data from seven cows with two measured variables, progesterone and the activity index derived in O'Connell et al. (2011). In addition, the dataset reproai contains days artifical insemination occurred for each cow and the dataset reproppa contains post-partum anoestrus lengths (in days) for 73 cows. We can use these auxilliary data sets for model validation and calculating start values, respectively.
R> data("reproai") R> data("reprocows") R> data("reproppa") We fit a HSMM to the activity data, using the states; S Activity = {post-partum anoestrus, standing heat, not standing heat} ( Figure 7). Validation for such a model is difficult, but we have two indicators: progesterone must be low for oestrus (and hence standing heat) to have occurred the artificial insemination at the end of each series was known to result in pregnancy We begin by defining the model in Figure 7 and setting reasonable starting values for the emission distribution.
R> N <-as.numeric(table(reprocows$id)) R> train <-list(x = reprocows$activity, N = N) R> class(train) <-"hsmm.data" We need starting values for the sojourn distribution. We have found the Gamma distribution works well for these data. We can use the reproppa data set to estimate parameters for the initial states.
R> tmp <-gammafit(reproppa * 24) As we are unsure of the parameters for the other two states, we just crudely use a uniform distribution with a reasonable range. The software will use this d j (u) for the initial E-step and then calculate Gamma parameters on the M-step. This ad-hoc procedure has been found to work well in simulation and in this practical application. We can view the predicted states for the first cow in Figure 8, the predicted states are consistent with the biological processes of the cow. As more formal validation, we compare the artifical insemination times with the timings of predicted standing heat (and therefore oestrus).  All the artificial inseminations occurred within a day after the last predicted standing heat period except one (Figure 10, left), and this is consistent with the information in cattle reproduction texts, e.g. Ball and Peters (2004). If we view the "missed" insemination ( Figure 9), we can see that two inseminations occurred around this oestrus, and that the first one was likely premature. We can view the estimated sojourn distributions in Figure 10, right.
A more rigorous analysis and validation of a larger version of this data set is presented in O'Connell et al. (2011).

User-defined extensions
Cases may arise where we wish to simulate or estimate a model whose emission distribution is not provided in the mhsmm packages. If users can provide

Poisson emission distribution
We assume X t are Poisson distributed given S = i, that is, X t |S t = i ∼ P (λ i ). Then λ i can be reestimated via the equation,λ .
We implement this in the function mstep.pois. Functions for the emission M-step take two arguments, x, the vector or dataframe of observed data and wt which is a T × K matrix representing the values γ t (j) in (4). The return value is a list corresponding to the $emission slot in a hsmmspec or hmmspec object.
R> mstep.pois <-function(x, wt) + list(lambda = apply(wt, 2, function(w) weighted.mean(x, w))) Functions to simulate from and for calculating densities can be defined as: In this example, we will simulate from a two state HSMM with Gamma distributed sojourn times and the multivariate Gaussion distribution we have just defined. We will generate multiple observations sequences.

Summary and perspectives
In this paper we have presented the mhsmm package for R through several examples. We have also outlined the theory behind the hidden Markov and hidden semi-Markov models and we have described the estimation algorithms in some detail. In particular, we have shown that mhsmm is extensible as we have tried to facilitate the design and use of custom emission distributions. These features come perhaps at the cost of some simplicity and ease of use.
The creation the mhsmm package was motivated by the work on detecting reproductive status of dairy cows, (O'Connell et al. 2011) where two indicators of oestrus were used for estimating the reproductive status of cows, (see also Section 5). Problems of this type are commen, for example, in modern highly efficient farming because modern sensor technology allows for frequent online measurement of many indicators on a large group of animals. As an additional example, Højsgaard and Friggens (2010) consider estimating the degree of mastitis for dairy cows from a panel of three indicators (measured with different intensities). It is an ongoing activity to apply hidden semi-Markov models for monitoring the mastitis status of a cow. The mhsmm package allows for missing values among the observables and this facility allows different sampling intensities to be handled in a natural way.