Bayesian spline-based hidden Markov models with applications to actimetry data and sleep analysis

B-spline-based hidden Markov models employ B-splines to specify the emission distributions, offering a more flexible modelling approach to data than conventional parametric HMMs. We introduce a Bayesian framework for inference, enabling the simultaneous estimation of all unknown model parameters including the number of states. A parsimonious knot configuration of the B-splines is identified by the use of a trans-dimensional Markov chain sampling algorithm, while model selection regarding the number of states can be performed based on the marginal likelihood within a parallel sampling framework. Using extensive simulation studies, we demonstrate the superiority of our methodology over alternative approaches as well as its robustness and scalability. We illustrate the explorative use of our methods for data on activity in animals, i.e. whitetip-sharks. The flexibility of our Bayesian approach also facilitates the incorporation of more realistic assumptions and we demonstrate this by developing a novel hierarchical conditional HMM to analyse human activity for circadian and sleep modelling.


Introduction
The class of hidden Markov models (HMMs) offers a powerful approach for extracting information from sequential data (Rabiner, 1989). A basic N-state HMM consists of a discretetime stochastic process (x t , y t ) where x t is an unobserved N-state time-homogeneous Markov chain, and y t |x t ∼ f xt (y t ) with the emission distributions f xt belonging to some parametric family such as normal or gamma. However, a parametric HMM is often too restrictive for complex real data (Zucchini et al., 2016). It is recognized that simple parametric choices for the emission distributions are not always justified, and moreover, their misspecification can lead to seriously erroneous inference on the number and classification of hidden states (Yau et al., 2011;Pohle et al., 2017). Semi-and nonparametric modelling of emission distributions offer more flexibility and/or may serve as exploratory tools to investigate the suitability of a parametric family; see Piccardi and Pérez (2007) for activity recognition in videos, Yau et al. (2011) for the analysis of genomic copy number variation, Langrock et al. (2015Langrock et al. ( , 2018 for modelling animal movement data and Kang et al. (2019) for delineating the pathology of Alzheimer's disease, among many others. Theoretical guarantees for inference in such models have been studied in a number of recent papers. Notably Alexandrovich et al. (2016) proved that model parameters and the order of the Markov chain are identifiable (up to permutations of the hidden states labels) if the transition probability matrix of {x t } has full rank and is ergodic, and if the emission distributions are distinct. These conditions are fairly generic and in practice will usually be satisfied.
We also refer to Gassiat et al. (2016a,b) for further identifiability results, and to Vernet et al. ), De Castro et al. (2017, and references therein, for further theoretical results on inference under nonparametric settings. The increased flexibility and modelling accuracy obtained by non-parametric emission distributions comes at a higher computational cost. For instance, the cost of the standard HMM algorithms (e.g. the forward-backward algorithm of Rabiner (1989)) for kernel-based HMMs (Piccardi and Pérez, 2007) is subject to a quadratic growth with data size n and thus can be prohibitive for long time series data. Bayesian nonparametric HMMs (Yau et al., 2011) built on Dirichlet process mixture models pose challenges to the existing sampling methods due to the increased complexity of the model space (Hastie et al., 2015).
Splines have good approximation properties for a rich class of functions (De Boor et al., 1978;Schumaker, 2007). A spline function of order O is a piecewise polynomial function of degree O − 1 where the polynomial pieces are connected at the so-called knot points.
Provided these are distinct, the derivatives of piecewise polynomials are (O − 2)-times continuously differentiable at the knots. B-splines (short for basis splines) of order O provide basis functions for representing spline functions of the same order defined over the same set of knots (De Boor et al., 1978). The great flexibility and nice computational properties of the B-splines make them a popular tool in semi-/nonparametric statistical modelling, especially in nonlinear regression analysis (Denison et al., 2002;Zanini et al., 2020) and density estimation (Koo, 1996;Edwards et al., 2019). Incorporating B-splines into HMMs is attractive for real applications as two powerful aspects can be exploited, the forward-backward algorithm for efficient HMM inference, and the flexibility for estimating the emission densities. A frequentist estimation approach for HMM based on penalized B-splines (P-splines) was introduced by Langrock et al. (2015Langrock et al. ( , 2018. It requires prespecifying the number and positions of knots where, in practice, a large number of knots are needed to ensure flexibility, leading to computational challenges (e.g. convergence to suboptimal local extrema of the likelihood) and cost. Also to date the selection of the state-specific smoothing parameters and the quantification of parameter uncertainty remain challenging inferential tasks in the frequentist framework. Current methods rely on cross-validation and parametric bootstrap techniques (Langrock et al., 2015), which are extremely computationally intensive and can be numerically unstable especially for increasing cardinality N . Hence their approach is so far only feasible for models with a small N which may severely limit its applicability.
As far as we are aware spline-based methods have not yet been considered for emission density estimation in a Bayesian formulation of HMMs. The aim of this article is to propose and develop a methodology that achieves exactly this by means of an almost "tuningfree" reversible jump Markov chain Monte Carlo (RJMCMC) algorithm (Green, 1995), which exploits (i) the forward filtering backward sampling (FFBS) procedure for efficient simulation of the hidden state process, (ii) a stochastic approximation based adaptive MCMC scheme for automatic tuning, (iii) a reparameterization scheme for enhancing the sampling efficiency and (iv) an adaptive knot selection scheme that modifies and extends ideas considered in other scenarios such as DiMatteo et al. (2001) and Sharef et al. (2010) for flexible emission modelling. We report results demonstrating significant advantages of our proposed adaptive spline based algorithm. Compared to current alternative splinebased approaches, namely the frequentist P-spline approach of Langrock et al. (2015), and a Bayesian adaptive P-spline (Lang and Brezger, 2004) approach which is newly adapted here for density estimation in HMMs, our method generally achieves higher estimation accuracy and efficiency while maintaining a much lower model complexity. It also performs favourably over the Gaussian mixture based HMM in more challenging data-generating scenarios.
Estimating N is often a question of scientific interest in itself and introduces an additional level of complexity to HMM inference. While order estimation has been extensively studied for parametric HMMs, such as in Celeux and Durand (2008), Pohle et al. (2017) and Frühwirth-Schnatter and Frèuhwirth-Schnatter (2006), few theoretical or practical results have been obtained for the semi-or nonparametric case, which often requires the number of states to be known or fixed in advance (Piccardi and Pérez, 2007;Yau et al., 2011;Lehéricy, 2018). Recently, Lehéricy et al. (2019) proposed two estimators for N , which are theoretically attractive but suffer from implementation difficulties such as non-convex optimization problems and heuristic tuning. In this paper, we address this issue with a fully Bayesian approach to the selection of N through a parallel sampling scheme that is both easy to implement and computationally efficient. Quantities such as the marginal likelihood for each model can be easily estimated.
Among the numerous application fields of HMMs, there is a growing interest within the context of e-Health to gain insight into an individual's health status based on relevant biomarker data. Physical activity (PA) is receiving much attention as an important biomarker of the sleep-wake cycle and circadian timing system, which is closely associated with our physical and mental health (Roenneberg and Merrow, 2016). PA can be easily and objectively measured in a non-obtrusive way under normal living conditions using accelerometry or actigraphy through wearable sensing devices. Huang et al. (2018) investigated PA using a HMM with circadian-clock driven transition probabilities and, amongst various circadian parameters of interest, they proposed a novel model-derived circadian parameter for monitoring and quantifying a subject's circadian rhythm. An advantage of a further Bayesian formulation is that the modularity of its components can be used to perform inference rigorously even in a more complex hierarchical HMM model. As a further contribution based on our proposed Ansatz, we develop a hierarchical conditional HMM that may be applied to (i) characterize the sleep-wake patterns in the overall PA data of an individual and (ii) analyze sleep patterns of an individual in a refined way through a "sub-HMM" that is conditional on the "rest" state inferred from (i).
The manuscript is structured as follows: Section 2 provides details of a Bayesian formulation of the spline-based HMM, Section 3 details the structure of our proposed inference algorithms including model selection on the number of states and summarizes the performance of our methods in comparison to other related methods under various simulation settings. Section 4 illustrates our methods on animal activity data and introduces the conditional HMM approach that is applied to human PA data from the Multi-Ethnic Study of Atherosclerosis (MESA) (1) . Section 5 provides a discussion and possible directions of further work.

A Bayesian HMM with spline-based emissions
We approximate the emission densities f 1 , . . . , f N , focusing on univariate emissions, using mixtures of standardized cubic B-spline basis functions of order O = 4 (Langrock et al., 2015). The knots are located between boundary knots a and b (assumed fixed), and we use R K = (r 1 , . . . , r K ) to denote the interior knot configuration shared across states, with the left and right external knots set to a and b, respectively (Friedman et al., 2001). Note that K = k corresponds to the case of k + 4 B-spline basis functions, and we assume K ≥ 2 for identifiability. Under these settings, f i is formulated as where B k (y), k = 1, . . . , K + 4, denotes the k-th normalized (such that it integrates to one) B-spline basis function of degree 3, and the a i,k are the corresponding coefficients such that K+4 k=1 a i,k = 1 and a i,k ≥ 0 for all k = 1, . . . , K + 4. In the time-homogeneous case, i.e.
where the transition probabilities of the Markov chain are constant over time, the resulting class of HMMs is fully specified by the initial state distribution, δ = (δ 1 , . . . , δ N ), with δ i = P (x 1 = i), the transition probability matrix, Γ = (γ i,j ) i,j=1,...,N , with γ i,j = P (x t = j|x t−1 = i), and the emission densities defined in (1). The joint (complete) likelihood of observations y (n) = (y 1 , . . . , y n ) and the hidden states x (n) = (x 1 , . . . , x n ) is f (y (n) , x (n) |K, R K , δ, A K , Γ) = δ x 1 n t=2 f (x t |x t−1 , Γ) n t=1 f xt (y t ), (1) We refer to Chen et al. (2015) and Zhang et al. (2018)) for more background information on the MESA dataset where here, and throughout this paper, we use f (·|·) as a generic notation to represent conditional densities as specified by their arguments and A K denotes the set of spline coefficients a i,k , i = 1, . . . , N, k = 1, . . . , K + 4. Integrating out the hidden states the marginal likelihood can be evaluated in O(N 2 n) steps using the forward algorithm (in the form of Zucchini et al. (2016)), via the matrix product expression where P (y t ) is a diagonal matrix with i-th diagonal entry given by f i (y t ) and 1 is a column vector of dimension N of ones.
To complete the Bayesian formulation of the model, we assume the following factorization of the complete joint density The assumption that the parameters associated with the observed and hidden process are a-priori independent is commonly adopted in Bayesian HMMs. We use a uniform prior on {2, . . . , K max } for K, with K max fixed to 50 in our examples (2) where a preliminary study suggested that this was large enough to cover the support of K. For the knot positions, we propose that the r k are taken to be the k-th order statistics of K independent uniform The state-specific spline coefficients (a i,1 , . . . , a i,K+4 ), i = 1, . . . , N , are reparameterized as a i,j = exp(ã i,j )/ K+4 l=1 exp(ã i,l ), a i,j ∈ R, so that the positivity and unit sum constraints will not hinder the design of our RJ moves. The fact that theã i,j are not identifiable is not a concern as we are only interested in the a i,j , which are identifiable, and in this way the mixing of the MCMC may be improved (2) Clearly larger default values, including the sample size n, may be used instead and the estimation results do not appear to be sensitive to its choice as long as K max is large enough.
( Cappé et al., 2003). We choose to use a log-gamma prior with shape parameter ζ and rate parameter 1 on theã i,j , i.e. exp(ã i,j ) ∼ Gamma(ζ, 1), giving a symmetric Dirichlet, i.e. Dir(ζ, . . . , ζ) distribution on the corresponding (a i,1 , . . . , a i,K+4 ). We choose a vague Gamma(1, 1) hyperprior (3) on ζ to reflect our uncertainty on ζ and our prior belief of sparse distributions on the spline coefficients (when ζ < 1). For the transition probability matrix we followed the literature (see, e.g. Rydén et al. (2008)) assuming that the rows are a-priori independent, each of which has a vague Dirichlet prior (γ i,1 , . . . , γ i,N ) ∼ Dir(1, . . . , 1), i = 1, . . . , N , where we assume that the initial distribution (4) is fixed and uniform on {1, . . . , N }. Thus the complete joint density incorporating the reparametrization can be rewritten as whereÃ K represents the set ofã i,k (i = 1, . . . , N, k = 1, . . . , K + 4). We remark that in cases where N is large, using state-specific knot configurations for emissions may be preferred over a shared knot configuration across states, and our Bayesian model can be readily adapted. See supplementary Section ?? for more details.

Inference
Our aim is to obtain realisations from the posterior distribution of (K, R K , A K , Γ, ζ, x (n) |y (n) ), which can be achieved by simulating from the joint posterior density defined via (4). To allow for model searches between parameter subspaces of different dimensionality, we de- (3) In our experiments we did not find the values of these hyperparameters to be very influential, and other reasonable values may be used.
(4) Note that it is not possible to estimate it consistently as there is only one unobserved variable associated with it.
velop a RJMCMC algorithm which combines a Metropolis-within-Gibbs sampler with transdimensional moves generated by births and deaths of knot points. The structure of our algorithm is listed in Algorithm 1, where b K = I(K = 2) + 0.5 × I(3 ≤ K < K max ) and I(·) is the indicator function (therefore b K = 0.5 for 3 ≤ K < K max ). The RJMCMC algorithm is conditioned on the cardinality N noting that model selection will be addressed in Section 3.3. Steps (a)-(e) propose moves within a dimension while the last step proposes a birth or death of a knot point which changes the model dimension. We now outline the rules for each of the updating steps while further details of this algorithm together with an extension assuming state-specific knots are provided in the supplementary Section ??.
Algorithm 1: Reversible jump MCMC algorithm for spline-based HMMs (a) update the hidden state sequence x (n) ; (b) update the transition probability matrix Γ; (c) update the knot location vector R K ; (d) update the set of B-spline coefficients A K (viaÃ K ); (e) update the hyperparameter ζ; consider the birth of a knot point in the B-spline representation in (1); else consider the death of a knot point in the B-spline representation in (1); end end 3.1 Within-model moves The moves in steps (a) and (b) are of Gibbs type whereas those in steps (c) to (e) are of Metropolis-Hastings (MH) type, all of which are conditioned on the current number of knot points K. In step (a), x (n) can be simulated exactly and efficiently from its full conditional distribution, f (x (n) |y (n) , K, R K , A K , Γ), via a standard FFBS procedure with transition matrix Γ and emission densities f i (y t ) given in (1) (see e.g. Cappé et al. (2005)). In step (b), the rows of Γ are conditionally independent and are updated from their conjugate Dirichlet posterior where n i,j denotes the number of transitions from state i to j in x (n) . In step (c) a knot r k * is chosen uniformly from the set of existing knots {r 1 , . . . , r K } and proposed to be moved to a candidate point, r c , which is generated from a normal distribution with mean r k * and standard deviation τ 1 , truncated to [a, b] (DiMatteo et al., 2001). The proposal in step (d) is generated by a random walk on the reparameterized spline coefficientsã i,j (i = 1, . . . , N ; j = 1, . . . , K + 4), i.e.ã ′ i,j =ã i,j + η i,j , where η i,j ∼ N (0, τ 2 2 ). In step (e), we update ζ via a log-normal random walk log(ζ ′ ) = log(ζ) + ν, where ν ∼ N (0, τ 2 3 ). To allow for automatic tuning of the variance parameters τ 1 , τ 2 and τ 3 , we adopt a simple well-

Birth and death moves
The birth and death moves allow for increasing or decreasing the number of knots, or equivalently, the number of B-spline basis elements. Our design extends the ideas of DiMatteo et al. (2001) and Sharef et al. (2010) to the framework of HMMs. Suppose that the current model has knot configuration (K, R K ), we make a random choice between birth or death with probabilities b K and d K = 1−b K , respectively. In the birth move, we select a knot, r b * , at random from the existing knots and create a candidate new knot, r c , by drawing from a normal distribution (truncated to [a, b]) with mean r b * and standard deviation τ where τ is chosen as a function having the form (r b * +1 − r b * −1 ) α and α is a positive real constant. The intuition here is that a new knot is more likely to be needed in locations where existing knots are relatively "dense". To complete the birth step we update the corresponding spline coefficients, which now has dimension K + 5 for each state. Here, our design is guided by the deterministic knot insertion rule described in De Boor (2001) which allows a new knot to be inserted without changing the shape of the overall B-spline curve, noting that this exact relationship becomes approximate in our context as we are working with normalized basis functions. We extend the scheme by adding more degrees of freedom in order to meet the dimension matching condition required for the validity of the RJM-CMC algorithm. More specifically, for the birth of a candidate knot point r c ∈ (r n * , r n * +1 ), the associated spline parametersã ′ i,j , for i = 1, . . . , N , are created as where c j = (r c − r j−4 )/(r j−1 − r j−4 ) and u i iid ∼ U (0, 1). Here theã Next, consider the death of a knot point from the current knot configuration (K, R K ).
A knot, r d * , is chosen at random from the set of existing knots {r 1 , . . . , r K } and then deleted. The spline parameters associated with this move are updated according to the inverse transformation of (5): The parameters for the state process remain unaltered in either birth or death move (5) .

Bayesian model selection: Estimating the cardinality N
While it is theoretically possible to extend Algorithm 1 by introducing an additional reversible jump step on the number of states, or by using a product space search algorithm to sample from the joint posterior of parameters from all competing models (e.g. Carlin and Chib (1995)), it is challenging to design computationally practical trans-dimensional algorithms in the HMM setting due to the large and complex parameter space. Another potential strategy is to use Dirichlet process based priors for the transition matrix, which allows for a potentially infinitely large state space (see e.g. Fox et al. (2011)). However, combining such a framework with the spline-based emission model would pose significant computational challenges. Instead, we propose to perform model selection based on the marginal likelihood, also known as "evidence" where θ j is the parameter set (excluding x (n) ) associated with the j-state model, f (y (n) |θ j , N = j) is the observed likelihood given in (3) and M denotes some maximum number of states that we want to consider. Given prior model probabilities P (N = j) and evidences, the posterior model probabilities can be computed using Bayes' theorem. Following Bayesian decision theory we can pick the model that gives the highest posterior probability, i.e. N * = argmax k=1,...,M P (N = k|y (n) ). For most models of interest (including HMMs), however, the integral in (6) has no closed-form expression and needs to be approximated.
Various Monte Carlo based approximation schemes have been proposed, see Friel and Wyse (2012) and Llorente et al. (2020) for recent reviews.
We propose to approximate the evidence of a spline-based HMM by using a harmonic mean estimator (Gelfand and Dey, 1994), which allows direct estimation of the evidence using the simulation output and thus is straightforward to implement. The estimator relies on the simple fact that for any proper density function h, we have for the expectation j is the i-th sample simulated from the posterior f (θ j |y (n) ). This estimator enjoys a finite variance if h 2 (θ)/(f (θ)f (y (n) |θ))dθ < ∞, i.e. h(θ) must have lighter tails than f (θ)f (y (n) |θ) (DiCiccio et al., 1997). To ensure this we follow Robert and Wraith (2009) and Marin and Robert (2009) to construct an appropriate density h based on truncated highest posterior density (HPD) regions derived from the MCMC samples. The resulting estimator is known as a truncated harmonic mean estimator and has been successfully used in various other model settings, see for instance Durmus et al. (2018) and Acerbi et al. (2018). More specifically, we define a sample-based 100β% HPD region as (omitting the dependence on the index of state j for clarity) whereq β is the empirical upper β quantile of the (f (θ (i) )f (y (n) |θ (i) )) produced in the output of the MCMC. Here we propose to construct the density h as where V (ξ) is the volume of a ball centered at θ with radius ξ (small), dim(·) is the dimensionality of the argument and d(·, ·) is a suitable distance measure. It is easy to check that h is a proper density function and has a finite support, noting that in our context the parameter space of θ = (K, ζ, R K , A K , Γ) is a union of subspaces of varying dimension. Our proposal h can be viewed as a histogram-like nonparametric estimator of the posterior f (θ|y (n) ) based solely on samples in the HPD regions. Note that V (ξ) does not need to be computed as it cancels out when computing posterior model probabilities, provided that ξ is fixed across models.

Performance in Simulations
To thoroughly evaluate the empirical performance of our proposed adaptive spline (adSP) Bayesian methodology, we conducted simulation studies in various hypothetical and realistic settings. Here we briefly summarize the design and results, with additional details provided in supplementary sections ?? and ??.
We first compared our adSP with three other relevant candidate methods: (i) the frequentist P-spline (fpSP) approach of Langrock et al. (2015), (ii) a Bayesian adaptive P-spline approach (bpSP) that, to the best of our knowledge, represents the first implementation within HMMs, and (iii) a frequentist Gaussian mixture-based HMM (GMM) motivated by Volant et al. (2014). Our comparison is based on estimation accuracy, using two criteria, namely the average Kullback-Leibler divergence (KLD) from the true emission distributions, and the decoding accuracy/error, quantified via the proportion of correctly/incorrectly classified states, as well as computational cost. We generated artificial data from four simulation models with emissions exhibiting features such as multi-modality, skewness, heavy-tailedness and excess kurtosis. Model 1 is a 2-state HMM with a normal and a normal mixture emission as considered in Langrock et al. (2015). Model 2 is a 3state HMM with a unimodal positively skewed emission distribution in state 1, a bimodal distribution in state 2, and a unimodal negatively skewed distribution in state 3. First, we verified the performance of our model selection method, which is based on the marginal likelihood, for the two Bayesian approaches, adSP and bpSP. We found that the adSP method identified the correct number of states in all 60 replicates for each of the four simulation models, with averaged posterior probability of the correct model equal to one.
In contrast, bpSP had a lower accuracy and underestimated the number of states in some repetitions. Furthermore, comparing the performance of the four methods across the simulation models for fixed cardinality N , we conclude that the proposed adSP method is the only method that performs consistently well in all scenarios, with a particular advantage in decoding. Within the spline-based methods, adSP and bpSP had roughly comparable accuracy in the easier settings (Models 1 and 2), while the advantages of using adSP became apparent in the more challenging scenarios (Models 3 and 4) where bpSP suffered from poor mixing or convergence issues. Although fpSP was found to have better convergence than bpSP, it yielded lower accuracy than adSP in almost all cases. The GMM method performs well when the true emissions were close to Gaussian or Gaussian mixtures, but its performance was weak when emissions possessed non-Gaussian properties (e.g. skewness or heavy tails). The choices of its associated hyperparameters are influential to the results (overfitting or over-penalization), which is in agreement with previous findings (Baudry and Celeux, 2015;Fan et al., 2021). It also suffered from serious convergence issues in challenging scenarios like Model 4. In terms of computational efficiency, the adSP was generally the second most efficient approach after GMM, while bpSP was the most computationally costly. It is important to note that when calculating the computational time for the frequentist approaches, i.e. fpSP and GMM, the additional time required for performing uncertainty quantification of the parameters was not included. In practice this poses a significant computational burden, especially for fpSP, while it is provided by the Bayesian methods without extra cost.
We considered two additional simulation scenarios to further assess the applicability of our proposed methods (see supplementary Section ??). We examined the robustness of adSP in the presence of misspecified transition dynamics, where data were generated using either a semi-Markov or non-homogeneous state process. We also tested the feasibility and scalability of the algorithm based on state-specific knot configurations in handling systems with a larger cardinality N . In both cases, our proposed methods showed strong performance in estimation accuracy and efficiency.

Analysis of oceanic whitetip shark acceleration data
HMMs provide a useful tool for modelling animal movement metrics to study the dynamic patterns of an animal's behavioural states (e.g., resting, foraging or migrating) in ecology (Langrock et al., 2012(Langrock et al., , 2018. Here we consider a time series of the overall dynamic body acceleration (ODBA) collected from an oceanic whitetip shark at a rate of 16 Hz over a time span of 24 hours. A larger replicate data set was analyzed in Langrock et al. (2018).
For our analysis, the raw ODBA values are averaged over non-overlapping windows of length 15 seconds and log transformed (lODBA), resulting in a total of 5760 observations. The marginal distribution of the transformed data is shown in Figure 1. We modelled the lODBA values using our adSP with cardinality set to N = 3 as in Langrock et al. (2018), who present biological reasons for this assumption. Implementational details of the MCMC algorithm are provided in supplementary Section ??. Figure 1 (left panel) shows the estimated emission densities (obtained as in supplementary Section B.1) along with pointwise 95% credible intervals. The posterior modal number of knots is 13, witĥ P (K = 13|data) = 0.689. For comparison, we also fitted a B-spline HMM using the method of Langrock et al. (2018), where we have set K = 39 to ensure enough flexibility and selected λ = (300, 1, 400) for the smoothing parameters based on our experiments. While the resulting transition probability estimates and the density fits seem to be comparable between the two approaches (see Figure 1 middle panel), their method uses approximately three times the number of parameters as our method for estimating the emissions, for which we experienced numerical stability issues during estimation (6) . Therefore, the bootstrapbased uncertainty quantification approach would be challenging and costly to implement, whereas we obtained posterior uncertainties for the parameters at no extra cost.
The computational feasibility and stability of the proposed algorithm allowed us to increase the value of N and perform model selection. Preliminary runs indicated that N > 5 was likely to be favoured, so we used adSP with state-specific knots along with the marginal likelihood approach described above. Using a discrete uniform prior over {2, . . . , 10}, the posterior modal number of states was estimated to be N = 9, with a (6) This finding is consistent with our experience in the simulation studies.
posterior probability of 1, thus the data strongly support a considerably larger number of states than was originally assumed in Langrock et al. (2018). In addition, individual emissions of the 9-state model now appear unimodal (see right panel of Figure 1). This is interesting: apart from investigating the biological interpretation of these states, our results suggest that one might consider fitting a fully parametric HMM with standard unimodal forms of emission densities for N = 9. Further detailed results of the application to the shark data can be found in supplementary Section ??.

A conditional HMM for analysing circadian and sleep pat-
The use of PA data obtained from wearable sensors for monitoring circadian rhythm and sleep pattern outside laboratory settings is well justified (Ancoli-Israel et al., 2015;Quante et al., 2018). We next use our proposed Ansatz to introduce a conditional hidden Markov model in its general form and illustrate the method using publicly available human PA data from MESA. The conditional HMM consists of a "main model" to characterize the general pattern of the overall time series, and a "sub-HMM" that is invoked based on a specific state i of the main model with N S possible sub-states. Without loss of generality, we set i = 1 and for simplicity omit this subscript in what follows. The posterior distribution of the parameters in the sub-HMM can be expressed as where θ S is the parameter set for a N S -state sub-HMM, x (n) is the hidden state sequence associated with the main-HMM and we assume that We refer to the second term in (8) as the "conditional likelihood" for the sub-HMM. The key idea here is that by conditioning on state 1 of the main-HMM, we want to restrict the observations that contribute to the likelihood to only those associated with the time points where x t = 1, while also maintaining the temporal dependence of these observations.
To achieve this, we could simply treat observations {y t : x t ̸ = 1} as "missing data".
This strategy offers the advantage that the resulting conditional likelihood can be easily evaluated in the HMM framework. More specifically, let (t 1 , . . . , t T 1 ) be the collection of time points in ascending order such that x t j = 1, j = 1, . . . , T 1 . Using the notation of Section 2, we have where P (y t ) = I N S , the identity matrix of dimension N S , for t ̸ = t 1 , . . . , t T 1 . The last row of (9) takes the same form as the marginal likelihood of a standard HMM, and therefore, the standard forward algorithm can be used to efficiently evaluate the conditional likelihood.
Note that the uncertainty regarding the state classification is properly taken into account, as the state sequence will be integrated out to obtain the marginal posterior as defined in (7), on which our inference for the sub-HMM will be based.
It should be pointed out that our conditional HMM approach is different from what is called the hierarchical HMM (see e.g. Adam et al. (2019)) in that for the latter, a joint model is formulated for multiple observed processes at different temporal resolutions, each of which is modelled via a hidden Markov process and the process at the coarser level determines the onset of a specific finer level process for each epoch. In contrast, our method may operate on a single time scale and allows us to refine our analysis of a chosen state.
It is also important to note that fitting the main HMM with N + N S − 1 states will not necessarily split state 1 of the original model into N S sub-states as desired, whereas in our framework we control this directly through the conditioning.
In our application we are interested in studying sleep from accelerometer data, where state 1 corresponds to the lowest activity state that contains the sleep bouts. However, accelerometer data often contain a large number of zeros during a rest or sleep state (Ae Lee and Gill, 2018), which could cause issues for the spline-based model. To address this, we assume a "zero inflation" of the emissions at both HMM levels (7) f xt (y t ) = w xt, where x t indicates the underlying state at time t, w xt,1 represents the state-specific zero weight such that 0 ≤ w xt,1 ≤ 1 and w xt,1 + w xt,2 = 1, δ 0 is the Dirac delta distribution and f B xt (y t ) is a spline-based emission density as defined in Section 2. Following Gassiat et al.
(2016a) we can establish identifiability of the resulting HMM provided that at most one w xt,1 is equal to one and that {δ 0 , f B 1 , . . . , f B N } are linearly independent. In our analysis these conditions are always satisfied.
(7) The model can be easily adapted to applications where further discrete mass points for low observations are needed.

Application to the MESA dataset
To illustrate our proposed method, we consider two example subjects, For the sub-HMM we assume 2 sub-states, 1.1 and 1.2, to potentially capture the ultradian oscillations between higher and lower intensity of movement during sleep. Such were found in accelerometer data by Winnebeck et al. (2018) who concluded that they correspond to the circa 120-min periodic transitions between the Non-REM and REM stages of sleep.  Table   1) (9) . We can see that while both sub-states contain a mix of all the PSG stages, only state Here we present only results for our two example subjects. Such an analysis could be extended to include all MESA subjects but this is beyond the remit of this article. subject A and (26%, 67.4%, 77%, 100%, 67.9%) for subject B for (wake, N1, N2, N3, REM).
In contrast, State 1.2 tends to be associated with lighter sleep stages as well as disruptions into wake which were not identified by the main-HMM. We therefore anticipate that state 1.2 provides additional useful information regarding the sleep quality of a subject.
The estimated transition probabilities of the fitted sub-HMM provide a systematic quan-   Table 2, which shows that subject B spent a larger proportion of sleep time in wake and N1 stages while having a lower proportion of time in the deeper N2 and N3 stages. The decoding and state probabilities of the sub-HMM also allow us to investigate the dynamic variation within and between bouts of sleep. For instance, the fragmentation of the blue region in the state probability plots of Figure 2 suggests that subject A seems to experience more interruptions and lighter sleep during the earlier phase of the sleep bout, whereas subject B suffers from sleep interruptions and transitions to lighter sleep throughout the entire bout. These observations are consistent with their own reports in the sleep questionnaire and the PSG recordings during the single night.

Summary and further Discussion
In this paper, we propose and develop a Bayesian methodology for inference in spline-based HMMs, which offer attractive properties compared to alternative nonparametric HMMs in terms of simplicity in model interpretation and flexibility in modelling. Our method allows for the number of states, N , to be unknown along with all other model parameters including the spline knot configuration(s). Compared with a P-spline-based construction, we achieve a parsimonious and efficient positioning of the spline knots via a RJMCMC algorithm, where the knots can either be shared across states or be state-specific. Model selection on N is based on the marginal likelihood, which can be effectively estimated via a truncated harmonic mean estimator under an easy-to-implement parallel sampling scheme.
Through extensive simulation studies, we demonstrated the effectiveness and superiority of our proposed methods over alternative comparators, including the Gaussian mixture based HMM, the frequentist P-spline-based approach of Langrock et al. (2015), and a Bayesian adaptive P-spline approach which is investigated here for the first time. Importantly, the computational efficiency and flexibility of our algorithm allows us to deal with more states, which is a challenging problem even for parametric approaches due to convergence problems with increasing N . We highlight this advantage in the application to the animal movement data and illustrate the use of our method as a nonparametric approach for explorative data analysis.
The application to human PA data highlights the flexibility of our Bayesian modelling approach that can be extended in a relatively straightforward way to hierarchical scenarios such as the conditional HMM. The extension to a hierarchical framework of a sub-HMM within an overall HMM here allows us to estimate many important parameters that characterize an individual's circadian rhythm, and to model the individual stochastic dynamics of the rest state activity where the sub-states may be associated with deeper and lighter or interrupted sleep stages. Another feature of our method is that the algorithm operates in an unsupervised manner, i.e. it does not require PSG labels for learning the model, which is desirable in applied settings as these labels are very costly or even impossible to acquire . The method developed here is thus of imminent interest to sleep and circadian biology researchers using data from wearable sensors.
Our modelling framework opens up several possible extensions in further research. For instance, the homogeneous assumption on the hidden Markov chain can be relaxed by reparameterizing Γ in terms of the covariates via multinomial logistic link functions (Zucchini et al., 2016). Efficient MCMC inference can be achieved by incorporating the Polya-Gamma data augmentation scheme of Polson et al. (2013), which was successfully applied to parametric nonhomogeneous HMMs in Holsclaw et al. (2017), into the present modelling framework. Our methodology can also be extended in a relatively straightforward manner to Markov switching (generalized) additive models as studied in Langrock et al. (2017Langrock et al. ( , 2018 using frequentist approaches, where the splines can be used to model the functional effects of the covariates instead of the emissions. Without the density constraints on the spline parameters, the design of the RJMCMC algorithm can be simplified, and the efficiency of the resulting algorithm may be further improved. We believe that the advantages of using a Bayesian approach over a frequentist penalized approach as observed in this paper would carry over to this context. Additionally, it would be interesting to explore the combination of the modern deep-learning-based methods, which excel at handling highly complex temporal dependence in the series and utilizing historical information sets, with the conventional HMM probabilistic framework for achieving better predictive ability while maintaining model interpretability. A., Spira, A. P., and Taylor Chen, X., Wang, R., Zee, P., Lutsey, P. L., Javaheri, S., Alcántara, C., Jackson, C. L.,

A.1 Adaptive MCMC set-up
The scaling parameter τ i is adapted from iteration t − 1 to t as where ϵ(t) = min(0.01, 1/ √ t) following suggestions in Roberts and Rosenthal (2009) and Rosenthal (2007), sgn(·) is the sign function, ρ (j) i is the MH acceptance rate in iteration j, ρ * i is the targeted acceptance rate and ϵ i is a sufficiently small positive number. That is, we adjust the scaling parameter at every iteration by adding or subtracting a factor ϵ(t) (whose magnitude is diminishing) if the averaged acceptance rate over the past T a iterations is below or above the target ρ * i , see Green et al. (2015) and references therein for more details on the related theory and methods. In our context we set ρ * 1 = ρ * 3 = 0.4 and ρ * 2 = 0.24 based on the optimal scaling results for MH algorithms (see e.g. Gelman et al. (1997) and Roberts and Rosenthal (2001)). Furthermore, we set T a = 10 (inspired by results in Marshall and Roberts (2012)) and ϵ i = 10 −6 . In our examples, the lower bound ϵ i is usually not reached as the algorithm usually stabilizes well.

A.2 Acceptance probabilities for the Metropolis-Hastings moves
Moves (c) and (d) in Algorithm 1 are standard Metropolis-Hastings updates. For (c), the acceptance probability for relocating the knot r k * to the candidate point r c is: where R ′ K differs from R K only in the replacement of r k * by r c , and f N, [a,b] (·|µ, σ 2 ) denotes the density of the truncated normal distribution with mean µ, standard deviation σ and bounded on [a, b]. For move (d), since the proposal is symmetric, the acceptance probability is: where the set ofã ′ i,j is denoted byÃ ′ K . In step (e) we used a log-normal random walk to update the parameter due to the positivity constraint, and the corresponding acceptance probability after adjusting for the log-transformation is A.3 Acceptance probabilities for the birth and death moves Using the notation of Green (1995), the birth move for the spline parameters is accepted with probability min (1, A), where A could be expressed in the form likelihood ratio × prior ratio × proposal ratio × Jacobian.
In our context the likelihood ratio is: is given by equation (??) of the main paper. The prior ratio is given by the product of the ratio of the priors on each block of parameters that are involved in this update: where f U (·) denotes the probability mass function of a uniformly distributed random variable on {2, . . . , K max }, and f LG (·|ζ, 1) stands for the log-gamma density with shape parameter ζ and rate parameter 1. The proposal ratio is given by where f N, [a,b] (·|µ, σ 2 ) denotes the density of a truncated normal distribution with mean µ, standard deviation σ and bounded on [a, b]. Lastly, the Jacobian corresponding to the where n * is defined as in Section 3.2 of the main paper such that r c ∈ (r n * , r n * +1 ). After simplification A is thus given by Since the birth and death moves are defined in a symmetric way, the acceptance probability for this death move is min(1, A −1 ), where A is computed by considering the reverse birth move starting with the K − 1 interior knots and with n * = d * − 1.

A.4 Tackling label switching
A practical consequence of the properties of the model and its prior is that samples generated by the reversible jump MCMC algorithm are subject to the label switching problem, i.e. the state labels could permute during the MCMC iterations without changing the posterior density (Scott, 2002). As a result, the MCMC output cannot be directly used for inference about the state specific parameters. To tackle this issue we choose to use the Kullback-Leibler relabelling algorithm developed in Stephens (2000), which has been successfully applied for both parametric HMMs (Rodríguez and Walker, 2014) and nonparametric HMMs (Hadj-Amar et al., 2021). The basic ideas are as follows. For each of the T collected MCMC samples (discarding those collected during the burn-in), we first construct a n × N dimensional "classification probability matrix" whose (i, k)-th entry in our context is given by where f The algorithm then involves iteratively searching a specific permutation of state labels to minimize the KLD between classification probabilities averaged over the MCMC iterations, i,k )/T , and the classification probabilities obtained in each MCMC iteration.
In other words, we make the state labels associated with each MCMC draw agree on the classification probabilities [P (t) i,k ]. The "optimized" permutation searched for each MCMC sample can then be used to relabel the samples to achieve a consistent ordering of the labels.
We refer to Stephens (2000), and Rodríguez and Walker (2014), for more details of the algorithm. In our implementations, we use the R package Label.switching of Papastamoulis (2016) to perform this minimization procedure.

A.5 Validity of the algorithm
When the adaptive tuning scheme ceases after a period of burn in, the validity of the proposed reversible jump MCMC algorithm can be established following standard Markov chain theory as presented in Tierney (1994) and Robert and Casella (2013). all initial states (except for a set of posterior probability zero). To replace "almost all" by "all" we require a stronger condition called Harris recurrence, which is generally difficult to verify in the trans-dimensional MCMC set-up (Roberts and Rosenthal, 2006;Hastie and Green, 2012). However in practice we could tackle this issue by drawing the initial state using a continuous distribution centered around the posterior mode (or other approximations to that such as the maximum likelihood estimate). This strategy is employed in our initialization process. To accelerate the convergence of the chain we initialize the knot points at the empirical quantiles of the data so that more knots are initially placed at datarich regions. For the remaining parameters the initial values are drawn from appropriate truncated normal distributions centered at their respective maximum likelihood estimates computed given the initial knot configuration.
A.6 State-specific knot configurations for emission estimation for larger N A shared knot configuration or basis functions provides a parsimonious representation of the emission densities, but this approach can become inefficient as the number of states, N , grows large. In such scenarios, the usage of basis functions for a particular state can be very low, and a knot point will be more difficult to add or delete due to the shared nature (i.e. in general, a new knot is more likely to be accepted if it contributes to the fit of all the emission densities), resulting in slower mixing of the chain. To address this issue, we propose an extension of the basic algorithm allowing different knot configurations, whereÃ i,K i represents the spline coefficient vector for state i (with dimensionality dependent on K i ), ζ i denotes the state-specific shape parameter for the log-gamma prior forÃ i,K i .
Slightly different from the shared knot setting, we use a Gamma(1, 1) hyperprior for the shape parameter ζ i , but with an additional truncation at 0.01, to improve numerical stability. Posterior sampling for the extended model can be performed in a similar manner as described in Algorithm 1. The state sequence and transition matrix are updated in exactly the same way as before. For R i,K i ,Ã i,K i and ζ i , the same types of Metropolis-Hastings updates (i.e. steps (c), (d) and (e) in Algorithm 1) can be used, now scanning through each i. Birth or death of a knot point is considered separately for the knot configuration associated with each state, via the reversible jump move as described in Section ?? of the main paper. The resulting algorithm is thus expected to be more computationally intensive than Algorithm 1 due to the increased number of updating steps. However, note that this increase is partially counterbalanced by the potentially reduced number of basis functions and spline coefficients needed for each state, as well as the improved mixing of the Markov chain. Experimental results indicate that, say for N ≥ 5, the algorithm with state-specific knots is more efficient overall than the shared knot version, and its advantage becomes more pronounced as the value of N increases. Furthermore, the algorithm demonstrates scalability, remaining computationally feasible for up to 10 states.

B Details for the main simulation study
In this section, we provide details on the setups and results of our main simulation study, where we evaluate the performance of our proposed adSP method and compare it against fpSP, bpSP and GMM methods.

B.1 Evaluation metrics for estimation methods
To compare the estimation accuracy of different approaches, we consider two primary met- In our implementations we used the KLD function in the R package LaplacesDemon (Statisticat and LLC., 2021) for approximating the average KLD.
2. Decoding accuracy: Decoding is to infer the hidden state process x (n) based on the observed y (n) and is one of the key inference tasks in HMMs. We assess the agreement/disagreement between the estimated and true state sequences by calculating the proportion of correctly/incorrectly classified states, where the latter is also known as normalized Hamming distance (Fox et al., 2011). For the adSP and bpSP methods we estimate the states by first estimating the marginal state probability based on the MCMC samples of x (n) asP (x t = k|y (n) ) = T j=1 I(x

B.3 Setup and estimation for the Bayesian P-spline-based HMM
Motivated by Lang and Brezger (2004), we set up the spatially adaptive Bayesian P-spline model for the emissions as follows. We consider a second-order random walk prior for the reparameterized spline coefficients for state i,ã i = (ã i,1 , . . . ,ã i,K+4 ): (1) whereτ i and λ i,j may be understood as state-specific "global" and "local" smoothing parameters, respectively. We additionally assign noninformative normal priors N (0, 100 2 ) tõ a i,1 andã i,2 to ensure that the joint prior forã i is proper. For λ i,j , we follow Lang and Brezger (2004) to use independent Gamma priors, i.e. λ i,j ∼ Gamma( α λ 2 , α λ 2 ). For the global smoothing parameter, to address its impact on the smoothness of the spline fit, we adopt a robust specification by introducing the following hyperpriors (Jullion and Lambert, 2007) In our implementation we use α λ = 1 as in Lang and Brezger (2004) and we choose ατ′ = βτ′ = 10 −3 following the suggestion in Jullion and Lambert (2007). The choice of ατ is not influential and we set ατ = 1 as in Bremhorst and Lambert (2016) and Maturana-Russel and Meyer (2021). Details for the associated MCMC algorithm are given below.
Note that other versions of spatially adaptive Bayesian P-spline models may exist, but our goal here is to compare with the state-of-the-art approach commonly used in practice.
For step (e), note that the full conditional distribution ofτ i is Therefore we updateτ i by drawing from Step (f) is also a Gibbs step. The full conditional distribution ofτ ′ is and thus we draw:τ For the same reason stated before, here MCMC inference is subject to the label switching problem, which can be tackled using the Kullback-Leibler relabelling algorithm described above.

B.4 Setup and estimation for the GMM-based HMM
For the GMM-based HMM, the emission distributions are given by where f N (·|µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 , w i,j > 0 for j = 1, . . . , K i with K i j=1 w i,j = 1 are mixing weights for state i and K = (K 1 , . . . , K N ) specifies the number of Gaussian mixture components for each state. As for GMMs, estimation of the resulting GMM-based HMM can be performed in a maximum likelihood framework, where there are two key issues to be tackled. The first arises from the fact that the ordinary maximum likelihood estimate of the GMM is not well-defined as the likelihood is unbounded with σ ij → 0 (Kiefer and Wolfowitz, 1956). Therefore a naive implementation may lead to degenerate solutions. To overcome this issue we follow popular suggestions in the literature (see e.g. Ciuperca et al. (2003)) to introduce a penalty term to the log-likelihood function, which can be regarded as a Bayesian regularization approach, leading to a penalized log-likelihood where f (y (n) |·) is given by (??) and p λ () is a suitable penalty function with hyperparameters λ. Here, inspired by Ciuperca et al. (2003) and Baudry and Celeux (2015), we choose p λ () to be the density of an inverse gamma distribution, and we set both of the shape and scale parameters to be 0.001, which coincides with the commonly used noninformative prior distribution for Gaussian variance. Note that we used "non-informative" penalization parameters for the variance parameters to reflect the fact that, in reality, the true nature of the emissions would usually be unknown. The second key issue is to select the number of mixture components K i , which is a long standing problem for GMMs. When GMM is used for density estimation, the BIC has been shown to be theoretically and practically adequate (Roeder and Wasserman, 1997). We thus follow Sun and Tony Cai (2009) and Shokoohi et al. (2019) to use the BIC to choose K from a pre-defined grid such that each K i varies from 1 to 7, and the "optimal" values selected by this procedure are reported in Table B.1. We note that this may not be the best approach but we found it worked well for our purpose.

B.5 Results
To take care of the variability in the simulated data set, for each of the four simulation models 60 random replicates of the data were generated and the four methods were applied to each set of replicates. For our proposed algorithm the unspecified constants are set to a = min(y (n) ) − 10, b = max(y (n) ) + 10 and α = 0.65 in all scenarios (experiments suggest that our results are not very sensitive to these specific choices). For the fpSP method we select the number of equidistant knots K, and the state-specific smoothing parameters  Table   B.2. Settings for the bpSP and GMM methods are as described in the previous sections. For Bayesian MCMC methods (adSP and bpSP), we ran the corresponding MCMC samplers for a total of 120k iterations. We discarded the first 60k iterations as burn-in, which was found to be sufficient to obtain reliable results in all four scenarios. Our results are then Computing (HPC) system using the Ice Lake CPUs.

Comparison with fixed N
Suppose that N is fixed at the true value. We compare the performance of the four competing methods across the four simulation models in Figure B.1. Additional Figures, namely Figures B.2, B.3, B.4, and B.5, show the estimated emission density obtained with each method for simulation models 1, 2, 3 and 4, respectively. For Model 1, the adSP method indicated 6 or 7 as the posterior mode for K with frequencies of 38.3% and 50%, respectively. In contrast, the bpSP and fpSP methods used considerably more parameters with 31 basis elements for the estimation, yet the adSP method achieved comparable density fits in terms of the average KLD. The GMM approach achieved slightly better density fits which is to be expected given that the true emissions are indeed Gaussian mixtures. In terms of decoding, adSP and bpSP demonstrate slightly better performance than the fpSP and GMM methods, with mean accuracies of 93.6%, 93.7%, 92.9% and 93.5%, respectively, averaged over the 60 repetitions. For Model 2, the adSP method suggested K = 9 as the posterior mode in 36/60 of the simulation runs, while due to inefficient knot placements, the two P-spline-based approaches use a much larger K = 47. Despite this, the fpSP method gave generally poorer fits. The GMM method performed the worst in this scenario, struggling to estimate the emissions of states 1 and 3 which possess excessive skewness, see also  was suggested. The bpSP and GMM methods, however, failed to converge (or converged to sub-optimal solutions) in 25 and 28 of the repetitions, respectively. When considering only the cases where convergence was achieved, bpSP, fpSP and GMM methods obtained slightly better density estimates than the adSP method. This result is understandable given the smooth nature of the true emissions and the fact that a much larger number of basis elements (K = 51) are employed in the P-spline-based approaches, and the fact that the ground truth is a Gaussian mixture. However, when looking at decoding accuracy, our proposed adSP method outperformed the frequentist counterparts with an average accuracy of 85.6%, compared to 83.7% and 84% for the fpSP and GMM methods, respectively.
The results demonstrate again the better decoding ability of the adSP over the frequentist counterparts, despite potentially slightly worse density fits in some cases. Our adSP method also compares favourably with the Bayesian nonparametric approach of Yau et al.  Table 2, case T = 5000). However, we note that in contrast to Yau et al. (2011) we did not need to assume a known translation nature of the emission as our adSP method was able to identify this from the data.
In the bottom panel of Figure B.1, we compare the computational time required for each method in each of the 4 scenarios. For MCMC-based methods, this includes the total sampling time, including the burn-in period, while for frequentist approaches, this includes the total time for performing the penalized likelihood optimization based on the 50 different initializations. We can see that within the spline-based methods, our adSP is the most efficient method (except for Model 1), whereas bpSP is the most computationally intensive.
Among the 4 methods, GMM required the least computing time, which is reflective of its simpler model structure. However, it is important to note that for fpSP and GMM, the time required for selecting the smoothing parameters or the number of mixture components is not taken into account here. Additionally, significant additional computational effort is still needed for uncertainty quantification of the parameters, especially for fpSP.

B.6 Performance details of the proposed RJMCMC algorithm
For adaptive MCMC steps (steps (c)-(e) of Algorithm 1), the empirical acceptance rates are closed to the pre-specified desired levels. We examined the trace plots and running    averages of selected parameters, including the tuning variance parameters, across MCMC iterations and found acceptable mixing patterns in most cases. However, step (c) might mix more slowly compared to other MH steps due to the relatively high dimensionality of the spline coefficients vector. For the dimension changing moves, the averaged acceptance rates for models 1-4 are 0.27%, 0.15%, 0.55% and 0.41%, respectively. While these rates are lower than desired in our simulation cases, we did not detect any apparent convergence issues from our diagnostic trace plots (see Figure B.6). To further assess the convergence of the Markov chain, we performed 4 independent MCMC runs with different initializations for each of the 60 replicates generated for each of the 4 simulation models considered. We employed the Gelman-Rubin convergence diagnostic (Gelman and Rubin, 1992), with a focus on parameters in the fix-dimensional transition matrix. The obtained potential scale reduction factors (PSRF) are summarized in Figure B.7, which are generally very close to 1, supporting our visual diagnostic on the trace plots.
We investigated various modifications of the proposed algorithm. One possible modification is to replace the random walk MH with a Metropolis adjusted Langevin algorithm (MALA) (Roberts and Tweedie, 1996) for updating the spline coefficients. MALA is a specific class of MH algorithms which exploits the local gradient information of the target distribution to propose new state, making it highly effective for sampling complex distributions. In our context, the proposal takes the form where h is a positive real number, Σ is a symmetric positive definite matrix and Both h and Σ are tuning parameters that need to be selected using pilot runs or tuned on-the-fly using adaptive MALA techniques, see, e.g. Atchadé (2006); Christensen et al. (2005). However, the potential gain in sampling efficiency from using MALA comes at the cost of a much higher computational cost at each iteration due to the need to evaluate gradients. In our simulation scenarios, we found the random walk MH to be sufficient.
However, in more challenging cases, one may consider combing MALA and random walk MH updates. For instance, MALA could be used for the first few thousand iterations to speed up convergence to stationarity, and then random walk MH can be used for the remaining iterations.
For the reversible jump move, the acceptance rate may be mildly affected by the standard deviation τ of the truncated normal distribution used to generate the new knot or the proposal distribution for u i used in the birth move. In our experiments, with the functional form of τ fixed, the results are not very sensitive to the value of α, provided that it is chosen in a reasonable way. For instance we prefer to set 0 < α < 1 when the averaged distance between knots is much larger than 1 and vice versa. The use of other potential proposal distributions for u i within the Beta family was also investigated, but no clear evidence was found in terms of superiority of different choices over the noninformative choice U (0, 1). A potentially promising strategy is to generate the random variables u i used in equation (??) from a more informative proposal distribution such as a truncated normal distribution centred at its maximum likelihood estimate (approximated via some numerical optimization routines). However, the computational cost of the algorithm will increase dramatically, which may prevent a successful application of the algorithm in some settings. We also compared our algorithm with the one that integrates out the latent state sequence x (n) , using the forward algorithm in (??), and subsequently draws x (n) at each iteration using the FFBS algorithm, conditional on the simulated model parameters. Although this strategy yielded a slightly higher acceptance rate, which is not surprising given the reduced dimensionality of the parameter space, there is no noticeable gain in terms of overall computational cost or estimation accuracy. C Additional simulation studies C.1 Systems with more complex transition dynamics In reply to an anonymous referee we modify Model 2 of Section B to further assess the robustness of our proposed method against misspecification of the transition model, i.e., when the true generating process of the latent state sequence {x t } does not follow a homogeneous Markov chain as assumed before. We considered two possible deviations in this context, using the same set of emission distributions as in Model 2. In the first case, we allow x t to follow a semi-Markov process where the associated sojourn time in each state is explicitly modelled via a separate distribution, rather than the Markov implied geometric distribu-tion (Yu, 2010). More specifically, letx = {x t } denotes the "super-states" defined on {1, . . . , N } following a Markov process with transition matrix given by Ω = (Ω i,j ) i,j=1,...,N , with Ω i,j = P (x t = j|x t−1 = i) and Ω i,i = 0, i = 1, . . . , N . Therefore no consecutive realizations ofx t can be equal. Conditional onx t , a dwell time d t is generated according to a distribution gx t parametrized by some state-specific hyperparameters, indicating how long we stay at the current super statex t . Such runs of "d t -values" ofx t , stacked together, form a semi-Markov chain {x t }. The generative procedure of {x t } may thus be summarized as where Ω i denotes the i-th row of Ω. Here N = 3 and we set Ω 1,2 = Ω 2,1 = Ω 1,3 = Ω 3,1 = Ω 2,3 = Ω 3,2 = 0.5 and T to be the smallest value such that the sample size passes n = 1500.

C.2 Systems with larger number of regimes
To show the feasibility of the extended algorithm with state-specific knots (as described in Section A.6) in situations with a relatively large number of states, as requested by an anonymous referee, we considered two additional simulation models with 5 and 6 hidden states. For the 5-state model, the emissions are specified as y t |x t = i ∼ N (µ i , σ 2 i ), for i = 1, 3, 4, 5, and y t |x t = 2 ∼ 0.5N (µ 21 , σ 2 21 ) + 0.5N (µ 22 , σ 2 22 ). The parameter values are set as follows: µ 1 = −3, µ 21 = 0.5, µ 22 = 2, µ 3 = 4, µ 4 = 8, µ 5 = 11, σ 1 = 2, σ 21 = σ 22 = 0.5, σ 3 = 1.5, σ 4 = 3 and σ 5 = 1. The states were generated using a uniform initial distribution and the transition matrix The 6-state model is derived from the 5-state model by splitting state 2, which has a Gaussian mixture emission, into 2 separate states. The transition matrix for this model is specified as For each model, we examined two different sample sizes: n = 3000 and n = 6000. For each sample size, we generated 60 random replicates of the data. Note that in both cases, respectively. This could be due to the simpler structure of the 6-state model compared to the 5-state model. Overall, the algorithm is computationally feasible in all scenarios.
Though not presented here, we also investigated the estimation performance of the proposed algorithm (using shared knots or state-specific knots) with other simulation settings where the transition model is correctly specified or misspecified. Our preliminary findings indicate that the performance generally improves as serial correlation and/or sample size increase, while it declines as the number of states and/or the overlap of the emission distributions increase. These patterns are commonly observed in parametric HMM inference. It is also important to note that, as with any estimator, the performance of our method depends on the settings of the true simulation model. One can imagine that it would be very challenging or even impossible to accurately estimate the number of states and/or achieve highly precise parameter estimates when the data size is too "small" and/or the temporal dependency in the signal is weak. Obtaining an exact quantification of such dependency and estimation errors for the proposed method requires further theoretical research.   Here we provide additional details for the 3 and 9-state HMMs for the shark activity data.
Posterior inference was performed based on 50k iterations of the Algorithm 1 after a burnin of 70k iterations, with a thinning interval of 10.
where the point estimates are the posterior means and the associated standard deviations are shown in brackets. In particular, the diagonal entries ofΓ are all close to 1, highlighting the serial dependence. The estimated state sequence (indicated in colours) obtained via local decoding is shown in the left panel of Figure D.1. Note that, as was pointed out in Langrock et al. (2018), there could be a potential lack of fit if one chooses to model this type of data by some common parametric HMMs. To illustrate this, we fitted a Gaussian HMM with N = 3 to the lODBA data using the standard MLE approach and the resulting estimates of the emissions are shown in the right panel of Figure D.1, which exhibits a certain level of underfitting (e.g. fails to accurately capture the spikiness and the slight right tail of the emission of the middle state). The diagonal entries of the transition matrix were estimated asγ 1,1 = 0.941,γ 2,2 = 0.952 andγ 3,3 = 0.915, suggesting that the state persistency was underestimated in all three states, especially for state 3. Using our model selection method we found that the data support a model with N = 9 states, which points to the potential rich information contained in this high-resolution signal and may not be surprising given the expected complexity of shark's behaviour in reality (Bres, 1993). We obtained posterior estimates for this model by generating 50k samples after discarding the initial 200k iterations as burn-in, with a thinning interval of 10. To assess convergence, we estimated the PSRF focusing on parameters in the transition matrix by executing 4 independent MCMC runs with different initializations. Among the 81 parameters, the PSRF ranges from 1 to 1.02, indicating satisfactory convergence. Figure  D.2 displays the estimated emission densities (left panel) and the corresponding decoded times series (right panel). We can see that the estimated hidden states roughly correspond to 9 different levels of activity which interestingly, resolves the multimodality seen in the 3-state model (e.g. state 1) into a mixture of unimodal emission densities. The posterior means of the transition probabilities arê Almost all of the estimated states are persistent in the sense that there usually is a large probability of staying in the current state (see diagonal entries ofΓ 9 ). In comparison to the 3-state model, the diagonal entries are naturally lower as the shark's movement is now subdivided into more states. The fitted model with 9 states indicates, for example, that states 1-3, 4-7 and 8-9 or 7-9 tend to have a higher probability of transitioning within the group of states than to other states. Furthermore, these three "clusters" roughly correspond to the lower, middle and higher activity mode of the 3-state model. To assign biologically meaningful interpretations to these states, however, additional ecological information and investigation are required and we have not pursued this further here. Nevertheless, our analysis demonstrates the ability of our method to deal with model selection in an HMM with a relatively large number of states, while Langrock et al.'s approach can be expected to be severely challenged with increasing N . Our flexible HMM approach could also be used in an explorative way to identify a suitable parametric model. We can see from Figure   D.2 that most of the emission densities of the 9-state model have a relatively regular shape.
An interesting further question would be whether the 3-state spline model can be replaced by a 9-state parametric approach, assuming for instance a mixture of Gaussian emission densities, which can capture a higher degree of multimodality in the marginal distribution. In this section we present additional implementation details and estimation results for applying the proposed conditional HMM approach on the example subjects.
E.1 Further details of the prior setting and the MCMC algorithm The priors for parameters in the sub-HMM are specified as follows. We used the same priors for the number of knot points and spline coefficients as described in Section ??
of the main paper. For the knot location vector, R K , we used the same uniform prior as in Section ??, but with an additional constraint that ensures the minimum distance between two adjacent knots is greater than a threshold, set to 0.6 here. This constraint is introduced for the sub-HMM as it directly models the 30-s PA counts data, helping to prevent numerical issues that may arise from the data's high level of discreteness. For the transition matrix of sub-HMM there is no conjugate prior available as the associated hidden state process is not simulated, and a MH update is required instead. Following the reparametrization scheme used in Section ??, we reparameterize each row of the transition probability matrix as γ i,j =γ i,j / N l=1γ i,l ,γ i,j > 0, and place a vague gamma prior on thẽ γ i,j , i.e. f (γ i,j )=gamma(1, 1), resulting in a Dir(1, . . . , 1) distribution on (γ i,1 , . . . , γ i,N ).
For the main-HMM, we use the same prior settings for the transition matrix and spline parameters as in Section ??, and the same prior specification for the weight parameters involved in the emissions as the sub-HMM.
Posterior inference for the main and sub-HMMs can be achieved by sequentially sampling from f (θ, x (n) |y (n) ) and f (θ S |y (n) ), where θ represent the parameter set of the main-HMM with the additional weight parameters included. The MCMC methodology described in Section ?? of the main paper (Algorithm 1) can be used to simulate from f (θ, x (n) |y (n) ), with an additional MH step to update the state-specific weight parameters and the details for this update are given below. For the sub-HMM, we simulate from f (θ S |y (n) ) according to (??) in Section ?? of the main paper. Conditional on each realisation of x (n) (obtained from the posterior simulation for the main-HMM), we simulate from f (θ S |x (n) , y (n) ) by drawing a sample for θ S from the joint density defined in (2) using essentially the same sampling scheme as used for the main-HMM, where the RJMCMC updates are run for several iterations and the last sample is kept. Tuning of the MH scaling parameters was achieved via a separate pilot run using the same adaptive procedure as described in Section A, where the conditioning variable x (n) may be fixed at a specific realisation from its posterior distribution or the local decoding result obtained from the simulation output for the main-HMM. To perform posterior simulation of the hidden state process associated with the sub-HMM for a given segment(s) of the time series (e.g. data points that were assigned to state 1 by the local decoding result of the main-HMM), a standard FFBS algorithm can be used by conditioning on the sampled θ S . During the forward procedure, data points that were assigned to states 2 or 3 by the the main-HMM are treated as "missing".
Mathematically this is done by replacing the emission densities associated with those time points by the constant of one. In running the backward simulation procedure, the whole sub-state sequence associated with the given series is simulated. However, only sub-states that correspond to the "non-missing" data points (assigned by state 1 of the main-HMM) are of interest.
To update the state-specific weight parameters in the zero-inflated emission distributions, we use a log-normal random walk for the reparameterized weightsw i,j , i = 1, . . . N , j = 1, 2: where ϕ i,j ∼ N (0, τ 2 w ). The acceptance probabilities of this move for the main and sub-HMM are and min 1, respectively, whereW ′ denotes the vector of proposedw ′ i,j and θ ′ and θ S ′ denote the corresponding updated parameter set for the main and sub-HMM, respectively. This scheme can be easily adjusted for cases where multiple point masses are used, as in our MESA application.

E.2 Further details of the MESA application
Here we present additional implementation details and estimation results for applying the proposed conditional HMM approach on the example subjects. For the main-HMM, we pre-processed the data by averaging the PA over 5-min non-overlapping intervals, followed by a log-transformation as used in , i.e. log(1 + P A), to handle the high variability observed in the MESA data. For the emission model, we included two point masses, one at 0 and another at log(1.1) (the second smallest possible value for 5-min averaged PA after the log-transformation) to handle the high occurrence of both values.
For the algorithm's constants, we have chosen a = 0.1, b = max(log(1 + P A)) + 3 and α = 2. The main-HMM results were based on 25k iterations of the proposed algorithm, after discarding the first 50k as burn-in. To assess the MCMC convergence, we estimated the PSRF by running 4 independent chains focusing on the transition matrix as before.
The obtained PSRF values were close to one for both subjects, with values ranging between (1, 1.04) for subject 921 (subject A) and (1, 1.02) for subject 3439 (subject B), respectively.  Table E.1 shows the posterior means for the state-specific weights of the point masses at 0 (w i,1 ) and log(1.1) (w i,2 ).
Inference for the sub-HMM is based on the raw 30-s PA data. To specify the emissions for the sub-states, we introduced point masses at the first four lowest values including 0, which exhibit a high frequency of occurrence. We set α = 0.65, a = 4.5 and b = 10 plus the maximum of the 30-s PA data corresponding to state 1 of the main-HMM. We ran our proposed algorithm for 25k updates (based on the 25k posterior samples of x (n) obtained from the main-HMM analysis), discarding the first 15k as burn-in. The obtained PSRF values were 1.11 (for γ 1.1,1.1 ) and 1.03 (for γ 1.2,1.2 ) for subject A, and 1.04 (for γ 1.1,1.1 ) and 1.004 (for γ 1.2,1.2 ) for B. The estimated emission densities (for the continuous part) for the two subjects are displayed in the right panel of Figure