Bayesian Wavelet Shrinkage of the Haar-Fisz Transformed Wavelet Periodogram

It is increasingly being realised that many real world time series are not stationary and exhibit evolving second-order autocovariance or spectral structure. This article introduces a Bayesian approach for modelling the evolving wavelet spectrum of a locally stationary wavelet time series. Our new method works by combining the advantages of a Haar-Fisz transformed spectrum with a simple, but powerful, Bayesian wavelet shrinkage method. Our new method produces excellent and stable spectral estimates and this is demonstrated via simulated data and on differenced infant electrocardiogram data. A major additional benefit of the Bayesian paradigm is that we obtain rigorous and useful credible intervals of the evolving spectral structure. We show how the Bayesian credible intervals provide extra insight into the infant electrocardiogram data.


Introduction
For a real-life time series it is sometimes difficult to determine whether the underlying process is really stationary using only observations from a section of the process. Often, the spectral behaviour of a real-life time series can change from one time point to another and nonstationarity may only become apparent with continued observation. If we disregard the stationarity assumption, there are an abundance of different models that can be considered. One class of nonstationary models, which we consider here, are the locally stationary processes with slowly evolving second-order structure. Two prominent sub-classes are the locally stationary (Fourier) processes due to [1] and the locally stationary wavelet (LSW) processes due to [2]. However, nonstationary Fourier processes have a long history see, e.g. [3][4][5]. Reviews can be found in [6] and [7]. The second-order structure of a time series can be assessed via the (auto-)covariance or spectrum, and accurate specification and estimation of these quantities is particularly important to improve our understanding of the data.
This article assumes that a time series can be modelled by a LSW process and considers the estimation of the associated evolutionary wavelet spectrum (EWS). As is the case for stationary spectral estimation obvious 'raw' estimators are not statistically consistent and require smoothing. For example, [2] introduced a kind of 'method of moments' spectral estimator and used wavelet shrinkage to smooth it and [8] used kernel smoothing to produce estimates for forecasting. See also [9] who introduce a pointwise estimator. [10] introduced a powerful new approach based on Haar-Fisz transformation of the raw wavelet periodogram and essentially using universal thresholding [11] on the Haar-Fisz coefficients.
This article builds on the [10] work by using Bayesian wavelet shrinkage to bear on the Haar-Fisz coefficients and does so for two reasons. First, recent Bayesian wavelet shrinkage techniques based on the Berger-Müller prior and empirical marginal maximum likelihood determination, such as [12], show dramatic performance improvements over earlier concepts such as universal thresholding. The Bayesian approach uses priors well-adapted to the known mathematical theory underlying wavelet coefficients of a wide class of functions from Besov scales. Secondly, the coherent Bayesian approach permits rational and effective quantification of credible intervals for the EWS. Our simulation results and results on real data show good performance and new insights.
Section 2 reviews the locally stationary wavelet model and the associated evolutionary wavelet spectrum and the wavelet periodogram. Section 3 briefly reviews the Haar-Fisz transformation at establishes notation for subsequent Bayesian wavelet shrinkage. Section 4.2 first reviews wavelet shrinkage and Bayesian wavelet shrinkage and then describes each of the components of our Bayesian wavelet shrinkage method adapted for the Haar-Fisz-transformed spectral coefficients. Section 5 outlines some implementation issues, presents a simulation and analyses an infant electrocardiogram (ECG) data set and compares it to earlier analyses. Finally, section 6 concludes and provides some ideas for further developments.

Locally Stationary Wavelet Processes
Locally stationary wavelet (LSW) processes were introduced by [2], and extended to encompass a larger range of processes in [9] which we use here. As in [2] assume that the wavelets used are [13] compactly supported, and that the length of the support for any wavelet ψ j,0 is equal to L j : = jsupp(ψ j,0 )j. Therefore, if we have J scales, where 1 is the finest scale, then jsupp(ψ j, k )j = L j = (2 j − 1)(L 1 − 1) + 1 8 j ! 1. Here N is the set of natural numbers {1, 2, 3, . . .}.
Definition 1 (The Locally Stationary Wavelet Process) A LSW process is a sequence of doubly indexed stochastic processes, {X t, T } t = 0, . . ., T−1 , where T = 2 J for some J 2 N. This process has the representation X t;T ¼ where c ðsÞ j;kÀt is a discrete non-decimated family of wavelets for scale j 2 N, location k 2 Z based on a mother wavelet, ψ(t), of compact support, which we shall refer to as the synthesis wavelet; and ξ j, k is a Gaussian random zero mean orthonormal increments sequence. The component w j, k:T ξ j, k can be thought of as a random amplitude of the oscillation c ðsÞ j;kÀt . The quantities in Eq (1) possesses the following properties: c. For each j 2 N there exists a function W j (z) for z 2 (0,1), that possesses the following properties i.
jW j ðzÞj 2 < C uniformly in z 2 ð0; 1Þ: ii. There exists a sequence of constants C j such that for each T sup k jw j;k;T À W j ðzÞj C j T : iii. The total variation (TV) of W 2 j ðzÞ is bounded by a constant L j , that is iv. The constants C j and L j satisfy The time evolution of LSW processes is governed by the time-scale varying evolutionary wavelet spectrum which we define next.

Evolutionary Wavelet Spectrum and its Estimation
The evolutionary wavelet spectrum (EWS) measures the 'contribution to the variance' of X t, T at scale level j 2 N and location z 2 (0,1) and is defined as follows.
Definition 2 (Evolutionary Wavelet Spectrum) The EWS is defined by S j ðzÞ ¼ jW j ðzÞj 2 8 j 2 N and z 2 ð0; 1Þ: Estimation of the EWS can be achieved by first computing the raw wavelet periodogram, defined as follows.
In theory, the analysis wavelet from Eq (3) is the same as the synthesis wavelet in Eq (1). However, often in practice the synthesis wavelet is unknown. For the purposes of our analysis we shall assume the synthesis wavelet is known and equivalent to the analysis wavelet. The raw wavelet periodogram, I j, k , is a biased estimator of the EWS, but can be made asymptotically unbiased after simple correction which we will explain next. To proceed with this, the autocorrelation wavelet (ACW) is defined as follows.
Definition 4 (Discrete Autocorrelation Wavelet) The ACW at scale j 2 N at lag τ 2 Z is defined by The discrete ACW determines the autocorrelation of a wavelet at a particular scale, j and different lags, τ. The discrete ACW provides a family of symmetric, compactly supported, positive semi-definite functions on τ 2 Z. Further theoretical details can be found in [2] and [14]. To form an asymptotically unbiased estimator of the spectrum we require the inner product matrix of the ACW defined as follows.
Definition 5 (The Inner Product Matrix) The operator A = (A j, l ) j, l ! 0 is defined by and the J-dimensional matrix is A J = (A j, l ) j, l = 1, . . ., J . Then using definitions 1 and 5, proposition 3.3 of [2] shows that for j 2 N, k 2 Z, where A is calculated using the chosen analysis wavelet and the variance of the wavelet periodogram is given by This result implies that as the sample size increases (T ! 1) the variance does not vanish. [2] show that the obvious asymptotically unbiased estimator A À1 J I k for {S j (k/T)} where I k = (I 1,k , . . ., I J, k ) is not statistically consistent. As is typical in spectral analysis in time series the periodogram needs to be smoothed to obtain consistency.

Wavelet Periodogram Smoothing
Various techniques have already been developed to smooth the wavelet periodogram, such as those by [2,9,10]. [9] is theoretically attractive but practically challenging.
In [2] each level, j, of the raw wavelet periodogram is smoothed as a function of z using translation-invariant (TI) de-noising [15]. Non-linear wavelet shrinkage is performed on the approximately w 2 1 distributed raw wavelet periodogram then bias corrected by the inner product matrix (A −1 ). An appropriate threshold for the shrinkage was determined in [2]. The technique raises a number of questions, such as what is an appropriate wavelet? [2] believe that smoother wavelets, such as Daubechies extremal phase with 10 vanishing moments, help to avoid 'leakage' of power into the surrounding scales because of their short support in the Fourier domain. They also produce less spiky and variable estimates in their example.
[10] suggested applying the soft shrinkage rule upon the Haar-Fisz coefficients of the raw wavelet periodogram, using a scale dependent threshold. The methodology produced an estimator which was mean-square consistent, rapidly computable, easy to implement and performs well in practice. However, the theoretical validation of this technique was restricted to locally stationary processes with a time-varying, but piecewise constant form.
The Haar-Fisz transform in [10] is very attractive producing transformed periodogram ordinates that are very close to being uncorrelated and Gaussian. We apply Bayesian wavelet shrinkage to this enticing situation and not having to worry about first order estimation error in the variance.

Spectral Normalisation using the Haar-Fisz Transform
The Haar-Fisz transformation works by normalising the wavelet coefficients of a signal to obtain elements that are close to Gaussian and have near-constant variance. We adapt the definition from [10], Section 6, which applies the Haar-Fisz transform to the raw wavelet periodogram I j, k as follows. To prevent unnecessary notational overload we will temporarily drop the j subscript and write I m for I j, m . The next algorithm is applied to each scale j of the periodogram separately. In other words, we have transformed the input vector fI m g 2 J m¼1 into the Haar-Fisz output vector fH m g 2 J m¼1 . Now we re-introduce the j subscript as this Haar-Fisz processing is replicated at each scale, and let F denote the non-linear invertible Haar-Fisz operator, hence H j, k = FI j, k .

Brief Review of Wavelet Shrinkage
Wavelet shrinkage is a form of nonparametric regression introduced in a series of seminal articles such as [11,16]. See [17] or [18] for more details and further references. Suppose we have a set of noisy observations, y = (y 1 , . . ., y n ) of an unknown function f(x), taken at regularly spaced locations, denoted by x = (x 1 , . . ., x n ). In our context, we can use the well-known additive signal-plus-noise model for each scale-level, j, in Eq (8): where e = (e 1 , . . ., e n ) are random variables which are usually assumed to be iid with zero mean and some variance σ 2 . The aim is to devise an estimatorf ðxÞ to recover the signal f (also known as B) from the noisy observations y i (H). Wavelet shrinkage is very simple and the estimator can be obtained by the following three steps.
1. Apply the discrete wavelet transformation (DWT) to noisy data y, giving where d = W y, β = Wf(x), ε = W e and W is the orthogonal DWT matrix for a particular smoothing wavelet (SW). The vector β are considered to be the 'true' wavelet coefficients, d are the noisy empirical wavelet coefficients.
2. Apply a shrinkage method and threshold (such as hard shrinkage and the universal threshold) to the noisy coefficients, d, to obtain estimates,β, of the wavelet coefficients β.
3. Apply the inverse DWT to the estimated coefficientsβ to obtain an estimate,f ðxÞ, of the underlying function f(x) at the data points x.
To enable us to obtain good estimates with a sound basis for obtaining credible intervals we adopt a Bayesian wavelet shrinkage approach as described next.

Bayesian Wavelet Shrinkage
Bayesian statistical methods start with existing prior knowledge of model parameters (β), which are updated using the data (y) to give posterior knowledge. The resulting posterior knowledge can be used to interpret these parameters. The model commonly used for Bayesian inference is where p(yjβ) is the likelihood, p(β) is the prior density function and p(βjy) is the posterior density function of β given y. Credible intervals can be obtained from the upper and lower tail quantiles of the posterior distribution. Adopting a Bayesian approach for wavelet shrinkage has become increasingly popular for wavelet denoising due to its excellent theoretical and practical properties, see [19,20,21,22,23] and [12], for example. Bayesian wavelet shrinkage has also been used for stationary spectral estimation in [24] and for credible intervals for regression by [25,26] and [27]. The usual procedure is to place a prior distribution on the wavelet coefficients, use the Bayesian paradigm specified by Eq (9) with the necessary components specified as follows to enable us to derive a closed-form expression for the posterior means and variance. For parts of our specification below we shall use the empirical Bayes approach from [12].

Regression Model
We shall apply Bayesian wavelet shrinkage to the Haar-Fisz transformed wavelet periodogram, H. Taking the DWT of Eq (8), for a particular scale j, we obtain where h l, m = (WH j ) l, m , β l, m = (WB j ) l, m , ε l, m = (We j ) l, m for scales l = 0, . . ., J − 1 and locations m = 1, . . ., 2 l , and W is the T × T orthogonal DWT matrix associated with some [13] compactly supported wavelet. Due to the orthogonality of the wavelet transformation and the approximate error structure of the e j, k noted above, the distribution of the wavelet-transformed error is approximately ε l;m $ N ð0; n 2 l Þ, where n 2 l ¼ 2 JÀl s 2 j . For notational clarity we shall cease mention of the scale index j. However, it should be remembered we are applying Bayesian wavelet shrinkage scale-by-scale j to Eq (8).

Prior
We propose using the Berger-Müller mixture prior for β ℓ,m where ξ τ (β) = τξ(τβ), δ 0 (x) is the Dirac-delta function at zero, α l is the prior probability that the wavelet coefficient is zero, τ l is the prior precision and ξ is the distribution of a non-zero wavelet coefficient. [12] recommended using a heavy-tailed distribution, such as the Laplace distribution, to model this parameter and we use this here. Therefore pðb l;m Þ ¼ a l d 0 ðb l;m Þ þ 1 2 ð1 À a l Þ t l exp fÀt l jb l;m jg; ð12Þ where τ l is the prior precision and 2t À2 l is the prior variance for scale l = 1, . . ., J.

Hyperparameter Determination
As in [12] we use marginal maximum likelihood estimation (MMLE) to determine the hyperparameters: prior probability and precision (α l , τ l ), and error variance ν l . To do this, we maximize the hyperparameters over the log-likelihood of the error distribution multiplied by the prior, Lða l ; t l ; n l ; jh l Þ ¼ X 2 l À1 m¼0 log fa l n l ðh l;m Þ þ ð1 À a l Þgðh l;m jn l ; t l Þg; ð13Þ where gðyjn l ; t l Þ ¼ Z 1 À1 n l ðy À xÞ x t l ðxÞdx: The maximum log-likelihood can not be obtained analytically and required numerical maximisation.

Likelihood
Due to Property 6.1 (4) the Haar-Fisz transformation bestows approximate/asymptotic Gaussianity upon the data. Hence, we assume a likelihood of the form pðh l;m jb l;m Þ ¼ n l ðb l;m À h l;m Þ ¼ n À1 l ð2pÞ À1=2 exp À 1 2n 2 where ϕ ν l (Á) is the the probability density function of the Gaussian distribution with variance n 2 l , which we shall assume is equal to the error variance.

Posterior Distribution
For credible intervals we require the posterior variance which can be calculated via the integral Var½b l;m jh l;m ¼ E½b 2 l;m jh l;m À ðE½b l;m jh l;m Þ 2 ¼ R x 2 x t l ðxÞ n l ðx À h l;m Þdx y l n l ðh l;m Þ þ R x t l ðyÞ n l ðy À h l;m Þdy Àb 2 l;m : To simplify notation define x i x t l ðxÞ n l ðx À h l;m Þdx; Lemma 1 The quantities Q i (h) for the Laplace mixture prior in Eq (12) are given by Proof. in the appendix.
Proof. Substitute the formula (19) into Eqs (17) and (18). The next result gives us the necessary log-likelihood function of our Bayesian model from Eq (13) for the Laplace mixture prior.
Lemma 2 The log-likelihood function for the Laplace mixture prior is where ϕ ν (Á) is the zero mean Gaussian pdf with variance ν 2 , F(Á) is the Gaussian cdf, m 3 ¼ y þ n 2 l t l and m 4 ¼ y À n 2 l t l . Proof. The proof uses the same methods as for the proof of Lemma 1.

Implementation Issues
We determine the hyperparameters via MMLE of Eq (13) using the function optim in R which uses the L-BFGS-B method from [28]. Empirical investigations revealed that with four coarsest scales, l = 0,1,2,3, as they consist of 1,2,4 and 8 wavelet coefficients (respectively), numerically maximising the log-likelihood for each scale resulted in strongly biased hyperparameter estimates. Therefore, instead of maximising the log-likelihood for the four coarsest scales separately, the coefficients were grouped together and maximisation was performed over all the four scales. To distinguish between scales, the hyperparameter estimates were scaled appropriately, such that as the scale decreased α l decreased and τ l increased by a factor of two.
Ultimately, we are seeking an estimate of the posterior (mean and) variance of B(z). Formula (21) gives us an estimate of the posterior variance of β l, m the wavelet coefficients of B. We could use the approximate method of [25] to obtain the posterior variance of B(z). This works well for Haar wavelets (where the square of the wavelet ψ 2 (z) is equal to the father wavelet) but less accurate for non-Haar wavelets. Hence, we adopt the following simple sampling strategy to obtain posterior credible intervals for B(z).
We simulate S realisations for a complete set of wavelet coefficients {β l, m } from the posterior distributions given by Eq (16). Each realisation of wavelet coefficients is then subjected to the inverse wavelet transform which provides a posterior realisation of the B = {B(z 1 ), . . ., B(z n )}. We then use the sample mean and variance of the B(z i ) to provide the 'estimate' and credible intervals.

Simulation
To test the performance of our method we simulated 200 realisations, fX t g 1023 t¼0 , from the EWS in Fig 2 with Gaussian innovations as shown in Fig 3. The EWS was designed to encapsulate a time series with slowly varying power at a middle scale along with a burst of power at the finest scale.
These simulations were executed using the aforementioned wavethresh [29] package. First, we create a blank spectral object using the cns() function and then using the inserter function putD() we installed the sinusoidal spectral energy at level four and the small block at the finest scale. The realizations can then be generated by executing the LSWsim() function with the specified spectrum as an argument.
For each realization we produced a Bayesian Haar-Fisz and translation-invariant (TI) denoised estimator using the Daubechies extremal phase (EP) with 1 − 10 vanishing moments, and Daubechies least asymmetric (LA) with 4 − 10 vanishing moments smoothing wavelets. The TI estimator was described in Section 2.2. The average mean squared error (AMSE) were calculated using Haar-Fisz estimator with twenty cycle spins to remove any features of the wavelet alignment which might unduly influence our estimator. See [15] for further details on cycle spinning.
We calculated the mean EP smoothing wavelet estimate for each of the 200 processes, then calculated AMSE for both methods. The overall AMSE for the TI De-noising estimators was 0.192 and for Bayesian Haar-Fisz estimators 0.131. Table 1 shows the AMSE for each estimator and choice of smoothing wavelet. The EP 1 corresponds to the Haar wavelet, which gives the poorest estimator for Haar-Fisz and second poorest for TI-D, this is only the best wavelet to use if the underlying structure of the EWS for each scale is known to be piecewise constant. We found that both methods seemed fairly robust to the choice of wavelet, as the difference between the AMSE appeared to be fairly small. Although we noticed the AMSE of the TI de-noising estimator decreased as the support of the wavelet increased, which was not the case for the Bayesian Haar-Fisz estimator. However, the Bayesian Haar-Fisz estimator consistently out performed the TI de-noising estimator and also with a much smaller variability (as indicated by the mean absolute deviation figures).
We compared the best TI de-noising estimator [2, SW = EP 10 ], as shown in Fig 4 to our best estimator using Bayesian modelling of the Haar-Fisz periodogram (SW = LA 6 ), see Fig 5, as determined by the results in Table 1.
Comparing the plots in Figs 4 and 5, we can see that the Bayesian Haar-Fisz estimator is less susceptible to Gibbs-type phenomena, but the leakage of power in neighbouring scales appeared to be fairly comparable for both estimators. Some of the power from scale j = 6 has leaked into j = 5,7, which has made recovery of the true underlying signal difficult.     judge our method to be comparable to the TI-denoising away from z = 0.6 and considerably better near to z = 0.6. (This is because the red and blue lines both roughly match the solid line truth, but the blue is much better near to z = 0.6 where the TI-D (red) suffers from extreme variability).
A key advantage of our new methodology is the ability to easily generate credible intervals which are shown by grey-scale in Figs 6-9. For example, even though the estimator for S 3 (z) appears to be non-zero in Fig 8, the 50% credible intervals completely contain zero which

ECG Example
To test our methods further, we consider the study of infant sleep [30]. Five mothers and their healthy first-born infants slept in a sleep laboratory designed to be similar to a normal domestic bedroom once a month for the first five months. The rooms were thermally controlled and all infants slept supine in a cot besides their mother, who were free to care for their infants as they would at home (e.g. feed, change nappy, etc). Most studies commenced around 8-9pm and finished around 8-9am the next morning.  Amongst the measurements taken of each infant was their heart rate via ECG (electro-cardiogram) monitors, their brain waves via a EEG (electro-encephalogram) sensor and eye movements using a EOG (electro-oculogram) sensor. The infant's sleep state was then determined through manual analysis where a trained observer visually interprets the EEG and EOG at predetermined time periods, which can be time consuming and laborious. Four sleep states were recorded: AWAKE, ACTIVE SLEEP, BETWEEN and QUIET SLEEP. For simplicity, we have combined the latter three states into ASLEEP.
This data is freely available as part of the wavethresh [29] package for R [31] in the data sets BabyECG and BabySS.   indicates that when the infant is awake there is a larger variance in the infant's heart rate compared to the two different sleep stages, for which quiet sleep appears to possess the smallest variance. We have produced an estimate of the EWS for the differenced ECG data to establish whether we could use the second order structure of the data to determine the infant's sleep state. The plot in Fig 11 implies the majority of the power of the spectrum is present at the finest scale. There appears to be some difficulty in discerning the infant's sleep state when it changes quickly (such as between location z 2 [0.2,0.4]). As with earlier analyses, such as that in [18], there appears to be a link between active sleep and higher power at the finest scale.  However, our new analysis reveals much more: that there is more uncertainty associated with the higher power estimates and more certainty when the power is lower. The arrangement of the posterior mean estimate relative to the 50/90% credible intervals indicates skew in the posterior distribution which is especially noticeable around the peak near 0.65. A blow-up of the finest scale power is shown in Fig 12.

Conclusion and Further Work
This article combines the Haar-Fisz transform with Bayesian wavelet shrinkage to obtain a new method for modelling the evolutionary wavelet spectrum of a locally stationary wavelet process. Bayesian wavelet shrinkage is known and powerful technique and well-established for noisy data contaminated by uncorrelated Gaussian noise which the Haar-Fisz transform  approximately, but effectively, provides. Although there are competing methods for spectral estimation there are, as far as we know, no methods for generating credible intervals for evolving spectra certainly in the wavelet case. Our Bayesian wavelet shrinkage gives a rational method for assessing uncertainty in this case providing us with approximate credible intervals.
Further work to improve our method would be to improve our method of determining hyperparameters and also investigate its application to irregularly spaced time series. Another interesting possibility is to apply Bayesian wavelet shrinkage to Haar-Fisz transformed spectra in the stationary or locally stationary Fourier case.  where ϕ ν (Á) is the zero mean Gaussian pdf with variance ν 2 , m 1 ¼ h l;m þ n 2 l t l and m 2 ¼ h l;m À n 2 l t l . Formula (23) is obtained by substituting in the formula for the Laplace density in Eq (19) and splitting the integral into two parts on the negative and positive domains. Then, on each integral, the exp(−τ l jxj) term is merged with the exponential in the normal density and then the square completed for each term.
Finally, to obtain the quoted formulae in Lemma (1) use the following properties of the Gaussian distribution: