Data-driven wavelet-Fisz methodology for nonparametric function estimation

We propose a wavelet-based technique for the nonparametric estimation of functions contaminated with noise whose mean and variance are linked via a possibly unknown variance function. Our method, termed the data-driven wavelet-Fisz technique, consists of estimating the variance function via a Nadaraya-Watson estimator, and then performing a wavelet thresholding procedure which uses the estimated variance function and local means of the data to set the thresholds at a suitable level. We demonstrate the mean-square near-optimality of our wavelet estimator over the usual range of Besov classes. To achieve this, we establish an exponential inequality for the Nadaraya-Watson variance function estimator. We discuss various implementation issues concerning our wavelet estimator, and demonstrate its good practical performance. We also show how it leads to a new wavelet-domain data-driven variance-stabilising transform. Our estimator can be applied to a variety of problems, including the estimation of volatilities, spectral densities and Poisson intensities, as well as to a range of problems in which the distribution of the noise is unknown.


Introduction
A paradigmatic problem in nonparametric regression is the estimation of a onedimensional function α : [0, 1] → R from noisy observations X t taken on an equispaced grid: X t = α(t/n) + ε t , t = 1, . . . , n, where the ε t 's are random variables with E(ε t ) = 0. Various subclasses of the problem can be identified, depending on the smoothness properties of α and the joint distribution of (ε t ) n t=1 . Since the seminal work of Donoho and Johnstone (1994), estimation techniques based on non-linear wavelet shrinkage have become a commonly used tool in nonparametric function estimation, extensively studied in the statistical literature. Many of them combine excellent finite-sample performance, linear computational complexity, and optimal (or near-optimal) asymptotic Mean-Square Error behaviour over a variety of function smoothness classes, including functions of a low degree of regularity. This often puts them at an advantage compared to linear estimation techniques (such as kernel smoothing) in setups where the function α has discontinuities or exhibits an otherwise irregular behaviour. A comprehensive overview of wavelet methods in statistics can be found, for example, in Vidakovic (1999).
The main idea underlying most wavelet techniques is that upon transforming the original regression problem (1) via a "multiscale" orthonormal linear transform W called the Discrete Wavelet Transform (DWT), the following regression formulation is obtained: Y j,k = µ j,k + ε j,k , j = 0, . . . , log 2 n − 1, k = 1, . . . , 2 j , and k = 1 for j = −1, where j and k are (respectively) scale and location parameters, Y j,k are the empirical wavelet coefficients of X t , µ j,k are the true wavelet coefficients of α(t/n) which need to be estimated, and ε j,k are the wavelet coefficients of ε t . The sequence µ j,k is often sparse, with most µ j,k 's being equal or close to zero, which motivates the use of simple thresholding techniques that do not estimate µ j,k by zero if and only if the corresponding Y j,k exceeds a certain threshold in absolute value. This ensures that a large proportion of the noise ε j,k gets removed. The inverse DWT then yields an estimate of the original function α.
The overwhelming majority of wavelet-based estimation techniques, such as those proposed by Donoho and Johnstone (1995), Johnstone and Silverman (2005a) or Abramovich et al. (2006), to name but a few, were devised under the assumption that the errors (ε t ) n t=1 formed an independent, identically distributed Gaussian sequence. This was partly due to the fact that in view of the orthonormality of W , the "wavelet noise" ε j,k was then also i.i.d. Gaussian, which facilitated both the choice of thresholds and the theoretical analysis of the resulting estimators. Johnstone and Silverman (1997) proposed an extension of the wavelet thresholding paradigm to stationary Gaussian noise.
In practice, the assumption of Gaussianity is violated in many important estimation problems. We list and discuss a selection of them below.
• Poisson intensity estimation. In Poisson intensity estimation, X t are modelled as independent Pois{α(t/n)} variables, which implies that ε t are centered Poisson. The mean and variance of X t are linked via the relationship var(X t ) = h{E(X t )} with h(u) = u. This is in contrast to the i.i.d. Gaussian model in which h(u) = const. Recent examples of wavelet-based (or otherwise multiscale) Poisson intensity estimation techniques include the Bayesian methods of Kolaczyk (1999) and Timmermann and Nowak (1999), the multiscale likelihood technique of Kolaczyk and Nowak (2004), the Haar-Fisz method of Fryzlewicz and Nason (2004) and the extension of the latter proposed by Jansen (2006). Some of the above, as well as some other, techniques are reviewed in Besbeas et al. (2004). The work by Sardy et al. (2004), amongst other contributions, proposes an automatic smoothing procedure for Poisson data. • Nonparametric volatility estimation. Nonparametric volatility estimation techniques are widely used in the finance industry (for example by Risk-Metrics, as detailed in their "Technical Document" available from http://www.riskmetrics.com/pdf/td4e.pdf). In this set-up, the X t 's represent squared log-returns on a financial instrument and are modelled as independent and distributed as X t = α(t/n) Z 2 t , where E(Z 2 t ) = 1. Note that ε t = α(t/n)(Z 2 t − 1). Thus, the model is multiplicative and the variance function h(u) is proportional to u 2 . A multiscale Haar-Fisz technique for nonparametric volatility estimation was proposed by . • Spectral density estimation. In spectral density estimation based on the periodogram, the X t 's represent periodogram ordinates and are assumed to be asymptotically independent and asymptotically distributed as α(t/n) Z 2 t , where α(t/n) represents the spectral density at frequency t/n, and Z 2 t are Exp(1) random variables. This again renders the set-up multiplicative and, asymptotically, the variance function takes the form h(u) = u 2 . Recent wavelet and multiscale approaches to periodogram smoothing include Moulin (1994), Neumann (1996), Gao (1997a), Pensky et al. (2007) and Fryzlewicz et al. (2008).
In the above examples, the variance function h(u) is assumed to be known (as is the case in the work of  and ), and all of the multiscale approaches listed above, in one way or another, make use of its exact form in order to set the threshold at the "right" level: the threshold value usually depends on Var(ε j,k ), which involves h. However, in many estimation problems modelled by (1), it is clear that there exists a non-trivial mean-variance relatonship, but its exact form is unknown. Thus, it is not a priori clear what threshold values to use. For example, in gene expression data observed in microarray experiments, Rocke and Durbin (2001) identified that the variance of raw pixel intensities increased with their mean. An interesting mean-variance relationship arising in solar irradiance data was described in Fryzlewicz et al. (2007). Also, even if the variance function is "assumed to be known", the model which implies its particular form might have been chosen incorrectly. Thus, even in such a case it may often be safer to infer the form of the variance function from the data in order to set the thresholds at a suitable level.
A seemingly attractive solution to the problem of the unknown variance is to apply a data-driven variance stabilisation technique prior to smoothing the transformed data by means of a wavelet-based technique suitable for homoscedastic noise. Examples of data-driven variance-stabilising transforms include the ACE method of Breiman and Friedman (1985), the AVAS technique of Tibshirani (1988) as well as the method of Linton et al. (1997). In the context of one-colour microarray data, we mention the generalised log transformation proposed by Rocke and Durbin (2001) and the Spread-Versus-Level technique of Archer (2004). As these transforms operate on the original data, we refer to them as "time-domain" transforms below.
Despite its appealing modularity, a three-stage estimation procedure which involves a time-domain data-driven variance-stabilising transform followed by wavelet smoothing of the transformed data and finally the inverse transform, is not always recommendable. Firstly, the performance of time-domain data-driven variance stabilisation procedures is often less than satisfactory, as illustrated in Fryzlewicz et al. (2007). Secondly, due to the random and often highly nonlinear character of such transforms, theoretical properties of the resulting three-stage estimator may not be easy to establish.
Another possible route to follow is to treat the estimation problem (1) as an instance of the problem of estimating a function contaminated with locally stationary noise. Solutions to the latter were proposed, amongst others, by Gao (1997b) and von Sachs and MacGibbon (2000) (see also the references therein). However, adapted to our setting, these approaches would mean ignoring the fact that the local mean and variance were linked via a variance function h, and simply pre-estimating the evolution of the variance of X t over time. Thus, these methods can potentially be suboptimal in our context, as they do not take advantage of all available information.
In this paper, we propose an alternative approach to the problem of estimating α when h exists but is unknown, which consists of first estimating the variance function h, and then constructing a wavelet thresholding estimator of α which makes use of the estimate of h. Our method, termed the data-driven wavelet-Fisz estimation technique, overcomes the drawbacks mentioned above in that (a) it performs well in practice, (b) it only requires that the variance function h, and not the target function α, be "smooth", and (c) the theoretical performance of the final estimator of α is possible to quantify and near-optimal. In addition, our estimator of α is rapidly computable and easy to implement. Its simple modification can also be used in cases in which the variance function h is known. Thus, it is applicable to a wide range of problems, including all of the examples mentioned above. The added benefit of our approach is that as well as estimating α, it also returns an estimate of h, which may be of interest to the analyst.
Our estimator of h is a simple Nadaraya-Watson estimator, and is inspired by (but simpler than) the variance function estimator used by Chiou and Müller (1999) in a different context. The reason for preferring simplicity here is that in order to derive theoretical properties of the resulting estimator of α, we need to establish an exponential inequality for the estimator of h. The latter piece of theory forms a large part of this work, and we hope that it may be of independent interest. We note that a large deviation theory for a class of Nadaraya-Watson estimators was obtained by Louani (1999) and Joutard (2006). However, it was done in a simple nonparametric regression set-up with independent errors, which is not applicable in the context of variance function estimation. We also note that the theoretical part of our work does not address the automatic choice of the smoothing parameter occurring in the estimator of h. However, the simulations section provides detailed practical recommendations as to the choice of this parameter. This selection is also performed automatically in the software package which accompanies this paper. Finally, we note that what we mean by the term "variance function estimation" is different from the use of the same term by some other authors, for example Cai and Wang (2007) and Wang et al. (2008), who consider the estimation of the variance function as a function of time, not as a function of the mean. However, a similar element in our work and the above-cited articles is the fact that in both estimation problems, a preliminary estimator of the mean is used to estimate the variance function.
It is interesting to note that the algorithm for computing our estimator of α can be decomposed into three separate stages, the first and last of which is a particular data-driven variance-stabilising transform, and its inverse, respectively. In the rare cases when the exact standard deviation of the noise is known, the simplest time-domain variance stabilisation procedure consists of using this quantity to pre-divide the original regression set-up coordinate-wise, as discussed by Gao (1997b). Unlike this and other existing transforms, some of which are listed above, our variance-stabilising transform is performed in the wavelet domain, as opposed to the time domain. Roughly speaking, the transform consists of dividing each empirical wavelet coefficient Y j,k by an estimate of its own standard deviation, the latter involving the estimate of h. In a nonwavelet context, similar ratio statistics (for a known function h) were studied by Fisz (1955), which justifies the name of our procedure.
We also mention that this paper was inspired by our earlier work Motakis et al. (2006) and Fryzlewicz et al. (2007), where we proposed computational procedures related to that described here. However, they were not accompanied by any theoretical analysis, partly because any such analysis appeared challenging due to the level of complexity of the proposed algorithms. Indeed, one of the aims of this paper is to provide a procedure which is simple enough to be theoretically tractable, but also performs well in practice.
The paper is organised as follows. In Section 2, we describe our model and estimation problem. In Section 3, we introduce and analyse our wavelet-Fisz technique in the case when the variance function h is known. Section 4 describes the Nadaraya-Watson estimator of h and considers its theoretical properties. In Section 5, we show how the estimator of h from Section 4 is used in our data-driven wavelet-Fisz estimator of α, and establish the mean-square nearoptimality of the latter. In Sections 6 and 7, we discuss various implementation aspects of our estimators of h and α, respectively. Section 8 demonstrates how our wavelet estimator leads to a new data-driven variance-stabilising transform performed in the wavelet-domain. Section 9 concludes, and the proofs of our results are in three appendices. R code implementing our method has been made available on http://www.maths.bris.ac.uk/∼mapzf/ddwf/ddwf.html

Model and preliminaries
We consider the regression model where X 1 , . . . , X n are assumed to be nonnegative and independent, with E(X t ) = α(t/n) and Var(X t ) = Var(ε t ) = h{α(t/n)}. Our task is to estimate α nonparametrically via nonlinear wavelet shrinkage, assuming that the function h is not necessarily known.
We place the following assumption on the function α.
Assumption 1. For the function α(z) : [0, 1] → R, we denote α = inf z α(z) and α = sup z α(z). We assume (i) α is of finite total variation over [0, 1], Assumption 1(i) is a mild smoothness assumption on α. Since the class of bounded variation functions also includes functions of a low degree of regularity, the choice of nonlinear wavelet shrinkage as the preferred estimation method appears natural in this context.
As mentioned earlier, our estimator of α can be viewed as using the principle of variance stabilisation (in the wavelet domain). Many time-domain variancestabilising transforms, such as the square-root transform for Poisson data or the log transform for scaled χ 2 data, would require that the function α be bounded from below, as specified in Assumption 1(ii). Therefore, it is not surprising that we also require this assumption to hold.
Further, we impose the following assumption on h.
The class of distributions with a variance function h satisfying Assumption 2 includes, amongst others, the Poisson distribution, for which h(u) = u, and distributions of the form Finally, we make the following assumption about the central moments of ε t .
Assumption 3. We assume that there exists a positive constant K and a nonnegative constant γ such that for l = 3, 4, . . . and all t.
Assumption 3 is natural and common in the context of wavelet estimation in non-Gaussian noise, see for example Neumann (1996). It is satisfied by many standard distributions, including, amongst others, Poisson and gamma. Roughly speaking, it ensures that local sums of X t are asymptotically normal in a certain asymptotic regime and in a certain required "strong" sense.

Wavelet-Fisz estimation for h known
In this section, we aim to estimate α using nonlinear wavelet shrinkage assuming that the variance function h is known. Throughout the paper, we assume basic familarity with the Discrete Wavelet Transform (DWT). We refer the reader to Mallat (1989) for a description of the DWT, and to Vidakovic (1999) for an excellent overview of wavelet methods in statistics.
We now describe, step by step, our algorithm for computing the wavelet-Fisz estimator of α if the function h is known.
1. The starting point is the DWT of the observed data {X t } n t=1 with respect to an orthonormal basis of compactly supported wavelets. Later, in Assumption 4, we will specify additional technical conditions on the wavelets. The DWT converts the regression problem (1) into a regression problem in the wavelet domain where J = log 2 n, with the only "smooth" coefficient indexed by (j, k) = (−1, 1). The variables Y j,k are the empirical wavelet coefficients of X t , the constants µ j,k are the wavelet coefficients of α(t/n) which need to be estimated, and the "wavelet noise" variables ε j,k are the wavelet coefficients of ε t . The sequence µ j,k will often be sparse, with most µ j,k 's being equal or close to zero. 2. We then separate the indices (j, k) into two groups: those corresponding to the coarser scales 0 ≤ j ≤ J * − 1, for which ε j,k will be asymptotically normal, and those corresponding to the finer scales J * ≤ j ≤ J − 1, which will be "ignored" in the estimation procedure. To be more precise, we define J * and a set I n as follows: for a fixed ǫ ∈ (0, 1/3]. The choice of 1/3 as the upper bound for ǫ is linked to the postulated smoothness of α. The reader is referred to Section 3.3 of Fryzlewicz et al. (2008) for a more detailed discussion of this issue. 3. As the sequence µ j,k is likely to be sparse, with most µ j,k 's being equal or close to zero, we use a simple thresholding technique, which does not estimate µ j,k by zero if and only if the corresponding Y j,k exceeds a certain threshold in absolute value. This ensures that a large proportion of the noise ε j,k gets removed.
In wavelet function estimation with Gaussian errors, possibly the simplest ("universal") threshold, advocated by Donoho and Johnstone (1994), takes the form where #A is the cardinality of the set A. Since in the set I n , our wavelet coefficients ε j,k are asymptotically Gaussian, we wish to explore the possibility of applying an analogous threshold in our set-up. To effect this idea, we need to determine Var(ε j,k ). Denoting by {ψ j,τ } τ the elements of the discrete wavelet vector at scale j, we find Obviously, α(t/n) is unknown and needs to be pre-estimated. For simplicity and speed of computation, we use the same pre-estimate for each t in supp(ψ j,k−· ), namely the following localised mean of X t : where the constants κ j,τ satisfy Assumption 5 below. As the discrete wavelet vectors are normalised so that t ψ 2 j,k−t = 1, we obtain our "estimated" thresholds aŝ (As a side remark, we note that in the absence of an assumed meanvariance relationship, Gao (1997b) estimates thresholds as proportional to the "running median absolute deviation" estimate of the standard deviation of the heteroscedastic noise.) We use our estimated thresholds to estimate µ j,k in the set I n via either the soft or the hard thresholding rule: where I(·) is the indicator function. Outside the set I n , we simply estimate µ j,k by zero, that iŝ 4. Let (e) denote either one of: (s) or (h). The inverse DWT of the sequencê µ (e) j,k yields our wavelet-Fisz estimatorα (e) . A few remarks are in order.
Stability. Let stdev(X) denote the standard deviation of a random variable X. Looking back at the derivation in formula (4), another "obvious" way of estimating stdev(ε j,k ) would be to set stdev(ε j,k ) = t ψ 2 j,k−t h{X t } 1/2 . However, comparing it to our estimator stdev(ε j,k ) = h 1/2 ( t κ j,k−t X t ) from formula (6), it is easily seen that the latter typically involves lower powers of X t , and thus is potentially a more "stable" statistic. As an example, consider h(u) = u 2 .
In this case, stdev(ε j,k ) is a localised l 2 norm of X t , whereas stdev(ε j,k ) is a localised l 1 norm. A similar comment applies in the case h(u) ∼ u β for all β > 1.
Link to maximum likelihood estimation. If α(t/n) is constant over the support of κ j,k−· , then, by the invariance principle of maximum likelihood estimators, our estimator stdev(ε j,k ) is precisely the maximum likelihood estimator of stdev(ε j,k ), provided that κ j,k−t = const for t ∈ supp(κ j,k−· ).
The name "wavelet-Fisz". Note that the argument of the indicator function in (8) can be rewritten as In a non-wavelet context, ratio transformations similar to that on the left-hand side of (9) and the asymptotic normality of the resulting random variables were studied by Fisz (1955), which justifies the name of our procedure. The division in (9) provides a degree of "variance stabilisation": note that the threshold {2 log(#I n )} 1/2 is suitable for standard homoscedastic normal noise. In this sense, our procedure can be viewed as being based on the principle of variance stabilisation in the wavelet domain. We expand on this issue later in Section 8.
Link to Fryzlewicz et al. (2008). We note that in the special case h(u) = u 2 , our estimation algorithm is equivalent to the method proposed by Fryzlewicz et al. (2008) in the context of spectral density estimation.
We now establish the mean-square convergence rate of our wavelet-Fisz estimatorα (e) . In order to do so, we specify assumptions on the wavelets ψ j,k and the constants κ j,k .
Note that each of the vectors κ j can be interpreted as a linear filter which computes the local mean of X t over the support of the vector ψ j in (5).
As has now become standard in the wavelet literature, we assume that the unknown function α is in a Besov ball of radius C > 0 on [0, 1], F m = F m p,q (C), where m > 0 and 0 < p, q ≤ ∞. Roughly speaking, the not necessarily integer parameter m indicates the number of derivatives of α, where their existence in required in the L p -sense, and thus p can be viewed as a measure of the inhomogeneity of α. The additional parameter q provides a further finer gradation. Besov classes include the traditional Hölder and Sobolev classes (p = q = ∞ and p = q = 2, respectively). If the regularity r of a wavelet basis is greater than m and if other conditions of Assumption 4 hold, then the membership of α in F m can be characterised in terms of the wavelet coefficients µ ′ j,k = µ j,k n −1/2 of the function α in the following way. Define the Besov sequence ball of radius C as The reader is referred to Meyer (1992) for rigorous definitions and a detailed study of Besov spaces. Denote We are now ready to state a result on the mean-square rate of convergence of our wavelet-Fisz estimatorα (e) .
Theorem 1. Let (e) denote either one of: (s) or (h). Suppose that Assumptions 1, 2, 3, 4 and 5 hold. We have The rate O n −2m/(2m+1) is the best possible mean-square error rate for Besov spaces, and our wavelet-Fisz estimator achieves it up to a logarithmic term, attaining the same rate as the universal thresholding estimator in the case of i.i.d. Gaussian noise. We mention that linear estimators, such as kernel estimators, cannot attain the optimal mean-square error rate (not even up to a logarithmic factor) for p < 2.

Estimation of the variance function h
In this section, we assume that the function h is unknown, and we propose to estimate it by means of a Nadaraya-Watson estimatorĥ. Later, in Section 5,ĥ will be used in the data-driven wavelet-Fisz estimator of α. In order to establish the mean-square convergence of the latter estimator, we need to be able to determine large deviation properties ofĥ. Indeed, the main aim of this section is to demonstrate an exponential inequality forĥ (which will be sufficient for our purposes).
The estimatorĥ is constructed as follows. We start with a preliminary estimator of α(t/n) defined byα where the choice of M will be discussed in the paragraph underneath Theorem 2, as well as in Section 6. We define empirical residuals from this fit byε t = X t −α t . Our Nadaraya-Watson estimatorĥ performs kernel smoothing of the squared empirical residualsε 2 t . More specifically, we use a kernel function K which satisfies the following assumption.
We define where the choice of b will also be discussed in the paragraph underneath Theorem 2, as well as in Section 6. The Nadaraya-Watson estimatorĥ is defined byĥ (u) = n t=1Ŵ nt (u)ε 2 t n t=1Ŵ nt (u) . We now list and clarify a number of assumptions which will be used in proving the main result of this section.
Assumption 7. We have Var(ε 2 t ) ≥ c > 0, uniformly over t. Assumption 8. Denote Z t = |α t − E(α t )|. We assume that there exists a positive constant C 2 such that uniformly over t.
Assumption 10. There exists a function c(u) such that uniformly over n and b, for all u ∈ range{α(z)}.
Assumption 7 ensures that ε 2 t is a non-degenerate random variable. For any random variable Y , it is easy to see that Var|Y | ≤ Var (Y ). Assumption 8 guarantees that the converse is true, up to a constant. This is not a restrictive assumption: we expect thatα t − E(α t ) will be close to N (0, σ 2 ), and for Y ∼ N (0, σ 2 ) we have Var(Y ) = π π−2 Var|Y |. Assumption 9 is satisfied for all distributions whose tail decays exponentially. We now demonstrate that Assumption 10 is satisfied for functions α(z) which are piecewise Lipschitz-continuous of order 1 with a finite number of breakpoints. For clarity, we only show it for the triangular kernel K(z) = (−4|z| + 2)I(|z| ≤ 1/2), but the proof remains almost unchanged for other kernels.
We note that Assumption 10 can be relaxed to include functions which are piecewise Hölder continuous, at the expense of worse rates of tail decay in Theorem 2. As an interesting example, we note that Donoho and Johnstone's (1994) benchmark signals blocks, bumps, doppler and heavisine are all piecewise Lipschitz-continuous of order 1.
Assumption 11 ensures that "local averages" of the function α lie within the range of α.
We now state an exponential inequality for the estimatorĥ(u), which is the main result of this section.
Theorem 2. Suppose that Assumptions 1-3, 6-11 hold, and that the constants κ j,τ satisfy Assumption 5. Let b n , d n and i n be any fixed sequences such that b n = o (n/M ) 1/(6+4γ) , d n = o min j 2 J−j 2(1+max(γ,1)) and i n = o (nb 2 ) 1/(10+12γ) , where M , b and γ are defined in formulae (10) and (11) and Assumption 3, respectively; 2 J = n, and the range of j is 0 ≤ j ≤ J * − 1. Let δ 1 be any positive quantity such that δ 1 < c 1 where c 1 is defined in Assumption 10, and define δ ′ = δ(c 1 − δ 1 )/2, where δ appears in the exponential inequality below. Assume as M, n/M, nb → ∞ and b → 0, where d is defined in Assumption 9 and H is the Lipschitz constant for h(u) over u ∈ range{α(z)}. In the asymptotic limit, as M, n/M, nb → ∞ and b → 0, we have, uniformly over (j, k) ∈ I n , where C 3 , C 4 and C 5 are positive constants, Φ is the cdf of the standard normal, and Explanation of the rates. We take M = O(n ϑ ) for ϑ ∈ (0, 1) and b = O(n −ζ ) for ζ ∈ (0, 1) and investigate conditions on ϑ and ζ which ensure that the assumptions of Theorem 2 are satisfied. Clearly, if ζ < 1/2, then b n , d n and i n can be chosen such that they are all of order O(n ς ), for ς > 0. Fixing δ 1 to be constant, and assuming that δ and δ ′ either are constant or tend to zero no faster than logarithmically in n, we have that conditions (12) - (14) are satisfied if both 1 > 2ζ + ϑ and ϑ/4 > ζ. Finally, to ensure that the sequences a n , c n , e n , f n , g n and h n are all of order O(n ς ) for ς > 0, we additionally require that ǫ/4 > ζ, where 2 J * = n 1−ǫ . Thus, in the (ǫ, ϑ, ζ) space, the set A of parameter configurations which are "admissible" in the above sense has the form In view of the above discussion, the following corollary can be formulated.

Data-driven wavelet-Fisz estimation (for h unknown)
Our algorithm for computing the wavelet-Fisz estimator of α if the function h is unknown (which we also call the data-driven wavelet-Fisz estimator of α), proceeds in the same way as the wavelet-Fisz algorithm for h known, described in detail in Section 3, the only exception being that we use our estimateĥ from Section 4, instead of the true h.
To be more precise, we replace our thresholdsλ j,k , defined in formula (6) and subsequently used in the threshold estimatorsμ (s) j,k andμ (h) j,k (see formulae (7) and (8)), with thresholds We denote the thus constructed data-driven wavelet-Fisz estimator of α bȳ α (e) , where (e) is one of: (s) (soft thresholding) and (h) (hard thresholding).
The following theorem quanitifies the mean-square rate of convergence ofᾱ (e) . Comparing this result with Theorem 1, we note that the estimatorᾱ (e) does not exhibit any loss of asymptotic efficiency compared toα (e) despite the fact that it uses an estimate of h rather than the true h.

Estimation of h -implementation issues
This section briefly describes the outcome of an extensive simulation study aimed at assessing the performance of the estimatorĥ(u) for various parameter configurations. Recall that h(u) is assumed to be non-decreasing (see Assumption 2). Asĥ(u) is not guaranteed to be non-decreasing, in practice we used the following computational "correction" toĥ(u). Having obtainedĥ(u), we input it into the (automatic) "pool-adjacent-violators" algorithm for least-squares isotone regression, described in detail in Johnstone and Silverman (2005b), Section 6.3. The resulting estimate, denoted hereafter byḣ(u), is a non-decreasing, piecewise constant function of u, which is as close as possible toĥ(u) in the least-squares sense. Empirically, we found that the use ofḣ(u), rather than h(u), in the estimatorᾱ (e) , results in a slightly superior performance of the latter. (We note that literature on monotone estimation of variance function is sparse; Dette and Pilz (2007) consider estimation of a monotone conditional variance in nonparametric regression.) Having investigated various choices of the span M and the bandwidth b for a range of test functions and noise distributions, we found thatḣ(u) performed particularly well for "small" values of M . Any value of M ≤ 3 consistently resulted in good estimates. The examples later in this section use the value M = 3. The estimator seems to be less sensitive to the choice of b: this is due to the computational correction (described above), which "smooths out" any remaining wiggles inĥ(u). We recommend an "automatic" choice of b, such as that described in Gasser et al. (1991) and conveniently implemented in the routine glkerns from the R package lokern. We also use the default kernel function K(·) from the above routine.
We briefly illustrate the performance ofḣ(u) on 4 simulated datasets. The models are: the Poisson model, whereby X t ∼ Pois{α(t/n)}, and the exponential model, in which X t ∼ α(t/n) Exp(1). With each model, we use two functions α(z): the blocks function, scaled and shifted to have the minimum (maximum) value of 1 (22.6), and the bumps function, with the minimum (maximum) value of 3 (23.21). Both functions are sampled at 2048 equispaced points. Figure 1 shows sample paths from each model, together with the corresponding estimatesḣ 1/2 (u). The estimator performs well in all cases. The good performance is not incidental: indeed, we found that for the parameter choices described above, the estimatorḣ(u) performed well across all simulated examples.

Estimation of α -implementation issues
This section discusses the choice of the various parameters for our data-driven wavelet -Fisz estimatorᾱ (e) . The examples in this section use Haar wavelets: this choice is motivated in Section 8 below. As a default option, we use translationinvariant (see Nason and Silverman (1995)) hard thresholding with J * = J − 2, as this parameter configuration seems to offer the best empirical performance. We use the variance estimatorḣ(u) described in Section 6 with M = 1. For each j, k, we choose the parameters κ j,k−t to be constant for t ∈ supp(ψ j,k−· ). This is a natural choice for Haar wavelets as the coefficients q κ j,k−q X q are already available to us as "by-products" of the discrete Haar transform. Figure 2 shows the outcome of our estimation procedure described above for the sample paths from Figure 1. It is clear that our procedure performs very well for Poisson noise. Performance for exponential noise is also satisfactory given how noisy the original signals are: indeed, it is extremely hard to identify some of the features of the clean bumps and blocks signals from the visual inspection of the corresponding exponential datasets. We mention again that our estimation procedure "does not know" any characteristics of the noise, and estimates the variance function h(u) from the data. We end this section with a brief comparison study of our estimator versus Gao's (1997) estimator for general heteroscedastic noise. The better performance Table 1 Mean-square errors over 100 simulations for the 4 models, for Gao's method (Gao)  of our method is unsurprising as Gao's method does not take into account the mean-variance relationship and thus uses less information.

Data-driven wavelet-Fisz transform
In this section, we describe a wavelet-domain variance-stabilising transform implied by our data-driven wavelet-Fisz estimation procedure. Note that the computation of the data-driven wavelet-Fisz estimateᾱ (e) can be performed in the following three steps.

2.
SmoothX t by means of a standard nonlinear wavelet thresholding procedure suitable for i.i.d. Gaussian noise, using the same wavelet family as in Step 1. To be more precise, we apply the threshold {2 log(#I n )} 1/2 for (j, k) ∈ I n , and set the empirical coefficients to zero for j ∈ [J * , J − 1]. Either soft or hard thresholding can be used. 3. Take the inverse transform to that described in Step 1.
We call the transform in Step 1 the data-driven wavelet-Fisz variance-stabilising transform. Empirically, it stabilises the variance of X t and brings its distribution closer to Gaussianity. The mechanism of the transform was already explained in the discussion underneath formula (9).
In our simulations, we found that the distribution of the "noise" in the transformed vectorX t was the most symmetric when Haar wavelets were used. This was due to the fact that Haar wavelets are symmetric in the sense that the positive part of each Haar wavelet vector is an exact shifted version of its negative part. Figure 3 compares the data-driven wavelet-Fisz transform (with parameters as in Section 7) of the exponential bumps dataset, with its logarithmic transform. Note that the latter acts as an exact variance-stabiliser, due to the multiplicative structure of the model. However, it is clear from the plot that not only does the data-driven wavelet-Fisz transform also stabilise the variance of the noise very well, but in addition it brings the distribution of the noise much closer to Gaussianity than the log transform (and does this without knowing the original noise distribution or the structure of the model). It also seems to bring out the shape of the underlying signal more clearly.
From a computational point of view, in view of the Gaussianising and variancestabilising action of the data-driven wavelet-Fisz transform, the analyst wishing to find out more about the shape of the underlying signal may choose to apply any smoother suitable for i.i.d. Gaussian noise to the wavelet-Fisz-transformed dataset.

Discussion
We conclude with three final remarks.
Applications. Readers interested in the applications of the data-driven wavelet-Fisz methodology are referred to our earlier work Motakis et al. (2006) and Fryzlewicz et al. (2007), where we proposed computational procedures related to that described here and applied them to gene expression data, and solar irradiance data, respectively. Those heuristic algorithms were not accompanied by any theoretical analysis, leaving a gap which is filled by the present work.
Density estimation. In practice, our data-driven wavelet-Fisz estimator can also be applied to the problem of density estimation from binned data. Although this problem does not exactly fall into the class of models described by formula (2), it can be approximated, in a certain asymptotic regime, by the Poisson model, which is a sub-case of (2). We mention that Brown et al. (2007) propose a wavelet-based method for density estimation from binned data which includes a time-domain variance-stabilising transform as an initial step.
Non-equispaced design. Jansen et al. (2009) mention the possible use of the Haar-Fisz transform for Poisson data on graphs and in other "irregular multidimensional situations". We believe that similarly, the data-driven wavelet-Fisz methodology could also be used in such set-ups.
where φ is the standard normal density.
For the purposes of this section, we extend the notationα (e) to mean any estimator constructed as in Section 3, using an arbitrary set of thresholds λ j,k . To emphasise the dependence ofα (e) on λ j,k , we will writeα (e) (λ j,k ).
j,k be any non-random thresholds satisfying Assumption 12. Further, suppose that Assumptions 1, 3 and 4 hold. We have The proof of Theorem 4 proceeds exactly like the proof of Theorem 5.2(ii) in Neumann (1996). We omit the details.
We now define what we call "lower" and "upper" deterministic thresholds λ (d,l) j,k and λ (d,u) j,k : where γ n is a sequence approaching one from below, C is a generic positive constant (see Lemma 1 for the permitted range of C) and κ j,τ are as in formula (6). Proving that the lower and upper thresholds satisfy Assumption 12 (and thus that Theorem 4 holds forα (e) (λ (d,l) j,k ) andα (e) (λ (d,u) j,k )) is a step on the way to proving Theorem 1.
Proof. It is easy to check that if C ≥ (2h) 1/2 , then for all j, k, λ j,k . We first check (17) for λ (d,l) j,k . The factor λ (d,l) j,k /σ j,k + 1 only contributes a logarithmic term so we skip it. Denote α j,k = inf{α(t/n) : t ∈ supp(ψ j,k )} and α j,k = sup{α(t/n) : t ∈ supp(ψ j,k )}. Further, let TV(f)| A denote the total variation of the function f measured on the set A. Using Assumption 2, we bound λ (d,l) j,k /σ j,k from below as follows: where the last inequality follows from the fact that v(x) = x/(x + a 2 ) is increasing on [0, ∞). As in Neumann (1996), the proof of Lemma 6.1(ii), we have k TV(α)| supp(ψj,k) ≤ O(1)TV(α)| [0,1] and thus for a sequence ω n → 0, at each scale j we have Denote D n = I n ∩ {(j, k) : TV(α)| supp(ψj,k) > ω n } note that by (21), at each scale j at most O(ω −1 n ) coefficients are in D n . Denote further E n = I n \ D n . We have for any m > 0. The last equality follows from the fact that 1−γ 2 n h 1/2 h+Hωn 2 → 0.
Choosing ω n = log −1 n (say), we have that (17) is satisfied irrespective of the smoothness parameter m. Because the thresholds λ (d,u) j,k are higher than λ (d,l) j,k , (17) also holds for λ (d,u) j,k . Obviously, (18) holds for λ (d,u) j,k , which implies that it also holds for λ (d,l) j,k , since λ (d,l) j,k are lower than λ (d,u) j,k . We now state another auxiliary result. We first specify an assumption on an arbitrary set of random thresholdsλ for some γ n → 1 − (see the definition of λ (d,l) j,k in formula (19)), some ν < 1/(2m + 1) (with m given in Theorem 1), and some C ≥ (2h) 1/2 (see the definition of λ (d,u) j,k in formula (20)).

Theorem 5. Letλ
(r) j,k be any random thresholds satisfying Assumption 13. Further, suppose that Assumptions 1, 2, 3, 4 and 5 hold. We have The proof of Theorem 5 proceeds exactly like the proof of Theorem 6.1 in Neumann (1996). We omit the details.
In view of Theorem 5, in order to prove Theorem 1, it suffices to show that our random thresholdsλ j,k , defined in formula (6), satisfy Assumption 13.
Proof. We start with (22). To shorten notation, denoteû = q κ j,k−q X q and u = q κ j,k−q α(q/n). Denote further Note that ν n → 0. We have Suppose ν n tends to zero logarithmically fast in n (which is easy to ensure by placing an appropriate assumption on the speed of convergence of γ n to one). Then, by Lemma 8, there existsǫ > 0 such that Summing up over j, k we obtain for any ν, which shows (22). The technique for showing (23) is exactly the same.
We omit the details.
The proof of Theorem 1 is complete.
Because α(z) is piecewise Lipschitz-continuous of order 1, it is clear that the cardinality of B z n is uniformly bounded from below by cn where c ∈ (0, 1) and that B z n contains those t for which t/n is arbitrarily close to z from either the left-or the right-hand side (or both). We have which completes the proof.
Noting that p(l) ≤ const l , we observe that (25) can comfortably be bounded by for a constantK > 0, which completes the proof.
Lemma 4. For a nonnegative random variable X and l ≥ 1, we have Proof. Noting that for c ≥ 0, we have |x − c| l ≤ |x| l + c l , we obtain where the last step uses Jensen's inequality.
Let b n be any fixed sequence such that b n = o{(n/M ) 1/(6+4γ) }. In the asymptotic limit, as n, M, n/M → ∞, we have where C 3 is a positive constant, Φ is the cdf of the standard normal, and the sequence a n satisfies a n = O(δn −1/2 − M n −1/2 − n 1/2 M −1/2 ).
Proof. We first note that if Assumption 1 (i) holds, then uniformly over i and M , where C 1 is a positive constant. Summing up both sides of equation (27) over i, we have n t=1 |α(t/n) − E(α t )| ≤ C 1 (2M + 1). We bound To avoid complications which arise in deriving exponential inequalities for dependent sequences, we split it into independent sequences as follows. Rewriting the LHS of (28) as and using the fact that a 1 + . . . + a m ≥ δ ⇒ ∃ i a i ≥ δ/m, as well as the Bonferroni inequality, we bound (29) by We drop the range of t to shorten notation, and assume without loss of generality that there are exactly n/(2M + 1) terms in the above sum. Denoting Z t = |α t − E(α t )|, we bound the above by We first assess t E(Z t ): Note that by Assumption 8, we have δ ′′ = O(δn −1/2 − M n −1/2 − n 1/2 M −1/2 ) and, by assumptions of the lemma, δ ′′ → ∞. We now apply Theorem 1 and its Corollary from Rudzkis et al. (1978) to the standardised sum ξ. By Lemma 5, in our case the quantity ∆ n from the above Theorem takes the form which, by Assumption 8, is of order O{(n/M ) 1/2 }. Recall that b n = o{(n/M ) 1/(6+4γ) }. Using the above Theorem, we bound (30) by which completes the proof.
Using the above Theorem, we bound which completes the proof.
Lemmas 7 and 8 yield the result.
Proof of Corollary 1. Denoting byφ(x) the standard normal pdf, and recalling that for large x, we have 1 − Φ(x) ≤φ(x), Corollary 1 is a direct consequence of Theorem 2 and the discussion directly underneath it.
Appendix C: Proof of Theorem 3 In view of Theorem 5, it suffices to show that our thresholdsλ j,k satisfy Assumption 13. The following lemma holds.
Suppose γ 2 n converges to one logarithmically fast in n. Then, by Corollary 1, the above probability can uniformly be bounded by O(n −2 ). Summing over j, k, we obtain (j,k)∈In P (λ j,k < λ (d,l) j,k ) ≤ O(n −1 ) ≤ O(n ν ), which shows (22). The technique for showing (23) is exactly the same. We omit the details.