Jump estimation in inverse regression

We consider estimation of a step function $f$ from noisy observations of a deconvolution $\phi*f$, where $\phi$ is some bounded $L_1$-function. We use a penalized least squares estimator to reconstruct the signal $f$ from the observations, with penalty equal to the number of jumps of the reconstruction. Asymptotically, it is possible to correctly estimate the number of jumps with probability one. Given that the number of jumps is correctly estimated, we show that the corresponding parameter estimates of the jump locations and jump heights are $n^{-1/2}$ consistent and converge to a joint normal distribution with covariance structure depending on $\phi$, and that this rate is minimax for bounded continuous kernels $\phi$. As special case we obtain the asymptotic distribution of the least squares estimator in multiphase regression and generalisations thereof. In contrast to the results obtained for bounded $\phi$, we show that for kernels with a singularity of order $O(| x|^{-\alpha}),1/2<\alpha<1$, a jump location can be estimated at a rate of $n^{-1/(3-2\alpha)}$, which is again the minimax rate. We find that these rate do not depend on the spectral information of the operator rather on its localization properties in the time domain. Finally, it turns out that adaptive sampling does not improve the rate of convergence, in strict contrast to the case of direct regression.


Introduction
Assume we have observations from a regression model given by where Φf = φ * f denotes convolution of some L 1 -functions φ and f and ε 1 , ε 2 , . . . are i.i.d. mean zero random variables with finite second moment. In the following we denote model (1) as inverse (deconvolution) regression model and we assume throughout that φ is known. Suppose the objective function f : [0, 1] → R is in L 1 and moreover locally constant, i.e. a piecewise constant function with k jumps given by s.t. −∞ = τ 0 ≤ 0 < τ 1 < . . . < τ k < 1 ≤ τ k+1 = ∞ and k ∈ N possibly unknown (see Figure 1). From Figure 1 the difficulty of estimating jumps in inverse reression becomes visible: Due to the smoothing by φ jumps only appear as small changes in Φf .
In this paper we show that the joint least squares estimatorθ n of jumps and heights θ = (b 1 , τ 1 , b 2 , τ 2 , . . . , b k , τ k , b k+1 ) is n −1/2 consistent and follows a multivariate normal limit law. This is in strict contrast to the case of direct regression (where Φ in (1) is the identity). In the latter case it is known that the LSE converges at the (minimax) n −1 rate and its distribution (after recentering and rescaling with n) is given as the minimizer of a certain random walk process. Further, jump heights and locations are asymptotically independet (see van de Geer (1988); Yao and Au (1989); Müller (1992); Müller and Stadtmüller (1999); Yakir et al. (1999); Birgé and Massart imsart-generic ver. 2008/01/24 file: ejs_2008_204.tex date: March 14, 2008March 14, (2006 for some references on jump estimation in direct regression). Finally, by an adaptive choice of the design points it is possible to speed up the n −1 rate to any polynomial rate of convergence (Lan et al. (2007)). We will see that in inverse regression the situation is completely different w.r.t. all of these issues: In general, all components of n 1/2 (θ n − θ) will be dependent asymptotically (depending on the kernel φ). Further, rather surprisingly, the n −1/2 rate does not depend on the decay of the Fourier transform of φ which usually determines the rate of convergence in more common function spaces, such as Sobolev spaces (cf. Cavalier and Tsybakov (2002) among others). Indeed, we will show that the n −1/2 rate is minimax if φ is a bounded, continuous function. Because our minimax lower bound will be independent of the design points we obtain the suprising finding that adaptive sampling cannot improve the rate of convergence in the inverse case. In fact a main motivation to consider the space of locally constant functions as in (2) stems from the observation that in general deconvolution is a difficult problem, which is reflected by rates of convergence which can be arbitrarily slow, e.g. (log n) −β rates as for supersmooth (e.g. gaussian) deconvolution (cf. Hinkley (1969) and more recently by van de Geer (1988), Yakir et al. (1999) and Koul et al. (2003), among others. If the objective function f is assumed to be continuous, two-phase regression can be modeled by an inverse regression model with a polynomial kernel with p = 0, i.e. φ(x) = 1 [0,1) (x). In this setting the n −1/2 rate and the asymptotic distribution were derived by Hinkley (1969) and -for more general segmented regression models -by Feder (1975). From the perspective of a statistical inverse problem their results are quite natural to understand: multiphase regression corresponds to estimation of a jump function in a noisy Volterra equation where the location of jumps correspond to the kinks of the multiphase regression function.
Our results generalize the known results on the estimation of the intersection in two phase regression to the case where the objective function has an arbitrary number of phases and is piecewise polynomial of order p + 1, with p continuous derivatives and a (p + 1)-th derivative, which is a step function. For piecewise linear regression (p = 1) in a deconvolution context this problem occurs in rheology where the relaxation time spectrum has to be estimated from measurements of the dynamic moduli of materials (cf. Roths et al., 2000). Other applications stem from biophysics, where the ion-channel activity of lipid membranes are measured by impedance spectroscopy and the jump locations indicate different opening states (cf. Schmitt et al., 2006;Römer et al., 2004). We obtain the somewhat suprising result that the rate of estimating the change-point does not depend on p, whereas in general nonparametric regression settings, the convergence rates for estimating a jump in the p-th derivative become slower as p grows (see Raimondo, 1998).
The first one to investigate the change-point problem in the framework of a statistical inverse problem was Neumann (1997), who considered the estimation of a change-point in a density deconvolution model Y = X + ξ with known error density f ξ . He treated the case that the density of X is bounded, has one jump at τ and is Lipschitz continuous elsewhere. In this setting τ can imsart-generic ver. 2008/01/24 file: ejs_2008_204.tex date: March 14, 2008 be estimated at a rate of min(n −1/(2β+1) , n −1/(β+3/2) ), provided the tails of the Fourier transform f ξ (x) decrease at a rate of |x| −β . Moreover, he proved that these rates are optimal in a minimax sense. This result was extended by Goldenshluger et al. (2006c) (in a white noise model) to classes of functions f which can be written as a sum of a step function and a function with smooth m-th derivative. They showed that in this case the minimax rates are of order min(n −1/(2β+1) , n −(m+1)/(2β+2m+1) ). If the smooth part of the function of interest belongs to a Paley-Wiener class, they show that a rate of min(n −1/2 , n −1/(2β+1) ) can be obtained up to a logarithmic factor. Their recent work (Goldenshluger et al., 2006a,b) generalize these results to a unifying framework of sequence space models covering delay and amplitude estimation, estimation of change-points in derivatives and change point estimation in a convolution white noise model. We remark that the specific choice of jump functions in (2) used in this work comes close to the super-smooth case for β ≥ 1/2, but we can get rid of the additional logarithmic factor. Moreover, we will see that similar rates hold in the case of β < 1/2 if the assumption on the boundedness of the kernel is dropped (see Remark 3).
This work is structured as follows. Section 2 gives some basic notation and the main assumptions. The estimate and its asymptotic properties are given in section 3 and the proof of the main result can be found in section 4. In section 5 we derive the required results from the theory of radial basis functions whicdh yields sufficient conditions on φ for the asymptotic normality of the LSE. Finally, in section 6 we derive the minimax rate for estimating the jump location.

Notation
Define as the set of possible jumps of f in (1), and denote the corresponding function space of locally constant functions with at most k jumps by Write T ∞ := ∞ k=1 T k for the set of all step functions on R with a finite but arbitrary number of jumps, where we exclude an isolated jump at the end points of the interval [0, 1]. Note, that outside of [0, 1] these functions are constant. Let T k,R = {g ∈ T k : g ∞ < R} as well as T ∞,R := ∞ k=1 T k,R the corresponding spaces of uniformly bounded functions for some R > 0. If not mentioned otherwise, the restriction of these spaces to [0, 1] are considered to be subspaces of Define the empirical norm · n and the empirical inner product ·, · n by where x 1 , . . . , x n are the design points. Similarly set y 2 i as well as y, z n := 1 n n i=1 y i z i for y, z ∈ R n .
Write g(t + ) := lim xցt g(x) for the right limit of g in t and g(t − ) := lim xրt g(x) for the corresponding left limit. For some proper function g : R → R define the set of jump points of g as and Finally, for ease of notation for any a, b ∈ R, [a, b] and (a, b) always denote the intervals [min(a, b), max(a, b)] and (min(a, b), max(a, b)), respectively.

Assumptions
Assumptions on the error If the number of jumps is known the following basic assumption is sufficient to deduce the n −1/2 rates of convergence for the least squares estimates.
Assumption A. The array (ε 1 , . . . , ε n ) consists of independent identically distributed random variables with mean zero for every n. Additionally, assume If the number of jumps of the objective function is unknown, we will additionally need that the error satisfies the following subgaussian condition.
Throughout the following we require a slightly stronger condition, the independence of the functions in (6) together with their derivatives.
Assume that φ ∈ L 1 (R) ∩ L 2 (R) ∩ L ∞ (R) is piecewise continuous with finitely many jumps. Additionally the functions are linearly independent for every choice of k ∈ N and where only two subsequent τ i are allowed to be equal.
The following theorem gives some general conditions, which are sufficient for φ to satisfy Assumption B.
Theorem 2.1. The function φ satisfies Assumption B if one of the following conditions is satisfied.
is a symmetric real-valued function with Fourier transform φ(x) ≥ 0, such that there exists n 0 ∈ N and C > 0 with (ii) φ is extended sign regular of order k + 2 on R, with 0 < φ(x)dx < ∞.
(iii) The function φ is given by The proof of part (i) is given in section 5, the proofs of part (ii) and (iii) are straightforward and can be found in Boysen (2006) 1966). Note that part (ii) covers the Gauss kernel φ(x) = (2π) −1/2 exp(−x 2 /2) (see Section 3, Example 5 in Karlin and Studden, 1966).
Assumptions on the design points We make the following assumption on the design points.

Assumption C. There exists a function
Moreover, the design points x 1 , . . . , x n are independent of the error terms ε 1 , . . . , ε n . Here x (i) denotes the i-th order statistic of x 1 , . . . , x n .
Note that the above assumption covers random designs as well as fixed designs generated by a regular density in the sense of Sacks and Ylvisaker (1970). If the design points x 1 , . . . , x n are nonrandom, the O P (n −1/2 ) term above is to be understood as O(n −1/2 ). In this case the design points have to be understood as a triangular scheme. Dümbgen and Johns (2004)

Estimate and asymptotic results
Estimate Define the restricted least squares estimatef n as approximate minimizer of the empirical L 2 distance to the data in the space T k,R . More preciselŷ The minimizer of the functional on the right hand side always exists (compare Lemma 4.6). Note that we do not assume that the minimum is attained, but only that the functional above can be minimized up to some term of order o p (n −1 ). It does not need to be unique. This assumption allows for numerical approximation of the minimizer and gives an intuition of the needed precision for the asymptotic results to be valid. The restriction to functions with f ∞ < R is a technical assumption, which requires that some upper bound of the supremum norm of the objective function is known beforehand.
If the number of jumps is unknown, a different estimate is needed. In this case, assume that the penalized least squares estimatef λn satisfiesf λn ∈ T ∞,R and is defined as any solution of where λ n > 0 is some smoothing parameter, s.t. λ n → 0 as n → ∞. Asymptotic results Before we state the main result, we first define the map and the (2k + 1) × (2k + 1) matrix V by its entries Here h is the design density given by Assumption C. Now we are able to formulate the asymptotic result for the least squares estimator.
Theorem 3.1. Suppose the Assumptions A, B and C are met. Letf n and V be given by (10) and (13), respectively. Set θ as the parameter vector of f given in (3), andθ n as the corresponding vector of estimates defined by (10). Given (9) and model (1), then The following theorem implies that the penalized and the restricted least squares estimates asymptotically coincide, i.e. the number of jumps in T ∞ is asymptotically correctly estimated with probability one. In this sense the the results of Theorem 3.1 can be applied to the penalized estimatef λn .
imsart-generic ver. 2008/01/24 file: ejs_2008_204.tex date: March 14, 2008 Theorem 3.2. Suppose condition (A1), (11) and the assumptions of Theorem 3.1 are satisfied. If λ n → 0 and λ n n 1/(1+ǫ) → ∞ for some ǫ > 0 as n → ∞, The proofs of Theorem 3.1 and 3.2 can be outlined as follows. For a known number of jumps an entropy argument yields consistency of the least squares estimator. It is possible to represent the estimator as the minimizer of a stochastic process, which allows for a local stochastic expansion. This can be used to derive asymptotic normality. If the number of jumps is unknown, an imitation of techniques from empirical process theory shows that for a suitable choice of the smoothing parameter the case of an unknown number of jumps can asymptotically be reduced to the case where this number is known.
The details of the proofs are given in several steps in section 4.
The next theorem states that the rate given above is optimal in a minimax sense.
Theorem 3.3. Suppose the Assumption B is met and ε 1 , . . . , ε n are independent identically distributed normal random variables with zero mean and positive variance. Set For arbitrary fixed design points x 1 , . . . , x n ∈ [0, 1] denote by P n θ the probability measure associated with the observations Then there exists some c 0 > 0 independent of n and x 1 , . . . , x n such that The proof is given in section 6.

Remarks and Extensions
Remark 1. (Adaptive sampling). Theorem 3.3 states for any fixed bounded kernel φ and any choice of design points, that faster rates of convergence as n −1/2 are not possible. This is intuitively clear as the convolution "spreads" the information of the jump location over the whole interval. As a consequence adaptive sampling schemes (where the sampling point x i may dependent on the data Y 1 , · · · , Y i−1 ) cannot lead to a faster rate of convergence as n −1/2 . This is in strict contrast to the case of direct regression (Φ = Id) where any polynomial rate of convergence can be achieved by an adaptice scheme (Lan et al. (2007)).

Remark 2. (Noisy Fredholm equations). All results of this chapter can also be
shown for more general integral operators of the type Φf = K(x, y)f (y)dy with In this case in definition (7) φ(x − y) has to be replaced by K(x, y). Assumption B can be formulated in the same way.
Remark 3. (Singular kernels). If the assumption of the boundedness of the integral kernel is dropped, faster rates than O P (n −1/2 ) for estimating the jump for α ∈ (0, 1) then a jump can be recovered at a rate of O P (n −1/ min(2,3−2α) ).
Given a uniform design, these rates are minimax. For details see Boysen (2006) intervals.
This corresponds to findings of Neumann (1997) and Goldenshluger et al. (2006c), who also observe an elbow in the rate of convergence of recovering a change point in an inverse problem at β = 1/2, if the Fourier transform of φ(x) decreases at rate of |x| −β . Goldenshluger et al. (2006c) give a rate of it follows that the "elbow" for β = 1/2 can be identified with the elbow for α = 1/2.

Proof of Theorem 3.1 and 3.2
We start with some technical lemmata, give some entropy results on the spaces of interest which are required to apply tools of empirical process theory to prove consistency of the estimates. Afterwards we give a local stochastic expansion of the minimized process and use this to derive asymptotic normality. Finally we again imitate some techniques from empirical process theory to show that the penalized estimate asymptotically coincides with the restricted least squares estimate. Note that Assumption B is needed to assure identifiability as well as positive definiteness of the asymptotic covariance matrix V .

Some technical lemmata
In order to gain some insight into the model, it is useful to have a closer look at the implications of Assumption B on the mapping Φ restricted to the space of step functions. The following lemma collects some properties of this mapping.
for some C 0 depending on φ and ǫ only. This proves (i).
Similarly we can show Φf 2 ≤ C f 2 L2([−ǫ,1+ǫ]) for f ∈ T k which gives continuity and hence (ii). As argued in the part on the assumptions on the kernel in section 2.2, (iii) follows from the independence of ∆ φ (·, τ i , τ i+1 ).
To prove (iv), note that The following lemma provides a link of the empirical and the L 2 norm. Then

If additionally Assumption B is met
Proof. The proof is straightforward. For details see Boysen (2006) Lemma 7.2.

Entropy results
To show consistency of the estimates, we wish to apply results from empirical process theory. To this end, let us first introduce some additional notation (cf. van de Geer, 2000).
Given a measure Q, a set of Q-measurable functions G and a real number δ > 0, define the δ-covering number N (δ, G, Q) as the smallest value of N for which there exist functions g 1 , . . . , g N such that for every g ∈ G there is a j ∈ 1, . . . N with Moreover, define the δ-entropy H of G as If Q is the Lebesgue measure we will write H(δ, G) and N (δ, G) instead of H(δ, G, Q) and N (δ, G, Q). Given design points x 1 , . . . , x n ∈ R, the empirical measure will be denoted by Q n = n −1 n i=1 δ xi . Note that · n is the norm corresponding to the space L 2 (R, Q n ).
Finally, define the entropy integral Note that for our purposes, the relevant quantity is the entropy of the space However, it is convenient to first calculate the entropy of (T k,R , · L2([a,b]) ) and then use Lemma 4.1 to infer on the space G k,R .
Proof. Define the sets where c 1 , c 2 will be defined later. Define the function class H(δ) as Now for g 0 ∈ T k,R we can choose g ∈ H(δ) such that d(J (g), J (g 0 )) ≤ c 1 δ 2 /2, and that for any x ∈ [a, b] with d(x, J (g)) > c 1 δ 2 /2 we have (g 0 (x) − g(x)) 2 ≤ c 2 2 δ 2 /4. Since g 0 has k jumps between a and b we get is an δ-covering of (T k,R , · L2([a,b]) ). Since the claim is proved.
We will now use the assumptions on the operator Φ or, to be more precise, Lemma 4.1, to deduce bounds on the entropy of the space Corollary 4.5. Assume Φ satisfies Assumption B. There exists a constant C 2 independent of n,k and R such that for f, g ∈ T k . Assume H(δ) is a δ-covering of (T k,R , · L2([a,b]) ) for every δ > 0.
Then H(δ/C 0 ) is a δ-covering of G K (R). Consequently, the claim follows from Lemma 4.3.
Again, this implies that the space G k,R (Φ) equipped with the empirical norm · n is compact. Consequently the functional · −Y n has a minimizer in G k,R (Φ) for every k. As λ n J # (·) is strictly increasing in the number of jumps for every λ n > 0 this implies the following lemma.

Consistency
To deduce consistency of the jump estimates from the L 2 consistency of the function estimator, a result on the dependency of d(J (f ), J (g)) on the L 2 distance of f and g is needed. This is given by the following lemma.
In order to show consistency off n , we first prove the consistency of Φf n . To this end we require the following result which follows directly from the proof of Theorem 4.8, page 56 in van de Geer (2000).
imsart-generic ver. 2008/01/24 file: ejs_2008_204.tex date: March 14, 2008 Proof. Use (9) and Y = Φf + ε to obtain since f −f n ∈ T 2k,2R . By Corollary 4.5 is continuous if Ω is compact. This gives continuity of Φ −1 as mapping from This allows us to infer the consistency of the parameter estimates. The following corollary is a direct consequence of Lemma 4.7 and 4.9.

Asymptotic normality
To show asymptotic normality for M-estimators, it is common to assume existence of the derivative of the function which is minimized. However, as φ is allowed to have discontinuities, a less restrictive result is needed.
As discussed in Chapter 5.3 of van der Vaart (1998) it is sufficient to assume existence of a second order Taylor-type expansion. Following this idea, the next theorem gives the asymptotic normality of the minimizer of a process Z n (θ), provided it allows for a certain expansion. It is similar to Theorem 5.23 of van der Vaart (1998), but also covers the case of non i.i.d. random variables, which is required for the fixed design.
Theorem 4.11. Assume Θ ⊂ R d is open and θ 0 ∈ Θ. Let (Z n (θ)) θ∈Θ be a stochastic process. Assume there exists a sequence of random variables (W n ) n∈N ⊂ R d and a positive definite matrix V ∈ R d×d such that Ifθ n is a consistent estimator of θ 0 andθ n is an approximate minimizer of Z n , i.e.
Proof. The proof is straightforward and similar to the case when the second derivatives exist. For details see Boysen (2006) Theorem 7.12.
A second order expansion for the minimized process To derive an expansion of type (16) for the problem in (9), let us first introduce some notation.
Assume that f and the estimatef n as defined by (9) are given by respectively. By definition of Z n (b,τ ) it is clear that To obtain an expansion for Z n (b,τ ), first examine the difference g(x, b, τ ) − g(x,b,τ ).
First assume thatτ j ≥ τ j and φ is continuous on [ The same holds forτ j < τ j . Note that Similarly, Remember τ 0 =τ 0 and τ k+1 =τ k+1 , combine the preceding results to obtain where R n satisfies condition (17), ∆ is given by (20) and V is the (2k + 1) × (2k + 1) matrix defined by (13). Moreover Before we give the proof, we need the following result on the number of design points contained in a sequence of intervals.
Lemma 4.14. If the design points x 1 , . . . , x n satisfy Assumption C, then for any two sequences a n , b n , n ∈ N with 0 ≤ a n < b n ≤ 1 we have n −1 #{i : x i ∈ [a n , b n ]} = O P (|b n − a n | + n −1/2 ).
Proof. The proof is straightforward using that H(x) = x 0 h(y)dy is strictly monotone, and that by Assumption C Proof of Lemma 4.13. Expand (18) to obtain Note that the last term equals Z n (b, τ ). We will first estimate the second term of (21). Denote the points of discontinuity of φ by J (φ) = {ϑ 1 , . . . , ϑ #J (φ) } with ϑ 1 < ϑ 2 < . . . < ϑ #J (φ) . This means By Lemma 4.14, #{i : This gives Use Lemma 4.12 and the results above to obtain where V is given by (13). The remainder terms clearly satisfy condition (17).
Next, examine the first term of (21). Set The second term is clearly o P ( ∆ 2 ).
Lemma 4.15. Given the Assumptions C and B, the matrix V defined by (13) is positive definite.

Proof of Theorem 3.1
The proof of the main theorem is now a direct consequence of the results given above. Part (v) follows directly from the proof of Lemma 4.13.
Proof of part (i) Corollary 4.10 implies θ −θ n = o P (1). By relation (19) and Lemma 4.13 the assumptions of Theorem 4.11 are satisfied. The claim follows by application of this theorem. Proof of part (ii) By Lemma 4.12 since ν(x) is bounded. This proves the claim.
Proof of part (iv) and part (iii) Note that This proves part (iv). Part (iii) follows by application of Lemma 4.7.

Proof of Theorem 3.2
In this section we analyze the case where the number of jumps is unknown.
In order to reconstruct the number of jumps correctly, it is helpful to use a penalty function which is strictly increasing in the number of jumps. Any penalty term, which depends on the number of jumps only, is not a pseudonorm on T ∞,R , since #J (λf ) = #J (f ) for λ = 0. Hence, the standard results from empirical process theory do not apply. However, it is possible to use similar techniques in the proofs.
The fact thatf λn (approximately) minimizes the penalized L 2 functional, implies that for any f ∈ T ∞,R , we get that
Theorem 4.16. Suppose Assumption A is met and the error satisfies (A1).
Assume sup g∈G g n ≤ R. There exists a constant C depending only on Assumption (A1), such that for all δ > 0 satisfying Proof. See Lemma 3.2, page 29 in van de Geer (2000).
A bound of this type can be obtained from the following exponential inequality. There exist constants c 1 , c 2 > 0, such that for all t ≥ c 1 n −1/2 we have Proof. Set G k,R (Φ) = {Φg : g ∈ T k,R }. By Corollary 4.5 there exists a constant C > 0 independent of u,k,R and n such that H u, G k−1,R (Φ), Q n ≤ Ck 1 + log Rk u .
where C 1 is some finite constant independent of k and δ. By Theorem 4.16 there exists some constant C 2 depending on the subgaussian error condition (A1) only, Consequently, for all t ≥ C 2 C 1 n −1/2 we have that P sup We arrive at P sup imsart-generic ver. 2008/01/24 file: ejs_2008_204.tex date: March 14, 2008 Splitting this sum at s R := ⌈(1 + log(R))/ log(2)⌉ gives P sup Here C 3 , C 4 , C 5 , C 6 are constants depending on C 1 , C 2 and R only. The last inequality holds by t 2 n ≥ C 2 1 C 2 2 . Since the constant C 6 does not depend on k, the exponential inequality also holds if we additionally take the supremum over all k. This proves the claim.
With the help of Lemma 4.7, it follows d(J (f λn ), J (f )) = o P (1), which in turn Assume n k is a subsequence such that Φf λn k − Φf 1−ǫ n k ≥ cn −1/2 k for some c > 0. Dividing the last equation by Φf λn k − Φf 1−ǫ n k gives This yields

Moreover, by (27)
Combine the last two equations to obtain Now assume n k is a subsequence such that Φf λn k − Φf 1−ǫ n k < cn −1/2 k for some c > 0. Application of Corollary 4.18 to (24) and the observation that J # (g) ≥ 1 for all g gives As each sequence can be decomposed into a subsequence containing only elements smaller than cn −1/2 and a subsequence containing only elements greater or equal to cn −1/2 for some c > 0, we have shown that J # (f λn ) ≥ J # (f ) implies (28).
To this end, assume there exists some subsequence n k such that and Together with (28), the assumption λ n k n 1/(1+ǫ) k → ∞ and (29), this gives for n → ∞. This proves the claim.

Proof of Theorem 2.1
To give the proof of Theorem 2.1, part (i) we will define the native Hilbert space N φ of a positive definite function φ and show that the elements of its dual space and ρ x,y (f ) = y x f (t)dt are linearly independent, if φ has certain properties. Then we will deduce that the functions ∆ φ (·, τ 0 , τ 1 ), . . . , ∆ φ (x, τ k , τ k+1 ) are linearly independent.
Note that Lemma 5.1 implies that for any interval (a, b) ⊂ Ω there exists some test function ψ ∈ N φ (Ω) satisfying supp(ψ) = [a, b]. One example is This observation can be used to show that point evaluation and integral mean are linearly independent as elements of the dual space of N φ (Ω).

A lower bound for estimating the jump locations
In this section we show that the obtained rate d(J (f n ), J (f )) = O P (n −1/2 ) is optimal in a minimax sense. To do so, we construct functions f 0 , f 1,n , f 2,n with d(J (f 0 ), J (f i,n )) = cn −1/2 for i = 1, 2 and some c > 0 to be chosen later. Given the observations Y i = g(x i ) + ε i i = 1, . . . , n for g ∈ {Φf 0 , Φf 1,n , Φf 2,n } and ε 1 , . . . , ε n independent and identically distributed according to N (0, σ 2 ) with σ 2 > 0, we show that for any estimator, the probability to choose the true function is strictly smaller than one. Obviously it is sufficient to consider the case of a single jump with a fixed jump height.
For the proof we need the following theorem.
Note that in the proof we used the absolute integrability and the boundedness in supremum norm of the integral kernel only.
Proof of Theorem 3.3. Lemma 6.1 directly implies that the jump estimator attains the minimax rate. If f is a step function with known jump locations and unknown level heights b i , the inverse regression model (1) reduces to a standard linear regression model. It is well known that in this setting the levels b i cannot be estimated at a rate faster than O P (n −1/2 ). Consequently, this also holds for the case of unknown jump locations. This proves Theorem 3.3.