Extensions of smoothing via taut strings

Suppose that we observe independent random pairs $(X_1,Y_1)$, $(X_2,Y_2)$,>..., $(X_n,Y_n)$. Our goal is to estimate regression functions such as the conditional mean or $\beta$--quantile of $Y$ given $X$, where $0<\beta<1$. In order to achieve this we minimize criteria such as, for instance, $$ \sum_{i=1}^n \rho(f(X_i) - Y_i) + \lambda \cdot \mathop TV\nolimits (f) $$ among all candidate functions $f$. Here $\rho$ is some convex function depending on the particular regression function we have in mind, $\mathop {\rm TV}\nolimits (f)$ stands for the total variation of $f$, and $\lambda>0$ is some tuning parameter. This framework is extended further to include binary or Poisson regression, and to include localized total variation penalties. The latter are needed to construct estimators adapting to inhomogeneous smoothness of $f$. For the general framework we develop noniterative algorithms for the solution of the minimization problems which are closely related to the taut string algorithm (cf. Davies and Kovac, 2001). Further we establish a connection between the present setting and monotone regression, extending previous work by Mammen and van de Geer (1997). The algorithmic considerations and numerical examples are complemented by two consistency results.

for some unknown family of distribution functions F (· | x), x ∈ R. Often one is interested in certain features of these distribution functions. Examples are the mean function µ with and, for some β ∈ (0, 1), the β-quantile function Q β , where Q β (x) is any number z such that This paper treats estimation of such regression functions utilizing certain roughness penalties. The literature about penalized regression estimators is vast and still growing. As a good starting point we recommend Antoniadis and Fan (2001), van de Geer (2001), Huang (2003), and the references therein. A first possibility is to minimize a functional of the form over all functions f on the real line. Here ρ is some convex function measuring the size of the residual Y i − f (x i ) and depending on the particular feature we have in mind. Moreover, TV(f ) denotes the total variation of f , that is the supremum of m−1 j=1 |f (z j+1 ) − f (z j )| over all integers m > 1 and numbers z 1 < z 2 < · · · < z m , while λ > 0 is some tuning parameter.
Example I (means). In order to estimate the mean function µ, one can take ρ(z) := z 2 /2.
This particular case has been treated in detail by Mammen and van de Geer (1997) and Davies and Kovac (2001); see also the remark following Lemma 2.2. In particular, the latter authors describe an algorithm with running time O(n), the taut string method, to minimize the functional T above.
Example II (quantiles). For the estimation of a quantile function Q β one can take ρ(z) = ρ β (z) := |z|/2 − (β − 1/2)z = ( Of particular interest is the case β = 1/2. Then ρ(z) = |z|/2, and Q 1/2 is the conditional median function. This particular functional has also been suggested by Simpson, He and Liu in their discussion of Chu et al. (1998) but, as far as we know, not been considered in more detail later on. However, similar functionals using a discretisation of the total variation of the first derivative as a penalty have been studied by Koenker, Ng and Portnoy (1994) or in two dimensions by Koenker and Mizera (2004). They employ linear programming techniques like interior point methods to find solutions to the resulting minimisation problems. A primary goal of the present paper is to extend the classical taut string algorithm to other situations such as Example II, or binary and Poisson regression. Compared to the linear programming techniques mentioned above, the generalized taut string method has the advantage of being computationally faster and more stable. In the specific case of Example II it is possible to calculate a solution in time O(n log(n)). Note that the original algorithm yields piecewise constant functions. On each constant interval the function value is equal to the mean of the corresponding observations, except for local extrema of the fit. In their discussion of Davies and Kovac (2001), Mammen and van de Geer mention the possibility to replace sample means just by sample quantiles, in order to treat Example II. However, the present authors realized that the extension is not that straightforward.
The remainder of this paper is organized as follows: In Section 2 we describe an extension of the function T above such that it covers also other models such as binary and Poisson regression. In addition we replace the penalty term λ·TV(f ) by a more flexible roughness measure which allows local adaptation to varying smoothness of the underlying regression function. Then we derive necessary and sufficient conditions for a functionf to minimize our functional. In that context we also establish a connection to monotone regression which is useful for understanding adaptivity properties of the procedure. This generalizes findings of Mammen and van de Geer (1997) for the least squares case. In Sections 3 and 4 we derive generalized taut string algorithms, extending the algorithm described by Davies and Kovac (2001). While Section 3 covers continuously differentiable functions ρ, Section 4 is for general ρ and, in particular, Example II. Section 5 explains how our tuning parameters, e.g. λ in (1), may be chosen. Section 6 presents some numerical examples of our methods. In Section 7 we complement the algorithmic considerations with two consistency results which are of independent interest. One of them entails uniform consistency of monotone regression estimators, while the other applies to arbitrary estimators such that the corresponding residuals satisfy a certain multiscale criterion. Both results are a first step towards a detailed asymptotic analysis of taut string and related methods. All longer proofs are deferred to Section 8.

The general setting
For simplicity we assume throughout this paper that x 1 < x 2 < · · · < x n . In Section 3.3 we describe briefly possible modifications to deal with potential ties among the x i .

The target functional
We often identify a function f : Our aim is to minimize functionals of the form over all vectors f ∈ R n , where λ ∈ (0, ∞) n−1 is a given vector of tuning parameters while the R i are random functions depending on the data. In general we assume the following two conditions to be satisfied: Condition (A.1) entails that T is a continuous and convex functional on R n , so the additional Condition (A.2) guarantees the existence of a minimizerf of T . This will be our estimator for the regression function of interest, evaluated at the design points x i . The special functional T in (1) corresponds to λ 1 = · · · = λ n−1 = λ and Example III (Poisson regression). Suppose that Y i ∈ {0, 1, 2, . . .} has a Poisson distribution with mean µ(x i ) > 0, and let These functions are strictly convex with R i ≥ Y i log(e/Y i ). Thus T is even strictly convex, and elementary considerations reveal that it is coercive if Y i > 0 for at least one index i. In that case we end up with a unique penalized maximum likelihood estimatorf of log µ.
Example IV (Binary regression). Similarly let Y i ∈ {0, 1} with mean µ(x i ) ∈ (0, 1), and define Here R i > 0, and again T is strictly convex. It is coercive if the Y i are not all identical, and the minimizerf of T may be viewed as a penalized maximum likelihood estimator of logit(µ) := log(µ/(1 − µ)).

Characterizations of the solution
As mentioned before, Conditions (A.1-2) guarantee the existence of a minimizer f of T . In the present subsection we derive various characterizations of such minimizers, assuming only Condition (A.1).
By convexity of T , a vectorf ∈ R n minimizes T if, and only if, all directional derivatives atf are non-negative, i.e.
More specifically, let R ′ i (z ±) be the left-and right-sided derivatives of R i at z, i.e.
Lemma 2.1 A vectorf ∈ R n minimizes T if, and only if, (3) holds for all 1 ≤ j ≤ k ≤ n.
In case of differentiable functions R i there is a simpler characterization of a minimizer of T : Lemma 2.2 Suppose that all functions R i are differentiable on R. Then a vectorf ∈ R n minimizes T if, and only if, for 1 ≤ k ≤ n, For k = n, it follows from λ n = 0 that Condition (4) amounts to Thus our result entails Mammen and van de Geer's (1997) finding that the solutionf may be represented as the derivative of a taut string connecting the points (0, 0) and n, n i=1 Y i and forced to lie within a tube centered at the points k, k i=1 Y i , 1 ≤ k < n. In the general setting treated here, there are no longer taut strings, but the solutions can still be characterized by a tube. This is illustrated in the left panels of Figure 1 with a small example. The upper panel shows a data set of size n = 25 (with x i = i) and the approximationf obtained from the functional T in (1) with ρ(z) := √ 0.1 2 + z 2 and λ = 2. This function ρ(r) may be viewed as a smoothed version of |z| with being similar to sign(z − Y i ). The lower panel shows the cumulative sums of the "residuals" R ′ i (f i ). As predicted by Lemma 2.2, these sums are always between −λ and λ, and they touch these boundaries whenever the value off changes.

Bounding the range of the solutions
Sometimes it is helpful to know a priori some bounds for any minimizerf of T . We start with a rather obvious fact: Suppose that there are numbers z ℓ < z r such that for i = 1, . . . , n, Then any minimizer of T belongs to [z ℓ , z r ] n . For if f ∈ R n \ [z ℓ , z r ] n , one can easily verify that replacing f with min(max(f i , z ℓ ), z r ) n i=1 yields a strictly smaller value of T (f ). In case of differentiable functions R i an even stronger statement is true: Lemma 2.3 Suppose that the functions R i are differentiable such that for certain numbers z ℓ < z r , R ′ i (z ℓ ) ≤ 0 for all i with at least one strict inequality, R ′ i (z r ) ≥ 0 for all i with at least one strict inequality.
Then any minimizer of T is contained in (z ℓ , z r ) n .
In the special case of R i (z) = ρ(z − Y i ) we reach the following conclusions: If 0 is the unique minimizer of ρ over R, then any minimizer of T belongs to min(Y 1 , . . . , Y n ), max(Y 1 , . . . , Y n ) n .
If in addition ρ is differentiable and Y = (Y i ) n i=1 is non-constant, then any minimizer of T belongs even to min(Y 1 , . . . , Y n ), max(Y 1 , . . . , Y n ) n .

A link to monotone regression
An interesting alternative to smoothness assumptions and roughness penalties is to assume monotonicity of the underlying regression function f on certain intervals. For instance, if f is assumed to be isotonic, one could determine an estimatorf minimizing n i=1 R i (f i ) under the constraint thatf 1 ≤f 2 ≤ · · · ≤ f n . The next theorem shows that our penalized estimatorsf often coincide locally with monotone estimators.
An analogous statement holds for antitonic fits.
3. Computation in case of regular functions R i 3.1. A general algorithm for strongly coercive functions R i In this subsection we present an algorithm for the minimization of the functional T in case of (A.1) together with the additional constraint that Obviously the latter condition is satisfied in Example I. Note that Conditions (A.1) and (A.3) imply Condition (A.2). Moreover, R ′ i : R → R is isotonic (i.e. non-decreasing), continuous and surjective.
The algorithm's principle. The idea of our algorithm is to compute inductively for K = 1, 2, . . . , n a vector (f i ) K i=1 such that Condition (4) holds for 1 ≤ k ≤ K, wheref K+1 may be defined arbitrarily.
Precisely, inductively for K = 1, 2, . . . , n we compute two candidate vectors f = (f i ) K i=1 and g = (g i ) K i=1 in R K such that at the end of step K the following three conditions are satisfied: Moreover, f i is antitonic (i.e. non-increasing) and g i is isotonic in i ∈ {k o + 1, . . . , K}.
Note that Conditions (C.1 K ) and (C.2 f &g,K ) imply the following fact: When the algorithm finishes with K = n, a solutionf ∈ R n may be obtained as follows: If k o = n,f := f = g satisfies the conditions of Lemma 2.2. If k o < n, we definef This definition entails that f i ≤f i ≤ g i for all indices i, whilef i is constant in i > k o . Hence one can easily deduce from Conditions (C.2 f &g,n ) thatf satisfies the conditions of Lemma 2.2. To ensure a certain optimality property described later, in case of 1 ≤ k o < n we choose Conditions (C.1 K ) and (C.2 f &g,K ) are illustrated in the right panels of Figure 1 which show the same example as the left panels that were discussed in the previous Section. Strictly speaking, Condition (A.3) is not satisfied here, but one can enforce it by adding min(z − c 1 , 0) 2 + max(z − c 2 , 0) 2 to R i (z) = ρ(z − Y i ) with arbitrary constants c 1 ≤ min(Y 1 , . . . , Y n ) and c 2 ≥ max(Y 1 , . . . , Y n ). This modification does not alter the solution which is contained in [c 1 , c 2 ] n ; see Section 2.3.
The solid and grey lines in the upper panel represent f and g, respectively, where K = 20 and k o = 3. The lower panel shows the corresponding cumulative sums of the numbers Some auxiliary functions and terminology. Later on we shall work with partitions of the set {1, 2, . . . , n} into index intervals and functions (vectors) which are constant on these intervals. To define the latter vectors efficiently we define These quantities M jk (t) and M jk (t) are isotonic in t, where R ′ jk (z) = t for any real number z ∈ M jk (t), M jk (t) .
The following lemma summarizes basic properties of the auxiliary functions M jk which are essential for Algorithm I below. The functions M jk satisfy analogous properties.
Initializingg. We set Sinceg i = g i for all i ≤ K, one can easily verify that the inequality part of (C.2 g,K+1 ) is satisfied, and also Modifyingg. Suppose thatg i is not isotonic in i > k o . By the previous construction ofg, this means that the two rightmost segments {j, . . . , k} and {ℓ, . . . , K + 1} of (g This step is repeated, if necessary, untilg i is isotonic in i > k o . One can easily deduce from Part (a) of Lemma 3.1 below that throughoutg Hence the inequality part of Condition (C.2 g,K+1 ) continues to hold. The equality statements of Condition (C.2 g,K+1 ) are true as well, the only possible exception being k = k o . This exceptional case may occur only ifg ko+1 < g ko+1 , and by our construction ofg, this entailsg i being constant in i ∈ {k o + 1, . . . , K + 1}.
Initializingf . We set Modifyingf . Suppose thatf i is not antitonic in i > k o . This means that the two rightmost segments {j, . . . , k} and {ℓ, .
Hence the inequality part of Condition (C.2 f,K+1 ) continues to be satisfied. The equality statements of Condition (C.2 f,K+1 ) are true as well, the only possible exception being k = k o . This exceptional case may occur only iff ko+1 > f ko+1 , and this entails thatf i is constant in i > k o .
This step is illustrated in Figure 2 which shows again the example from Figure 1. Here K = 17 and the panels show from left to right the initialisation off , the first modification off and the second and final modification. The panels in the upper row show the data and the approximations while the panels in the lower row show the corresponding cumulative sums of the numbers Final modification off andg. Having completed the previous construction, we end up with vectorsf andg satisfying the inequality parts of Conditions (C.2 f &g,K+1 ). The equality parts are satisfied as well, with possible exceptions only for Suppose first thatf ko+1 ≤g ko+1 . Then one can easily deduce from (C.3 K ) and the properties off ,g just mentioned that Conditions (C.1 K+1 ) and (C.2 f &g,K+1 ) are satisfied.
One particular instance of the previous situation is that bothf i andg i are constant in i > k o . For then Now suppose thatf ko+1 >g ko+1 . The previous considerations and our construction off andg show that either We discuss only the former case, the latter case being handled analogously. Here i=ko+1 . Then we redefineg as follows: It may happen that Condition (C.1 K+1 ) is still violated, i.e.g k1+1 <f k1+1 . In that case we repeat the previous update ofg with k 1 in place of k o and iterate this procedure until Condition (C.1 K+1 ) is satisfied as well.
This step is illustrated in Figure 3 which shows once more the example from the panels in the upper row show the data and the approximations while the panels in the lower row show the corresponding cumulative sums of the numbers An optimality property of Algorithm I. The solutionf produced by Algorithm I is as simple as possible in a certain sense. For a vector f ∈ R n , an index interval {j, . . . , k} ⊂ {1, 2, . . . , n} with j > 1 or k < n is called a Theorem 3.2 Letf be the vector produced by Algorithm I, and let f be any vector in R n such that Then for any local minimum J off .
In particular the theorem shows that every vector f satisfying the tube condition (5) must have at least the same number of local extreme values asf .

Exponential families
Examples III and IV may be generalized as follows: Suppose that Y j has distribution P f (xj ) for some unknown real parameter f (x j ), where (P θ ) θ∈R is an exponential family with for some measure ν on the real line.
Note that in general, b(·) is infinitely often differentiable with b ′ (θ) and b ′′ (θ) being the mean and variance, respectively, of the distribution P θ . We assume that b ′′ (θ) is strictly positive for all θ ∈ R. Note also that ν(R\ [y min , y max ]) = 0, where y min and y max are the infimum and supremum, respectively, of the set b ′ (θ) : θ ∈ R .
Now we consider minimization of minus the log-likelihood function plus the localized total variation penalty, i.e.
By assumption, these functions are infinitely often differentiable with first derivative R ′ i (z) = b ′ (z) − Y i and strictly positive second derivative R ′′ i (z) = b ′′ (z). Hence they satisfy Condition (A.1). Unfortunately they may fail to be coercive individually, and Condition (A.3) is not satisfied in general. However, one can show that Condition (A.2) is satisfied as soon as Y is non-trivial in the sense that We will not pursue this here, because there is a simple solution of our estimation problem, at least in case of (6): At first we apply the usual taut string method to the observations Y i , i.e. we replace R i (z) with ρ(z − Y i ), where ρ(z) := z 2 /2. Letf LS be the resulting least squares fit. It follows from Lemma 2.3 that Thus the vectorf with componentŝ Hence applying Lemma 2.2 tof LS in the least squares setting entails the analogous conditions forf in the maximum likelihood setting. This shows thatf is indeed the unique minimizer of T .

In
Step K + 1 we aim for vectorsf andg in R i(K+1) . The initial versions are given bỹ while the remainder of Step K + 1 remains unchanged.

The case of arbitrary functions R i
In this Section we describe how to calculate solutions to the general setting described in Section 2. In particular we investigate how to solve the quantile regression problem presented in Example II. Throughout this section we assume that Conditions (A.1-2) are satisfied, while some of the functions R i may fail to satisfy the regularity condition (A.3).

Approximating the R i
We start with a general observation: For ε > 0 and i ∈ {1, 2, . . . , } let R i,ε : R → R be a (data-driven) convex function such that lim ε↓0 R i,ε (z) = R i (z) for any z ∈ R.
The corresponding approximation T ε (f ) = T λ,ε (f ) to T (f ) equals n i=1 R i,ε (f i )+ n−1 j=1 λ j |f j+1 − f j | and has the following properties: Theorem 4.1 For sufficiently small ε > 0, the setF ε := arg min f ∈R n T ε (f ) is nonvoid and compact. Moreover, it approximates the setF := arg min f ∈R n T (f ) in the sense that If we find approximations R i,ε satisfying additionally Condition (A.3), we can minimize the target functional T (·) at least approximately by means of Algorithm I. One possible definition of such functions R i,ε is given by Here one can easily verify (7) and Condition (A.3) with R i,ε in place of R i .

A non-iterative solution for Example II
Apparently the preceding considerations lead to an iterative procedure for minimizing T . But such a detour is not always necessary. In this section we derive an explicit combinatorial algorithm for Example II. Recall that for given This indicates already that for quantile regression mainly the ranks of the vector Y matter. Precisely, we shall work with a permutation Z = ( That means, Z is a rank vector of Y but without the usual modification in case of ties. The usefulness of this will become clear later. Solving the original problem with Z in place of Y would not be much easier. But now we replace Theorem 4.2 Suppose thatĝ minimizesT over R n . Thenĝ ∈ (β, n − 1 + β) n . Furthermore, letf ∈ R n be given bŷ with the order statistics Y (1) ≤ Y (2) ≤ · · · ≤ Y (n) of Y and ⌈a⌉ denoting the smallest integer not smaller than a ∈ R. Thenf minimizes T over R n . Algorithm II. Summarizing the preceding findings, we may compute the estimated β-quantile function by means of Algorithm I, applied to the functions R ′ i in place of R ′ i . Let us just comment on the corresponding auxiliary functions M jk (t) := min z ∈ R :R ′ jk (z) ≥ t , M jk (t) := max z ∈ R :R ′ jk (z) ≤ t .

This entails that
To implement this algorithm efficiently, one should use two additional vector variables Z (f ) and Z (g) such that for each segment {j, . . . , k} of f (resp. g), contains the order statistics of (Z i ) k i=j . Whenever two segments of f (resp. g) are merged, the vector Z (f ) (resp. Z (g) ) may be updated by a suitable version of MergeSort (Knuth, 1998).

Constant and fixed λ
Let us first discuss the simple case of a constant value λ > 0 for all λ j . In Example I, letσ be some consistent estimator of the standard deviation of the variables Y i , assuming temporarily homoscedastic errors Y i −µ(x i ). For instance, σ could be the estimator proposed by Rice (1984) or the version based on the MAD by Donoho et al. (1995). Since R ′ i (z) = z − Y i , and since for large n the process behaves similarly as a standard Brownian motion by virtue of Donsker's invariance principle, one could use for some constant c > 0. In our experience with simulated data, a value of c within [0.15, 0.25] yielded often satisfying results. In Example II, note that the data Y i may be coupled with independent Bernoulli random variables ξ i ∈ {0, 1} with mean β such that for 1 ≤ j ≤ k ≤ n. Since t → n −1/2 (β(1 − β)) −1/2 i≤nt (ξ i − β) behaves asymptotically like a standard Brownian motion, too, we propose λ = cn 1/2 (β(1 − β)) 1/2 .

Adaptive choice of the λ j
Let f be the unknown underlying regression function. Our goal is to find a "simple" vector (function)f which is adequate for the data in the sense that the deviations between the data andf satisfy a multiresolution criterion (Davies and Kovac, 2001) where we require the deviations between data andf at different scales and locations to be no larger than we would expect from noise. More precisely we require for each interval {j, . . . , k} from a collection I n of index intervals in {1, . . . , n} that The bounds η(·) < 0 < η(·) to be specified later will be chosen such that the inequalities above are satisfied with high probability in case of replacingf with the true regression function f . A typical choice for I n is the family of all n(n + 1)/2 such intervals {j, . . . , k}. Computational complexity can be reduced by considering a smaller collection such as the family of all intervals with dyadic endpoints, {2 ℓ m + 1, . . . , 2 ℓ (m + 1)}, where 0 ≤ ℓ ≤ ⌊log 2 n⌋ and 0 ≤ m ≤ ⌊2 −ℓ (n − 1)⌋. The difference in computational speed between these two choices for I n is easily noticeable in practice. The effect on the resulting approximationf , however, is rather small. If a vector g does not approximate the data well on some interval J which is not part of the scheme with the dyadic endpoints, then occasionally the multiscale criterion using the dyadic endpoints will consider g to be adequate where the multiscale criterion which makes use of all subintervals will notice the lack of fit. Since this effect is barely noticeable we prefer to use smaller collections such as the family with dyadic endpoints which was also used in the simulation study in Section 6. To obtain a vectorf satisfying (10) for all intervals in I n , we propose an iterative method for the data-driven choice of the tuning parameters λ i . This approach generalizes the local squeezing technique from Davies and Kovac (2001) to our general setting. We start with some constant tuning vector λ (1) = (Λ, Λ . . . , Λ) where Λ is chosen so large that the corresponding fitf (1) is constant. Now suppose that we have already chosen tuning vectors λ (1) , . . . , λ (s) , and the corresponding fits are denoted byf (1) , . . . ,f (s) . Iff (s) is still inadequate for the data, we define J (s) to be the union of all intervals {j − 1, j, . . . , k} such that {j, . . . , k} is an interval in I n violating (10) withf =f (s) . Then for some fixed γ ∈ (0, 1), e.g. γ = 0.9, we define One can easily derive from (3) that for sufficiently large s the fitf =f (s) does satisfy (10) for all {j, . . . , k} ∈ I n .
Example I (continued). In this case the multiresolution criterion (10) inspects the sums of residuals on all intervals I ∈ I n . If we assume additive and homoscedastic Gaussian white noise, possible choices for η(·) and η(·) are for some c ≥ 0. The first proposal coincides exactly with the local squeezing technique by Davies and Kovac (2001). The second one is motivated by results of Dümbgen and Spokoiny (2001).
Example II (continued). If we assume that Y i = f i +ε i where the ε 1 , . . . , ε n are independent with β-quantile 0 and continuous distribution function, then both are binomially distributed with parameters k − j + 1 and β. Let B(x; N, p) be the distribution function of a binomial distribution with parameters N and p. Then we define η(f j , . . . , f k ) = h(k − j + 1) minimal and η(f j , . . . , f k ) = h(k − j + 1) maximal such that Example III (continued). We assume that for each Y i is Poisson distributed with parameter exp(f i ). Then Example IV (continued). Suppose that Y i , . . . , Y n are binomially distributed with parameters 1 and p i = exp(f i )/(1+exp(f i )). Then k i=j Y i may be written Hoeffding's (1956) finding that the deviations of k i=j Y i from its mean k i=j p i tend to be largest in case of equal probabilities p i , we define η(f j , . . . , f k ) = h(k − j + 1,p jk ) and η(f j , . . . , j k ) = h(k − j + 1,p jk ), wherep jk denotes the mean of p j , . . . , p k while h(N, p) is maximal and h(N, p) is minimal such that For the consistency results to follow, it is crucial that the adaptive choice of the λ i yields a fitf such that for some constant c o , For example I this is obvious, at least if I n comprises all subintervals of {1, . . . , n}. By means of suitable exponential inequalities one can verify the multiscale criterion (11) for Examples II-IV, too.

Numerical examples
A simulation study was carried out to compare the median number of local extreme values for nine different versions of the general taut string method. Figure 5 shows rescaled versions of four standard test signals by Donoho (1993) and Donoho and Johnstone (1994) that have been used to create samples under four different test beds as described in detail below. For each function f and sample size n the following test beds were considered: • Gaussian: Independent normal observations Since the mean of the Cauchy distribution does not exist, only the quantile taut string was applied with quantiles 0.1, 0.5 and 0.9 to recover f − 1.231, f and f + 1.231.
• Binary: Binary observations were obtained by sampling from a Binomial distribution: For each of the four signals, three different sample sizes and each of the four test models 100 samples were generated and the various taut string methods were applied to the data as described above in the description of the test beds. For each application of one of the methods to a sample the number of local extreme values in the approximation was determined. Table 1 reports the median number of local extreme values over the simulations. In brackets the mean absolute deviation from the true number of local extreme values is given apart from the samples derived from the Doppler function which has an infinite number of local extreme values.
These simulations confirm that the usual taut string method is excellent in fitting the correct number of local extreme values and very reliably attains the correct number of local extreme values already for samples of size 512. However, the robust version performs remarkably well in the Gaussian case and has the additional advantage that it depends much less on the distribution of the errors and performs similarly in the Cauchy test bed. In contrast the approximation of 0.1 and 0.9 quantiles is much more difficult. Even for large sample sizes the fitted models often miss local extreme values, in particular for the 0.1 quantile of the Bumps data set which is an extremely difficult situation. The binary problem also appears to be considerably difficult, although still much of the underlying structure is recovered using the 0/1 observations. For the Poisson case the detection rate of the correct number of local extreme values is nearly as good as the robust taut string.
Then there exist constants K 1 , K 2 > 0 such that for all A ≤ a ≤ b ≤ B and η ≥ 0, Let us comment briefly on these conditions: Condition (C.1) is satisfied if, for instance, all design points x in are contained in [A, B] with x i+1,n − x in = (B − A)/n for 1 ≤ i < n. It also holds true almost surely if (x in ) n i=1 is the vector of order statistics of (X i ) n i=1 , where X 1 , X 2 , X 3 , . . . are i.i.d. with a Lebesgue density g which is bounded away from zero on [A, B].
As to Conditions (C.2-3), consider first R in (t) : for some bounded function σ(·) and i.i.d. random errors Z i with continuous and strictly positive density. Moreover, it follows from empirical process theory that Condition (C.3) is satisfied with universal constants K 1 and K 2 .
In what follows letf n be any estimator of f * . Our first consistency result applies to isotonic regression estimators as well as taut string estimators with constant tuning vector λ via Theorem 2.4. It also applies to the taut string estimators with adaptively chosen tuning vectors λ in case of (11).
n with asymptotic probability one.
Our second consistency result concerns estimation of f * close to its local extrema. It is known that the least squares taut string estimators tend to underestimate f * near local maxima and overestimate f * near local minima. The generalized estimators discussed here have the same property, but this effect can be bounded: for some κ > 0.
(a) Iff n is a taut string estimator with tuning constants λ in in (0, c o n 1/2 ] for some constant c o , then (b) Letf n be any estimator such thatf n = f n (x in ) n i=1 satisfies (11) for all n. Then for C sufficiently large, n with asymptotic probability one.
To illustrate the latter results, suppose that f * is twice differentiable with bounded second derivative on [A, B]. If x * ∈ (A, B) is a local extremum of f * , then (12) holds true with κ = 2. Then the taut string estimatorf n with global tuning parameter λ = λ n = O(n 1/2 ) underestimates (resp. overestimates) a local maximum (resp. minimum) by O p (n −1/3 ). In case of the adaptively chosen λ in = λ in (data), the latter rate improves to O p (ρ 2/5 n ).

Proofs
Proof of Lemma 2.1. As mentioned earlier, the necessity of Condition (3) follows from (2) applied to ±δ (jk) . On the other hand, it will be shown below that an arbitrary vector δ ∈ R n may be written as with real numbers α (jk) satisfying the following two constraints: (i) For 1 ≤ i ≤ n it follows from δ i > 0 (or δ i < 0) that α (jk) ≥ 0 (or α (jk) ≤ 0) whenever i ∈ {j, . . . , k}.
Proof of Lemma 2.3. Let J = {j, . . . , k} a maximal index interval such that f j = · · · =f k = max ifi . Then it follows from Lemma 2.2 that If j > 1 or k < n, the right hand side is strictly negative, whencef j < z r . If j = 1 and k = n, thenf 1 < z r , because n i=1 R ′ i (z r ) > 0. These considerations show that max ifi < z r , and analogous arguments reveal that min ifi > z ℓ . 2 Our proof of Theorem 2.4 relies on a characterization of isotonic fits which is of independent interest.
Proof of Theorem 8.1. For notational convenience let a = 1 and b = n.
Note that the functional f → T ↑ (f ) := n i=1 R i (f i ) is convex on R n , and that the set R n ↑ of vectors in R n with non-decreasing components is convex. Thus an isotonic vectorf minimizes T ↑ over R n ↑ if, and only if, for any δ ∈ R n such thatf +tδ is isotonic for some t > 0. The latter requirement is equivalent to δ i ≤ δ j whenever i < j andf i =f j .
One can deduce from this representation that and each summand on the right hand side is non-negative by (13) (13) and (14) witȟ f i =f i for a ≤ i ≤ b. But it follows from (3) and our assumptions on λ that the sum in (13) is not greater than λ j−1 sign(f j−1 −f j )+λ k sign(f k+1 −f k ) = 0, while the sum in (14) is not smaller than λ j−1 sign( Proof of Lemma 3.1. Suppose first that c ≥ M ℓm (v). For any z > c, . This proves Part (a). As for Part (b), suppose that c > M jm (u + v). Then for c ≥ z > M jm (u + v), Proof of Theorem 3.2. Let J = {j, . . . , k} be a local maximum off . A close inspection of Algorithm I reveals that On the other hand, it follows from our assumption on f that Analogously one can show that min i∈K f i ≤ min i∈Kfi for any local minimum K off . 2 Proof of Theorem 4.1. It follows from Assumption (7) that T ε converges pointwise to T as ε ↓ 0. Since all functions T and T ε are convex, it is well-known from convex analysis that the convergence is even uniform on arbitrary compact sets. Specifically consider the closed ball B R (0) around 0 with radius R > 0. It follows from (A.2) that for suitable R > 0, Hence for some ε o > 0, These inequalities and convexity of the functions T and T ε together entail that the setsF andF ε , 0 < ε ≤ ε o , are nonvoid and compact subsets of B R (0). Now convergence ofF ε toF follows easily from (0) T (g) + 2 max g∈BR (0) |T (g) − T ε (g)| . 2 Proof of Theorem 4.2. Thatĝ ∈ (β, n−1+β) n follows from Lemma 2.3 and the fact thatR ′ i (β) ≤ 0 with strict inequality if Z i > 1, whileR ′ i (n − 1 + β) ≥ 0 with strict inequality if Z i < n. Thusf is well-defined, and it suffices to verify (3) for indices 1 ≤ j ≤ k ≤ n. But the definitions off andR ′ i entail that According to Lemma 2.1, applied toT in place of T , the right hand side is not smaller than λ j−1 sign(ĝ j−1 −ĝ j ) + λ k sign(ĝ k+1 −ĝ k ) = λ j−1 (1 − 2 · 1{ĝ j−1 ≤ĝ j }) + λ k (1 − 2 · 1{ĝ k+1 ≤ĝ k }) ≥ λ j−1 (1 − 2 · 1{f j−1 ≤f j }) + λ k (1 − 2 · 1{f k+1 ≤ĝ k }) = λ j−1 sign(f j−1 −f j ) + λ k sign(f k+1 −f k ).
This proves the first part of (3). Similarly, Proof of Theorem 7.1. It follows from condition (C.3) that sup A≤a≤b≤B, t∈R with probability at least 1 − n 2 K 1 exp(−K 2 η o log n) → 1 if η o > 2/K 2 .
Suppose thatf n (x n ) > f * (x n ) + ε n for some x n ∈ [A n , B n − δ n ] and ε n = Cρ γ/(2γ+1) n = Cδ γ n with C to be specified later. Then for x n ≤ x ≤ x n + δ n , Iff n minimizes the sum i : An≤xin≤Bn R i (f (x in )) over all isotonic functions f on [A n , B n ], we assume without loss of generality thatf n (x in ) <f (x n ) whenever A n ≤ x in < x n . For otherwise we could replace x n with the smallest design point x in in [A n , B n ] such thatf n (x in ) =f n (x n ). Then Theorem 8.1 entails that Proof of Theorem 7.2. We only prove the assertion about the minimum off n over a neighborhood of x * , because the other part follows analogously. Suppose thatf n > f * + ε n on [x * ± δ n ], where both δ n > 0 and ε n > 0 are fixed numbers tending to zero, while δ n ≥ m * ρ n . In case of a taut string estimator with parameters λ in ∈ (0, c o n −1/2 ], it follows from (3) and (16) (1) Hence setting δ n := n −1/(2κ+2) yields the assertion. In case of any estimator satisfying (11), i.e. ε n equals O log(n)/M n [x * ± δ n ] 1/2 + log(n)/M n [x * ± δ n ] = O (ρ n /δ n ) 1/2 + ρ n /δ n .
Comparing this with (17) shows that one should take δ n = ρ 1/(2κ+1) n , and this yields the assertion about the minimum off n . 2

Software
The generalized taut string algorithm has been implemented in the ftnonpar package for the statistics software R (Ihaka and Gentleman, 1996). This add-on package can be downloaded and installed by the standard install.packages() command of R. All examples considered in this paper are available via the general genpmreg function using the method parameter to choose from the usual taut string method, the quantile version and the versions for binomial and Poisson noise.