Estimating Piecewise Monotone Signals

We study the problem of estimating piecewise monotone vectors. This problem can be seen as a generalization of the isotonic regression that allows a small number of order-violating changepoints. We focus mainly on the performance of the nearly-isotonic regression proposed by Tibshirani et al. (2011). We derive risk bounds for the nearly-isotonic regression estimators that are adaptive to piecewise monotone signals. The estimator achieves a near minimax convergence rate over certain classes of piecewise monotone signals under a weak assumption. Furthermore, we present an algorithm that can be applied to the nearly-isotonic type estimators on general weighted graphs. The simulation results suggest that the nearly-isotonic regression performs as well as the ideal estimator that knows the true positions of changepoints.


Introduction
Isotonic regression is a popular statistical method based on partial order structures, which has a long history in statistics (Ayer et al. 1955, Brunk 1955, van Eeden 1956. Suppose that θ * ∈ R n is a monotone vector satisfying θ * 1 ≤ θ * 2 ≤ · · · ≤ θ * n , and y is a noisy observation of θ * . The goal of the isotonic regression is to find a least-square fit under the monotone constraint: minimize y − θ 2 subject to θ 1 ≤ θ 2 ≤ · · · ≤ θ n . (1) In other words, the isotonic regression is the least squares estimatorθ =θ K ↑ n over a closed convex cone K ↑ n := {θ ∈ R n : θ 1 ≤ θ 2 ≤ · · · ≤ θ n }. Broadly speaking, the isotonic regression is an example of shape restricted regression. For comprehensive reviews on this field, see Robertson et al. (1988), Groeneboom and Jongbloed (2014), Chatterjee et al. (2015),  and references therein.
In this paper, we study the problem of estimating piecewise monotone vectors, which can be regarded as a generalization of isotonic regression that allows order-violating changepoints. We formulate the problem precisely as follows. Let us consider the Gaussian sequence model where y = (y 1 , y 2 , . . . , y n ) ∈ R n is the observed vector, θ * = (θ * 1 , θ * 2 , . . . , θ * n ) ∈ R n is the unknown parameter of interest, and ξ = (ξ 1 , ξ 2 , . . . , ξ n ) is the unobserved noise distributed according to the Gaussian distribution N (0, σ 2 I n ). Given the noisy observation y, the problem is to find a good piecewise monotone approximation of θ * . Here we define piecewise monotone vectors as follows.
We also say that θ is m-piecewise monotone if θ is piecewise monotone on some partition Π with |Π| = m.
We are particularly interested in the case where the number of pieces m is larger than two but much smaller than n because it is reduced to simpler problems if otherwise. From Definition 1.1, a monotone vector in K ↑ n is mpiecewise monotone for any m ≥ 1. In particular, the least squares estimators over 1-piecewise monotone vectors coincide with the isotonic regression. Besides, since any vector in R n is n-piecewise monotone, the least squares estimator over n-piecewise monotone vectors is merely the identity functionθ id = y.
In real-world applications, there are many signals that can be approximated by piecewise monotone vectors. Here, we provide a few examples. First, in seismology, geological observations such as tide gauge records (Nagao et al. 2013) and GPS records (Roggers and Dragert 2003) often consist of a long-term monotonic trend and discontinuous jumps caused by tectonic activities. In particular, Roggers and Dragert (2003) reported that GPS measurements that are nearby a subduction zone in North America can be approximated by a sawtooth function. The difference of the east-west component of GPS measurements between Victoria (British Columbia, Canada) and Seattle (United States). The trend factor seems to be approximated by a piecewise monotone signal. A possible reason for this behavior is the seismological phenomenon reported in Roggers and Dragert (2003). See Section 6.3 for a more detailed explanation of this data. Bottom: The numbers of search queries for two words "Christmas" and "gift" in Google Trends (https://www.google.com/trends).
The top panel of Figure 1 shows an example of GPS measurements. Second, the numbers of search queries for some words related to seasons (e.g., "Christmas" and "gift") can be seen as periodic piecewise monotone signals (see the bottom panel of Figure 1 for examples). Third, in the ranking systems in online shopping websites, sales ranks of rarely sold items behave like piecewise monotone signals because they suddenly rise every time the items are sold (Hattori and Hattori 2010).
In this paper, we focus on the performance of nearly-isotonic regression proposed by Tibshirani et al. (2011). Given y ∈ R n and a tuning parameter λ ≥ 0, the nearly-isotonic regression estimatorθ λ is defined aŝ where (z) + := max{z, 0}. Intuitively, the tuning parameter λ controls the degree of monotonicity. The term (θ i − θ i+1 ) + poses a positive penalty if and only if the directed edge (i, i + 1) is order violating, i.e., θ i > θ i+1 . Hence, a large value of λ > 0 makes the estimatorθ λ close to a monotone vector. In particular, there is a sufficiently large λ such that the solutionθ λ becomes exactly the same as the isotonic regression (1). Our goal in this paper is to show that the nearly-isotonic regression can adapt to piecewise monotone vectors. As suggested in Tibshirani et al. (2011),  Figure 2: Examples of the nearly-isotonic regression estimators with different choices of tuning parameters. The nearly-isotonic regression interpolates between the identity estimatorθ id = y and the isotonic regression θ K ↑ n .
the nearly-isotonic regression can fit to a "nearly monotone" vector that is close to K ↑ n in 2 -sense. That is, the estimator performs well if θ * has a small 2misspecification error dist(θ * , K ↑ n ) defined as Moreover, we can observe that the nearly-isotonic regression can fit to piecewise monotone vectors, even if θ * is far from monotone in 2 -sense. Figure 2 shows an example of the nearly-isotonic regression with n = 100. The true parameter θ * (orange line) is 2-piecewise monotone. By varying the values of the tuning parameter λ ≥ 0, the nearly-isotonic regression behaves as follows: If λ = 0, the nearly-isotonic regression is just the identity estimatorθ id = y, which clearly overfits to the noisy observation. If λ is set to a sufficiently large value, θ λ coincides with the isotonic regression. In this example, however, the 2misspecification error dist 2 (θ * , K ↑ n ) is large compared with the normalized noise variance σ 2 /n. We can see that the mean squared error (MSE) 1 n E θ * θ − θ * 2 2 of the isotonic regression can be much worse than that of the identity estimator, which coincides with σ 2 /n (see Section 3.2). Indeed, we can choose a 2-piecewise monotone vector θ * ∈ K ↑ n/2 × K ↑ n/2 with arbitrarily large 2 -misspecification error. If we choose an intermediate value of λ, the nearly-isotonic regression seems to fit to the true parameter. This suggests the adaptation property to piecewise monotone vectors.

Summary of theoretical results
In this paper, we investigate the adaptation property of the nearly-isotonic regression estimators defined in (3).
In the monotone regression setting (i.e., m = 1), it is known that the isotonic regression estimatorθ K ↑ n achieves the risk bound where V(θ) = θ n − θ 1 is the total variation of the monotone vector θ. It is also known that the rate O((σ 2 V/n) 2/3 ) is minimax optimal under the assumption that θ * is monotone and V(θ * ) ≤ V (Zhang 2002). Hence, a natural question is whether a similar rate can be achieved in piecewise monotone regression. In Section 3.1, we provide the minimax lower bound over the class of piecewise monotone vectors. Let Θ n (m, V) be the set of m-piecewise monotone vectors whose "upper" total variations are bounded by V (a precise definition is provided in Section 3.1). Then, the minimax risk over Θ n (m, V) is bounded from below by a constant multiple of max σ 2 V n 2/3 , σ 2 m n log en m .
In Section 5, we construct a concrete (but not computationally efficient) estimator that adaptively achieves this rate, and hence this lower bound is tight in the sense of the order in n, m, and V. Intuitively, this suggest that the cost of not knowing the true partition is of order O( σ 2 m n log en m ). In Section 4, we provide the following risk bound for the nearly-isotonic regression estimator (3). A precise statement is given in Corollary 4.12.
Then, the estimator (3) with optimally tuned parameter λ satisfies the following risk bound: The above claim is obtained as a corollary of a more general risk bound in Section 4. In the above statement, we make somewhat restrictive assumptions. Here, (a) and (b) are introduced just for the sake of notation simplicity, whereas (c) is an essential assumption. If we assume only (a) and (b), the rate that appeared in (4) is minimax optimal up to a logarithmic multiplication factor. However, we require an extra growth condition (c), which seems to be unavoidable for the estimator (3). We will provide a precise definition of the growth condition in Section 4.3.

Organization
The rest of this paper is organized as follows. In Section 2, we give a brief literature review on the shape restricted regression and regularization based estimators and relate our theoretical results to previous work. We provide lower bounds on the risks in the piecewise monotone regression problem in Section 3. In Section 4, we describe our main results on the risk upper bounds for the nearly-isotonic regression estimator and its constrained form variant. In particular, a precise statement of Claim 1.2 in the above is provided in Section 4.3. In Section 5, we discuss the attainability of the minimax lower bound; herein, we provide a concrete example of a model selection-based estimator that achieves the optimal rate. Furthermore, we present some numerical examples in Section 6. Finally, we present our conclusion in Section 7. We have also included appendices which contain additional numerical examples on two-dimensional signals, explanations of algorithms, and all proofs of the theoretical results.

Notation
Throughout this paper, we assume that y = θ * + ξ is distributed according to an isotropic normal distribution N (θ * , σ 2 I n ), where θ * ∈ R n is the true mean parameter of interest and ξ ∼ N (0, σ 2 I n ) is the noise vector. The symbol E θ * denotes the expectation with respect to y.
We sometimes denote by C an absolute positive constant whose value may vary.

Related work
There are two classes of estimators that are closely related to the nearly-isotonic regression (3): the isotonic regression and the fused lasso.
As we mentioned above, the isotonic regression is an instance of shape restricted regression. Many existing estimators in shape restricted regression can be formulated as least squares estimators (denoted byθ K ) onto closed convex sets (denoted by K). Examples include, but not limited to, the isotonic regression, the isotonic regression in two-dimensional grid or more general partial orders (see e.g., Robertson and Wright (1975) and Kyng et al. (2015)), and convex regression (Hildreth 1954).
Recently, researchers have developed two important techniques for analyzing risk behaviors of least squares estimators. First, Chatterjee (2014) proved that the Euclidean norm θ K −θ * 2 is tightly concentrated around a certain quantity defined by the localized Gaussian width. As applications of Chatterjee's method, non-asymptotic upper bounds that have similar rates to the minimax risks have been proved for the isotonic regression (Chatterjee 2014, Bellec 2018, the multiisotonic regression on two or more high dimension (Chatteejee et al. 2018, the multi-dimensional convex regression (Han and Wellner 2016), and the constrained form trend filtering estimator . See also Section 2.2 in Bellec (2018) for a related result. Second, risk bounds based on the statistical dimension of the tangent cone of K has been developed by Oymak and Hassibi (2016) and Bellec (2018). This technique is useful because it takes into account the facial structure of K, which leads to risk bounds that are adaptive to low dimensional sub-structures. It has been shown that some least squares estimators are adaptive to piecewise constant vectors: for example, the isotonic regression (Bellec 2018) and the multi-isotonic regression (Chatteejee et al. 2018. In particular, for the one-dimensional isotonic regression, Chatterjee et al. (2015) and Bellec (2018) proved the following oracle inequality where k(θ) is the number of constant pieces of θ. If θ * is monotone and k(θ * ) is small, the right-hand side can be much smaller than the worst-case rate of O((σ 2 V/n) 2/3 ). However, the first term in the right-hand side can become arbitrarily large if θ * is not included in K ↑ n . The fused lasso (Tibshirani et al. 2005), also known as the total variation regularization (Rudin et al. 1992), is a penalized estimator defined aŝ where λ ≥ 0 is the tuning parameter. The fused lasso poses the penalty whenever θ i = θ i+1 , whereas the penalty of the nearly-isotonic regression (3) activates only if θ i > θ i+1 . Theoretical risk bounds for the fused lasso have been studied by Mammen and van de Geer (1997), Dalalyan et al. (2017), Lin et al. (2017), and . In particular,  showed an oracle inequality of the following form: where λ * is an optimally tuned parameter. One can control the quantity ∆ fused (θ) by assuming a mild regularity condition on θ * so that the inequality (7) recovers the minimax rate for the piecewise constant vectors (see e.g., Gao et al. (2017)). However, even if θ * is a monotone vector, (7) does not recover the rate of the isotonic regression (5) because ∆ fused (θ) becomes zero if and only if θ is just a constant vector.
Our risk bound for the nearly-isotonic regression in Section 4.2 fills the gap between the above risk bounds for the isotonic regression and the fused lasso. We will show an oracle inequality of the following form: Like in the case of the fused lasso (7), this inequality provides a meaningful risk bound even if we cannot approximate θ * by a monotone vector. Furthermore, ∆ neariso (θ) becomes zero for any monotone vector θ ∈ K ↑ n . Hence, our result can exactly recover the rate achieved by the isotonic regression (5).

Lower bounds
In this section, we provide lower bounds for the risk in one-dimensional piecewise monotone regression.

Minimax lower bound
We are interested in the lower bound for the minimax risk defined as where Θ ⊂ R n is a set of piecewise monotone vectors, and the infimum is taken over all (measurable) estimatorsθ : R n → R n . In particular, for 1 ≤ m ≤ n, we consider the class of m-piecewise monotone vectors with a bounded total variation that is defined as follows.
Definition 3.1. Let n ≥ 2 and 1 ≤ m ≤ n. For any V > 0, letΘ n (m, V) denote the set of (at most) m-piecewise monotone vectors such that the upper total variation is bounded by V. In other words, a vector θ ∈ R n is an element ofΘ n (m, V) if and only if the following conditions hold: (i) θ is piecewise monotone on a connected partition Π = {A 1 , . . . , A m * } of [n] whose cardinality |Π| = m * is not larger than m.
In addition, we also define Θ n (m, V) as the set of m-piecewise monotone vectors such that the total variations for all pieces are uniformly bounded by V/m. That is, Θ n (m, V) is obtained by replacing (ii) by the following condition: (ii)' V(θ Ai ) ≤ V/m for all i = 1, . . . , m * . First, we consider θ * is piecewise monotone on a known partition Π * = {A 1 , A 2 , . . . , A m * } and that the total variation of the sub-vector θ * Ai is bounded as V(θ * i ) ≤ V i for each i = 1, 2, . . . , m * . Then, the problem is decomposed into m * independent subproblems of estimating monotone vectors θ * i . The minimax risk lower bound for monotone vectors has been proved by Zhang (2002) and Chatterjee et al. (2015). For simplicity in the notation, we assume here that n i = |A i | ≥ 2 for all i = 1, 2, . . . , m. The minimax risk can be written as Hence, the minimax risk overΘ n (m, V) is clearly bounded from below by If the partition Π * is known, then this convergence rate can be obtained by concatenating the least squares estimators on all pieces. By Jensen's inequality, the quantity (9) is not larger than (σ 2 i V i /n) 2/3 . In the general setting, we have to deal with unknown partitions. The following proposition gives the lower bound over the class of piecewise monotone vectors in Definition 3.1.
Proposition 3.2. Let n ≥ 3, 3 ≤ m ≤ n, and V > 0. Suppose that Θ is either Θ n (m, V) or Θ n (m, V) in Definition 3.1. Then, for any estimatorθ : R n → R n , we have the following lower bound: where C > 0 is a universal constant.
It remains to verify that the lower bound (10) is tight. Thus, in Section 5, we will construct an estimator that adaptively achieves a similar rate.

Lower bound of isotonic regression with misspecified partitions
Suppose that θ * is an m-piecewise monotone vector. As we mentioned in the previous subsection, if we know the true partition on which θ * is monotone, the least squares estimator can achieve the rate shown in (9). Here, we consider what happens if we underestimate the true number of the pieces. We consider the risk behavior of the isotonic regressionθ K ↑ n , which corresponds to the least squares estimator for the underestimated number of pieces as m = 1. If the true number of pieces is larger than or equal to two, θ * may not be contained in K ↑ n . Recall that dist(θ * , K ↑ n ) is the 2 -misspecification error against the set of monotone vectors. Bellec (2018) showed that the isotonic regression is robust against a small 2 -misspecification, that is, if dist(θ * , K ↑ n ) ≤ , where k(θ) is the orthogonal projection of θ * onto K ↑ n . Conversely, if the 2misspecification error is large, we see that the isotonic regression can have an arbitrarily large risk.
Proposition 3.3. There is a positive number t = t n,σ 2 that depends on n and σ 2 such that if the true parameter θ * satisfies dist(θ * , K ↑ n ) > t, then the MSE of the isotonic regression is bounded from below as In this case, the isotonic regression has a strictly larger MSE than that of the identity estimatorθ id = y.
We can easily check that there is a 2-piecewise monotone vector with an arbitrarily large 2 -misspecification error. To see this, let θ * ∈ R 2n be a piecewise constant vector defined as θ * i = M > 0 for i = 1, . . . , n and θ * i = 0 for i = n + 1, . . . , 2n. Then, it is easy to see that dist(θ * , K ↑ 2n ) = nM 2 /2 diverges as M → ∞. Figure 2 shows an example of a 2-piecewise monotone vector θ * such that the isotonic regression has a larger squared loss value than the identity estimator.

Risk bounds for nearly-isotonic regression
In this section, we develop the risk bound for the nearly-isotonic regression estimator (3). Proofs of all the theorems and propositions in this section are presented in Appendix D.

Risk bounds for constrained estimators
Before considering the original version of the nearly-isotonic regression (3), we consider the performance of the constrained form nearly-isotonic regressionθ V defined by the following constrained optimization problem: where V ≥ 0 is the tuning parameter. By the fundamental duality theorem in convex optimization, there exists a Lagrange multiplier λ V ≥ 0 such that the regularization type formulation (3) admits the same solutionθ λ V =θ V . Hence, the solution path of penalized estimators {θ λ : λ ≥ 0} and that of constrained estimators {θ V : V ≥ 0} are equivalent. However, the properties of estimators with fixed values of λ ≥ 0 and V ≥ 0 can be different in the following sense: • From a computational perspective, calculating the constrained estimator (11) for a given V ≥ 0 is more difficult than the regularization estimator (3). For the regularization estimator (3), we can use the Modified Pool Adjacent Violators Algorithm (Modified PAVA) proposed by Tibshirani et al. (2011), which outputs the solution path for every λ ≥ 0. In particular, given λ ≥ 0, we can always obtain an exact solutionθ λ . However, to the best of our knowledge, there are no practical algorithms that obtain an exact solution for the constrained problem (11) that run as fast as the algorithms for the penalized problem (3). We present detailed explanations for the algorithms in Section A.
• From a statistical perspective, the correspondence between tuning parameters λ and V is not deterministic (i.e., it depends on the realization of the data y). For this reason, a risk bound that is obtained for one of (3) or (11) cannot be directly applied to the other.
We show the main results on the adaptation property to piecewise monotone vectors in terms of sharp oracle inequality.
Before proceeding, we introduce some notations. Suppose that θ ∈ R n is piecewise constant on a connected partition Π const = {A 1 , . . . , A k } of [n]. We denote by k(θ) := |Π const | the number of pieces in which θ becomes constant. That is, there are integers 1 = τ 1 < · · · < τ k+1 = n + 1 such that (i) A i = {τ i , τ i + 1, . . . , τ i+1 − 1} for i = 1, . . . , k and (ii) for any i ∈ [k], there exists t i ∈ R such that θ j = t i for all j ∈ A i . We define the sign w i ∈ {0, 1} associated with each knot τ i (i = 1, . . . , k + 1) as w 1 = w k+1 = 0 and In other words, w i = 1 if and only if the order violation θ j−1 > θ j occurs at j = τ i . See Figure 3 for the graphical illustration. Then, we define M (θ) as M (θ) determines the non-monotonicity of a piecewise constant vector θ. If θ is m-piecewise monotone, then it is clear that M (θ) ≤ 2(m − 1). In particular, for any monotone vector θ, we have M (θ) = 0. Based on these notations, we have the following sharp oracle inequality.
The following risk bound for the best choice of the tuning parameter V ≥ 0 is an immediate consequence of Theorem 4.1.
Remark 4.3. We briefly comment on the proof of Theorem 4.1 and Corollary 4.2. A key ingredient is to obtain a bound on the statistical dimension (Amelunxen et al. 2014) of the tangent cone of the constraint set {θ ∈ R n : V − (θ) ≤ V}. This methodology was first developed for the isotonic regression and the convex regression by Bellec (2018). In particular, our approach is inspired by the analysis of the constrained trend filtering estimators by . See Appendix D for detailed proofs.
By restricting the region over which the infimum in (16) is taken, we have the oracle inequality for monotone vectors which recovers the existing results on the isotonic regression (Chatterjee et al. 2015, Bellec 2018 up to a constant multiplicative factor. To understand the general upper bound in (16), we have to control the quantity M (θ) defined in (13). To this end, we consider the minimal length condition; we say that θ ∈ R n satisfies the minimal length condition for a constant c > 0 if it satisfies where the partition Π const = {A 1 , A 2 , . . . , A k } and the signs w i (i = 1, . . . , k +1) are defined as in (13). Intuitively, a signal θ ∈ R n is well approximated by another signal that satisfies the minimal length condition if θ has "moderate slopes" around the order-violating jumps. For further discussion on such growth conditions, see Section 4.3. Based on the minimal length condition, we have the following result from Theorem 4.1 .
Corollary 4.4. Suppose that θ * ∈ R n satisfies the minimal length condition (18) for a constant c > 0. Assume that θ * is k(θ * )-piecewise constant and m(θ * )-piecewise monotone. Then, the constrained nearly-isotonic regression (11) satisfies In particular, if the tuning parameter V is chosen so that where C is a positive constant.
Remark 4.5. If θ is k-piecewise constant and m-piecewise monotone, it is always true that k ≥ 2(m − 1). Hence, the inequality (19) can be simplified as where C(c) > 0 is a constant that depends on c alone.
Remark 4.6. We comment on the minimal length condition and the relation to estimation of piecewise constant vectors. We conjecture that the minimum length condition (18) is essentially unavoidable for the risk bound of the nearlyisotonic regression due to the following analogy to the fused lasso. The minimal length condition for the fused lasso is considered by . For the fused lasso, Fan and Guan (2017) showed that the minimum length condition cannot be removed in the sense that there is a lower bound depending on the minimum length ∆ = min i |A i | (see also the experimental result by , Remark 2.5).

Risk bounds for penalized estimators
In this section, we consider the risk bounds for the nearly-isotonic regression (3) in the original penalized form by Tibshirani et al. (2011).
We comment on some direct consequences of Theorem 4.7. In this theorem, λ * (θ) is defined as a function of θ. To understand the risk bound (20), we consider the choice of the tuning parameter λ ≥ 0 that depends on the true parameter θ * . Letθ be a vector that minimizes the quantity among all θ ∈ R n . Then, taking λ * * := λ * (θ), we have the following oracle inequality which has the same form as (16): .
Again, if we assume the minimal length condition (18) on θ * , we obtain a simplified bound of the form (17). We move on to discuss a precise expression of λ * (θ) in Theorem 4.7. The next proposition provides an upper bound for λ * (θ).
. . , A k } be the constant partition of θ, and w 1 , w 2 , . . . , w k+1 be the associated signs defined in (12). Then, there is a universal constant C > 0 such that λ * (θ) in Theorem 4.7 is bounded from above by The purpose of the choice of λ * in Proposition 4.8 is to derive the theoretical convergence rate in terms of k(θ) and M (θ). However, different choices are possible if we are interested in other theoretical aspects (e.g., estimation consistency for changepoints). For the fused lasso estimator (6), several authors have studied theoretical choices of tuning parameters that result in risk upper bounds (Dalalyan et al. 2017, Lin et al. 2017).
Remark 4.9 (Example of parameter choice). Here, we provide an example choice of the tuning parameter λ under a simple length condition. Let us assume that (i) θ * is not globally monotone (i.e., M (θ * ) > 0)) and (ii) |A i | is of order n/k, that is, holds for some 0 < c 1 < c 2 . Then, we can see that λ * (θ * ) is bounded from above by C σ n log en, where C is a constant that depends on C, c 1 , c 2 . For the fused lasso, the theoretical choice λ = O(σ √ n log en) has been suggested by Dalalyan et al. (2017) and . For a detailed discussion, see Remark 2.7 by  and references therein.
Remark 4.10. In general, the choice of the tuning parameter that minimizes the risk can be different from the theoretical suggestion. More importantly, we cannot obtain the value of λ suggested in Proposition 4.8 because it depends on the unknown true parameter θ * and the noise standard deviation σ. In practice, there are two typical data-dependent choices of λ: • Stein's unbiased risk estimate: If we know σ or its estimate valueσ, we can reasonably choose a parameter λ by minimizing Stein's unbiased risk estimate (SURE) Here,df(θ λ ) := k(θ λ ) is an unbiased estimate of the degrees of freedom. See Tibshirani et al. (2011) for the derivation.
• Cross-validation: We can also apply the cross-validation when the model (2) is interpreted as a discrete observation of a continuous signal. Specifically, suppose that the data is generated according to the following nonparametric regression model: where x 1 < x 2 < . . . < x n are given design points in [0, 1] and f * : [0, 1] → R is an unknown piecewise monotone function. We define the nearly-isotonic regression estimatorf λ over the interval [0, 1] as follows: First, we determine the valuesθ λ,i (i = 1, 2, . . . , n) by solvinĝ Then, we definef λ : [0, 1] → R by interpolation. For instance, one can output a piecewise constant function so thatf λ (x i ) =θ λ,i . In this sense, given a new design point x new , we can predict the value of f * (x new ) bŷ f λ (x new ). Hence, we can naturally apply the cross-validation in this situation.

Application to piecewise monotone vectors
To gain a deeper understanding of the adaptation property of the nearly-isotonic regression, we study the risk bound under a more specific assumption. We define the following moderate growth condition for piecewise monotone vectors.
Definition 4.11. Let n ≥ 2. We say that a monotone vector θ ∈ K ↑ n satisfies the moderate growth condition if The blue circles depict an example of a signal that satisfies the moderate growth condition. That is, it is not larger than the linear signal θ linear i for 1 ≤ i ≤ 10, and not less than θ linear i for 10 ≤ i ≤ 20. On the other hand, the orange triangles depict a counterexample for this condition. Right: If θ satisfies the moderate growth condition, there is a k-piecewise monotone vector such that the lengths of segments at both ends are not less than k/n. See Appendix D.5 for a detailed explanation. and for i = n/2 , n/2 + 1, . . . , n. Figure 4 gives an illustration of the moderate growth condition. In words, the signal θ ∈ R n satisfying the moderate growth condition is not larger than the linear signal in the left half of the domain, and not less than that in the right half of the domain. Intuitively, the role of the moderate growth condition is to guarantee the minimal length condition (18) for a piecewise constant approximation.
Suppose that the true signal θ * is piecewise monotone and every segment satisfies the moderate growth condition. Then, the nearly-isotonic regression achieves a nearly minimax convergence rate as follows.
Corollary 4.12. Suppose that the following assumptions hold: (a) The partition is equi-spaced: (c) θ * Aj satisfies the moderate growth condition for each j = 1, 2, . . . , m. Then, the estimator (3) with optimally tuned parameter λ satisfies the following risk bound: The risk bound (25) achieves the minimax rate over Θ n (m, V) in Proposition 3.2 up to a multiplicative factor of log 2/3 en m . We should note that the restrictive assumption (a) in Corollary 4.12 is employed merely for the sake of simplicity of the proof. We may relax this assumption as for some c > 0.

Model selection based estimators
Here, we consider estimators obtained by model selection among all partitions Π. The main purpose of this section is to discuss whether the minimax lower bound in Proposition 3.2 can be achieved without any additional assumption such as the moderate growth condition. Given a connected partition Π = (A 1 , A 2 , . . . , A m ) of [n], we write K ↑ Π for the set of piecewise monotone vectors on Π, i.e., Letθ Π denote the projection estimator onto K ↑ Π . By definition,θ Π is obtained by concatenating isotonic regression estimators defined in every segment.
If we know the true partition Π * on which θ * is piecewise monotone, then the risk of the projection estimatorθ Π * is bounded from above by If the true partition is unknown, a natural idea is to select a data-dependent partitionΠ by a penalized selection rule: Here, pen(Π) is a positive penalty for the partition Π.
The penalized selection rules have been well studied in statistics. In particular, Birgé and Massart (2001) and Massart (2007) developed non-asymptotic risk bounds for generic model selection settings in Gaussian sequence models. Hereafter, we construct a penalized selection estimator in the spirit of Theorem 4.18 in Massart (2007).
Instead of selectingθ Π according to (26), we introduce the total variation sieves. Namely, in addition to selecting partitions, we also select budgets of piecewise total variations as follows. Let Π = (A 1 , A 2 , . . . , A m ) be a connected partition. For any vector V = (V 1 , V 2 , . . . , V m ) with V i ≥ 0 (i = 1, 2, . . . m), we define the set of piecewise monotone vectors with bounded total variations as Then, we defineθ Π,V as the projection estimator onto K ↑ Π (V). Next, we define a countable set of vectors V as where v(j) := j 3/2 . Finally, we select a pair (Π,V) as the solution of the following minimization problem: With a careful choice of the penalty term pen(Π, V), we have the following result: Theorem 5.1. There exists an absolute constant C pen > 0 such that the following statement holds. For any pair (Π, V), define the penalty pen(Π, V) so that Let (Π,V) be the minimizer in (27).
In particular, if θ * is piecewise monotone on Π = (A 1 , A 2 , . . . , A m ), we have We emphasize that Theorem 5.1 does not require any additional assumptions on θ * , e.g., the minimum length condition or the moderate growth condition introduced in the previous section. Therefore, it suggests the existence of a penalized model selection estimator that achieves the minimax rate in Proposition 3.2. However, the estimator (27) is not practical for a computational reason because it is obtained through the minimization over exponentially many possible partitions Π.
The dependence on the total variation of each segment in (28) is (V Ai (θ * Ai )+ 1) 2/3 instead of (V Ai (θ * Ai )) 2/3 . The additional constant 1 is due to the minimal resolution of the sieve. To establish a non-asymptotic risk bound for the penalized model selection estimator without sieves (i.e., (26)) and remove the dependence on the sieve resolution remains an open problem.

Simulations
We provide some numerical examples for piecewise monotone regression problems.

Dealing with inconsistency at boundaries
Before presenting the simulation results, we here explain a well-known practical issue in the isotonic regression literature and a regularization method to cope with it.
In the study of statistical estimation under monotonicity constraints, it is known that the least squares estimatorθ K ↑ n is inconsistent at the boundary points (see e.g., Groeneboom and Jongbloed (2014) and Woodroofe and Sun (1993)). A similar issue arises for the nearly-isotonic regression estimators. Since the penalty term in (3) does not activate if the orders are not violated at the boundary points (i.e., y 1 < y 2 or y n−1 < y n ), the nearly-isotonic regression is not robust against a negative noise at the left boundary or a positive noise at the right boundary. To overcome this issue, we consider the following boundary correction regularization for the nearly-isotonic regression: where µ > 0 is an additional tuning parameter. It can easily be checked that the solution is equivalent to that of the ordinary nearly-isotonic regression (3) applied toỹ = (y 1 + µ, y 2 . . . , y n−1 , y n − µ). Similar regularization methods for isotonic regression have been studied by Chen et al. (2015), Wu et al. (2015) and Luss and Rosset (2017).

Simulation data
Here, we evaluate the performance of the nearly-isotonic regression and related estimators on simulated data. According to the one-dimensional regression model (23), we generated data with equi-spaced design points x i = (i − 1)/n (i = 1, 2, . . . , n). For the true function f * , we consider m-piecewise monotone functions defined as where f : [0, 1) → R is a given monotone function and I j := [(j − 1)/m, j/m) for j = 1, 2, . . . , m. Following Meyer and Woodroofe (2000), we choose f from the following two monotone functions: f cubic (x) = (2x − 1) 3 + 1. Figure 2 shows an example of f = f sigmoid and m = 2. It is worth noting that the former sigmoidal function f sigmoid satisfies the moderate growth condition (see Definition 4.11), whereas the latter cubic function f cube does not. Hence, for the case of piecewise sigmoidal functions f (m) sigmoid , the minimax rate of O(n −2/3 ) is achieved by both the nearly-isotonic regression and the fused lasso (see Corollary 4.12 above and Corollary 2.8 by ).
In our experiments, the size n of the signal is chosen from {2 6 , 2 7 , . . . , 2 10 }. The noise standard deviation σ is assumed to be known and fixed to 0.25. We evaluated the MSE for the following four estimators: • Neariso: The nearly-isotonic regression (3).
• PO: The projection estimator with the partition oracle, i.e., the projection estimator onto K ↑ Π provided with the true partition Π.
For Neariso and Fused, the tuning parameter λ is selected by generalized C p criteria (i.e., minimizing SURE (22)). For NearisoBC, the tuning parameters (λ, µ) are selected by a similar criterion. To estimate the MSE, we generated 500 replications of the data and calculated the average value of the squared loss 1 n θ − θ * 2 2 . Figure 5 presents the results for m = 2, 4 and f = f sigmoid , f cubic . The upper line shows log-log plots of the MSE versus n. In each setting, the three regularization based estimators (i.e., Neariso NearisoBC and Fused) performed as well as the ideal estimator PO, whereas the former three estimators do not use the information about the true partition. The risks of PO are well fitted by lines of slopes of −2/3, which means that the speed of the convergence is about the minimax optimal rate of O(n −2/3 ).
Next, we provide more detailed comparisons of regularization based estimators. The lower line in Figure 5 shows the difference of MSEs from that of PO. For piecewise sigmoidal functions, NearisoBC and Fused performed better than Neariso. Notably, in the case of m = 2, the risks of Fused were even better than PO for large values of n. A possible reason for the better performance of the fused lasso is that the sigmoidal function can be well approximated by a piecewise constant function near the boundaries. On the other hand, for piecewise cubic functions, Neariso performed slightly better than the other two estimators for small values of n.

Geological data
We conducted experiments on GPS data related to a seismological phenomenon reported by Roggers and Dragert (2003). The aim here is to investigate the performance of the nearly-isotonic type estimators on real-world data in which piecewise monotone approximations have already been justified in the previous work. For the signal y, we used the difference of the east-west components of GPS measurements between two observatories, which are located in Victoria (British Columbia, Canada) and Seattle (United States). The GPS data is provided by Melbourne et al. (2018). The top panel in Figure 6 shows the plot. The data period starts on January 1, 2010, and ends on December 2, 2017. After removing missing records, the size of the signal is n = 2885. The increasing trend of the signal is considered to be caused by the subduction process at the plate boundary. We can also see periodic reversals in the signal, and the entire signal may be approximated by a piecewise monotone signal. Such reversals may be related to the seismological phenomenon so-called the episodic tremor and slip. According to Roggers and Dragert (2003), such slip events were observed in every 13 to 16 months in their data taken from 1997 to 2003. GPS data contains several anomalous values. For the signal y considered above, most of the values y i are between 20 and 50, except for a single outlier y 2344 = 139.34. The behaviors of the estimators are extremely affected by the existence of such outliers. In our situation, we can manually remove the anomalous value (denoted byỹ). However, it is often difficult to distinguish outliers in practical situations. From this perspective, we also considered the robust M -estimation version of the nearly-isotonic regression defined as (34) with L(θ; y) = n i=1 δ (θ i − y i ). Here, δ is the Huber loss: , which is commonly used in the robust regression literature. We applied the nearly-isotonic regression (3) and its robust variant to the signals y andỹ in the above. The tuning parameters λ were determined by the 5-fold cross-validation, and δ in the Huber loss was fixed as δ = 0.01.
First, we consider the case where the outlier is removed manually. The second panel in Figure 6 shows the result for the cross-validated nearly-isotonic regression. The vertical lines denote the locations of downward jumps in the estimators. We can see that the period of jump clusters is about 12 to 14 months, which is close to that of the seismological slip events suggested by Roggers and Dragert (2003).
Next, we consider the case where the signal contains an outlier. In this case, the value of the squared loss largely depends on the error at the coordinate of the outlier. Then, the cross-validation may choose a large tuning parameter, and the resulting estimator becomes close to a monotone signal. The third panel in Figure 6 shows that the number of downward jumps is considerably less than the number that is expected from the known frequency of the slip events. Conversely, the fourth panel in Figure 6 shows that the robust version of the nearly-isotonic regression outputs similar clusters of change points as in the second panel.

Discussion
In this paper, we studied the problem of estimating piecewise monotone signals. The classical isotonic regression estimator cannot be applied in this setting because of the existence of arbitrarily large downward jumps. We derived the minimax risk lower bound over piecewise monotone signals with bounded upper total variations. The minimax rate is tight up to multiplicative constant because it can be achieved by a (computationally inefficient) model selection based estimator. Our main results show that the nearly-isotonic regression estimator achieves this rate under an additional growth condition. An advantage of the nearly-isotonic regression is that the estimator can be calculated efficiently on arbitrary directed graphs by parametric max-flow algorithms. The simulation results demonstrate that the nearly-isotonic regression has an almost similar convergence rate as the ideal estimator that knows the true partition.

Non-Gaussian noises
In this paper, we provided risk bound for the nearly-isotonic regression under the assumption that the noise distribution is Gaussian. However, in practice, this assumption is too restrictive. We here briefly discuss the risk bound with non-Gaussian error distributions.
Suppose that ξ 1 , . . . , ξ n are i.i.d. random variables with E[ξ 1 ] = 0 and Var(ξ 1 ) = σ 2 . Then, we can see that the "expectation bound" (20) holds with a different constant C > 0. See Remark D.14 in the appendix for the key ingredients for the derivation. On the other hand, the "high-probability bound" (21) does not hold in general since it requires a more strong concentration property (i.e., the Gaussian concentration).

Future directions
An interesting direction for future work is to investigate the optimal rate of piecewise monotone regression on higher dimensional grids or general graphs. Recently, several researchers have analyzed the risk bounds for the isotonic regression estimators on two or more higher dimensional grid graphs (Chatteejee et al. 2018. It is natural to ask whether one can construct a computationally efficient estimator that is adaptive to piecewise monotone vectors on a given graph. We believe that the nearly-isotonic type estimator (32) is a candidate. A major difficulty is to determine an appropriate graph topology. Given a partial order on a set V = [n], the corresponding isotonic regression estimator is uniquely determined. However, there are many directed acyclic graphs that correspond to partial order . Hence, the graph topology for the nearly-isotonic type estimators is not unique. To control the connectivity, it may be useful to introduce edge weightings proposed by Fan and Guan (2017).
Another direction is to develop a model selection method for least squares estimators over unbounded cones. We introduced sieves on the total variation in Section 5 to construct an estimator that is adaptive to piecewise monotone vectors. In practice, sieve-based methods can be computationally inefficient. Conversely, if the true vector θ * is monotone, the isotonic regression automatically achieves the minimax rate with respect to the total variation. We conjecture that it is also possible to select the least squares estimatorθ Π without using sieves. In particular, we leave it as an open question whether the adaptive risk bound is achieved by the penalized selection rule of the form (26).

A Algorithms for nearly-isotonic estimators
In this section, we present algorithms for the nearly-isotonic regression and related estimators and discuss their computational complexities. Note that the main purpose of this section is to give a review of existing algorithms, and hence most results presented in this section are not new (except for Proposition A.1).

A.1 Penalized estimators
Here, we introduce two algorithms to solve the penalized form nearly-isotonic regression (3). In Section A.1.1, we introduce the solution path algorithm developed by Tibshirani et al. (2011). The advantage of the solution path algorithm is that it outputs the solutionsθ λ for every λ ≥ 0 simultaneously. However, the solution path algorithm cannot be applied to the estimators with general weights and graphs. In Section A.1.2, we provide another algorithm that outputs the exact solution for a single λ. The latter algorithm can be applied to the nearly-isotonic type estimators defined on any weighted directed graphs.

A.1.1 One-dimensional problem
The modified pool adjacent violators algorithm (modified PAVA, Tibshirani et al. (2011)) is the algorithm used to calculate the solution path for the problem (3). Here, we present a variant of the modified PAVA for the following weighted version of the estimator: where c i > 0 (i = 1, 2, . . . , n − 1) are positive weight parameters. Letting c i = (x i+1 − x i ) −1 , this formulation covers the nearly-isotonic regression for general increasing design points (24).
Setθ λi to be the piecewise constant vector whose values on A j are t j + m j δ (j = 1, 2, . . . , k).

10
Set Π i to be the constant partition ofθ λi . end The derivation of Algorithm 1 is straightforward from the original paper of Tibshirani et al. (2011). We should note that the validity of this algorithm crucially depends on the property that the solution path is piecewise linear and "agglomerative". It is well known that the piecewise linearity of the solution path holds for many classes of regularization estimators (Rosset and Zhu 2007).
We say that the solution path {θ λ } λ≥0 is agglomerative if it satisfies the following condition: ifθ λ,i =θ λ,j holds for some λ = λ 0 , then the same equality holds for any λ ≥ λ 0 . For the constant weights (c i ≡ 1), such agglomerative property was proved by Tibshirani et al. (2011). However, for general non-unitary edge weights (c i = 1), this need not be true. Here, we provide the following proposition to ensure the agglomerative property for non-unitary edge weights.
Proposition A.1. The solution path of weighted nearly-isotonic regression (30) is piecewise linear and agglomerative if the edge weights satisfy the following concavity condition.
where we defined c 0 := 0. In particular, this condition implies that Algorithm 1 outputs the exact solution path.
The condition (31) demands that c j can be written as c j = f (j) for some concave function f : R ≥0 → R ≥0 with f (0) = 0 and f (x) > 0 for all x > 0. In particular, for any i ≤ j ≤ k, we have Proof sketch of Proposition A.1. We can prove the validity of Algorithm 1 by a similar argument as Tibshirani et al. (2011) if we assume the piecewise linearity and the agglomerative property. The piecewise linearity is already shown in Rosset and Zhu (2007). Hence, it remains to prove the agglomerative property under the condition (31). To this end, we leverage the "agglomerative clustering condition" defined in Appendix D.6. In particular, we defer the details to Remark D.25 as well as Remark D.27.

A.1.2 General graphs
Let G = (V, E) be a directed graph with V := [n]. Suppose that each edge (i, j) ∈ E is equipped with a positive weight c (i,j) > 0. We define the generalized nearly-isotonic regression aŝ where V G is a nearly-isotonic type penalty defined as For any choices of G and c, V G becomes a convex function. Clearly, the lower total variation V − is a special case where E = {(i, i + 1) : i = 1, 2, . . . , n − 1} and c (i,i+1) ≡ 1. Thus, (32) can be regarded as a generalization of the nearly-isotonic regression to general directed graphs. The problem of the form (32) has been well studied in the optimization literature. In particular, we can see that solving (32) is equivalent to solving a certain parametrized family of minimum-cut problems. For detailed explanations of such an equivalence, see Obozinski and Bach (2016) and Chapter 8 in Bach (2013). Hence, (32) can be solved by the parametric max-flow algorithm (Gallo et al. 1989) that runs in O(n|E| log n 2 |E| ). Conversely, it has been pointed out by Mairal et al. (2011) that, for many practical instances, some simplified variants of the parametric max-flow algorithm output the solution faster than the original algorithm by Gallo et al. (1989). We remark that Hochbaum and Queyranne (2003) also developed the relationship between the isotonic regression and the parametric max-flow algorithm.
Algorithm 2 shows the Divide-and-Conquer algorithm (Chapter 9 of Bach (2013)) that solves (32). In the inner loop, the algorithm recursively solves max-flow problems by defining smaller networks (Algorithm 3). See Figure 7 for examples of networks used in the first two recursions in the algorithm.
Algorithm 2: Divide-and-Conquer algorithm for the generalized nearlyisotonic regression 32 Input: y ∈ R V , a directed graph G = (V, E) with positive edge weights {c (i,j) }, a tuning parameter λ ≥ 0.
Output: The solutionθ λ of (32) 1 Construct a flow network N by adding a source node s and a sink node t to the graph G. 2 Computeθ λ = Prox λF N (y) according to Algorithm 3.

A.1.3 General convex loss functions
In practice, we are often interested in general convex loss functions other than the squared loss. Here, we consider a generalized problem of the following form: where θ → L(θ; y) is a convex loss function for any y ∈ R n . As an example, this formulation contains the M -estimator in the regression setting L(θ; y) = 1 2 (y i − x i , θ ), where (y i , x i ) ∈ R × R p (i = 1, 2, . . . , n) are the observed data and : R → R is a convex function. We can also obtain algorithms that output approximate minimizers of (34) as follows. First of all, note that Algorithm 2 outputs the proximal operator of the regularization term V G (θ). Once we have an oracle for the proximal operator, we can apply proximal gradient methods to solve (34). In particular, Herein, F N is the s-t cut function of the network N . This step is equivalent to solving the max-flow problem defined by the flow network in Figure 7-(a).
The corresponding network is obtained by shrinking nodes V \ A into the sink node t (Figure 7-(b)).
The corresponding network is obtained by shrinking nodes A into the source node s and adding −F N (A) to the capacity of (s, t) (

A.2 Constrained estimators
Consider the following generalized version of the constrained form of nearlyisotonic regression (11): Unlike the penalized estimators, it is difficult to find an exact solution of (35). However, since problem (35) is an instance of a quadratic programming problem, there are polynomial time algorithms to obtain approximate solutions. Here, we explain the existence of such algorithms. The following result is a direct application of Theorem 1 by Lee et al. (2018), which provides a convergence guarantee of a variant of cutting plane methods. V > 0. Then, for any > 0, there exists a randomized algorithm that outputs θ satisfying and y −θ 2 ≤ min θ∈R n : V G (θ)≤V y − θ 2 + 2 y 2 with a probability of 0.99. The overall complexity of the algorithm is O((n + |E|)n 2 log O(1) n |E| ).
Remark A.3. In practice, due to computational considerations, we recommend to use the penalized estimator (33) instead of the constrained estimator (35). For the penalized estimator, we empirically observed that Algorithm 2 runs sufficiently fast graphs with several hundreds of nodes. For the constrained estimator, Proposition A.2 theoretically guarantees polynomial time solvability of the constrained problem (35), whereas it does not provide a practical algorithm.

B Supplemental experiments
To understand the behavior of the nearly-isotonic regression in more generic settings, we present additional simulation results for the nearly-isotonic regression on general graphs (32). Here, we consider the problem of estimating piecewise monotone signals on two-dimensional grids. The true parameter θ * is a 32 × 32 matrix that is monotone on each 16 × 16 segment. The bivariate isotonic regression (LSE) does not capture the piecewise monotone structure. The solution of the nearly-isotonic regression (Neariso2) seems to be close to the partition oracle (PO).
We say that an n 1 × n 2 matrix θ is monotone if θ ij ≤ θ kl whenever i ≤ k and j ≤ l. In other words, θ is monotone if it has no order-violating edges in the two-dimensional grid graph G 2 = (V 2 , E 2 ), where V 2 = [n 1 ] × [n 2 ] is the set of all subscripts (i, j) and We say that θ is piecewise monotone if there is a partition Π of V such that, for each A ∈ Π, A is a weakly connected component of G 2 and θ A has no orderviolating edges in the induced subgraph. For simplicity of experimental settings, we here only consider "block" type partitions, i.e., we say that Π is of block type if it can be represented as a product of two partitions of the two coordinates. The left panel in Figure 8 is an example of two-dimensional piecewise monotone signals on a block type partition. We compare the following three estimators: • LSE: The bivariate isotonic regression (see e.g., Robertson et al. (1988)).
• Neariso2: The two-dimensional nearly-isotonic regression with C p -tuned parameter.
• PO: The bivariate isotonic regression applied to the true partition.
For monotone matrices, Chatteejee et al. (2018) proved that LSE is minimax rate optimal with respect to n = n 1 n 2 . Hence, the partition oracle estimator PO can be regarded as an ideal benchmark that is minimax optimal over piecewise monotone matrices. On the other hand, if the true matrix θ * is piecewise monotone, the risk of LSE can be arbitrarily large for the same reason as Proposition 3.3. Neariso2 is the special case of the generalized nearly-isotonic regression (32) applied to the graph G 2 defined above. Neariso2 was originally discussed in Tibshirani et al. (2011), but no experimental results have been presented. Figure 8 shows examples of the solutions of the three estimators. We construct an n×n matrix θ * as follows: We define a k×k small monotone matrix U , and then we define θ * as an mk × mk block matrix by repeating U for m times both in rows and columns (thus n = mk). We choose the small matrix U = (U ij ) from where we write x i = i−1 k−1 for i = 1, 2, . . . , k. With the former choice, θ * becomes an m 2 -piecewise monotone matrix. With the latter choice, θ * becomes an mpiecewise monotone matrix such that θ * ij does not depend on j. We generated noisy observations y by adding independent Gaussian noises ξ ij ∼ N (0, (0.25) 2 ) to every entries of θ * . To estimate the MSE, we used 500 replications of the data. Figure 9 shows the results. Clearly, the risks of LSE (blue triangles) are much larger than those of the other two estimators. Neariso2 (green circles) has slightly larger risks compared to PO (magenta squares), while their slopes seem to be close.

C Proofs in Section 3
C.1 Proof of Proposition 3.2 Let Θ be eitherΘ n (m, V) or Θ n (m, V), which are defined in Definition 3.1. The minimax lower bound (10) is proved by combining the following two lower bounds: (i) (Lower bound for monotone vectors (Zhang 2002, Chatterjee et al. 2015) Let K(V) = {θ ∈ K ↑ n : V(θ) ≤ V} be the set of monotone vectors with bounded total variations. There is a universal constant C 1 > 0 such that for any estimatorθ, (ii) (Lower bound for piecewise constant vectors) Let C(m) be the set of m-piecewise constant vectors in R n , i.e., θ ∈ C(m) if |{i : θ i = θ i+1 }| ≤ m − 1. The minimax lower bound over C(m) can be related to sparse estimation as follows. Let X be an n × n matrix whose (i, j) entries are given as 1 {i≥j} . Then, C(m) contains the set {θ = Xβ : β 0 ≤ m}, and the lower bound for the minimax risk over C(m) follows from the wellknown results for 0 balls (e.g., Raskutti et al. (2011), Theorem 3-(b)). In particular, for any m ≥ 3, the following lower bound is presented in Gao et al. (2017): where C 2 > 0 is a universal constant.
It remains to show that Θ contains K(V) and C(m). C(m) ⊆ Θ is obvious because an m-piecewise constant vector is also an m-piecewise monotone vector such that the piecewise total variations are zero. From the definition, it is also clear that K(V) ⊆Θ n (m, V). If θ ∈ K(V), the jumps θ i+1 − θ i that strictly exceeds V/m cannot occur more than m − 1 times. Hence, we can choose a partition Π with |Π| ≤ m so that each A ∈ Π does not contain such large jumps, which implies that θ ∈ Θ n (m, V).

C.2 Proof of Proposition 3.3
The following theorem in the seminal paper of Chatterjee (2014) provides useful upper and lower bounds for the risk of the least square estimator over any closed convex set K.
Theorem C.1 (Chatterjee (2014), Corollary 1.2). Let K ⊆ R n be any closed convex set, and letθ K denote the least squares estimator over K. For any θ * ∈ R n , define the function g θ * : Here, if the set {θ ∈ K : θ − θ * 2 ≤ t} is empty, we define g θ * (t) = −∞. Then, g θ * is strictly concave for t ≥ dist(θ * , K) and has a unique maximizer t θ * . Moreover, there are universal constants C 1 , C 2 > 0 such that To prove Proposition 3.3, we use the lower bound in (36). Note that for a sufficiently large t 0 > 0, t → t 2 − Ct 3/2 is a strictly increasing in t ∈ [t 0 , ∞). For any n and σ 2 , choose t ≥ t 0 so that t 2 − Ct 3/2 ≥ nσ 2 . Then, for any θ * such that dist(θ * , K) ≥ t, we have Remark C.2. We should note that the above proof is valid for any closed convex set K. For the specific choice of K = K ↑ n , the lower bound of t n,σ 2 used in the proof can be quite conservative. In practice, the risk of the isotonic regression estimator can be larger than σ 2 under a smaller value of 2 -misspecification error.

D Proofs in Section 4 D.1 Preliminaries
To state the results for risk upper bounds, we first introduce some quantities related to Gaussian processes.
Definition D.1. Let C be a closed convex set in R n . Let E denote the expectation with respect to an isotropic Gaussian random variable Z ∼ N (0, I n ).
(i) The Gaussian width of C is defined as (ii) The Gaussian mean squared distance is defined as where dist(z, C) := inf x∈C x − z 2 .
(iii) Suppose that C is a convex cone. The statistical dimension of C is defined as We present some historical remarks on these definitions. The three quantities in Definition D.1 can be interpreted as complexity measures for the subset C in the Euclidean space. The Gaussian width has been well studied in convex geometry, signal processing, high-dimensional statistics, and empirical process theory; See e.g., Section 7.8 in Vershynin (2018) for a literature review. The definition of the Gaussian mean squared distance is due to Oymak and Hassibi (2016). As we will see in Lemma D.4 below, the Gaussian mean squared distance is useful to provide the risk bounds for proximal denoising estimators. The statistical dimension was defined in Amelunxen et al. (2014). Recently, Bellec (2018) pointed out that the statistical dimension characterizes the adaptive risk bounds for some shape restricted estimators including the isotonic regression and the convex regression.
As suggested by the definitions, these three quantities are closely related to each other. In particular, if C is a convex cone, these are comparable as follows.
Proposition D.2. Let C be a closed convex cone.
Then, we have D(C) = δ(C • ). Now, we introduce two general results for risk bounds for general projection estimators and proximal denoising estimators.
Let K be a closed convex set in R n , and define the projection estimator onto K asθ K = argmin θ∈K y − θ 2 . Bellec (2018) proved the following oracle inequality that relates the risk of the projection estimator to the statistical dimension of the tangent cone of K. Here, the tangent cone T K (θ) of K at θ ∈ K is defined as Bellec (2018), Corollary 2.2). Let θ * ∈ R n be any vector, and suppose that the observation y is drawn according to N (θ * , σ 2 I n ). Then, we have the following risk bound: Moreover, for any η ∈ (0, 1), the inequality holds with probability at least 1 − η.
Next, we provide a general result for proximal denoising estimators. Let f : R n → R be a convex function, and λ ≥ 0. We define the proximal denoising estimatorθ λ asθ The class of proximal denoising estimators contains the soft-thresholding estimator (Donoho et al. 1992), the total variation regularization (Rudin et al. 1992), the trend filtering (Kim et al. 2009) and the nearly-isotonic regression (Tibshirani et al. 2011). Oymak and Hassibi (2016) pointed out that the risk bound of proximal denoising estimators can be characterized by the Gaussian mean squared distance of the set λ∂f (θ * ). Remarkably, based on this technique,  proved sharp adaptation results for the trend filtering estimators. The following oracle inequality can be regarded as a generalization of Theorem 2.2 in Oymak and Hassibi (2016). For the sake of completeness, we also provide its proof below.
Lemma D.4. Let θ * ∈ R n be any vector, and suppose that the observation y is drawn according to N (θ * , σ 2 I n ). Let f : R n → R be a convex function, and letθ λ denote the proximal denoising estimator defined as (37). Then, we have Moreover, for any η ∈ (0, 1), the inequality holds with probability at least 1 − η.
From the first order optimality condition of the convex minimization problem (37), we have θ −θ, y −θ ≤ σλ(f (θ) − f (θ)) for any θ ∈ R n . See Lemma 6.1 in van de Geer (2015) for a formal proof. Using the elementary fact that 2 u, v = u 2 2 + v 2 2 − u − v 2 2 and substituting y = θ * + σz, we have Now, take v ∈ ∂f (θ) arbitrarily. From the definition of the subgradient, we have Hence, the right-hand side of (40) is bounded from above by Since the choice of v ∈ ∂f (θ) is arbitrary, we have By taking the expectation of both sides, (38) is proved.
To prove the high-probability bound (39), we use the well-known Gaussian concentration inequality (see e.g., Theorem 5.6 in Boucheron et al. (2013)); for any L-Lipschitz function h : R n → R and η ∈ (0, 1), we have In fact, the map z → dist(z, λ∂f (θ)) is a 2-Lipschitz function because, for any z 1 , z 2 ∈ R n , we have where P is the orthogonal projection map onto the set λ∂f (θ). Now, we takeθ asθ ∈ argmin Combining (41) and the Gaussian concentration applied for θ =θ, we have the desired result.

D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)
In this subsection, we provide the proof of Theorem 4.1 as an application of Lemma D.3. To this end, we have to evaluate the statistical dimension of the tangent cone of a convex set It is not surprising that the analysis of the tangent cone of K − (V) goes very similar to that of the set with bounded total variation K(V) = {θ ∈ R n : V(θ) ≤ V} in . Our goal is to show the following upper bound for the statistical dimension: Proposition D.5. Suppose that θ is a vector with V − (θ) = V. Then, there exists a universal constant C > 0 such that where M (θ) is defined in (13).
We briefly outline the proof for this result. We divide the proof into four steps: First, we provide some useful characterizations of the tangent cone. Second, we decompose the tangent cone into finitely many pieces so that the Gaussian widths become easy to evaluate. Third, we provide the concrete upper bounds the Gaussian widths of these pieces. Lastly, we combine the upper bounds and apply Lemma D.3 to complete the proof.
Step 1: Characterizing the tangent cone If V − (θ) < V, θ is contained in the interior of K − (V), and the tangent cone becomes the entire Euclidean space R n . Hereafter, we assume that θ lies on the boundary of K − (V), that is, V − (θ) = V. Let us recall the definition of the sign of jumps w i in (12). Roughly speaking, the tangent cone of K − (V) is characterized by the sign of jumps.
Lemma D.6. Let θ be a vector in R n such that V − (θ) = V. Let Π = {B 1 , B 2 , . . . , B k } be any connected refinement 1 of the constant partition Π const (θ) of θ. Let 1 = τ 1 < τ 2 < · · · < τ k < τ k +1 = n + 1 be a sequence such that B i = {τ i , τ i + 1, . . . , τ i+1 − 1} for any i ∈ {1, 2, . . . , k }. We define the signs w 2 , w 3 , . . . , w k ∈ {0, 1} as For any Π and w 2 , w 3 , . . . , w k taken as above, we define a convex cone T (Π, w) as where V Bi − (v Bi ) is the lower total variation for the restricted vector v Bi . Then, for the tangent cone T K−(V) (θ), we have the followings: (ii) If Π is a connected refinement of Π const (θ) and w is taken arbitrarily as above, then Proof. First, we show that T K−(V) (θ) ⊆ T (Π, w). By the definition of the tangent cone, it suffices to show that v := z − θ ∈ T (Π, w) holds for any z ∈ K − (V). Note that θ is constant on every B i ∈ Π since Π is finer than the constant partition of θ. Since the lower total variation is not changed by adding any constant value to each coordinates, we have V Bi which proves v ∈ T (Π, w) and hence (ii).
From Proposition D.2-(i), we can bound the statistical dimension by the Gaussian width as follows: Here, B n := {v ∈ R n : v 2 ≤ 1} is the unit ball in R n . Hence, it suffices to consider the set T K−(V) (θ) ∩ B n . In analogy to Lemma B.2 in , we obtain the following characterization of this set.
Defining Γ i (v, i ) as in (45), we can rewrite (46) as Now, let t i denote the 2 norm of v Bi for i = 1, 2, . . . , k . By the assumption, For these choices of i , the right-hand side of (47) is bounded from above by , which proves the desired result.
Remark D.8. Note that Γ i (v, i ) is always non-negative. This is checked as follows: First, the lower total variation is always larger than the difference of boundary points, that is, for every v ∈ R m , we have where w is taken arbitrarily from {0, 1}. The equality holds if and only if v is monotone non-increasing. Then, for any ∈ [m] and w 1 , w 2 ∈ {0, 1}, we have In particular, we obtain Γ i (v, i ) ≥ 0. If θ is monotone non-decreasing (i.e., w 0 = w 1 = · · · = w k+1 = 0), then the right-hand side of (44) equals to 0, and so Γ i (v, i ) = 0.
Step 2: Quantizing the tangent cone Now, let Π = {B 1 , B 2 , . . . , B k } be a connected refinement of Π const (θ). Lemma D.7 implies that T K−(V) (θ) ∩ B n is contained in the set such that some i ∈ B i and γ > 0. From this perspective, we consider finitely many allocation patterns of the budgets for v Bi 2 2 and Γ i (v, i ).
To be more precise, we construct a cover of the tangent cone in the following way. Consider a triple (t, q, l) such that: (a) t = (t 1 , t 2 , . . . , t k ) and q = (q 1 , q 2 , . . . , q k ) are vectors consisting of nonnegative numbers, and (b) l = ( 1 , 2 , . . . , k ) is a set of indices such that i ∈ B i for i = 1, 2, . . . , k .
For such triple, we define a set where γ is taken as the right-hand side of (44): Then, quantizing the allocation vectors t and q, we can cover the set T K−(V) (θ)∩ B n with finitely many T (t, q, l)s as the following lemma.
We should note that the cardinalities of Q and L are respectively bounded as follows: Proposition D.10. Let Q and L are the sets defined in Lemma D.9. Then, we have: (i) log |Q| ≤ 2k log 2e, and (ii) log |L| ≤ k log n k . Proof. For the first part, we observe that |Q| is not larger than the cardinality of Then, we have The proof of the inequality (a) in the above can be found in Proposition 4.3 of Dudley (2014). The second part is obtained by Jensen's inequality as Step 3: Controlling Gaussian widths As mentioned before, our goal is to obtain an upper bound of the Gaussian width where we convene that E = E Z∼N (0,In) . Let (Π, w) is a pair of a partition and a sign vector of knots defined as in Lemma D.7. Using the decomposition in Lemma D.9, we haveW Besides, leveraging a general result for Gaussian suprema (see Lemma F.4 below), we havẽ Here, we used Proposition D.10 to bound the cardinality of the set Q 2 × L.
Given t, q ∈ Q and l ∈ L, we definẽ Dividing the supremum into k pieces v B1 , v B2 , . . . , v B k , this quantity is bounded from above asW (t, q, l) ≤ Here, we write We now consider the quantity (53). In the set T i (t i , q i , i ) over which the supremum taken, the lower total variation of v Bi is bounded from above as As mentioned in Remark D.8, the reverse inequality is always true, and the equality can hold only if two sub-vectors (v τi , v τi + 1, . . . , i ) and ( i , i + 1, . . . , v τi+1 − 1) are either monotone increasing or nonincreasing. From this point of view, we may consider that the meaning of the condition (54) is that v Bi is approximated by two nearly monotone pieces. This suggests that the complexity of T i (t i , q i , i ) can be evaluated by that of the class of monotone functions. Below, we provide the upper bound of the Gaussian width of the form (53). First, the following lemma treats a special case where i is taken as the rightmost point in B i . Lemma D.11. For every n ≥ 1, t > 0, w ∈ {0, 1} and γ ≥ 0, we have Proof. The proof is divided into two cases where w = 1 and w = 0. Case 1 (w = 1): By scaling properly, we need only consider the case where t = 1. For a vector v ∈ R n , we define a monotone vector v + as We also define another monotone vector v − as It is easy to check that v = v + − v − . Using these notations, we have Denote byW the left-hand side in (55) with t = 1. The argument in the previous paragraph implies that The expectation in the last line is bounded as Here, the first inequality is the Jensen's inequality, and the second inequality is a consequence of equation (D.12) in Amelunxen et al. (2014). Combining with (56), we have the desired result. Case 2 (w = 0): We can assume w.l.o.g. t = 1. As in Case 1, and we write a vector as a difference of monotone vectors. For v ∈ R n , we define v + and v − as respectively. Under this notation, the condition V − (v) ≤ γ is equivalent to v − n ≤ γ, and therefore we have Then, a similar argument as Case 1 yields the result.
(57) In particular, we deduce a simpler bound Proof. Let (A 1 , A 2 ) be a pair of sub-vectors of [n] defined as A 1 = {1, 2, . . . , } and A 2 = { , + 1, . . . , n}. If either = 1 or = n (i.e., one of A 1 and A 2 becomes a singleton), the result is a direct consequence of Lemma D.11. Henceforth, we assume that 1 < < n. Suppose that v ∈ R n satisfies the assumption Based on these observations, we reduce tõ in which both terms in the right-hand side can be bounded using Lemma D.11. Before going to the next step, we summarize the results in Step 3 as follows.
Step 4: Applying Lemma D.3 We now are ready to complete the proof of Theorem 4.1.
Recall that our goal is to obtain an upper bound forW (θ) which is defined in (53). To this end, we will construct a suitable refinement of Π const (θ) with moderate piece lengths so that we can control the first term in (59). In fact, from an argument parallel to that in , there exists a refinement Π = (B 1 , B 2 , . . . , B k ) such that |B i | ≤ 4n k for i = 1, 2, . . . , k and k(θ) ≤ k ≤ 2k(θ). We also define the signs w 1 , w 2 , . . . , w k in a similar way as Lemma D.6, but if the knot τ i is not contained in the original partition Π const (θ), the corresponding sign w i will be specified later. We can bound the first term in (59) as the following two steps. First, from the Cauchy-Schwarz inequality and the fact that t ∈ Q, we have Second, by the above construction of Π, we have Therefore, the right-hand side in (59) can be bounded from above by Here, to hide the constant term π/2, we have also used the fact that m log(en/m) ≥ 1 for every integer 1 ≤ m ≤ n. Let w 0 1 , w 0 2 , . . . , w 0 k(θ)+1 be the signs associated with the constant partition Π const (θ) = (A 1 , A 2 , . . . , A k(θ) ) (recall the definition (12)). Then, we can choose the values of w i so that the following inequality holds: In fact, this is possible if we choose w i as the sign w 0 j for the nearest knot that is to the right of τ i . Combining (61), (60) and Proposition D.2, the statistical dimension of T K−(V) (θ) is bounded from above as where we also used the elementary fact that (a+b) 2 ≤ 2(a 2 +b 2 ). Consequently, applying Lemma D.3, we have desired result.
Remark D.14 (Non-Gaussian noises). For non-Gaussian noise setting, we could prove an analogous result to Proposition D.5. We comment on a sketch of the proof for such a generalization. The proof of Proposition D.5 consists of (i) a decomposition argument for the tangent cone and (ii) bounds for some probabilistic quantities (i.e., the statistical dimension and the Gaussian width). The former argument is completely deterministic and independent from the distributional assumption on the noise variables. Regarding the probabilistic bounds, we used the following bound for (Gaussian) statistical dimension of K ↑ n : Hence, if we can obtain a similar bound for non-Gaussian random variables, we can prove a analogous result to Proposition D.5. Let ξ 1 , . . . , x n be i.i.d. random variables with E[ξ 1 ] = 0 and Var(ξ 1 ) = σ 2 . For a convex cone C, we define the statistical dimension as Here, we write Proj C (x) = argmin z∈C z − x 2 , and the last equality holds from a deterministic relation (See Amelunxen et al. (2014) for details). Then, from Theorem 3.1 in Chatterjee et al. (2015), we can check that Therefore, by following a similar argument as the proof of Proposition D.5, we conclude thatδ for some universal constant C > 0. As a consequence, we can prove the expected risk bound similar to (20) for non-Gaussian noise variables.

D.3 Proof of Corollary 4.4
Let α > 0 be a number to be specified later. Define a vector θ ∈ R n as θ 1 = θ * 1 and Then, we have V − (θ ) = αV − (θ * ). Moreover, the constant partition and the sign of θ (defined in (12)) are the same as those of θ * , and therefore k(θ ) = k(θ * ) and M (θ ) = M (θ * ). Now, we set α = V/V − (θ * ) so that V − (θ ) = V. Applying the upper bound (14), we have The first term in the right-hand side is bounded from above as From the minimal length condition (18) and the definition of M (θ), we also have .
Combining the above inequalities, we have the desired result.

D.4 Risk bounds for penalized estimators (Proof of Theorem 4.7)
We prove Theorem 4.7 as an application of Lemma D.4. Let ∂V − (θ) denote the set of subgradients (i.e., subdifferential) of the convex function V − (·) at θ ∈ R n . The task is to provide a suitable upper bound for the Gaussian mean squared distance of the set λ∂V − (θ). To do this, we use the technique developed in . The idea is stated roughly as follows: Recall that the Gaussian mean squared distance of a convex cone can be written as the statistical dimension of the polar cone ). This motivates us to relate the Gaussian mean squared distance D(λ∂V − (θ)) to that of an associated cone. In particular, we consider the conic hull of the subdifferential: cone(∂V − (θ)) := λ≥0 λ∂V − (θ).
As we explain later, D(cone(∂V − (θ))) can be evaluated by the results in the previous subsection. Then, we can complete the proof if we have an upper bound of the following form: where ∆(θ, λ) is a residual term that depends on θ and λ. First, we show that D(cone(∂V − (θ))) has exactly the same value as the statistical dimension of the tangent cone of T K−(V−(θ)) (θ), which we have already provided a bound in the previous part in this paper.
In particular, we have the following upper bound: where C is the same universal constant as in Proposition D.5.
Proof. Let us write T := T K−(V(θ)) (θ). In the light of Proposition D.2-(ii), it suffices to show that T is the polar cone of cone(∂V − (θ)). However, from fundamental results in convex geometry, we always have for any convex function f : R n → R (see Lemma A.5 and Lemma A.5 in ). For the case where f = V − , the set K(θ) above is which implies the desired result.
Then, λ(z) is well-defined, and has a finite expectation E Z∼N (0,In) [λ(Z)] < ∞. Further, define λ * as Then, for every λ ≥ λ * and v * ∈ ∂f (θ), we have Before proceeding, we introduce an additional terminology: A convex function f : R n → R is said to be weakly decomposable if we have for every θ ∈ R n . In other words, we can choose v 0 ≡ v * in (64) if f is weakly decomposable. Under the assumption that f is weakly decomposable, the inequality (64) can be simplified as follows: Corollary D.17. Suppose that f : R n → R is convex and weakly decomposable. Under the same notation as in Lemma D.16, we have D(λ∂f (θ)) ≤ 3D(cone(∂f (θ))) + 3(λ − λ * ) 2 v 0 2 2 + 112. Now, we apply Lemma D.16 to the case f = V − . The following proposition provides the structural information of ∂V − (θ) that we need for evaluating the upper bound (64). The proof is postponed to Appendix D.6.
(ii) For any θ ∈ R n , let us define v 0 as (63). Then, we have From Proposition D.18 and Corollary D.17, D(λ∂V − (θ)) is bounded from above by provided that λ ≥ λ * . Here, C > 0 is a universal constant. Combining this bound with Lemma D.4, we proved the desired risk bound. Lastly, we provide an upper bound for the optimal tuning parameter λ * . This is obtained from the following estimate of E[λ(Z)].
Then, we have where E is the expectation with respect to Z ∼ N (0, I n ).
Proof. Let C := cone(∂V − (θ)) be the conic hull of ∂V − (θ), and let P C denote the orthogonal projection map onto C. By the definition of λ(z), there exists a vector v(z) ∈ ∂V − (θ) such that λ(z)v(z) = P C (z). First, we show a partial result As we will see in Appendix D.6, V − is the support function for a certain convex set. Then, by the fundamental fact for the support function that θ, v = V − (θ) for all v ∈ ∂V − (θ) (see Corollary 8.25 in Rockafeller and Wets (1998)), we have Here, in the last line, T := T K−(V−(θ)) (θ) is the polar cone of C (see Proposition D.15), and we used the Moreau decomposition z = P C (z) + P T (z). Taking the expectation of both sides with respect to z ∼ N (0, I n ), we have which implies the desired result. Here, we used the equality between the statistical dimension and the expected squared norm of projection: δ(T ) = E Z∼N (0,In) P T (Z) 2 2 (see Proposition 3.1 in Amelunxen et al. (2014)). To prove the other inequality, we use the characterization of aff(∂V − (θ)) given in (72) in Appendix D.6 below. In particular, if we take v * as in (75) and hence the result follows.

D.5 Proof of Corollary 4.12
First, we explain that a monotone vector satisfying the moderate growth condition is approximated by a piecewise-constant vector such that the segments at both ends have sufficient lengths. To this end, we need the following lemma.
Here, the first two statements (i) and (ii) are shown in Lemma 2 in Bellec and Tsybakov (2015). The third statement (iii) ensures that the moderate growth conditions implies the minimal length condition (18).
Lemma D.20. Let θ ∈ K ↑ n be a monotone vector satisfying the moderate growth condition and θ n − θ 1 = V. Then, there exists another monotone vector θ ∈ K ↑ n satisfying the following three conditions.
(i) θ is k-piecewise constant with k = max 3, V 2 n σ 2 log(en) Here, t is the smallest integer that is not less than t.
(ii) We have and (iii) Let Π = {A 1 , A 2 , . . . , A k } be the partition on which θ is constant. Then, we have |A 1 | ≥ n/k and |A k | ≥ n/k.
It remains to prove (iii) under the moderate growth condition. Below, we will only check that the maximal element in A 1 is not less than n/k because |A k | ≥ n/k can be checked in a similar way. Let i * := n/k . Note that we have i * ≤ n/2 since k ≥ 3. By the moderate growth condition, we have which means i * ∈ A 1 and hence |A 1 | ≥ n/k . Now, we are ready to prove Corollary 4.12. Applying Lemma D.20 for every segments A 1 , A 2 , . . . , A m , we have a k-piecewise constant and m-piecewise monotone vector θ ∈ R n such that Moreover, θ satisfies the minimum length condition (18) with c = 1. Therefore, we have M (θ ) ≤ 2(m − 1)k/n and where we used an obvious inequality m ≤ k. Then, Theorem 4.7 implies that there exists λ such that , mσ 2 n log en m for some universal constant C > 0. This is the desired conclusion. Note that an upper bound for such λ is suggested by Proposition 4.8.

D.6 Subdifferential and weak decomposability
In this subsection, we discuss the structure of the subdifferential of the nearlyisotonic type penalties. The main purpose is to discuss the weak decomposability (defined in Appendix D.4) of V − .

D.6.1 Characterization of the subdifferential
First, we observe that V − (θ) = n−1 i=1 (θ i − θ i+1 ) + can be written as a support function of a certain convex set. In fact, by Theorem 8.24 in Rockafeller and Wets (1998), we can see that where B is a closed convex set. Conversely, once we have a convex function V − , the set B is specified as Many properties of the support function can be understood through the structure of the set B; In particular, we can characterize the subdifferential and weak decomposability. Below, we investigate the more detailed structure of the set B in terms of submodular functions. Let G = (V, E) be a directed graph equipped with positive edge weights {c (i,j) }. For any θ ∈ R n , we define a nearly-isotonic type penalty V G (θ) for the weighted graph G as in (33). For any subset A ⊆ [n], we also define κ G (A) by the total weights of outgoing edges: The function A → κ G (A) is called the cut function of the weighted graph G.
It is well known that the cut function is a submodular function. Here, a function F : 2 [n] → R is called submodular if F (∅) = 0 and holds for any subsets A, B ⊆ [n]. We refer the reader to Bach (2013) for fundamental properties of submodular functions. For any submodular function F : 2 [n] → R, we define the base polyhedron B(F ) ⊆ R n as The Lovász extension f : R n → R of F is defined as the support function of B(F ), that is, for any θ ∈ R n , f (θ) := max v∈B(F ) v, θ .
We see that the nearly-isotonic type penalty (33) is actually the Lovász extension of the cut function (71).
Proposition D.21. For any directed graph G and edge weight c (i,j) , the function V G is the Lovász extension of the cut function κ G .
Proof. This is the consequence of the well-known result so-called the greedy algorithm; see e.g., Proposition 3.2 in Bach (2013). In particular, we can find a derivation in Section 6.2 of Bach (2013). Now, we have the following useful characterizations of the subdifferential.
Proposition D.22. Define F : 2 [n] → R be a submodular function and f : R n → R be its Lovász extension. Suppose θ ∈ R n .
(iii) Let v be any point in the relative interior of ∂f (θ). Then, the normal cone of ∂f (θ) at v is contained in the set of partition-wise constant vectors: Proof. The first statement is just a well-known property for the support function (Corollary 8.25 in Rockafeller and Wets (1998)). The second statement follows from the characterization of faces for the base polyhedron (see Proposition 4.7 in Bach (2013)). The third statement follows from (ii) and the characterization of normal cones of polyhedra (see Theorem 6.46 in Rockafeller and Wets (1998)).

D.6.2 Weak decomposability
Here, we discuss the weak decomposability of the Lovász extension. Before describing the result, we introduce some terminology. Let F : 2 [n] → R be a submodular function. We say that a set A ⊆ [n] is separable for F if there is a non-empty proper subset B of A such that F (A) = F (B) + F (A \ B). We also say that A is inseparable if it is not separable. For example, if F = κ G is the cut function defined in (71), A is inseparable if and only if it is a connected component in the graph G. Furthermore, we define the following agglomerative clustering condition.
Definition D.23. We say that a submodular function F : 2 [n] → R satisfies the agglomerative clustering (AC) condition if it has the following property: Let A, B ⊆ [n] be a any disjoint pair of subsets such that A = ∅ and A is inseparable for the function F A B : 2 A → R defined by F A B (C) := F (B ∪ C) − F (B). Then, for any C ⊂ A, we have Recall the definition of weak decomposability (65). The following proposition provides a sufficient condition for the weak decomposability of the Lovász extension.
Proposition D.24. Let F : 2 [n] → R be a submodular function satisfying the AC condition in Definition D.23. Then, the Lovász extension of f of F is weakly decomposable.
Proof. Fix θ ∈ R n . Since f is the support function of the base polyhedron B(F ), ∂f (θ) coincides with a face of B(F ). Let A 1 , A 2 , . . . , A k be a partition of [n] such that aff(∂f (θ)) is represented as (72). For i = 1, 2, . . . , k, we write S 0 := ∅ and S i := A 1 ∪ A 2 ∪ · · · ∪ A i . We should note that the above partition can be chosen so that A i is inseparable for the function defined as In this case, ∂f (θ) is an n − k dimensional subset. Since holds for any i = 1, . . . , k, we have v * ∈ aff(∂f (θ)). Moreover, v * is also contained in the normal cone of aff(∂f (θ)). Hence, if we prove v * ∈ ∂f (θ), we which implies that v * ∈ argmin v∈∂f (θ) v 2 2 . Now, our goal is to prove v * ∈ ∂f (θ) under the AC condition. If k = n, then it is clear from (72) that ∂f (θ) = {v * }. Below, we assume that k < n. Since v * ∈ aff(∂f (θ)), it suffices to show that i∈S v * i ≤ F (S) holds for any S ⊆ [n] that determines a relative boundary of ∂f (θ). The relative boundary of ∂f (θ) can be written as the union of all n − k − 1 dimensional faces of B(F ) that have non-empty intersection with ∂f (θ). Such faces can be characterized as follows: Let Π = (A 1 , A 2 , . . . , A k ) be the partition defined in the above, and choose A i with |A i | ≥ 2. Let A i be any non-empty proper subset of A i . We define a new ordered partition of [n] by inserting (A i , A i \ A i ) instead of A i : Π = (A 1 , A 2 , . . . , A i−1 , A i , (A i \ A i ), A i+1 , . . . , A k ).
Then, Π defines an n − k − 1 dimensional affine subspace by (72), which defines a part of the relative boundary of ∂f (θ). Therefore, we have to show that i∈S v * i ≤ F (S) for any S that can be written as S = S i−1 ∪ A i with A i ⊂ A i . From the AC condition, we have This proves that v * ∈ ∂f (θ), and hence f is weakly decomposable.
Remark D.25. The AC condition was originally introduced in Bach (2011). In that paper, the author consider the proximal denoising estimators (37) where f is the Lovász extension of a submodular function F . The name "agglomerative clustering" captures the following property: Let us consider the solution path of the minimization problem (37) parametrized by λ, that is, the solution path is the collection {θ λ } λ≥0 calculated for all λ ≥ 0. In general, the solution path starts withθ λ = y for λ = 0, andθ λ shrinks toward some piecewise constant vector as λ increases. Proposition 4 of Bach (2011) showed that the solution path is agglomerative if F satisfies the AC condition. We provide some examples of functions satisfying the AC condition: • Let h : R → R be a concave function with h(0) = 0. A submodular function defined as F (A) := h(|A|) satisfies the AC condition. Examples of solutions paths for this class can be found in Bach (2011).
• The one-dimensional fused lasso has an agglomerative solution path. The corresponding submodular function is the cut function of the undirected one-dimensional grid graph, which satisfies the AC condition. Hence, by Proposition D.24, the penalty of the one-dimensional fused lasso is weakly decomposable. This provides an alternative proof for Lemma 2.7 in . On the other hand, the fused lasso on the twodimensional grid does not satisfy this condition. See Bach (2011) for details.
• The nearly-isotonic regression (3) has an agglomerative solution path. A direct proof for this property is provided in Lemma 1 in Tibshirani et al. (2011). Below, we prove that the cut function for directed one-dimensional grid graph satisfies the AC condition, which provides an alternative proof for this fact.
The following proposition provides a proof for Proposition D.18.
Proposition D.26. The cut function F associated with the nearly-isotonic regression satisfies the AC condition. In particular, the lower total variation V − (θ) is weakly decomposable. Moreover, for any θ ∈ R n , the minimum value of the 2 -norm in ∂V − (θ) is given by (66). We will check the AC condition by considering all patterns of adjacency as Table 1. Here, C represents any proper subset of A, and "None" means that A contains 1 or n. In each case, we can easily check that the inequality (73) is satisfied. Hence, F satisfies the AC condition.

Proof. For any
The second statement is a consequence of Proposition D.24. The last statement follows from fact that the minimizer of v 2 2 in ∂f (θ) coincides with that in aff(∂f (θ)), which is given as (74). In this case, we can choose A 1 , A 2 , . . . , A k as the constant partition of θ that is sorted by the values of θ. Thus, we have which proves the desired result.
Remark D.27 (Missing part in the proof of Proposition A.1). With a slight modification of the above argument, we can show the AC condition for the cut function of weighted graph where c j > 0 (j = 1, . . . , n − 1) are edge weights. As mentioned in Proposition A.1, we need this result to prove the validity of the modified PAVA algorithm (Algorithm 1). Here, we prove that (31) provides a sufficient condition for the AC condition, and hence the solution path of the weighted nearly-isotonic regression (30) is agglomerative. Let A ⊆ [n] be a non-empty connected subset, B be a subset of [n] \ A, and C be a proper subset of A. Recall that our goal is to check the inequality (73). For clarity, we write A = {j L , j L + 1, . . . , j R }. As in the proof of Proposition D.26, we consider all adjacency patterns of A, B and C. Then, we can easily check the following case statement: 1. Suppose that either "j L = 1 and j R + 1 / ∈ B" or "j L − 1 / ∈ B and j R + 1 / ∈ B" holds. Then, we have F (B ∪ A) − F (B) = F (A) = c j R and F (B ∪ C) − F (B) = F (C). Now, we will check (73) under the concavity condition (31). First, (73) trivially holds when j R ∈ C because in this case F (C) ≥ c j R = F (A). Next, we assume j R / ∈ C. Let i be the largest element in C. Then, we have F (C) ≥ c i , |C| ≤ i − j L + 1. Under the assumption (31), we have ≤ F (C), which implies (73).

F Auxiliary lemmas
Here, we present several auxiliary lemmas that are used in the proofs in the previous sections.
Lemma F.4 , Lemma D.1). Suppose p, n ≥ 1 and let Θ 1 , . . . , Θ p be subset of R n each containing the origin and each contained in the closed Euclidean ball of radius D centered at the origin. Then, for ξ ∼ N (0, σ 2 I), we have