On the total variation regularized estimator over a class of tree graphs

We generalize to tree graphs obtained by connecting path graphs an oracle result obtained for the Fused Lasso over the path graph. Moreover we show that it is possible to substitute in the oracle inequality the minimum of the distances between jumps by their harmonic mean. In doing so we prove a lower bound on the compatibility constant for the total variation penalty. Our analysis leverages insights obtained for the path graph with one branch to understand the case of more general tree graphs. As a side result, we get insights into the irrepresentable condition for such tree graphs.


Introduction
The aim of this paper is to refine and extend to the more general case of "large enough" tree graphs the approach used by Dalalyan, Hebiri and Lederer (2017) to prove an oracle inequality for the Fused Lasso estimator, also known as total variation regularized estimator. As a side result, we will obtain some insight into the irrepresentable condition for such "large enough" tree graphs.
The main reference of this article is Dalalyan, Hebiri and Lederer (2017), who consider the path graph. We refine and generalize their approach (i.e. their Theorem 3, Proposition 2 and Proposition 3) to the case of more general tree graphs. The main refinements we prove are an oracle theory for the total variation regularized estimators over trees when the first coefficient is not penalized, a proof of an (in principle tight) lower bound for the compatibility constant and, as a consequence of this bound, the substitution in the oracle bound of the minimum of the distances between jumps by their harmonic mean. We elaborate the theory from the particular case of the path graph to the more general case of tree graphs which can be cut into path graphs. The tree graph with one branch is in this context the simplest instance of such more complex tree graphs, which allows us to develop insights into more general cases, while keeping the overview.
The paper is organized as follows: in Section 1 we expose the framework together with a review of the literature on the topic; in Section 2 we refine the proof of Theorem 3 of Dalalyan, Hebiri and Lederer (2017) and adapt it to the case where one coefficient of the Lasso is left unpenalized: this proof will be a working tool for establishing oracle inequalities for total variation penalized estimators; in Section 3 we expose how to easily compute objects related to projections which are needed for finding explicit bounds on weighted compatibility constants and when the irrepresentable condition is satisfied; in Section 4 we present a tight lower bound for the (weighted) compatibility constant for the Fused Lasso and use it with the approach exposed in Section 2 to prove an oracle inequality; in Section 5 we generalize Section 4 to the case of the branched path graph; Section 6 presents further extensions to more general tree graphs; Section 7 handles the asymptotic pattern recovery properties of the total variation regularized estimator on the (branched) path graph exposes an extension to more general tree graphs; Section 8 concludes the paper.

General framework
We study total variation regularized estimators on graphs, their oracle properties and their asymptotic pattern recovery properties.
For a vector v ∈ R n we write v 1 = n i=1 |v i | and v 2 n = 1 n n i=1 v 2 i . Let G = (V, E) be a graph, where V is the set of vertices and E is the set of edges. Let n := |V | be its number of vertices and m := |E| its number of edges. Let the elements of E be denoted by e(i, j), where i, j ∈ V are the vertices connected by an edge.
Let D G ∈ R m×n denote the incidence matrix of a graph G, defined as where D e ∈ R n is the row of D G corresponding to the edge of e(i, j). Let f ∈ R n be a function defined at each vertex of the graph. The total variation of f over the graph G is defined as Assume we observe the values of a signal f 0 ∈ R n contaminated with some Gaussian noise ǫ ∼ N n (0, σ 2 I n ), i.e. Y = f 0 + ǫ. The total variation regularized estimator f of f 0 over the graph G is defined as where λ > 0 is a tuning parameter. This is a special case of the generalized Lasso with design matrix I n and penalty matrix D G . Hereafter we suppress the subscript G in the notation of the incidence matrix of the graph G.
In this article, we restrict our attention to tree graphs, i.e. connected graphs with m = n − 1. For a tree graph we have that D ∈ R (n−1)×n and rank(D) = n − 1. In order to manipulate the above problem to obtain an (almost) ordinary Lasso problem, we define D, the incidence matrix rooted at vertex i, as In the following, we are going to root the incidence matrix at the vertex i = 1, obtaining in this way a lower triangular matrix with ones on the diagonal, and minus ones as nonzero off-diagonal elements. The quadratic matrix D is invertible and we denote its inverse by X := D −1 .
We now perform a change of variables. Let β := Df , then f = Xβ. The above problem can be rewritten as i.e. an ordinary Lasso problem with p = n, where the first coefficient β 1 is not penalized. Note that, in order to perform this transformation, it is necessary that we restrict ourselves to tree graphs, since we want D to be invertible.
Let X = (X 1 , X −1 ), where X 1 ∈ R n denotes the first column of X and X −1 ∈ R n×(n−1) the remaining n − 1 columns of X. Let β −1 ∈ R n−1 be the vector β with the first entry removed. Thanks to some easy calculations and denoting by Y and X −1 the column centered versions of Y and X −1 , it is possible to write and both β −1 and β 1 depend on λ.
Note that prediction properties of β, i.e. the properties of X β, will translate into properties of the estimator f , often also called Edge Lasso estimator.
Remark. In the construction of an invertible matrix starting from D, it would be possible to choose A = (1, . . . , 1) as well. Indeed, when we perform the change of variables from f to β,β −1 estimates the jumps and thus gives information about the relative location of the signal. However to be able to estimate the absolute location of the signal we either need an estimate of the absolute location of the signal at one point (choice A := (0, . . . , 0, 1, 0, . . . , 0),β i =f i , in particular we consider the case i = 1), or of the "mean" location of the signal (choice A = (1, . . . , 1),β 1 = n i=1f i ).

The path graph and the path graph with one branch
In this article we are interested, besides the more general case of "large enough" tree graphs, in the particular cases of D being the incidence matrix of either the path graph or the path graph with one branch. The choice of A makes it easy to calculate the matrix X and gives a nice interpretation of it. Let P 1 be the path matrix of the graph G with reference root the vertex 1. The matrix P 1 is constructed as follows: (P 1 ) ij := 1, if the vertex j is on the path from vertex 1 to vertex i, 0, else.
Theorem 1.1 (Inversion of the rooted incidence matrix). For a tree graph, the rooted incidence matrix D is invertible and Proof of Theorem 1.1. For a formal proof we refer to Jacobs et al. (2008) and to Bapat (2014). The intuition behind this theorem is to proceed as follows. We have to check that rank( D) = n. One can perform Gaussian elimination on the rooted incidence matrix. Keep the first row as it is and for row i add up the rows indexed by the vertices belonging to the path going from vertex 1 to vertex i. In this way one can obtain an identity matrix and thus rank( D) = n. Similarly one can find the inverse, which obviously corresponds to P 1 .
Example 1.2 (Incidence matrix and path matrix with reference vertex 1 for the path graph). Let G be the path graph with n = 6 vertices. The incidence matrix is and the path matrix with reference vertex 1 is 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example 1.3 (Incidence matrix and path matrix with reference vertex 1 for the path graph with a branch). Let G be the path graph with one branch. The graph has in total n = n 1 + n 2 vertices. The main branch consists in n 1 vertices, the side branch in n 2 vertices and is attached to the vertex number b < n 1 of the main branch. Take n 1 = 4, n 2 = 2 and b = 2. The incidence matrix is and the path matrix with reference vertex 1 is

Notation
Here we expose the notational conventions used for handling the (branched) path graph and later branching points with arbitrarily many (K) branches.

• (Branched) path graph
We decide to enumerate the vertices of the (branched) path graph starting from the root 1, continuing up to the end of the main branch n 1 and then continuing from the vertex n 1 + 1 of the side branch attached to vertex b up to the last vertex of the side branch n = n 1 + n 2 . We are going to use two different notations: the one is going to be used for finding explicit expressions for quantities related to the projection of a column of X onto some subsets of the columns of X. The other is going to be used when calculating the compatibility constant and is based on the decomposition of the (branched) path graph into smaller path graphs. In both notations we let the set S ⊆ {2, . . . , n} be a candidate set of active edges.

First notation
We partition S into three mutually disjoint sets S 1 , S 2 , S 3 , where S 1 ⊆ {2, . . . , b}, S 2 ⊆ {b + 1, . . . , n 1 }, S 3 ⊆ {n 1 + 1, . . . , n}. We write the sets S 1 , S 2 , S 3 as: Note that s i := |S i |, i ∈ {1, 2, 3} and s := |S| = s 1 + s 2 + s 3 . Let us write S = {ξ 1 , . . . , ξ s1+s2+s3 }. Define In the case where we consider the path graph we simply take S = S 1 (i.e. n = n 1 ) Second notation (for bounding the compatibility constant). What is meant with this second notation is that we decompose the branched path graph into three smaller path graphs. However the end of the first one does not necessarily coincide with the point b and the begin of the other two does not necessarily coincide with the points b + 1 and n 1 + 1 respectively. Let us write . . , b}, and . .+d i si +1}, i = 2, 3, where, using the first notation introduced, . . , s i } and i ∈ {1, 2, 3}. The elements d 1 s1+1 , d 2 1 , d 3 1 are only constrained by the fact that they have to be greater or equal than two, otherwise, for a given S their choice is left free. Moreover note that 3 i=1 s1+1 j=1 d i j = n. We thus end up with three se- We can relate part of these sequences to the set B defined in the first notation. Indeed, . We see that the only place where there might be some discrepancy between the first and the second notation is at d 1 s1+1 , d 2 1 , d 3 1 , which might be different from b s1+1 , b s1+2 , b s1+s2+3 .
In the case of the path graph we just consider one single of these path graphs and thus S = S 1 and s = s 1 and we omit the index i.

• Branching point with arbitrarily many branches
In Sections 3 and 6 we are going to consider branching points participating in K + 1 edges. In these cases we are ging go denote by b 1 the number of vertices between the ramification point and the last vertex in S in the main branch, with these two extreme vertices included, and by b 2 , . . . , b K+1 the number of vertices after the ramification point and before the first vertex in S (or the end of the relative branch). In these more complex cases for the sake of simplicity we only consider situations where the first and second notation coincide. We are often going to restrict our attention to "large enough" general tree graphs. These can be seen as tree graphs composed of g path graphs glued together at their extremities with d i j ≥ 4, ∀j ∈ {1, . . . , s i + 1}, ∀i ∈ {1, . . . , g}. The reason of these requirements will become clear in Sections 5 and 6.

Review of the literature
While to our knowledge there is no attempt in the literature to analyze the specific properties of the total variation regularized least squares estimator over general branched tree graphs, there is a lot of work in the field of the so called Fused Lasso estimator. An early analysis of the Fused Lasso estimator can be found in Mammen and van de Geer (1997). Some other early work is exposed in Tibshirani et al. (2005);Friedman et al. (2007); Tibshirani and Taylor (2011), where also computational aspects are considered.
In the literature we can find two main currents of research, the one focusing on the pattern recovery properties (which is going to be quickly exposed in Section 7) and the other on the analysis of the mean squared error to prove oracle inequalities.

Minimax rates
In this subsection we expose some results on minimax rates, making use of the notation found in Sadhanala, Wang and Tibshirani (2016). In particular, let be the class of (discrete) functions of bounded total variation on the path graph, where D is its incidence matrix. Assume the linear model with f 0 ∈ T (C) for some C > 0 and with iid Gaussian noise with variance σ ∈ (0, ∞) . It has been shown in Donoho and Johnstone (1998) that the minimax risk over the class of functions with bounded total variation R(T (C)) satisfies Geer (1997) prove that, if λ ≍ n −2/3 C 1/3 , then the Fused Lasso estimator achieves the minimax rate within the class T (C). Sadhanala, Wang and Tibshirani (2016) also point out, that estimators which are linear in the observations can not achieve the minimax rate within the class of functions of bounded total variation, since they are not able to adapt to the spatially inhomogeneous smoothness of some elements of this class.

Oracle inequalities
We expose some recent results, appeared in the papers by Hütter and Rigollet (2016); Dalalyan, Hebiri and Lederer (2017); Lin et al. (2017); Guntuboyina et al. (2017). In particular we give the rates of the remainder term in the (sharp) oracle inequalities holding with high probability exposed in these papers.
• Hütter and Rigollet (2016) obtain a quite general result, in the sense that it applies to any graph G with incidence matrix D ∈ R m×n . In particular for the choice of the tuning parameter λ = σρ 2 log (em/δ)/n, δ ∈ (0, 1 2 ), they obtain the rate is called compatibility factor and ρ is the largest ℓ 2 -norm of a column of the Moore-Penrose pseudoinverse D + = (δ + 1 , . . . , d + m ) ∈ R n×m of the incidence matrix D, i.e. ρ = max j∈[m] δ + j 2 , and is called inverse scaling factor. For the path graph, we have m = n − 1, ρ ≍ √ n and, according to Lemma 3 in Hütter and Rigollet (2016), κ D (S) = Ω (1), if |S| ≥ 2.

Approach for general tree graphs
The approach we follow is very similar to the one presented in the proof of Theorem 3 of Dalalyan, Hebiri and Lederer (2017). However, we refine their proof by not penalizing the first coefficient of β and by adjusting the definition of compatibility constant accordingly. Note that by not penalizing the first coefficient we allow it to be always active. This is a more natural approach to utilize, considered our problem definition. Let β ∈ R n be a vector of coefficients, S ⊆ {2, . . . , n} a subset of the indices of β, called active set with s := |S| being its cardinality.
Definition 2.1 (Compatibility constant). The compatibility constant κ(S) is defined as Remark. Note that ω {1}∪S = 0 and 0 ≤ ω ≤ 1 since for tree graphs the maximum ℓ 2 -norm of a column of X is √ n.
Definition 2.3. Take γ > 1. The vector of weights w ∈ R n is defined as Remark. Note that 0 ≤ w ≤ 1 and that w {1}∪S = 1.

For two vectors
Definition 2.4 (Weighted compatibility constant). The weighted compatibility constant κ w (S) is defined as Remark. Note that the (weighted) compatibility constant depends on the graph through X, which is the path matrix of the graph rooted at the vertex 1.
Remark. Note that a key point in our approach is the computation of a lower bound for the compatibility constant over the path graph, which is shown to be tight in some special cases. The concept of compatibility constant for total variation estimators over graphs is already presented in Hütter and Rigollet (2016). However, we refer to the (different) definition given in Dalalyan, Hebiri and Lederer (2017), which we slightly modify to adapt it to our problem definition.

Calculation of projection coefficients and lengths of antiprojections, a local approach
In this section we are going to present an easy and intuitive way of calculating (anti-)projections and the related projection coefficients of the column of a path matrix rooted at vertex 1 of a tree onto a subset of the column of the same matrix. Let this matrix be called X. These calculations are motivated by the necessity of finding explicit expressions for the length of the antiprojections (for the weighted compatibility constant) and for the projection coefficients (to check for which signal patterns the irrepresentable condition is satisfied). In particular consider the task of projecting a column X j , j ∈ {1} ∪ S onto X {1}∪S . This can be seen as finding the following argmin: We see that j . The direct computation of these quantities can be quite laborious. Here, we show an easier way to compute these projections and we prove that they can be computed "locally", i.e. taking into account only some smaller part of the graph.
We start by considering the path graph. Then we treat the more general situation of "large enough" tree graphs.

Path graph
Let j ∈ {1} ∪ S be the index of a column of X that we want to project onto and denote their indices inside {1} ∪ S ∪ {n + 1} = {i 1 , . . . , i s+2 } by l − and l + , i.e. j − = i l − and j + = i l + . We use the convention X n+1 = 0 ∈ R n . We are going to show that the projection of X j onto X {1}∪S is the same as its projection onto This means that the part of the set {1} ∪ S not bordering with j can be neglected. The intuition behind this insight can be clarified as follows. Projecting X j onto X {1}∪S amounts to finding the projection coefficientstheta j minimizing the length of the antiprojection. The projection is then X {1}∪Sθ j . Since the columns of X {1}∪S can be seen as indicator functions on [n], this projection problem can be interpreted as the problem of finding the least squares approximation to 1 {i≥j} by using functions in the class 1 {i≥j * } , j * ∈ {1} ∪ S .
We now apply a linear transformation in order to obtain orthogonal desing. Note that I s+1 =D (s+1) X (s+1) , whereD (s+1) is the incidence matrix of a path graph with s + 1 vertices rooted at vertex 1 and X (s+1) is its inverse, i.e. the corresponding rooted path matrix. We get that where τ j = X (s+1) θ j , i.e. the progressively cumulative sum of the components of θ j and X {1}∪SD (s + 1) ∈ R n×(s+1) is a matrix containing as columns the indicator functions 1 {i l ≤i<i l+1 } , l ∈ {1, . . . , s + 1} , which are pairwise orthogonal. Because of the orthogonality of the design matrix, we can now solve s + 1 separate optimization problems to find the components ofτ j . It is clear that, to minimize the sum of squared residuals (i.e. the length of the antiprojection), We see that, to get this projection coefficient, we either need to know j + and j − or the information on the length of the constant segment in which j lies with its position within this segment. Thus we obtain that and have proved the following Lemma.
Lemma 3.1 (Localizing the projections). Let X be the path matrix rooted at vertex 1 of a path graph with n vertices and S ⊆ {2, . . . , n}. For j ∈ {1} ∪ S define j − and j + as in Equations (1) and (2). Then i.e. the (length of the) (anti-)projections can be computed in a "local" way.

Moreover by writing
Furthermore, for j < i s , j ∈ {1} ∪ S, the sum of the entries ofθ j is 1.

General branching point
Using arguments similar to the ones above we can now focus on a ramification point of a general tree graph. Let us consider K path graphs attached at the end of one path graph (which we assume to contain the root). The path matrix rooted at the first vertex is and we want to find the projections of X −1 onto X 1 = (1, . . . , 1) ′ . The entries Without loss of generality we can consider only one i * ∈ {2, . . . , K + 1}. We now consider two cases l = 1 and l = 1.
We haveτ Note that in the last region before the end of one branch, the approximation of the indicator function we implicitely calculate does not have to jump up to one and thus only one coefficient of the respectiveθ j will be nonzero and this coefficient will be smaller than one. Now we focus on the case, where each of the branches (path graphs) involved in a ramification, presents at least one jump (i.e. one element of the set S). The length of the antiprojections is calculated in the same way as above. According to the arguments exposed in precedence, we can consider only the jumps surrounding the ramification point. Let us call them j 1 , j 2 , . . . , j k+1 . We have to findθ are respectively the rooted incidence matrix of a star graph with (K +1) vertices and its inverse.
We now consider two cases: l = 1 and l = 1.

Compatibility constant
In this section we assume G to be the path graph with n vertices. We give two lower bounds for the compatibility constant for the path graph with and without weights. The proofs are postponed to the Appendix B, where we present some elements that allow extension to the branched path graph and to more general tree graphs as well. These bounds are presented in a paper by van de Geer (2018) as well. We use the second notation exposed in Subsection 1.2.1.
Lemma 4.1 (Lower bound on the compatibility constant for the path graph, part of Theorem 6.1 in van de Geer (2018)). For the path graph it holds that Corollary 4.2 (The bound can be tight, part of Theorem 6.1 in van de Geer (2018)). Assume d j is even ∀j ∈ {2, . . . , s}. Then we can take Let β * be defined by f * = Xβ * . Then Proof of Corollary 4.2. See Appendix B.
Remark. For the compatibility constant we want to find the largest possible lower bound. Thus we have to choose the u j 's s.t. K is minimized. We look at the first order optimality conditions and notice that they reduce to finding the extremes of (s − 1) functions of the type g( Thus, we can not obtain the optimal value of K as soon as at least one d j is odd.  (2018)). For the path graph it holds that where D is the incidence matrix of the path graph.
Proof of Lemma 4.3. See Appendix B.

Oracle inequality
Define the vector and let ∆ h be its harmonic mean. We now want to translate the result of Theorem 2.5 to the path graph. To do so we need a lower bound for the weighted compatibility constant, i.e. an explicit upper bound for n i=2 (w i − w i−1 ) 2 . In this way we can obtain the following Corollary.
Corollary 4.4 (Sharp oracle inequality for the path graph). Assume d i ≥ 4, ∀i ∈ {1, . . . , s + 1}. It holds that If we choose f = f 0 and S = S 0 we obtain that Proof of Corollary 4.4. See Appendix B.
Remark. Since the harmonic mean of ∆ is upper bounded by its arithmetic mean, and this upper bound is attained when all the entries of ∆ are the same, we get a lower bound for the order of the mean squared error of s log(n) n s + log n s .
Remark. Our result differs from the one obtained by Dalalyan, Hebiri and Lederer (2017) in two points: • We have∆ h , the harmonic mean of the distances between jumps, instead of min j ∆ j , the minimum distance between jumps; • We slightly improve the rate from by reducing a log(n) to log(n/s). This is achieved with a more careful bound on the square of the consecutive differences of the weights.

Path graph with one branch
In this section we consider G to be the path graph with one branch and n vertices.

Compatibility constant
Lemma 5.1 (Lower bound for the compatibility constant for the branched path graph). For the branched path graph it holds that Corollary 5.2 (The bound can be tight). Assume d i j is even ∀j ∈ {2, . . . , s i }, i ∈ {1, 2, 3}. One can then choose u i j = d j /2, ∀j ∈ {2, . . . , s i }, i ∈ {1, 2, 3}. Moreover, assume that d 1 be the restriction of f to the three path graphs of length p i each. Let us now define f * i ∈ R pi by Proof of Corollary 5.2. See Appendix C.
Consider the decomposition of the branched path graph into three path graphs, implicitely done by using the second notation in Section 1.2.1. Let D * denote the incidence matrix of the branched path graph, where the entries in the rows corresponding to the edges connecting the three above mentioned path graphs have been substituted with zeroes. .
Proof of Lemma 5.3. See Appendix C.

Oracle inequality
As in the case of the path graph, to prove an oracle inequality for the branched path graph, we need to find an explicit expression to control the weighted compatibility constant to insert in Theorem 2.5. The resulting bound is similar to the one obtained in the Proof of Corollary 4.4, up to a difference: we now have to handle with care the region around the branching point b.
Remark. As made clear in the second notation in Section 1.2.1, we require that all d 1 This means that our approach can handle the case where at most one of the jumps surrounding the bifurcation point occurs directly at the bifurcation point. Note that neither b s1+1 = 0 nor b s1+2 = b s1+s2+3 = 0 are allowed.
If we choose f = f 0 and S = S 0 we get that

Extension to more general tree graphs
In this section we consider only situations corresponding to Case 1) of Corollary 5.4. This means that we assume that, even when at the ramification point is attached more than one branch, the edge connecting the branch to the ramification point and the consecutive one do not present jumps (i.e. are not elements of the set S).

Oracle inequality for general tree graphs
With the insights gained in Section 3 we can, by availing ourselves of simple means, prove an oracle inequality for a general tree graph, where the jumps in S are far enough from the branching points, in analogy to Case 1) in Corollary 5.4.
Here as well, we utilize the general approach exposed in Theorem 2.5 and we need to handle with care the weighted compatibility constant and find a lower bound for it.
We know that, when we are in (the generalization of) Case 1) of Corollary 5.4, to prove bounds for the compatibility constant, the tree graph can be seen as a collection of path graphs glued together at (some of) their extremities. As seen in Section 3, the length of the antiprojections for the vertices around ramification points depends on all the branches attached to the ramification point in question. Here, for the sake of simplicity, we assume that d i j ≥ 4, ∀j, ∀i, i.e. between consecutive jumps there are at least four vertices as well as there are at least four vertices before the first and after the last jump of each path graph resulting from the decomposition of the tree graph. This is what we call a "large enough" tree graph. Indeed, for d i j ≥ 4, we have that log(d i j ) ≤ 2 log(d i j /2). Let G be a tree graph with the properties exposed above. In particular it can be decomposed into g path graphs. For each of these path graphs, by using the second notation in Subsection 1.2.1, we define the vectors Moreover we write ∆ = (∆ 1 , . . . , ∆ g ) ∈ R 2s and |∆| = (|∆| 1 , . . . , |∆| g ) ∈ R 2(s+g) .
We have that for G, where∆ h is the harmonic mean of ∆. Moreover an upper bound for the inverse of the weighted compatibility constant can be computed by upper bounding the squared consecutive pairwise differences of the weigths for the g path graphs.
We thus get that, in analogy to Corollary 4.4 We therefore get the following Corollary Corollary 6.1 (Oracle inequality for a general tree graph). Let G be a tree graph, which can be decomposed in g path graphs. Assume that d i j ≥ 4, ∀j ∈ {1, . . . , s i + 1}, ∀i ∈ {1, . . . , g}. Then Remark. Notice that it is advantageous to choose a decomposition where the path graphs are as large as possible, s.t. g is small and less requirement on the d i j 's are posed.
Remark. This approach is of course not optimal, however it allows us to prove in a simple way a theoretical guarantee for the Edge Lasso estimator if some (not extremely restrictive) requirement on G and S is satisfied.
7. Asymptotic signal pattern recovery: the irrepresentable condition
Definition 7.2 (Pattern recovery). We say that an estimatorf of a signal f 0 on a graph G with incidence matrix D recovers the signal pattern if Df = s Df 0 .
In the literature, considerable attention has been given to the question whether or not it is possible to consistently recover the pattern of a piecewise constant signal contaminated with some noise, say Gaussian noise. In that regard, Qian and Jia (2016) highlight the so called staircase problem: as soon as there are two consecutive jumps in the same direction in the underlying signal separated by a constant segment, no consistent pattern recovery is possible, since the irrepresentable condition (cfr. Zhao and Yu (2006)) is violated.
Some cures have been proposed to mitigate the staircase problem. Rojas and Wahlberg (2015); Ottersten, Wahlberg and Rojas (2016) suggest to modify the algorithm for computing the Fused Lasso estimator.Their strategy is based on the connection made by Rojas and Wahlberg (2014) between the Fused Lasso estimator and a sequence of discrete Brownian Bridges. Owrang et al. (2017) propose instead to normalize the design matrix of the associated Lasso problem, to comply with the irrepresentable condition. Another proposal aimed at complying with the irrepresentable condition is the one by Qian and Jia (2016), based on the preconditioning of the design matrix with the puffer transformation defined in Jia and Rohe (2015), which results in estimating the jumps of the true signal with the soft-thresholded differences of consecutive observations.

Approach to pattern recovery for total variation regularized estimators over tree graphs
Let us now consider the case of the Edge Lasso on a tree graph rooted at vertex 1. We saw in Section 1 that the problem can be transformed into an ordinary Lasso problem where the first coefficient is not penalized. We start with the following remark.
Remark (The irrepresentable condition when some coefficients are not penalized). Let us consider the Lasso problem where some coefficients are not penalized, i.e. the estimator where U, R, S are three subsets partitioning p. In particular U is the set of the unpenalized coefficients, R is the set of truly zero coefficients and S is the set of truly nonzero (active) coefficients. We assume the linear model Y = Xβ 0 +ǫ, ǫ ∼ N n (0, σ 2 I n ). The vector of true coefficients β 0 can be written as Moreover we write Assume that |U | ≤ n and thatΣ UU ,Σ SS andΣ RR are invertible. We can write the irrepresentable condition as where z 0 S = sgn(β 0 S ), A U = I n − Π U is the antiprojection matrix onto V U , the linear subspace spanned by X U and Π U : Indeed, write δ :=β − β 0 . The KKT conditions can be written aŝ By solving Equation 3 with respect to δ U , then inserting into Equation 4 and solving with respect to δ S , then inserting the expression for δ R in the expression for δ U to get δ U (δ R ) and δ S (δ R ) and by finally inserting them into Equation 5 by analogy with the proof proposed by Zhao and Yu (2006), we find the irrepresentable condition when some coefficients are not penalized, which writes as follows: ∃η > 0 : UU X ′ U and we obtain the above expression. Thus, by using the notation of the remark above we let U = {1}, S = S 0 and R = [n] \ (S 0 ∪ {1}).
Lemma 7.5. We have that Proof of Lemma 7.5. See Appendix D.
This means that for tree graphs the irrepresentable condition can be checked for the "active set" {1} ∪ S 0 instead of S 0 , but then the first column has to be neglected. This fact is justified, however in a different way then the one we propose, in Qian and Jia (2016) as well.
Remark (The irrepresentable condition for asymptotic pattern recovery of a signal on a graph does not depend on the orientation of the edges of the graph). We assume the linear model Y = f 0 + ǫ, ǫ ∼ N n (0, σ 2 I n ). Then the Edge Lasso can be written asf Define β =ĨDf . Then f = XĨβ. The linear model assumed becomes Y = XĨβ 0 + ǫ and the estimator It is clear that the now the design matrix is XĨ. Let us write, without loss of generalityĨ According to the Lemma 7.5 we can check if ∃η ∈ (0, 1]: , whereβ 0 =Df 0 , i.e. the vector of truly nonzero jumps when the root has sign +1 and the edges are oriented away from it. Note thatĨ −({1}∪S0) does not change the ℓ ∞ -norm and by inserting the ex- wherez 0 S0 = sgn(β 0 ). This means that it is enough to check that ∃η > 0: to know, for all the orientations of the graph, whether the irrepresentable condition holds. The intuition behind this is that, by choosing the orientation of the edges of the graph, we choose at the same time the sign that the true jumps have across the edges.

Irrepresentable condition for the path graph
Theorem 7.6 (Irrepresentable condition for the transformed Fused Lasso, Theorem 2 in Qian and Jia (2016)). Consider the model for a piecewise constant signal and let S 0 denote the set of indices of the jumps in the true signal, i.e.
Remark. This fact can as well be easily read out from the consideration made in Section 3 and in particular in Lemma 3.1.

Irrepresentable condition for the path graph with one branch
Corollary 7.7 (Irrepresentable condition for the branched path graph). Assume S 0 = 0. The irrepresentable condition for the branched path graph is satisfied if and only if one of the following cases holds, ) and in the subvectors β 0 1:n1 and β 0 (b,n1+1:n) there are no two consecutive nonzero entries of β 0 with the same sign being separated by some zero entry.

The irrepresentable condition for general branching points
When the graph G has a branching point where arbitrarily many branches are attached, for the irrepresentable condition to be satisfied it is required, in addition to the absence of staircase patterns along the path graphs building G, that the last jump in the path graph containing the branching point has sign + (resp. −) and all the jumps in the other path graphs glued to this branching point have sign − (resp. +), with respect to the orientation of the edges away from the root. For the index of the K + 1 jumps surrounding the ramification point we use the same notation as in Subesction 3.2, i.e we denote them by {j 1 , . . . , j K+1 }.
Proof of Theorem 7.8. See Appendix D.

Conclusion
We refined some details of the approach of Dalalyan, Hebiri and Lederer (2017) for proving a sharp oracle inequality for the total variation regularized estimator over the path graph. In particular we decided to follow an approach where a coefficient is left unpenalized and we gave a proof of a lower bound on the compatibility constant which does not use probabilistic arguments.
The key point of this article is that we proved that the approach applied on the path graph can indeed be generalized to a branched graph and further to more general tree graphs. In particular we found a lower bound on the compatibility constant and we generalized the result concerning the irrepresentable condition obtained for the path graph by Qian and Jia (2016). The KKT conditions are where z −1 ∈ R n is a vector with the first entry equal to zero and the remaining ones equal to the subdifferential of the absolute value of the corresponding entry of β. Inserting Y = Xβ 0 +ǫ into the KKT conditions and multiplying them once by β and once by β we obtain where the last inequality follows by the dual norm inequality and the fact that z −1 ∞ ≤ 1. Subtracting the first inequality from the second we get Using polarization we obtain Let S ⊂ {2, . . . , n}. We have that Thus we get the "basic" inequality We are going to utilize the approach described by Dalalyan, Hebiri and Lederer (2017) to handle the remainder term I with care. Since Indeed the antiprojection of elements of V {1}∪S is zero. Note that Restricting ourselves to the set for γ ≥ 1 we obtain Using the definition of the weighted compatibility constant and the convex conjugate inequality we obtain We see that X( β − β) 2 n cancels out and we are left with It now remains to find a lower bound for P(F ) and a high-probability upper bound for Π {1}∪S ǫ 2 n .

Random part
• First, we lower bound P(F ), thanks to the following lemma.

Appendix B: Proofs of Section 4
Let f ∈ R n be a function defined at every vertex of a connected nondegenerate graph G. Moreover let be an ordering of f , with arbitrary order within tuples. Let D denote the incidence matrix of the graph G.
Lemma B.1 (Lemma 11.9 in van de Geer (2018)). It holds that Remark. For the special case of G being the path graph, we have equality in Lemma B.1 when f is nonincreasing or nondecreasing on the graph.
Proof of Lemma B.1. Since G is connected there is a path between any two vertices. Therefore there is a path connectiong the vertices where f takes the values f (n) and f (1) . The total variation of a function defined on a graph is nondecreasing in the number of edges of the graph. Let us now consider f P , the restriction of f on a path P connecting f (1) to f (n) . If f is nondecreasing on the path P , then Since G has at least as many edges as P : Lemma B.2 (Lemma 11.10 in van de Geer (2018)). It holds for any j ∈ {1, . . . , n} that Proof of Lemma B.2. See van de Geer (2018).
Lemma B.3 (Lemma 11.11 in van de Geer (2018)). Let f ∈ R n be defined over a connected graph G f whose incidence matrix is D f . The total variation of f is D f f 1 . Analogously, let g ∈ R m be defined over a connected graph G g whose incidence matrix is D g . The total variation of g is D g f 1 . Then for any j ∈ {1, . . . , n} and k ∈ {1, . . . , m} Proof of Lemma B.3. Suppose without loss of generality that f j ≥ g k . Then by Lemma B.2 Proof of Corollary 4.4. Let A = I − Π {1}∪S denote the antiprojection matrix on the coulmns of X indexed by {1} ∪ S. By using the definition of w i and ω i , we have that For the path graph we have, thanks to Section 3, .
We now have to find an upper bound for K. Since the choice of u j is arbitrary, we choose u j = ⌊d j /2⌋, j ∈ {2, . . . , s}, which minimize the upper bound among the integers. We thus have that K ≤ 2s ∆ h , where∆ h is the harmonic mean of ∆. Finaly, for the path graph we have and we obtain the Corollary 4.4.

B.1. Outline of proofs by means of a minimal toy example
For giving an intuition to the reader we present a minimal toy example. Consider the path graph with n = 8 and let S = {3, 7}. In this example d 1 = 2, d 2 = 4, u 2 = 2, d 3 = 2. We write The idea now is to apply Lemma B.3 twice, once to the path graphs ({1, 2}, (1, 2)) and ({3, 4}, (3, 4)) and once to the path graphs ({5, 6}, (5, 6)) and ({7, 8}, (7, 8)). Note that the term |f 5 − f 4 | is not needed to apply Lemma B.3 and thus can be left away. We get where the last step follows by the Cauchy-Schwarz inequality. We thus see that we can handle graphs built by modules consisting of small path graphs containing an edge in S and at least one vertex not involved in this edge on each side. The edges connecting these modules can then be neglected when upperbounding In the weighted case we define g i = w i f i , i = 1, . . . , 8 and write Here as well, note that the squared difference of the weights across the edge connecting the two modules (smaller but large enough path graphs containing an element of S) can be neglected. The procedure exemplified here can be used to handle larger tree graphs, as long as one is able to decompose them in such smaller modules. The fact that squared weights differences can be neglected at the junction of modules will be of use in the proof of Corollary 5.4.
Remark. The limits of this approach are given by Lemma B.3, since its use requires the presence of at least a distinct edge not in S on the left and on the right for each edge in S not sharing vertices with edges used to handle other elements of S. Thus s ≤ n/4. However, this limitation is very likely to be of scarce relevance if some kind of minimal length condition holds, see for instance Dalalyan, Hebiri and Lederer (2017) Proof of Corollary 5.2. The proof follows by direct calculations in analogy to the one of Corollary 4.2 (i.e. Theorem 6.1in van de Geer (2018)).
Proof of Lemma 5.3. In the proof of Lemma 4.1 and Lemma 4.3 (i.e. Theorem 6.1 and Lemma 9.1 in van de Geer (2018)) and in Appendix B.1 it is made clear, that the use of Lemma B.3 requires that the edges connecting the smaller pieces into which the path graph is partitioned are taken out of consideration when This results in an upper bound containing only the square of some of the consecutive pairwise differences between the entries of w, the vector of weights. This "incomplete" sum can then of course be upper bounded by Dw 2 , where D is the incidence matrix of the path graph.
In the case of the branched path graph the same reasoning applies in particular to the two edges connecting together the three path graphs defined by the second notation. Indeed these can be left away and it is natural to do so. Thus, in full analogy to the procedure exposed in the proofs of Lemma 4.1 and 4.3 (i.e. Theorem 6.1 and Lemma 9.1 in van de Geer (2018)) for the path graph, the statement of Lemma 5.3 follows. See Appendix B.1 for an intuition Proof of Corollary 5.4. We use the calculations done in Section 3. By writing we obtain that Indeed we can bound all the terms except the last two ones by applying the reasoning developed for the path graph.
We are now interested in upper bounding D * w 2 2 rather Dw 2 2 . We have that where we distinguish the following four cases 1) 2) Assume without loss of generality that b s1+s2+3 = 0.
By using the formula for the inverse of a partitioned matrix (see Lemma D.1) we get that where e 1 = (1, 0, . . . , 0) ∈ R s . As a consequence we can perform the following multiplication: We now develop A 1 X S0 (X ′ S0 A 1 X S0 ) −1 to see if it coincides with the second entry of the matrix we have obtained. In particular By using Lemma D.2 we can write the second term as X 1 e ′ 1 .
In the KKT conditions we note that z 0 1 = 0 (indeed we have the usual normal equations for coefficients not penalized) and thus we establish the desired equality.
Proof of Theorem 7.8. We refer to Section 3 for the calculation of the projection coefficients.
We now want to find the signal patterns z for which the irrepresentable condition is satisfied. Consider the first condition: it excludes the signal pattern where all the jumps have the same sign.
Thus, in the following assume w.l.o.g. that z 1 = 1. Now we look at the second condition. We are going to consider the cases where p of the K last elements of the vector (α, 1 − α, −α, . . . , −α) get the sign + and K − p get the sign −. We look for the linear combination with the highest absolute value. This can be seen as finding the linear combination L of (α, −α, . . . , −α) determined by p and then adding sgn(L) to it. We scan the cases p = 1, . . . , K − 1, since the case p = K is already discarded by looking at the first condition.
If K is odd, for p = (K + 1)/2, we have that K + 1 − 2p = 0 and the irrepresentable condition is violated, since the linear combination gives ±1.
Thus, it only remains to consider p = 0. For p = 0 we get the condition |1 − (K + 1)α| < 1 from the first as well as from the second condition above. This condition is satisfied whenever α < 2/(K + 1), i.e.