Exact post-selection inference for the generalized lasso path

: We study tools for inference conditioned on model selection events that are deﬁned by the generalized lasso regularization path. The generalized lasso estimate is given by the solution of a penalized least squares regression problem, where the penalty is the (cid:2) 1 norm of a matrix D times the coeﬃcient vector. The generalized lasso path collects these estimates as the penalty parameter λ varies (from ∞ down to 0). Leveraging a (sequential) characterization of this path from Tibshirani and Taylor [37], and recent advances in post-selection inference from Lee et al. [22], Tibshi-rani et al. [38], we develop exact hypothesis tests and conﬁdence intervals for linear contrasts of the underlying mean vector, conditioned on any model selection event along the generalized lasso path (assuming Gaussian errors in the observations). Our construction of inference tools holds for any penalty matrix D . By inspecting speciﬁc choices of D , we obtain post-selection tests and conﬁ-dence intervals for speciﬁc cases of generalized lasso estimates, such as the fused lasso, trend ﬁltering, and the graph fused lasso. In the fused lasso case, the underlying coordinates of the mean are assigned a linear ordering, and our framework allows us to test selectively chosen breakpoints or changepoints in these mean coordinates. This is an interesting and well-studied problem with broad applications; our framework applied to the trend ﬁltering and graph fused lasso cases serves several applications as well. Aside from the development of selective inference tools, we describe several practical aspects of our methods such as (valid, i.e., fully-accounted-for) post-processing of generalized lasso estimates before performing inference in order to improve power, and problem-speciﬁc visualization aids that may be given to the data analyst for he/she to choose linear contrasts to be tested. Many examples, from both simulated and real data sources, are presented to examine the empirical properties of our inference methods. MSC 2010 subject classiﬁcations: Primary 62F03


Introduction
Consider a classic Gaussian model for observations y ∈ R n , with known marginal variance σ 2 > 0, y ∼ N (θ, σ 2 I), (1.1) where the (unknown) mean θ ∈ R n is the parameter of interest. In this paper, we examine problems in which θ is believed to have some specific structure (at least approximately so), in that it is sparse when parametrized with respect to a particular basis. A key example is the changepoint detection problem, in which the components of the mean θ 1 , . . . , θ n correspond to ordered underlying positions (or locations) 1, . . . , n, and many adjacent components θ i and θ i+1 are believed to be equal, with the exception of a sparse number of breakpoints or changepoints to be determined. See the left plot in Figure 1 for a simple example. Many methods are available for estimation and detection in the changepoint problem. We focus on the 1-dimensional fused lasso [39], also called 1dimensional total variation denoising [30] in signal processing, for reasons that will become clear shortly. This method, which we call the 1d fused lasso (or simply fused lasso) for short, is often used for piecewise constant estimation of the mean, but it does not come with associated inference tools after changepoints have been detected. In the top right panel of Figure 1, we inspect the 1d fused lasso estimate that has been tuned to detect two changepoints, in a data model where the mean θ only has one true changepoint. Writing the changepoint locations as 1 ≤ I 1 < I 2 ≤ n, we might consider testing where we write I 0 = 1 and I 3 = n + 1 for notational convenience. If we were to naively ignore the data-dependent nature of I 1 , I 2 (these are the estimated changepoints from the two-step fused lasso procedure), i.e., treat them as fixed, then the natural tests for the null hypothesess H 0,j , j = 1, 2 would be to reject for large magnitudes of the statistics  The table reports p-values from the naive Z-test, which does not account for the data-dependent nature of the changepoints, and from our TG test for the 1d fused lasso, which does.
The table in Figure 1 shows the results of running such naive Z-tests. At location I 2 (labeled location B in the figure), which corresponds to a true changepoint in the underlying mean, the test returns a very small p-value, as expected.
But at location I 1 (labeled A in the figure), a spurious detected changepoint, the naive Z-test also produces a small p-value. This happens because the location I 1 has been selected by the 1d fused lasso, which inexorably links it to an unusually large magnitude of T 1 ; in other words, it is no longer appropriate to compare T 1 against its supposed Gaussian null distribution, with mean zero and variance σ 2 (1/(I 1 − I 0 ) + 1/(I 2 − I 1 )). Also shown in the table are the results of running our new truncated Gaussian (TG) test for the 1d fused lasso, which properly accounts for the data-dependent nature of the changepoints detected by fused lasso, and produces p-values that are exactly uniform under the null 1 , conditional on I 1 , I 2 having been selected by the fused lasso in the first place. We now see that only the changepoint at location I 2 has a small associated p-value.

Summary
In this paper, we make the following contributions.
• We introduce the usage of post-selection inference tools to selection events defined by a class of methods called generalized lasso estimators. The key mathematical task is to show that the model selection event defined by any (fixed) step of the generalized lasso solution path can be expressed as a polyhedron in the observation vector y (Section 3.1). The (conditionally valid) TG tests and confidence intervals from Lee et al. [22], Tibshirani et al. [38] can then be applied, to test or cover any linear contrast of the mean vector θ. • We describe a stopping rule based on a generic information criterion (akin to AIC or BIC), to select a step along the generalized lasso path at which we are to perform conditional inference. We give a polyhedral representation for the ultimate model selection event that encapsulates both the selected path step and the generalized lasso solution at this step (Section 3.2). Along with the TG tests and confidence intervals, this makes for a practical (nearly-automatic) and broadly applicable set of inference tools. • We study various special cases of the generalized lasso problem-namely, the 1d fused lasso, trend filtering, graph fused lasso, and regression problems-and for each, we develop specific forms for linear contrasts that can be used to test different population quantities of interest (Sections 4.1 through 4.5). In each case, our tests represent new contributions to the space of currently available inferential tools. For example, in the 1d fused lasso case, our tests are the first that we know of that are specifically designed to yield proper inferences after changepoint locations have been detected. • We present two of extensions of the basic tools described above for postselection inference in generalized lasso problems: a post-processing tool, to improve the power of our methods, and a visualization aid, to improve practical useability. • We conduct a comprehensive simulations across the various special problem cases, to investigate the (conditional) power of our methods, and verify their (conditional) type I error control (Sections 5.1 through 5.5). We also demonstrate a realistic application of our selective inference tools for changepoint detection to a data set of comparative genomic hybridization (CGH) measurements from two glioblas-toma multiforme (GBM) tumors (Section 5.6).

Related work
Post-selection inference, also known as selective inference, is a new but rapidly growing field. Unlike other recent developments in high-dimensional inference using a more classic full-population model, the point of selective inference is to provide a means of testing hypotheses that stem from a selected model, the output of an algorithm that has been applied to data at hand. In a sequence of papers, Leeb and Potscher [24,25,26] prove impossibility results about estimating the post-selection distribution of certain estimators in a classical regression setting. Berk et al. [3], Lockhart et al. [27] circumvent this by considering different test statistics, rather than the standard studentized pivot (the standard for inference without selection). The former work is very broad and considers all selection mechanisms in regression (hence yielding more conservative inference); the latter is much more specific and considers the lasso estimator in particular. Lee et al. [22], Tibshirani et al. [38] improve on the method in Lockhart et al. [27], and introduce a pivot-based framework for post-selection inference. Lee et al. [22] describe the application to the lasso problem at a fixed tuning parameter λ; Tibshirani et al. [38] describe the application to the lasso path at a fixed number of steps k (and also, the least angle regression and forward stepwise paths). A number of extensions to different problem settings are given in Lee and Taylor [23], Reid et al. [29], Loftus and Taylor [28], Choi et al. [8]. Asymptotics for non-Gaussian error distributions are presented in Tian and Taylor [33], Tibshirani et al. [36]. A broad treatment of selective inference in exponential family models and selective power is presented in Fithian et al. [10]. An improvement based on auxiliary randomization is given in Tian and Taylor [34]. A study of selective sequential tests and stopping rules is given in Fithian et al. [11]. Ours is the first work to consider selective inference in structured problems like the generalized lasso. Changepoint detection carries a huge body of literature; reviews can be found in, e.g., Brodsky and Darkhovski [4], Chen and Gupta [7], Eckley et al. [9]. Far sparser is the literature on changepoint inference, say, inference for the location or size of changepoints, or segment lengths. Hinkley [17], Worsley [42], Bai [2] are some examples, and Jandhyala et al. [20], Horvath and Rice [19] provide nice surveys and extensions. Some tools are built around likelihood ratio test statistics comparing two nested changepoint models, but at fixed locations.
Since interesting locations to be tested are typically estimated, these inferences can be clearly invalid (if estimation and inference are both done on the same data samples). Other tools use likelihood ratio tests of the null hypothesis of no change, against an alternative of any possible change. Because these are global tests, they are not directly comparable to our post-selection tests of linear contrasts of the mean.
Probably most relevant to our goal of valid post-selection changepoint inference is Frick et al. [12], who develop a simultaneous confidence band for the mean in a changepoint model. Their Simultaneous Multiscale Changepoint Estimator (SMUCE) seeks the most parsimonious piecewise constant fit subject to an upper limit on a certain multiscale statisic, and is solved via dynamic programming. Because the final confidence band has simultaneous coverage (over all components of the mean), it also has valid coverage for any (data-dependent) post-selection target. In contrast, our proposal does not give simultaneous coverage of the mean, but rather, selective coverage of a particular post-selection target. An empirical comparison between the two methods (SMUCE, and ours) is given in Section 5.2. While this comparison is useful and informative, it is also worth emphasizing that the framework in our paper applies far outside of the changepoint detection problem, i.e., to trend filtering, graph clustering, and regression problems with structured coefficients.

Notation
For a matrix D, we will denote by D S the submatrix whose rows are in a set S ⊆ {1, . . . , m}. We write D −S to mean D S c = D {1,...,m}\S . Similarly, for a vector x, we write x S or x −S to extract the subvector whose components are in S or not in S, respectively. We use A + for the pseudoinverse of a matrix A, and row(A), col(A), null(A) for the row space, column space, and null space of A, respectively, and nullity(A) for the dimension of null(A). We write P L for the projection matrix onto a linear subspace L. Lastly, we will often abbreviate a subsequence (x a , x a+1 . . . , x b ) of a vector x by x a:b .

The generalized lasso regularization path
Given a response y ∈ R n , the generalized lasso estimator is defined by the optimization problemβ = argmin where X ∈ R n×p is a predictor matrix, D ∈ R m×p is a penalty matrix, and λ ≥ 0 is a regularization parameter. (The solution in (2.1) is not in general unique, but we will restrict our attention to problems with rank(X) = p, in which case it is.) This matrix D is chosen so that sparsity of Dβ induces some type of desired structure in the solutionβ in (2.1). Important special cases, each corresponding to a specific class of matrices D, include the 1d fused lasso, trend filtering, and graph fused lasso problems. More details on these problems are given in Section 4; see also Section 2 in Tibshirani and Taylor [37]. We review the algorithm of Tibshirani and Taylor [37] to compute the entire solution path in (2.1), i.e., the continuum of solutionsβ(λ) as the regularization parameter λ desends from ∞ to 0. For now, we focus on the problem of signal approximation, where X = I: For a general X, a simple modification to the arguments used for (2.2) will deliver the solution path for (2.1), and we refrain from describing this until Section 4.5. The path algorithm of Tibshirani and Taylor [37] for (2.2) is derived from the perspective of its equivalent Lagrange dual problem, namelŷ (2.5) The strategy is now to compute a solution pathû(λ) in the dual problem, as λ descends from ∞ to 0, and then use (2.4) to deliver the primal solution path. Therefore it suffices to describe the path algorithm as it operates on the dual problem; this is given next.
Algorithm 1 (Dual path algorithm for the generalized lasso, X = I).
Given y ∈ R n and D ∈ R m×n .

While
(b) Compute the next hitting time, Define the hitting coordinate i hit k+1 and hitting sign r hit k+1 to be the pair achieving the maximum in the above expression.
(c) Compute the next leaving time, The dual path algorithm, in Algorithm 1, tracks the coordinates of the dual solutionû(λ) that are equal to ±λ, i.e., that lie on the boundary of the constraint region [−λ, λ] m . The collection of such coordinates, at any given step k in the path, is called the boundary set, and is denoted B k . Critical values of the regularization parameter at which the boundary set changes (i.e., at which coordinates join or leave the boundary set) are called knots, and are denoted From the form of the dual solutionû(λ) as presented in Algorithm 1, and also the primal-dual relationship (2.4), the primal solution path may be expressed in terms of the current boundary set B k and boundary sign list s B k , as in The above shows that the primal solution lies in the subspace null(D −B k ), which means it expresses a certain type of structure. This will become more concrete as we look at specific cases for D in Section 4, but for now, the important point is that the structure of the generalized lasso solution (2.9) is determined by the boundary set B k . Thus, by conditioning on the observed boundary set B k after a certain number of steps k of the path algorithm, we are effectively conditioning on the observed model structure in the generalized lasso solution at step k. This is essentially what is done in Section 3. Lastly, we note the following important point. In some generalized lasso problems, Step 2(c) in Algorithm 1 does not need to be performed, i.e., we can formally replace this step by λ leave k+1 = 0, and accordingly, the boundary set B k will only grow over iterations k. This is true, e.g., for all 1d fused lasso problems; more generally, it is true for any generalized lasso signal approximator problem in which DD T is diagonally dominant.

Exact inference after polyhedral conditioning
Under the Gaussian observation model in (1.1), Lee et al. [22], Tibshirani et al. [38] develop a framework for inference on an arbitrary linear constrast v T θ of the mean θ, conditional on y ∈ G, where G ⊆ R n is an arbitrary polyhedron. A core tool in these works is an exact pivotal statistic for v T θ, conditional on y ∈ G: they prove that there exists random variables V lo , V up such that μ,τ 2 denotes the cumulative distribution function of Z ∼ N [a,b] (μ, τ 2 ), a univariate normal random variate with mean μ and variance τ 2 , truncated to lie in the interval [a, b]. The statistic in (2.10) is called the truncated Gaussian (TG) pivot.
Here is some insight into the construction of (2.10). Let us represent our polyhedron as G = {x : Γx ≥ w}, where Γ ∈ R q×n and w ∈ R n (and the inequality here is interpreted componentwise). Some straightforward algebra shows that we can (essentially) write and ρ = Γv/ v 2 . A simple rearrangement of the above expressions shows V lo , V up are functions of P ⊥ v y alone, and so they are independent of v T y. This and integrating out over P ⊥ v y verifies the pivotal property in (2.10). The TG pivotal statistic in (2.10) enables us to test the null hypothesis H 0 : v T θ = 0 against the one-sided alternative H 1 : v T θ > 0. Namely, it is clear that the TG test statistic is itself a p-value for H 0 , with finite sample validity, conditional on y ∈ G. (A two-sided test is also possible: we simply use 2 min{T, 1 − T } as our p-value; see Tibshirani et al. [38] for a discussion of the merits of one-sided and two-sided selective tests.) Confidence intervals follow directly from (2.10) as well. For an (equi-tailed) interval with exact finite sample coverage 1 − α, conditional on the event y ∈ G, we take [η α/2 , η 1−α/2 ], where η α/2 , η 1−α/2 are obtained by inverting the TG pivot, i.e., defined to satisfy (2.12) A one-sided interval with coverage 1 − α of the form [η α , ∞) can be constructed similarly.
At this point, it may seem unclear how this framework applies to postselection inference in generalized lasso problems. The key ingredients are, of course, the polyhedron G and the contrast vector v. In the next section, we will show how to construct polyhedra that correspond to model selection events of interest, at points along the generalized lasso path. In the following section, we will suggest choices of contrast vectors that lead to interesting and useful tests in specific settings, such as the 1d fused lasso, trend filtering, and graph fused lasso problems.

Can we not just use lasso inference tools?
When the penalty matrix D is square and invertible, the generalized lasso problem (2.1) is equivalent to a lasso problem, in the variable α = Dβ, with design matrix XD −1 . More generally, when D has full row rank, problem (2.1) is reducible to a lasso problem (see Tibshirani and Taylor [37]). In this case, existing inference theory for the lasso path (from Tibshirani et al. [38]) could be applied to the equivalent lasso problem, to perform post-selection inference on generalized lasso models. This covers inference for the 1d fused lasso and trend filtering problems. But when D is row rank deficient (when it has more rows than columns), the generalized lasso is not equivalent to a lasso problem (see again Tibshirani and Taylor [37]), and we cannot simply resort to lasso inference tools. This would hence rule out treating problems like the 2d fused lasso, the graph fused lasso (for any graph with more edges than nodes), the sparse 1d fused lasso, and sparse trend filtering from a pure lasso perspective. Our paper presents a unified treatment of post-selection inference across all generalized lasso problems, regardless of the penalty matrix D.

The selection event after a given number of steps k
Here, we suppose that we have run a given (fixed) number of steps k of the generalized lasso path algorithm, and we have a contrast vector v in mind, such that v T θ is a parameter of interest (to be tested or covered). Define the generalized lasso model at step of the path to be where B , s B are the boundary set and signs at step , and R hit , I leave are quantities to be defined shortly. We will show that the entire model sequence from steps = 1, . . . , k, denoted M 1:k = (M 1 , . . . , M k ), is a polyhedral set in y. By this we mean the following: if M 1:k (y) denotes the model sequence as a function of y, and M 1:k a given realization, then the set is a polyhedron, more specifically, a convex cone, and can therefore be expressed as G k = {y : Γy ≥ 0} for a matrix Γ = Γ(M 1:k ) that we will show how to construct, based on M 1:k .
Our construction uses induction. When k = 1, and we write B 1 = {i 1 } and s B1 = (r 1 ), it is clear from the first step of Algorithm 1 that (i 1 , r 1 ) is the hitting coordinate-sign pair if and only if Hence we can construct Γ(M 1 ) to have the corresponding 2(m − 1) rows-to be explicit, these are We note that at the first step, there is no characterization needed for R hit 1 and I leave 1 (for simplicity, we may think of these as being empty sets). Now assume that, given a model sequence M 1:(k+1) = (M 1 , . . . , M k+1 ), we have constructed a polyhedral representation for G k = {y : M 1:k (y) = M 1:k }, i.e., we have constructed a matrix Γ(M 1:k ) such that G k = {y : Γ(M 1:k ) ≥ 0}. To show that G k+1 = {y : Γ(M 1:(k+1) ) ≥ 0} can also be written in the analogous form, we will define Γ(M 1:k+1 ) by appending rows to Γ(M 1:k ) that capture the generalized lasso model at step k + 1 of Algorithm 1. We will add rows to characterize the hitting time (2.6), leaving time (2.7), and the next action (either hitting or leaving) (2.8). Keeping with the notation in (2.6), a simple argument shows that the next hitting time can be alternatively written as Plugging in for a, b, we characterize the viable hitting signs at step k + 1, R hit k+1 = {sign(a i ) : i / ∈ B k }, as well as the next hitting coordinate and hitting sign, i hit k+1 and r hit k+1 , by the following inequalities: This corresponds to 2(m − |B k |) rows to be appended to Γ(M 1:k ). For (2.7), we first define the viable leaving coordinates, denoted I leave k+1 , by the subset of i ∈ B k for which c i < 0 and d i < 0. We may write Plugging in for c, d, we notice that only the former set C leave k+1 depends on y, and D leave k+1 is deterministic once we have characterized M 1:k . This gives rise to the following inequalities determining I leave which corresponds to |D leave k+1 | ≤ |B k | rows to be appended to Γ(M 1:k ). Given this characterization for I leave k+1 , we may now characterize the next leaving coordinate i leave k+1 by: This corresponds to |I leave k+1 | ≤ |B k | rows that must be appended to Γ(M 1:k ). Recall that the leaving coordinate is given by r leave k+1 = r i leave k+1 . Lastly, for (2.8), we either use , or the above with the inequality sign flipped, if λ hit k+1 < λ leave k+1 . In either case, only one more row is to be appended to Γ(M 1:k ). This completes the inductive proof.
Combining the results of this subsection with the TG pivotal statistic from Section 2.2, we are now equipped to perform conditional inference on the model that is selected at any fixed step k of the generalized lasso path. (Recall, we are assuming that a reasonable contrast vector v has been determined such that v T θ is a quantity of interest in the k-step generalized lasso model; in-depth discussion of reasonable choices of contrast vectors, for particular problems, is given in Section 4.) Of course, the choice of which step k to analyze is somewhat critical. The high-level idea is to fix a step k that is large enough for the selected model to be interesting, but not so large that our tests will be low-powered. In some practical applications, choosing k a priori may be natural; e.g., in the 1d fused lasso problem, where the selected model correponds to detected changepoints (as discussed in the introduction), we may choose (say) k = 10 steps, if in our particular setting we are interested in detecting and performing inference on at most 10 changepoints. But in most practical applications, fixing a step k a priori is likely a difficult task. Hence, we present a rigorous strategy that allows the choice of k to be data-driven, next.

The selection event after an IC-selected number of steps k
We develop approaches based on a generic information criterion (IC), like AIC or BIC, for selecting a number of steps k along the generalized path that admits a "reasonable" model. By "reasonable", our IC approach admits a k-step generalized lasso solution balances training error and some notion of complexity. Importantly, we specifically design our IC-based approaches so that the selection event determining k is itself a polyhedral function of y. We establish this below.
Defined in terms of a generalized lasso model ) at step k, we consider the general form IC: The first term above is the squared loss between y and its projection onto the subspace null(D −B k ); recall that the k-step generalized lasso solution itself lies in this subspace, as written in (2.9), and so here we have replaced the squared loss between y andβ(λ k ) with the squared error loss between y and the unshrunken estimate P null(D −B k ) y. (This is needed in order for our eventual IC-based rule to be equivalent to a polyhedral constraint in y, as will be seen shortly.) The second term in (3.1) utilizes nullity(D −B k ), the dimension of null(D −B k ), i.e., the dimension of the solution subspace. It hence penalizes the complexity associated with the k-step generalized lasso solution. Further, P n is a penalty function that is allowed to depend on n and σ 2 (the marginal variance in the data model (1.1)). Some natural choices are: P n (d) = 2σ 2 d, which makes (3.1) resemble AIC; P n (d) = σ 2 d log n, motivated by BIC; and P n (d) = σ 2 (d log n + 2γ log n d ), where γ ∈ (0, 1) is a parameter to be chosen (say, γ = 1/2 for simplicity), motivated by extended BIC of Chen and Chen [6]. Beyond these, any choice of complexity penalty will do as long as P n (d) is an increasing function of d.
Unfortunately, choosing to stop the path at the step that minimizes the IC defined in (3.1) does not define a polyhedron in y. Therefore, we use a modified IC-based rule. We first define the set of steps at which we see action (nonzero adjacent differences) in the IC.
, meaning that the structure of the primal solution is unchanged between steps k − 1 and k, and the IC is trivially constant as we move across these steps; we will hence restrict our attention to candidate steps in I IC (y) in crafting our stopping rule. Denoting by k 1 < k 2 < k 3 < . . . the sorted elements of I IC (y), we define for each j = 1, 2, 3, . . ., the sign of the difference in IC values between steps k j and k j+1 (two adjacent elements in I IC (y) at which the IC values are known to change nontrivially).
We are now ready to define our stopping rule, which chooses to stop the path at the step i.e., it chooses the smallest step k such that the IC defined in (3.1) has q successive rises in a row, among the elements of the candidate set I IC (y). Here q ≥ 1 is a prespecified integer; in practice, we have found that q = 2 often works well. It helps to see a visual depiction of the rule, see Figure 2. We now show that the following set is a polyhedron in y, where A ∈ R (j+q−1)×n is a matrix whose th row a ∈ R n spans the difference between null(D −B k ) and null(D −B k +1 ), for = 1, . . . , j + q − 1. (Note that by specifying M 1:(kj+q ) (y) = M 1:(kj+q ) , we have also implicitly specified the first j +q elements of I IC (y), and so we do not need to explicitly include a realization of the latter set in the definition of H.
From the previous subsection, we already know that H 1 is polyhedral. Thus it suffices to study H 2 , given M 1:(kj+q ) ; and as H 2 is defined by pairwise comparisons of IC values, it suffices to show that, for any = 1, . . . , j + q − 1, is equivalent to a linear constraint on y. A symmetric argument shows that if we flip the inequality sign above, this will still be equivalent to a linear constraint on y, and collecting these constraints over steps = 1, . . . , j + q − 1 gives the polyhedral representation for H 2 . Simply recalling the IC definition in (3.1), and rearranging, we find that (3.4) is equivalent to (3.5) Note that, by construction, the sets B k and B k +1 differ by at most one element. For concreteness, suppose that B k ⊂ B k +1 ; the other direction is similar.
, and the two subspaces are of codimension 1. Further, it is not hard to see that the difference in projection operators is itself the projection onto a subspace of dimension 1. 3 Writing a for the unit-norm basis vector for this subspace, and −b for the right hand side in (3.5), we see that (3.5) becomes , and the complexity penalty P n being an increasing function), and assume without loss of generality that the orientation of a is chosen so that a T y ≥ 0. Then the above becomes a linear constrast on y, as desired. Altogether, with the final polyhedron H, we can use the TG pivot from Section 2.2 to perform valid inference on linear contrasts v T θ of the mean θ, conditional on having chosen step k with our IC-based stopping rule, and on having observed a given model sequence over the first k steps of the generalized lasso path.

What is the conditioning set?
For a fixed k, suppose that we have computed k steps of the generalized lasso path and observed a model sequence M 1:k (y) = M 1:k . From Section 3.1, we can form a matrix Γ = Γ(M 1:k ) such that {y : M 1:k (y) = M 1:k } = {y : Γy ≥ 0}. From Section 2.2, for any vector v, we can invert the TG pivot as in (2.12) to compute a conditional confidence interval This holds for all possible realizations M 1:k of model sequences, and thus we can marginalize along any dimension to yield a valid conditional coverage statement. For example, by marginalizing over all possible realizations M 1:(k−1) of model sequences up to step k − 1, we obtain Above, B k (y) is the boundary set at step k as a function of y, and likewisê s B k (y), R hit k (y), I leave k (y) are the boundary signs, viable hitting signs, and viable leaving coordinates at step k, respectively, as functions of y. Since a data analyst typically never sees the viable hitting signs or viable leaving coordinates at a generalized lasso solution (i.e., these are "hidden" details of the path computation, at least compared to the boundary set and signs, which are reflected in the structure of solution itself, recall (2.9) and (2.5)), the conditioning event in (3.6) may seem like it includes "unnecessary" details. Hence, we can again marginalize over all possible realizations R hit k , I leave k to yield Among (3.6), (3.7), (3.8), the latter is the cleanest statement and offers the simplest interpretation. This is reiterated when we cover specific problem cases in Section 4. Similar statements hold when k is chosen by our IC-based rule, from Section 3.2. Applying the TG framework from Section 2.2 to the full conditioning set, in order to derive a confidence interval C 1−α for v T θ, and following a reduction analogous to (3.6), (3.7), (3.8), we arrive at the property Again this is a clean conditional coverage statement and offers a simple interpretation, for k chosen in a data-driven manner.

Changepoint detection via the 1d fused lasso
Changepoint detection is an old topic with a vast literature. It has applications in many areas, e.g., bioinformatics, climate modeling, finance, and audio and video processing. Instead of attempting to thoroughly review the changepoint detection literature, we refer the reader to the comprehensive surveys and reviews in Brodsky and Darkhovski [4], Chen and Gupta [7], Eckley et al. [9]. Broadly speaking, a changepoint detection problem is one in which the distribution of observations along an ordered sequence potentially changes at some (unknown) locations. In a slight abuse of notation, we use the term changepoint detection to refer to the particular setting in which there are changepoints in the underlying mean. Our focus is on conducting valid inference related to the selected changepoints. The existing literature applicable to this goal is relatively small; it is reviewed in Section 1.2 and compared to our methods in Section 5.2. Among various methods for changepoint detection, the 1d fused lasso [39], also known as 1d total variation denoising in signal processing [30], is of particular interest in the current paper because it is a special case of the generalized lasso. Let y = (y 1 , . . . , y n ) denote values observed at 1, . . . , n. Then the 1d fused lasso estimator is defined as in (2.2), with the penalty matrix being the discrete first difference operator, D = D (1) ∈ R (n−1)×n : In the 1d fused lasso problem, the dual boundary set tracked by Algorithm 1 has a natural interpretation: it provides the locations of changepoints in the primal solution, which we can see more or less directly from (2.5) (see also Tibshirani and Taylor [37], Arnold and Tibshirani [1]). Therefore, we can rewrite (2.9) aŝ Here I 1 < . . . < I k denote the sorted elements of the boundary set B k , with I 0 = 0, I k+1 = n for convenience, 1 p:q denotes a vector with 1 in positions p . . . q and 0 elsewhere, andb 1 (λ), . . . ,b k+1 (λ) denote levels estimated by the fused lasso with parameter λ. Note that in (4.2), we have implicitly used the fact that the boundary set after k steps of the path algorithm has exactly k elements; this is true since the path algorithm never deletes coordinates from the boundary set in 1d fused lasso problems (as mentioned following Algorithm 1). The dual boundary signs also have a natural meaning: writing the elements of s B k as s I1 , . . . , s I k , these record the signs of differences (or jumps) between adjacent levels, Below, we describe several aspects of selective inference with 1d fused lasso estimates. Similar discussions could be given for the different special classes of generalized lasso problems, like trend filtering and the graph fused lasso, but for brevity we only go into such detail for the 1d fused lasso.
Contrasts for the fused lasso. The framework laid out in Section 3 allows us to perform post-selection TG tests for hypotheses about v T θ, for any contrast vector v. We introduce two specific forms of interesting contrasts, which we call the segment and spike contrasts. From the k-step fused lasso solution, as portrayed in (4.2), (4.3), there are two natural questions one could ask about the changepoint I j , for some j ∈ {1, . . . , k}: first, whether there is a difference in the underlying mean exactly at I j ,  The resulting TG test, as in (2.11) with v = v spike , is called the spike test, since it tests differences in the mean θ at exactly one location. To test (4.5), we use The resulting TG test, as in (2.11) with v = v seg , is called the segment test, because it tests average differences across segments of the mean θ.
In practice, the segment test often has more power than the spike test to detect a change in the underlying mean, since it averages over entire segments. However, it is worth pointing out that the usefulness of the segment test at I j also depends on the quality of the other detected changepoints 1d fused lasso model (unlike the spike test, which does not), because these determine the lengths of the segments drawn out on either side of I j . And, to emphasize what has already been said: unlike the spike test, the segment test does not test the precise location of a changepoint, so a rejection of its null hypothesis must not be mistakenly interpreted (also, refer to the corresponding coverage statement in (4.8)).
Which test is appropriate ultimately depends on the goals of the data analyst. Figure 3 shows a simple example of the spike and segment tests. The behaviors of these two tests will be explored more thoroughly in Section 5.1.
Alternative motivation for the contrasts. It may be interesting to note that, for the segment contrast v seg in (4.7), the statistic v T seg y =ȳ (Ij +1):Ij+1 −ȳ (Ij−1+1):Ij is the likelihood ratio test statistic for testing the null H 0 : fixed. An equivalent way to write these hypotheses, which will be a helpful generalization going forward (as we consider other classes of generalized lasso problems), is In this notation, the segment contrast v seg in (4.7) is the unique (up to a scaling factor) basis vector for the rank 1 subspace null( and v T seg y is the likelihood ratio test statistic for the above set of null and alternative hypotheses. Lastly, both segment and spike tests can be viewed from an equivalent regression perspective, after transforming the 1d fused lasso problem in (2.2), (4.1) into an equivalent lasso problem (recall Section 2.3). In this context, it can be shown that the segment test corresponds to a test of a partial regression coefficient in the active model, whereas the spike test corresponds to a test of a marginal regression coefficient.
Inference with an interpretable conditioning event. As explained in Section 3.3, there are different levels of conditioning that can be used to interpret the results of the TG tests for model selection events along the generalized lasso path. Here we demonstrate for the segment test in (4.5), (4.7) what we see as the simplest interpretation of its conditional coverage property, with respect to its parameterθ (Ij +1):Ij+1 −θ (Ij−1+1):Ij , for some j ∈ {1, . . . , k}. The TG interval (2.12), computed by inverting the TG pivot, has the exact finite sample property P θ (Ij +1):Ij+1 −θ (Ij−1+1):Ij ∈ C 1−α I 1 , . . . , I k , s I1 , . . . , s I k = 1 − α, (4.8) obtained by marginalizing over some dimensions of the conditioning set, as done in Section 3.3. In words, the coverage statement (4.8) says that, conditional on the estimated changepoints I 1 , . . . , I k and estimated jump signs s I1 , . . . , s I k in the k-step 1d fused lasso solution, the interval C 1−α traps the jump in segment averagesθ (Ij +1):Ij+1 −θ (Ij−1+1):Ij with probability 1 − α. This all assumes that the choice of step k is fixed; for k chosen by an IC-based rule as described in Section 3.2, the interpretation is very similar and we only need to add k to the right-hand side of the conditioning bar in (4.8). A similar interpretation is also available for the spike test, which we omit for brevity.
One-sided or two-sided inference? We note that both setups in (4.4) and (4.5) use a one-sided alternative hypothesis, and the contrast vectors in (4.6) and (4.7) are defined accordingly. To put it in words, we are testing for changepoint in the underlying mean θ (either exactly at one location, or in an average sense across local segments) and are looking to reject when a jump in θ occurs in the direction we have already observed in the fused lasso solution, as dictated by the sign s Ij . On the other hand, for coverage statements as in (4.8), we are implicitly using a two-sided alternative, replacing the alternative in (4.5) by H 1 :θ (Ij +1):Ij+1 =θ (Ij−1+1):Ij (since the coverage interval is the result of inverting a two-sided pivotal statistic). Two-sided tests and one-sided intervals are also possible in our inference framework, however, we find them less natural, and our default is therefore to consider the aforementioned versions.

Knot detection via trend filtering
Trend filtering can be seen as an extension of the 1d fused lasso for fitting higher-order piecewise polynomials [32,21,35]. It can be defined for any desired polynomial order, written as r ≥ 0, with r = 0 giving piecewise constant segments and reducing to the 1d fused lasso of the last subsection. Here we focus on the case r = 1, where piecewise linear segments are fitted. The general case r ≥ 2 is possible by following the exact same logic, though for simplicity, we do not cover it.
As before, we assume the data y = (y 1 , . . . , y n ) has been measured at ordered locations 1, . . . , n. The linear trend filtering estimate is defined as in (2.2) with D = D (2) ∈ R (n−2)×n , the discrete second difference operator: For the linear trend filtering problem, the elements of the boundary set are in one-to-one correspondence with knots, i.e., changes in slope, in the piecewise linear sequenceβ = (β 1 , . . . ,β n ). This comes essentially from (2.5) (for more, see Tibshirani and Taylor [37], Arnold and Tibshirani [1]). Specifically, enumerating the elements of the boundary set B k as I 1 < . . . < I q (and using I 0 = 0 and I q+1 = 0 for convenience), each location I j + 1, j = 1, . . . , q serves a knot in the trend filtering solution, so that we may rewrite (2.9) aŝ (4.10) Above, q denotes the number of knots in the k-step linear trend filtering solution, which in general need not be equal to k, since (unlike the 1d fused lasso) the path algorithm for linear trend filtering can both add to and delete from the boundary set at each step. Also, for each j = 1, . . . , q + 1, the quantitiesb j (λ) andm j (λ) denote the "local" intercept and slope parameters, respectively, of the linear trend filtering solution, over the segment {I j−1 +1, . . . , I j }. 4 Denoting the dual boundary signs s B k by s I1 , . . . , s Iq , we have, from (4.10) and the fact that the linear pieces in the solution match at the knots, that sign m j+1 (λ) −m j (λ) = s Ij , for j = 1, . . . , q, λ ∈ [λ k+1 , λ k ], (4.11) i.e., the signs of changes in slopes between adjacent trend filtering segments.
Contrasts for linear trend filtering. We can construct both spike and segment tests for linear trend filtering using similar motivations as in the 1d fused lasso. Given the trend filtering solution in (4.10), (4.11), we consider testing a particular knot location I j + 1, for some j = 1, . . . , q. The spike contrast is defined by v spike = s Ij (1 Ij − 21 Ij +1 + 1 Ij +2 ), (4.12) and the TG statistic in (2.11) with v = v spike provides us with a test for The segment contrast is harder to define explicitly from first principles, but can be defined following one of the alternative motivations for the segment contrast in the 1d fused lasso problem: consider the rank 1 subspace , and define w to be a basis vector for this subspace (unique up to scaling). The segment contrast is then v seg = sign(w Ij − 2w Ij +1 + w Ij +2 )s Ij w, (4.14) i.e., we align w so that its second difference around location I j + 1 matches that in the trend filtering solution. To test v T seg θ = 0, we can use the TG statistic in (2.11) with v = v seg ; however, as w is not easy to express in closed-form, this null hypothesis is also not easy to express in closed-form. Still, we can rewrite it in a slightly more explicit manner: versus the appropriate one-sided alternative hypothesis. In words, θ proj is the projection of θ onto the space of piecewise linear vectors with knots at locations I + 1, = j, and h is a single piecewise linear activation vector that rises from zero at location I j + 1.
The same high-level points comparing the spike and segment tests for the fused lasso also carry over to the linear trend filtering problem: the segment test can often deliver more power, but at a given location I j + 1, the power of the segment test will depend on the other knot locations in the estimated model. The spike test at location I j+1 does not depend on any other knot points in the trend filtering solution. Furthermore, the segment null does not specify a precise knot location, and one must be careful in interpreting a rejection here. Figure 4 gives examples of the segment test for linear trend filtering. More examples are investigated in Section 5.3.

Cluster detection via the graph fused lasso
The graph fused lasso is another generalization of the 1d fused lasso, in which we depart from the 1-dimensional ordering of the components of y = (y 1 , . . . , y n ). Now we think of these components as being observed over nodes V = {1, . . . , n} of a given (undirected) graph, with edges E = {e 1 , . . . , e m }, where say each e = (i , j ) joins some nodes i and j , for = 1, . . . , m. Note that the 1d fused lasso corresponds to the special case in which E = {(i, i + 1) : i = 1, . . . , n}, called the chain graph. For a general graph G = (V, E), we define its edge incidence matrix D G ∈ R m×n by having rows of the form when the th edge is e = (i , j ), with i < j , for = 1, . . . , m. The graph fused lasso problem, also called graph total variation denoising, is given by (2.2) with D = D G . This has been studied by many authors, particularly in the case when G is a 2-dimensional grid, and the resulting program, called the 2d fused lasso, is useful for image denoising (see, e.g., Friedman et al. [13], Chambolle and Darbon [5], Hoefling [18], Tibshirani and Taylor [37], Sharpnack et al. [31], Arnold and Tibshirani [1]). Trend filtering can also be extended to graphs [41]; in principle our inferential treatment here extends to this problem as well, though we do not discuss it. The boundary set constructed by the dual path algorithm, Algorithm 1, has the following interpretation for the graph fused lasso problem [37,1]. Denoting Contrasts for the graph fused lasso. For the graph fused lasso problem, it is more natural to consider segment (rather than spike) type contrasts, conforming with the notation and concepts introduced for the 1d fused lasso problem. Even restricting our attention to segments tests, many possibilities are available to us, given the graph fused lasso solution as in (4.17), (4.18). Say, we may choose any two "neighboring" connected components C a and C b , for some a, b = 1, . . . , p, meaning that there exists at least one edge (in the original graph) between C a and C b , and test where s ab = s I , for some element I ∈ B k such that e I = (i , j ), with i < j , and i ∈ C a , j ∈ C b . Above, we use the notationθ S = i∈S θ i /|S| for a subset S. The hypothesis in (4.19) tests whether the average of θ over components C a and C b are equal, versus the alternative that they differ and their difference matches the sign witnessed in the graph fused lasso solution. To test (4.19), we can use the TG statistic in (2.11) with As in the 1d fused lasso problem, the above contrast can also be motivated by the fact that v T seg y is the likelihood ratio test for an appropriate pair of null and alternative hypotheses. More advanced segment tests are also possible, say, by testing whether averages of θ are equal over two subsets, each given by a union of connected components among C 1 , . . . , C p . 5 Figure 5 shows a simple example of a segment test of the form (4.19), (4.20) for the graph fused lasso. Section 5.4 gives another example.

Problems with additional sparsity
The generalized lasso signal approximator problem in (2.2) can be modified to impose pure sparsity regularization on β itself, as in where α ≥ 0 is an another tuning parameter. The above may be called the sparse generalized lasso signal approximation problem. In fused lasso settings, both 1d and graph-based, the estimateβ in (4.21) will now be piecewise constant across its components, with many attained levels being equal to zero exactly (for a large enough value of α > 0). In fact, the fused lasso as originally defined by Tibshirani et al. [39] was just as in (4.21), with both fusion and sparsity penalties. In trend filtering settings, the estimateβ in (4.21) will be similar, except that it will now have a piecewise polynomial structure whenever it is nonzero. There are many examples in which pure sparsity regularization is a useful addition, see Section 5.6, and also, e.g., Tibshirani et al. [39], Friedman et al. [13], Tibshirani and Wang [40], Tibshirani [35]. Of course, problem (4.21) is still a generalized lasso problem, since the two penalty terms in the criterion can be represented by λ D β 1 , wherẽ D ∈ R (m+n)×n is given by row-binding D ∈ R m×n and αI ∈ R n×n . This means that all the tools presented so far in this paper are applicable, and post-selection inference can be performed for problems like the sparse fused lasso and sparse trend filtering.

Generalized lasso regression problems
Up until this point, our applications have focused on the signal approximation problem in (2.2), but all of our methodology carries over to the generalized lasso regression problem in (2.1). Allowing for a general regression matrix X ∈ R n×p greatly extends the scope of applications; see Section 5.5, and the discussions and examples in, e.g., Tibshirani et al. [39], Friedman et al. [13], Tibshirani and Taylor [37], Arnold and Tibshirani [1].
To tackle the regression problem in (2.1) with our framework, we must assume that rank(X) = p (which requires n ≥ p). We follow the transformation suggested by Tibshirani and Taylor [37], whereỹ = XX + y,D = DX + , and the equivalence between solutionsβ,θ iŝ θ = Xβ. From what we can see above, a generic generalized lasso regression problem can be transformed into a generalized lasso signal approximation problem (just with a modified response vectorỹ and penalty matrixD) and so all of our tools can be applied to this transformed signal approximation problem in order to perform inference. When rank(X) < p (which always happens in the high-dimensional case n < p), we can simply add a small ridge penalty, which brings us back to the case in which the effective regression matrix is full column rank (see Tibshirani and Taylor [37]). Then the above transformation can be applied.

Post-processing and visual aids
We briefly discuss two extensions for the post-selection inference workflow.
Post-processing. The choices of contrasts outlined in Sections 4.1-4.5 are defined automatically from the generalized lasso selected model. Given such a selected model, before we test a hypothesis or build a confidence interval, we can optionally choose to ignore or change some of the components of the selected model, in defining a contrast of interest. We refer to this as "post-processing"; to be clear, it only affects the contrast vector being used, and not the conditioning set in any way. It helps to give specific examples. In the 1d fused lasso problem, empirical examples reveal that the estimator sometimes places several small jumps close to one larger jump. The practitioner could choose to merge nearby jumps before forming the segment contrast of Section 4.1; we can see from (4.7) that this would correspond to extending the segment lengths on either side of the breakpoint in question, which could result in greater power to detect a change in the underlying mean. See the left panel of Figure 6 for an example. In trend filtering, a practitioner could also choose to merge nearby knots before forming the segment contrast in (4.14), from Section 4.2. See the right panel of Figure 6 for an example. Similar post-processing ideas could be carried out for the graph fused lasso and generalized lasso regression problems. it can also be demonstrated that decluttering at one location helps the power for testing at another location that is farther away, but this phenomenon is absent in the fused lasso case (due to of the finite support of the segment test contrasts).
Visual aids. In designing contrasts, the data analyst may also benefit from visualization of the generalized lasso selected model. Such a "visual aid" has a similar goal to that of post-processing, namely, to improve the quality of the question asked, i.e., the hypothesis tested, following a generalized lasso selection event. For the eventual inferences to be valid, the visual aid must not reveal information about the data y that is not contained in the selection event, M 1:k (y) = M 1:k , defined in Section 3.1 (assuming a fixed step number k, for simplicity). Again, it helps to consider the fused lasso as a specific use case. See Figure 7 for an example. We cannot, e.g., reveal the 4-step fused lasso solution to the analyst, ask him/her to hand-craft a contrast to be tested, and then expect type I error control after applying our post-selection inference tools. This is because the solution itself contains information about the data not contained in the selection event-the magnitudes of the fitted jumps-and the decision of which contrast to test could likely be affected by this information. This makes the conditioning set incomplete (said differently, it means that the contrast vector no longer measurable with respect to the conditioning event), and we should not expect our previously established inference guarantees to apply, as a result. We can, however, reveal a characature of the solution, as long as this characature is based entirely on the selection event. For the 1d fused lasso, this means that the characature must be defined in terms of the changepoint locations and signs of the fitted jumps, and we refer to it as a "step-sign plot". Examination of the jump locations and signs can aid the analyst in designing interesting contrasts to test.

1d fused lasso examples
One-jump signal. First, we examine a problem setup with n = 60, and where θ ∈ R 60 has one changepoint at location 30, of height δ. Data y ∈ R 60 were generated by adding i.i.d. N (0, 1) noise to θ. We considered three settings for the signal strength: δ = 0 (no signal), δ = 1 (moderate signal), and δ = 2 (strong signal). See the top left panel of Figure 8 for an example. Over 10,000 repetitions of the data generation process, we fit the 1-step fused lasso estimate, and computed both the spike and segment tests at the detected changepoint location. Their p-values are displayed via QQ plots, in the top middle and top right panels of Figure 8, restricted to repetitions for which the detected location was 30. (This corresponded to roughly 2.2%, 30%, and 65% of the 10,000 total trials when δ = 0, 1, and 2, respectively.) When δ = 0, we see that both the spike   30. We see that the segment test has uniformly higher power. The tables below the plots report empirical coverages of one-sided 95% confidence intervals of the form [η 0.05 , ∞), along with the median lower bounds η 0.05 . The lower bounds from the segment test are greater than the corresponding lower bounds from the spike test. In the two-jump setting, we only considered the signal strength of δ = 2, and the left panel of the bottom row shows an example simulated data set. The middle and right panels show p-values coming from the spike and segment tests conducted at location 20, after 1 or 2 steps of the fused lasso. The p-values at step 1 were collected over simulations in which location 20 was detected, and at step 2 over simulations in which locations 20 and 40 were detected (in either order). The power of the segment test improves after 2 steps, as it incorporates the correct second jump into the contrast, while the power of the spike test degrades due to the increased conditioning with the same contrast. The tables below the plots report the empirical coverages for one-sided confidence intervals that trap the selected jump size after each step. and segment tests deliver uniform p-values, as they should. When δ = 1 and 2, we see that the segment test provides much better power than the segment test, and has essentially full power at the strong signal level δ = 2.
When the fused lasso detects a changepoint at location 29 or 31, i.e., a location that is off by one from the true changepoint at location 30, the spike and segment tests again perform very differently. The spike test yields uniform p-values, as it should, while the segment test offers nontrivial power. See Appendix A for QQ plots of these results. Two-jump signal. Next, we examine a problem with n = 60 and where θ ∈ R 60 has two changepoints, at locations 20 and 40, each of height δ = 2. Data y ∈ R 60 were again generated around θ by adding i.i. d. N (0, 1) noise. See the bottom left panel of Figure 8 for an example. Over 10,000 repetitions, we fit 2 steps of the fused lasso and recorded spike and segment p-values, at each step, for testing the significance of location 20. The bottom middle and bottom right panels of Figure 8 display QQ plots, restricted at step 1 to simulations in which location 20 was detected (corresponding to about 32% of the total number of simulations), and restricted at step 2 to simulations in which locations 20 and 40 were detected (in either order, corresponding to again about 32% of the total simulations). We see that the spike test has better power at step 1 versus step 2, however, for the segment test, the story is reversed. The spike test contrast for testing at location 20 does not change between steps 1 and 2; the extra conditioning incurred at step 2 only hurts its power. On the other hand, the segment test uses a different contrast between steps 1 and 2, and the contrast at step 2 provides better power, because it leads to an average over a shorter segment (to the right of location 20) over which the mean is truly constant.

Confidence intervals.
As explained in Section 2.2, post-selection confidence intervals are given by inverting the TG pivot. In the 1d changepoint detection setting, the quantity v T θ being covered corresponds to a measure of jump size in the signal. We note that for the spike test contrast (4.6), it is exactly the jump size, and for the segment test contrast (4.7), it is the mean difference between segments adjacent to the jump. Figure 8 reports coverages and median bound sizes from confidence intervals computed in a variety of settings. Figure 9 shows the segment test applied to a one-jump signal of length n = 20, with a jump at location 10 of height δ, but this time incorporating the IC-based stopping rules (to determine where along the fused lasso solution path to perform the test). This is a more practical performance gauge because it requires minimal user input on model selection. Shown are power curves (fraction of rejections, at the 0.05 level of type I error control) as functions of δ, computed over p-values from simulations in which location 10 was detected in the final model selected by the AIC-or BIC-type rule described in Section 3.2, using q = 2 (i.e., stopping after 2 rises in the criterion). Note that the p-values here were all adjusted by the number of changepoints in the final AIC-or BIC-selected model, using a Bonferroni correction (so that the familywise type I error is under control). The BIC-type rule has better power than the AIC-type rule, as the latter leads to larger models (AIC stops at 2.5 steps on average versus 1.7 from BIC), resulting in further conditioning and also misleading additional detected locations, both of which hurt its performance.

IC-based stopping rules. The left panel of
The results are compared to those from the segment test carried out at the 1-step fused lasso solution, over p-values from simulations in which location 10 was detected. With less conditioning (and no need for multiplicity correction), this method dominates the IC-based rules in terms of power. The results are also compared to an oracle rule who knows the correct segments and carries out a test for equality of means (with no conditioning); this serves as an upper bound for what we can expect from our methods. The right panel of Figure 9 shows BIC power curves as the sample size n increases from 20 to 80, in increments of 20. We see a uniform improvement in power across all signal strengths δ, as n increases. However, at n = 80, the BIC-based test still delivers a power that is noticeably worse than that of the oracle rule at n = 20.

Comparison to SMUCE-based inference
Here we compare our post-selection confidence intervals for the 1d fused lasso to those based on the Simultaneous Multiscale Changepoint Estimator (SMUCE) of Frick et al. [12]. The SMUCE approach provides a simultaneous confidence band for the components of the mean vector θ, from which confidence intervals for any linear contrast of the mean can be obtained, and therefore, valid confidence intervals for post-selection targets can be obtained. Admittedly, a simultaneous band is a much broader goal, and SMUCE was not designed for post-selection confidence intervals, so we should expect such intervals to be wider than those from our method. It is worthwhile to make empirical comparisons nonetheless.
Data were generated as in the top left panel of Figure 8, with the signal strength parameter δ varying between 0 and 4. We computed the 1d fused lasso path, and stopped using the 2-rise BIC rule. Over simulations in which the location 30 appeared in the eventual model selected by this rule, we computed the segment test contrast v seg around location 30, and used the SMUCE band with a nominal confidence level of 0.95 to compute a post-selection interval for v T seg θ. A power curve was then computed, as a function of δ, by keeping track of the fraction of times this interval did not contain 0. Again over simulations in which the location 30 appeared in the model chosen by the BIC rule, we used the TG test to compute p-values for the null hypothesis v T seg θ = 0. These pvalues were Bonferroni-corrected to account for the multiplicity of changepoints in the model selected by the BIC rule, and a power of curve was computed, as a function of δ, by recording the fraction of p-values below 0.05. In the middle panel of Figure 10, we can see that the TG test provides better power until δ is about 2.5, after which both methods provides strong power. The right panel In each simulation, the 1d fused lasso path was stopped using the 2-rise BIC rule, and segment test contrasts were formed around the detected changepoints. The middle panel shows power curves, computed over simulations in which the location 30 appeared in the model selected by BIC. These power curves were computed either from the SMUCE band having nominal confidence level 0.95, or the TG test with a type I error control of 0.05. We can see that the latter method has better power for smaller δ, and both perform well for larger δ, with the SMUCE-based method providing slightly more power. The right panel displays the empirical type I error of the two testing methods, which emphasizes that the SMUCE guarantees are only asymptotic, and this method can quite become conservative for large n, because in a way simultaneous coverage is a more ambitious goal that post-selection coverage.
investigates the empirical type I error of each method, as n varies. The SMUCE bands are asymptotically valid, and recall, the TG p-values and intervals are exact in finite samples (assuming Gaussian errors). We can see that SMUCE begins anti-conservative, before the asymptotics have "kicked in", and then as n grows, becomes overly conservative as a means of testing post-selection targets, because these tests are derived from a much more stringent simultaneous coverage property.

Trend filtering example
We examine a problem with n = 40, and where θ ∈ R 40 has its first 20 components equal to zero, and its next 20 components exhbiting a linear trend of slope δ/20. Data y ∈ R 60 were generated by adding i.i.d. N (0, 1) noise to θ. We considered the four settings: δ = 0 (no signal), δ = 1 (weak signal), δ = 2 (moderate signal), and δ = 5 (strong signal). See the left panel of Figure 11 for an example. We computed the trend filtering path, stopped using the 2-rise BIC rule, and considered the segment test at location 20. The right panel of Figure  11 shows the resulting p-values, restricted to repetitions in which location 20 appeared in the eventual model. We can see that when δ = 0, the p-values are uniformly distributed, as we should expect them to be. As δ increases, we can also see the increase in power, with the jump from δ = 2 to δ = 5 providing the segment test with nearly full power.

2d fused lasso example
We examine a problem setup where the mean θ is defined over a 2d grid of dimension 10 × 10 (so that n = 100), having all components set to zero, except for a 5 × 5 patch in the lower left corner where all components are equal to δ. Data y ∈ R 100 were generated by adding i.i.d. N (0, 1) noise to θ. We considered the following settings: δ = 0 (no signal), δ = 3 (medium signal), and δ = 5 (strong signal). See the left panel of Figure 12 for a visualization of the mean θ, and the middle panel for example data y, both when δ = 3.
Over many draws of data from the described simulation setup, we computed the 2d fused lasso solution path, and used the 1-rise BIC stopping rule. 6 For δ = 3, 5, we retained only the repetitions in which the BIC-chosen 2d fused lasso estimate had exactly two separate fused components-the bottom left 5 × 5 patch, and its complement-and computed segment test p-values with respect to these two components. For δ = 0, we collected the segment test p-values over all repetitions, which were computed with respect to two arbitrary components appearing in the BIC-chosen estimate, in each data instance. The right panel of Figure 12 shows QQ plots for each value of δ in consideration. When δ = 0, we see uniform p-values, as expected; when δ = 3, 5, we see clear power.

Regression example
We consider a semi-synthetic stock example, with n = 251 timepoints, and data y ∈ R 251 simulated from a linear model of log daily returns of 3 real Dow Jones Industrial Average (DJIA) stocks, from the year 2015. Data was obtained from the quantmod R package. See the left panel of Figure 13 for a visualization of these stocks (note that what is displayed is not the log daily returns of the stocks, but the raw stock prices themselves).
Denoting the log daily returns as X j ∈ R 251 , j = 1, 2, 3, our model for the response was The coefficient vectors β * j ∈ R 251 , j = 1, 2, 3 were taken to be piecewise constant; the first coefficient vector β * 1 had two changepoints at locations 83 and 166, and had constant levels -1, 1, -1 from left to right; the second coefficient vector β * 2 had one changepoint at location 125, switching from levels -1 to 1; the third coefficient vector β * 3 had no changepoints, and was set to have a constant level of 1. We generated data once from the model in (5.1), with σ = 0.002 (this is a reasonable noise level, as the log daily returns are on a comparable scale). We then computed the fused lasso regression path, where 1d fused lasso penalties were placed on the coefficient vectors for each of the 3 stocks, in order to enforce piecewise constant behavior in the estimatesβ j ∈ R 251 , j = 1, 2, 3. The path was terminated using the 2-rise BIC stopping rule, which gave a final model with 9 changepoints among the coefficient estimates. After postprocessing ("decluttering") changepoints that occurred within 10 locations of each other, we retained 5 changepoints: 2 in the first estimated coefficient vector, 2 in the second, and 1 in the third. Segment test p-values were computed at each of the decluttered changepoints, and 3 changepoints that approximately coincided with true changepoints were found to be significant, while the other 2 were found insignificant. See the right panel of Figure 13. For more details on the fused lasso optimization problem, and the contrasts used to define the segment tests, see Appendix B.

Application to CGH data
We examine the use of our fused lasso selective inference tools on a data set of array comparative genomic hybridization (CGH) measurements from two glioblas-toma multiforme (GBM) tumors, from the cghFLasso R package. CGH is a molecular cytogenetic method for determining DNA copy numbers of selected genes in a genome, and array CGH is an improved method which provides higher resolutions measurement. Each CGH measurement is a log ratio of the number of DNA copies of a gene compared to a reference measurementaberrations give nonzero log ratios. Tibshirani and Wang [40] considered the sparse 1d fused lasso as a method for identifying regions of DNA copy number aberrations from CGH data, and analyzed the GBM tumor data set as a specific example, with n = 990 data points.
Using the same GBM tumor data set, we examine the significance of changepoints that appear in the 10th step of the 1d fused lasso path, and separately, changepoints that appear in the 28th step of the sparse 1d fused lasso path (in general, unlike the 1d fused lasso, the sparse 1d fused lasso can add and delete changepoints at each step of the path; the estimate at the 28th step here only had 7 changepoints). These steps were chosen by the 2-rise and 1-rise BIC rules, respectively. 7 The resulting estimates are plotted along with the GBM tumor data, in Figure 14. Displayed below this is a step-sign plot of the sparse 1d fused lasso estimate, serving as example of what might be shown to the scientist to allow him/her to hand-design interesting contrasts to be tested.
Below the plot is a table containing the p-values from segment tests of the changepoints in the two models, i.e., from the 1d fused lasso and sparse 1d fused lasso. The segment test contrasts were post-processed (i.e., "decluttered") so as to exclude changepoints that occurred within 2 locations of each other-this only affected the locations labeled E and F in the sparse 1d fused lasso model (and as a result, the significance of changepoint at location F was not tested). Commonly detected changepoints occur at locations labeled A, D, E, and G; the segment tests from the 1d fused lasso model yield significant p-values at each of these locations, but those from the sparse 1d fused lasso model only yield a significant p-value at location E. This apparent loss of power with the sparse 1d fused lasso may be due to the larger amount of conditioning involved. We also compare the above to results from simple changepoint tests carried out using sample splitting. This is possible in a structured problem like ours, where there is a sensible way to split the data (note that in a less structured setting, like a generic graph fused lasso problem, there would be no obvious splitting scheme). We divided the GBM data set into two halves, based on odd and even numbered locations. On the first half, the "estimation set", we fit the 1d fused lasso path and chose the stopping point using 5-fold crossvalidation (CV), where the folds were defined to include every 5th data point in the estimation set. After determining the path step that minimized the CV error, we moved back towards the start of the path (back towards step 1) until a further move would yield a CV error greater than one standard error away from the minimum (this is often called the "one standard error rule", see, e.g., Chapter 7.10 of Hastie et al. [16]). This gave a path step of 18, and hence 18 changepoints in the final 1d fused lasso model. Using the second half of the data set, the "testing set", we then ran simple Z-tests to test for the equality of means between every pair of adjacent segments partitioned by the 18 derived changepoints from the estimation set. For simplicity, in the table in Figure 14, we only show p-values at locations that are comparable to the common locations labeled A, D, E, G from the fused lasso estimation procedures run on the full data set. All are significant.
Lastly, we note that to apply all tests in this subsection, it was necessary to estimate the noise variance σ 2 . To do so, we ran 5-fold CV on the full data set, chose the stopping point using the one standard error rule, and estimated σ 2 based on the residuals. This gaveσ = 0.46.

Discussion
We have extended the post-selection inference framework of Lee et al. [22], Tibshirani et al. [38] to the model selection events along the generalized lasso path, as studied by Tibshirani and Taylor [37]. The generalized lasso framework covers a fairly wide range of problem settings, such as the 1d fused lasso, trend filtering, the graph fused lasso, and regression problems in which fused lasso or trend filtering penalties are applied to the coefficients. In this work, we developed a set of tools for conducting formal inferences on components of the adaptively fitted generalized lasso model-these are, e.g., adaptively fitted changepoints in the 1d fused lasso, knots in trend filtering, and clusters in the graph fused lasso. Our methods allow for inferences to be conducted at any fixed step of the generalized lasso solution path, or alternatively, at a step chosen by a rule that tracks AIC or BIC until a given number of rises in the criterion is encountered.
It is worth noting the following important point. In the language of Fithian et al. [10,11], the development of post-selection tests in this paper was done under a "saturated model" for the mean parameter θ-this treats θ as an arbitrary vector in R n , and the hypotheses being tested are all phrased in terms of certain linear contrasts of the mean parameter begin zero, as in v T θ = 0. Fithian et al. [10,11] show how to also conduct tests under the "selected model"-to use the 1d fused lasso as an example, this would model the mean as a vector that is piecewise constant with breaks at the selected changepoints. The techniques developed in Fithian et al. [11], allow us to perform sequential tests of the selected model-again to use the 1d fused lasso as an example, this would allow us to test, at each step of the 1d fused lasso path, that the mean is piecewise constant in the changepoints detected over all previous steps, and thus a failure to reject would mean that all relevant changepoints have already been found. The selected model tests of Fithian et al. [11] have the following desirable properties: (i) they do not require the marginal error variance σ 2 to be known; (ii) they often display better power (compared to the tests from this paper) when the selected model is false; (iii) they yield independent p-values across steps in the path for which the selected model is true. The latter property allows us to apply p-value aggregation rules, like the "ForwardStop" rule of Grazier G'Sell et al. [15], to choose a stopping point in the path, with a guarantee on the FDR. This is an appealing alternative to the AIC-or BIC-based stopping rules described in Section 3.2. The downside of the selected model tests is that they are computationally expensive (compared to those described in this paper), and require sampling (rather than analytic computation, using a truncated Gaussian pivot) to compute p-values. Furthermore, once we use the selected model p-values to choose a stopping point in the path, it is not clear how to carry out valid postselection tests in the resulting model (due to the corresponding conditioning region being very complicated). Investigation of selected model inference along the generalized lasso path will be the topic of future work.
There are several other possible follow-up ideas for future work. One that we are particularly keen on is the attachment of post-selection inference tools to existing, commonly-used methods for 1d changepoint detection. It is not hard to show that the selection events associated with many such methods-like binary segmentation, wild binary segmentation of Fryzlewicz [14], and all wavelet thresholding procedures (provided that soft-or hard-thresholding is used)-can be characterized as polyhedral sets in the data y. The ideas in this paper can therefore be used to conduct significance tests for the detected changepoints after any number of steps of binary segmentation, wild binary segmentation, or wavelet thresholding, this number of steps either being fixed or chosen by an AIC-or BIC-type rule. Because other 1d changepoint detection methods can often outperform the 1d fused lasso in terms of their accuracy in selected relevant changepoints (wild binary segmentation, specifically, has this property), pairing them with formal tools for inference could have important practical implications. of the fused lasso path detects a changepoint at location 29 or 31, i.e., off by one from the true location 30. Figure 15 (right panel) shows QQ plots for the spike and segment tests, applied to test the significance of the detected changepoint, in these instances. We can see that the spike test p-values are uniformly distributed, which is appropriate, because when the detected changepoint is off by one, the spike test null hypothesis is true. The segment test, on the other hand, delivers very small p-values, giving power against its own null hypothesis, which is false in the case of a one-off detection. the mean is simply θ = Xβ * , and data is generated according to the model regression model y ∼ N (θ, σ 2 I).
Optimization problem. The fused lasso regression problem that we consider isβ = argmin β∈R 751 where X ∈ R 251×753 is as defined above, and using a block decomposition β = (β 1 , β 2 , β 3 ) ∈ R 751 , with each β j ∈ R 251 , we may write the penalty matrix D ∈ R 750×251 as so that Dβ 1 = 3 j= D (1) β j 1 . Note that small ridge penalty has been added to the criterion in (B.1) (i.e., ρ > 0 is taken to be a small fixed constant), making the problem strictly convex, thus ensuring it has a unique solution, and also ensuring that we can run the dual path algorithm of Tibshirani and Taylor [37]. Of course, the blocks of the solutionβ = (β 1 ,β 2 ,β 3 ) in (B.1) serve as estimates of the underlying coefficient vectors β * 1 , β * 2 , β * 3 .
For each detected changepoint I j ∈ B, we now define a segment test contrast But it is easy to see that X T M P ⊥ col(Xm) P ⊥ col(Xm) X M is proportional to ww T , because if a is any vector that has identical entries across coordinates r Ij and r Ij + 1, then P ⊥ col(Xm) X M a = P ⊥ col(Xm) X m a = 0, where a is simply a with its (r Ij )th coordinate removed. This completes the proof.