Scalable methods for Bayesian selective inference

Modeled along the truncated approach in Panigrahi (2016), selection-adjusted inference in a Bayesian regime is based on a selective posterior. Such a posterior is determined together by a generative model imposed on data and the selection event that enforces a truncation on the assumed law. The effective difference between the selective posterior and the usual Bayesian framework is reflected in the use of a truncated likelihood. The normalizer of the truncated law in the adjusted framework is the probability of the selection event; this is typically intractable and it leads to the computational bottleneck in sampling from such a posterior. The current work lays out a primal-dual approach of solving an approximating optimization problem to provide valid post-selective Bayesian inference. The selection procedures are posed as data-queries that solve a randomized version of a convex learning program which have the advantage of preserving more left-over information for inference. We propose a randomization scheme under which the optimization has separable constraints that result in a partially separable objective in lower dimensions for many commonly used selective queries to approximate the otherwise intractable selective posterior. We show that the approximating optimization under a Gaussian randomization gives a valid exponential rate of decay for the selection probability on a large deviation scale. We offer a primal-dual method to solve the optimization problem leading to an approximate posterior; this allows us to exploit the usual merits of a Bayesian machinery in both low and high dimensional regimes where the underlying signal is effectively sparse. We show that the adjusted estimates empirically demonstrate better frequentist properties in comparison to the unadjusted estimates based on the usual posterior, when applied to a wide range of constrained, convex data queries.


Introduction
A line of works Lee et al. (2016); Fithian et al. (2014); Tibshirani et al. (2016); Loftus and Taylor (2014); Yang et al. (2016); Tian and Taylor (2015) has established methodology for exact and asymptotic selection-adjusted inference that provide frequentist coverage guarantees in the regression framework.The driving motivation to adjust for selection is that analysts commonly conduct queries on a database in order to select inferential questions of interest about the population parameters.Inference after such interactions with the data lacks frequentist properties like target coverage when the same data set is used later for answering these very same questions.A Bayesian perspective on modeling the post-selective problem as a truncation is advocated in Yekutieli (2012) and extensions of the former work to the more general set-up of linear models are proposed in Panigrahi et al. (2016).These works propose the use of a fixed parameter view where the truncation is applied to the data exclusively conditional on the parameter.This alters the posterior distribution after selection unlike the usual Bayesian variable selection framework in Mitchell and Beauchamp (1988); George and McCulloch (1997) where the posterior is known to display inadapativity to selection.
More precisely, the truncated view point on inference is based on a selective posterior, formed by a truncated likelihood in conjunction with a prior that allows an analyst to inject a priori information on parameters in a model after selection.Such an approach has the additional flexibility in allowing the analyst to fix a model based on a parametrization that can be guided by a selection procedure.Motivated by the conditional approach of modeling Bayesian inference, the current work focuses on developing concrete, scalable methods that will allow the analyst to exploit the full potent of a Bayesian machinery post a wide range of constrained-convex learning programs.The Bayesian problem is by no means a trivial extension of the existing frequentist methods as it requires a closed form expression for the normalizer of the truncated likelihood.We describe the computational difficulties in providing Bayesian inference in the truncated framework and the contributions of this work more formally after introducing the selective posterior.

Selective posterior
A selective posterior modeled along the conditional approach has two components -a truncated likelihood and a prior distribution on the parameters in the likelihood.The truncation is imposed by selection as the analyst is interested in providing inference for a target parameter only if he observed the associated selection event.A generative model that the analyst is willing to impose on data post selection, together with the truncation to all realized values that lead to an observed selection event determine the truncated likelihood.The prior allows him to inject information on the target from his existing knowledge.
Formally, variable selection is based on an observed data vector S and the selection event of observing an active set of variables Ê(S) = E can be described as {s : Ê(s) = E}, the set of realizations of S that lead to E. It is only after selection that a model is defined, in this case, a generative Bayesian model with a likelihood parametrized by β * , denoted as f (s|β * ) and a prior β * ∼ π.The goal is to provide inference for a target determined by selection event E; that is, we infer about the target only if we observe E. This truncates the generative law of the data conditional on the parameter, resulting in the selection adjusted likelihood f (s|β * )1 { Ê(s)=E} /P( Ê(S) = E|β * ).
As is evident from above, the normalizer of the truncated likelihood is the probability of the selection event, computed as a function of the parameters in the generative density.While sampling from the truncated likelihood in a frequentist regime does not require knowledge of the normalizer (treated as a constant), the normalizer that typically lacks a closed form expression does contribute to the selective posterior in a Bayesian paradigm.Implementing a sampling scheme then becomes impossible in the absence of an expression for the normalizer to the truncated law.Panigrahi et al. (2016) identifies this technical hurdle and proposes an approximation to general affine selection probabilities which gives rise to a pseudo selective posterior.Sampling from the selective posterior necessitates computing the approximation, cast as an optimization problem for each draw of the sampler.The efficiency of any standard sampler thereby, hinges on the computational cost of solving the optimization objective associated with this approximation.In most cases, this can be very expensive and hard to scale with larger sample size and regression dimensions.
We propose in this work a randomization scheme for commonly used selection queries and offer an approximating optimization under the same to facilitate sampling from an approximate selective posterior.The three major gains in the current work associated with the proposed optimization are • an objective with simpler constraints on the optimizing variables, as opposed to polyhedral constraints in Lee et al. (2016) • a partially separable objective function with separability in the selective constraints • a reduction in dimensions of the optimization objective (an objective with smaller number of optimizing variables).
Typically, for popular constrained queries like marginal screening, Lasso, forward stepwise etc., the optimization solves an objective in min(d + |E|, p) dimensions with d as the size of the observed data vector S, p as the dimension of regression and |E| as the size of active set.The key idea behind these reductions is an upper bound to the normalizer that capitalizes on the structure of an inversion map associated with the randomized selective query and a change of measure induced by the same, discussed in Section 2. The problem of analytically getting an approximation for the normalizer is similar to the goals of variational Bayesian approaches in Minka (2001); Hoffman et al. (2013) that use a known parametric distribution to obtain an approximation to an intractable posterior based on the KL-divergence between the two posteriors.We adopt a different approach here by approximating an intractable integral, the normalizer of the truncated law as a function of the parameters in the model; we show that this approximation gives an asymptotic large deviation behavior of the exact normalizer under Gaussian randomization schemes.
These contributions allow a wider scope of applications of the truncated Bayesian approach to different generative models, randomization schemes and constrained selective queries.Such reductions become very useful in not just higher dimensions, but also, in providing inference after multiple selective queries.
imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 Below, we describe methods that demonstrate scalability of truncated Bayesian inference in both high and low dimensional regimes of inference.The effectiveness of the proposed methods is corroborated through Bayesian effect estimates with superior frequentist properties for data-mined variants in a real data set that investigates gene associations with local variants.

A motivating example
Before introducing our methods, we give an example that motivates the readers towards the inferential gains of the selective posterior over the more common inadaptive Bayesian approach.Consider data Y ∈ R n and a fixed predictor matrix X ∈ R n×p with columns scaled by 1/ √ n such that the response is generated as Y = Xβ + , ∼ N (0, σ 2 I n ) given a β ∈ R p and σ 2 = 1.An analyst decides to run Lasso on data (Y, X) in order to choose E, a set of selected predictors.Not having access to the actual generative model, he assumes the screened model from Lasso as a plausible model on his data, that is Y ∼ N (X E β E , I n ) and a non-informative prior π on the parameters β E in the selected model to offer Bayesian inference on β E .Ignoring selection, he uses the unadjusted posterior on to report credible intervals and the posterior mean as inference for target β E .We compare the estimates from the above approach of the analyst to truncated inference post a randomized version of the Lasso query.We give inference on β E using the same selected model and non-informative prior as the analyst where E is the output from (3) Randomization enters the objective as ω T β, perturbing selection that is otherwise based only on y; the above randomized version of Lasso has been proposed in Tian et al. (2016).The objective has a small added ridge penalty = 1/ √ n for existence of a solution and tuning parameter is set as λ = E[ X T ψ ∞ ] as proposed in Negahban et al. (2009) where Ψ ∼ N (0, I).On a high level, our method of providing estimates in the truncated regime involves approximating the intractable posterior truncated to the realizations (y, ω) that lead to the same selection event.We finally use a Langevin walk-based sampler to provide adjusted Bayesian inference based on the approximate posterior.
To compare our methods against the traditional Bayesian inference, we conduct the below experiment with two different generative mechanisms, Model I is a frequentist model with no signal and Model II is a Bayesian model.Let X ∈ R n×p , n = 200, p = 1000 be a design matrix with independent Gaussian entries normalized to have column norm 1.
(1).In each trial, set target and model of inference based on observed E post Lasso query in (3) as described above; we compare estimates based on (2) in the untruncated regime against our method of inference that gives adjusted estimates.To conduct the randomized query (3), we draw ω as an instance of Ω ∼ N (0, τ 2 I p ) in every trial.The below table gives a comparison of coverage of the credible intervals and risk in the frequentist model and of Bayesian FCR and Bayes risk of the posterior mean in the Bayesian model after 50 trials.The target coverage for the intervals is set at 90%.Bayesian FCR in Yekutieli (2012) is defined as E β,Y (V / max(R, 1)), where V is the number of non-covering credible intervals and R = |E| is the number of intervals constructed after selection.Consistent with coverage, we report the proportion of |E| intervals covering the target in the Bayesian model in Table 2 and call it CR.Unlike the nonrandomized intervals in Lee et al. (2016) that are known to grow very wide, the power inherited from randomization is reflected in shorter lengths of the adjusted intervals.The results clearly highlight the superior frequentist properties of our methods, both in terms of coverage of credible intervals and risk of posterior mean.The rest of the paper is organized as follows.Section 2 outlines the truncated framework, giving the recipe for adjusted Bayesian inference using a selective posterior.Section 3 lays out the backbone of the paper, the approximating optimization problem that we solve to sample from a tractable version of the selective posterior and provides a sampler that targets the approximate posterior.Section 4 shows asymptotic validity of the finite sample bounds in Section 3 to the otherwise unavailable normalizer for non-local sequences of parameter.Section 5 lays out the optimization-based approach for popular selection queries.Section 6 includes simulations that demonstrate the inferential gains imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 associated with the truncated Bayesian methods in the current work over the unadjusted analog.Many of these examples bring to light the robustness of our methods to model mis-specifications.Section 6 concludes with an application of our methods to provide adjusted Bayesian effect size estimates for local genetic variants (GTEx gene association data set) that have been data-mined as the strongest effects.

A randomized query and an inversion map
Selection events can be broadly viewed as outputs from queries on a data-base.In the context of variable selection, we are typically interested in the active set of coefficients obtained upon solving convex optimization problems.As a followup on the recent work on randomized inference in Tian and Taylor (2015); Tian et al. (2016), these queries are randomized versions of learning problems with a convex loss (S(X, y), .)and a convex penalty P λ (.) with tuning parameter λ.Though the skepticism with randomization is that different instances of randomization can result in different selection outputs, we view it to be similar in spirit to the much practiced data-splitting where difference in outputs can result from various splits.Just as we can aggregate over the outputs from multiple splits of the data, we can similarly combine selections from multiple queries on the data-base as illustrated in Markovic and Taylor (2016).Also, sharing similarity with the concept of reusable hold-out introduced in the field of differential privacy Dwork et al. (2015), these forms of randomized inference come with the merit of higher statistical power during inference.For the Bayesian problem, randomization results in empirical improvements in the frequentist properties associated with the selective posterior, see Panigrahi et al. (2016) for examples illustrating robustness of the randomized-credible intervals.The empirical results in Section 6 of the current work corroborate these merits of a randomized Bayesian procedure, reflected in the coverage properties and shorter lengths of intervals.To add to these advantages, we leverage randomization to obtain significant computational reductions in solving an approximating optimization to sample from the selective posterior in the current work.The gains associated with randomized queries become clear after details in Section 3.
A randomized selective query taking a convex loss (S(X, y), .)and a convex penalty P λ (.) as inputs, assumes the canonical form with data realization s and randomization instance ω, where S ∼ F independent of randomization Ω ∼ G.The above algorithm has a linear term in randomization ω, drawn from a distribution G with a density g, fully supported on R p .This can be viewed as selection with a perturbed version of data, hence the term "randomized" program.Queries of the above form are termed as objective perturbation in the privacy literature, see Chaudhuri and Monteleoni (2009); Chaudhuri et al. (2011).Some randomized programs like the Lasso have an additional 2 penalty term 2 β 2 2 as in ridge regression in Zou and Hastie (2005) to enforce existence of a solution.The analyst has access to the output E, a function of (s, ω) from such a query in the inferential stage, typically the set of active coefficients along with their signs: see Taylor et al. (2013); Tibshirani et al. (2014); Lee et al. (2016).Selective inference seeks to overcome bias from having known the output of query prior to inference through the conditional approach.Fithian et al. (2014) presents a more natural analog of classical data-splitting in the form of data-carving, which advocates a random split of the data for selection, but allows the analyst to use the entire data for inference.A datacarved query that is performed on a randomly chosen split of the data is given by β(s (1) , ω) = argmin with r as the fraction of data-samples used in selection and s (1) as a random split of the data-vector s.Markovic and Taylor (2016) shows that the above selection can be cast as a randomized query of the form (4).This can allow an analyst to collect new data and view prior selection on existing data as a split on the updated data-base.Hence, it can facilitate valid inference post already conducted exploratory analyses on existing data-bases while having extra power in comparison to an analysis on the new data set only.We discuss the datacarved version of Bayesian inference in more details later in Section 5.
The starting point of achieving computational reductions in approximating the selective posterior is an inversion map that characterizes the output from randomized queries.Such a map is obtained from the subgradient equation of (4).The canonical selection event of observing an active set of coefficients E with signs z E can be described in terms of the randomization ω and data instance s using the inversion map.Denoting by β = ( βE , 0) the solution from a query in (4) with βE as the active non-zero coefficients, the inversion map is given by The above equation maps the randomization instance to realized data S = s and optimization variables O = ( βE , v −E ) where βE denotes the active coefficients and v −E represents the inactive sub-gradient corresponding to the inactive coordinates of ∂P λ (( βE , 0)).We denote the optimization variables corresponding to the active coordinates as O E and the ones corresponding to the inactive subgradient variables as O −E from now on, referring them as active and inactive optimization variables respectively.Post an affine randomized selection event, the canonical inversion map that is the basis of a new measure takes the form ω(s, o) = Ds + P o + q (7) with s and o representing data and optimization variables respectively and D, P, q are fixed.
The scope of randomized queries is quite broad in nature allowing for even discrete versions of randomizations like carving.In practice, the analyst may use a union of outputs Ē or some reasonable combination) based on a sequence of outputs (E 1 , E 2 , ..., E k ) from multi stagewise selective algorithms to determine a target and a generative mechanism in the inferential stage.We demonstrate the extension of our approach to multiple data queries using a combination of these model selection methods in 3.4 under Section 3.This finds similarity in the method of approximately estimating expectations after allowing an analyst to repeatedly query a database in Dwork et al. (2015).

A Bayesian inferential scheme using inversion map
The ingredients for selective Bayesian inference are the same as the usual one, a prior and a likelihood, except that we replace the usual likelihood with a truncated one.To describe our inferential framework, we assume a model f (.|β * ), parametrized by β * post-selection on data S ∈ R d and fix a target denoted as Θ E (β * ).In the linear model settings with a fixed design matrix X ∈ R n×p , f (.|β * ) might correspond to a family of models We emphasize here that we do not have an idea about β * before we run a selection mechanism like the Lasso.There are some settings where the parameterization exists before selection and does not change, example being the saturated model of Lee et al. (2016).Typically, we are running Lasso in order to find something that might be an interesting parameterization.
A common target of inference post selection of an active set E is the usual population coefficient corresponding to ordinary least squares on the selected model E, that is With a random design matrix, the target of inference can be described as with the generative family of models parametrized as {f Remark 1. Generative models and targets: The selected model described in Fithian et al. (2014)  interesting target Θ Ē (β * ), where Ē is determined through E. To be able to highlight the flexibility of our method to various parametrizations of the mean, we use the general notation β * to denote the parameters underlying the Bayesian model assumed post selection.
Remark 2. Prior on variance parameter: The variance σ in the generative likelihood can be modeled in a Bayesian paradigm by putting a joint prior on (β * , σ).We do not delve into details of incorporating a Bayesian model on the variance in the current draft; hereafter, we stick to a fixed variance setting.
Using a change of measure based on the inversion map in (7), the joint truncated density at (s, o) corresponding to a generative model f on data in R d with parameters β * and randomization density g ∈ R p decouples as |J| is a Jacobian reflecting the change of measure, a constant for affine inversion maps as in ( 7).The support is unrestricted on data s and constrained to R O ⊂ R p , representing constraints on optimization variables, imposed by selection output.Tian et al. (2016) advocates this new measure in order to enable a frequentist to sample from a density with a fairly simple support region as opposed to more general affine constraints on data and randomization.
Coming back to a Bayesian setting, the selective posterior for generative parameters β * given data S when β * ∼ π(β * ) is formed by appending the marginal selective density of S to the prior π(.).The truncated marginal of S given parameters β * is obtained by marginalizing over O in the joint density (8).The selective posterior is thus, given by The above posterior is however intractable as the normalizer has no exact closed form expression.The problem reduces to computing the normalizer P((S, O) ∈ R|β * ); we focus on this in Section 3.

Approximate normalizer based on inversion map
Using the inversion map that defines the selection output from a query in (7), we derive an approximating optimization with a constrained objective in d + p dimensions that bounds from above the log normalizer.We state below the first theorem of this paper that gives rise to an upper bound on the volume of a convex and compact selection region R with respect to the joint density of data and optimization variables.It involves computating the log-MGF of the augmented vector of data and optimization variables with respect to a transformed measure induced by the inversion map in (7).
Theorem 1. Denoting Λ * f (.|β * ) as the convex conjugate of the log-MGF Λ f (.|β * ) of data vector S ∈ R d and Λ * g (.) as the conjugate of the log-MGF Λ g of randomization Ω ∈ R p , a Chernoff upper bound to the exact selection probability We prove the above in Appendix A.1.While the above upper bound does hold for compact selection regions, the canonical selective constraints lead to a selection region of the form and R O is typically tensor of orthants and cubes; this lacks compactness.The upper bound derived in 1 can still be applied as an approximation as we can work with a sufficiently large compact and convex subset of R that has an almost 1-measure under prior π.A smooth version of ( 10) is seen to lead to better frequentist properties in Panigrahi et al. (2016) in the non-randomized settings; in the current work, we opt for (12) to solve a smooth objective in place of a constrained optimization.
The bound-based approximation above is given by In particular, χ R O (.) can be interpreted as a function with a uniformly 0 penalty within the selection region.An improved approximation to the selection probability can be obtained by smoothing the discrete penalty χ R O (.) in the bound with a barrier penalty b R O (.), which imposes a continuously decaying penalty as distance from the selective boundary increases.This leads to a smooth, unconstrained version of (10) to approximate log P((S, O) ∈ R|β * ) and is given by using a barrier penalty b R (.) on affine constraints induced on the optimization variables.The gain with (12) in comparison to the prior work is a much easier objective function as the canonical constraints on the optimization variables simplify to sign and cube constraints as in Tian et al. (2016) instead of the complicated affine constraints as in Lee et al. (2016).We can further benefit from separability and achieve more reductions from such an approximation under certain randomizations, as seen later in (15).The unconstrained optimization given by ( 12) in d+p dimensions can be used to approximate selection probabilities under any randomization with a log-MGF Λ g , that is independent of the data vector.In particular, we can use the optimization for inference post data carved queries of the form (5). Randomization in such queries takes the form of the gradient of difference of losses and is asymptotically independent of the data vector for a Gaussian generative model and marginally an asymptotic centered Gaussian with a covariance Σ g .Using the conjugate of the log-MGF of a Gaussian density, we obtain a tractable pseudo posterior.We illustrate inference based on the approximate selective posterior post selection on a random fraction of the data in Section 5.

Reduction in optimization
Under randomizations with a density supported on R p that are independent in all p-component coordinates, we present an approximation that is based on smoothing a modified upper bound.For most common queries, it involves an optimization objective in d + |E| dimensions, where |E| ≤ p is the size of the active set from the selective query.Note that the optimization in (12) involves d+ p optimizing variables.With the reduction in dimensions of the optimization, we make a significant improvement in scalability of our methods in high dimensional sparse problems, when |E| p.Such a reduction is possible due to • decoupling of randomization density under independence • the structure of the canonical inversion map in (13) that allows an exact and easy calculation of the volume of the inactive selection region with respect to the density of O −E .
Before proceeding further, consider a break-up of the canonical randomization map into E active and p − |E| inactive coordinates.Such a decomposition takes the form where o E denotes the active coefficients and o −E represents the inactive subgradient.The inversion map has such a structure in most commonly used queries like the Lasso, forward stepwise, thresholding etc. as we see later in Section 5.The density g under a component-wise independent randomization scheme decouples into the active and inactive coordinates as The constraints on (o E , o −E ) for the canonical map are also separable and particularly, the inactive constraints are separable in each coordinate.The selection region induced by the selective constraints can thus, be denoted by where R E represents the active constraint region, R −E the inactive region and R j,−E , each component inactive constraint.The below theorem uses this separability in constraints and independence to obtain an upper bound on the logarithm of the normalizer of the truncated law.It involves computing the exact probability of the inactive subgradient variables lying in the selection region R −E = j R j,−E as a function of realizations of the active optimization variable o E and data s.
Theorem 2. Under a randomization scheme composed of p independent components and a selective query of the form (13) yielding a compact and convex selection region and R O takes the form (14), an upper bound for log P((S, O) ∈ R|β * ) for a compact, convex selection region R is given by where D j,−E , P j,−E and q j,−E denote the j-th rows of the matrices D −E , P −E and j-th component of vector q −E in (13) respectively.
A proof of the bound is done in Appendix A.1.A heuristic 1 minimax argument together with smoothing of constraints by a barrier penalty yields a reduced analog of ( 12) for canonical selective queries in the paper.An approximating optimization with a barrier penalty on the active constraints denoted as b R E (.) can be written as with B(o E ; s) as defined in Theorem 2.
1 exact is possible if selection region were compact; we still can apply the approximation with a large enough compact subset of the selection region with almost mass 1 under the prior imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 Expression (15) yields an approximating optimization in R d+|E| with a barrier function on the sign constraints of the active optimization variables in R |E| .We use the fact that the volume of the inactive selection region B(o E ; s) can be calculated exactly and easily as p−|E| simple, univariate integrals over intervals R j,−E ⊂ R. For example, for a centered Gaussian randomization with covariance matrix τ 2 I p and the canonical cube constraints on the inactive subgradient variables O −E taking the form R j,−E = {o j,−E : |o j,−E | ≤ λ}, a closed form expression for the logarithm of the volume of the inactive cube region is Here Marginalizing over the inactive optimization variables results in a significant reduction in dimensions of optimization from the objective in ( 12).Similar exact calculations of univariate probabilities of lying within an interval are easily available for other heavier tailed randomizations like the Laplace, Logistic etc. used in implementations in Markovic and Taylor (2016).

Dual problem: low dimensional regime
While solving the pseudo selective posterior using the above optimization as a surrogate to the normalizer is scalable for high dimensional problems, when p d+|E|, it is not very ideal in the low dimensional regime with a large sample size, when d + |E| p. Further, the optimization in (15) requires knowledge of the conjugates of the log-MGFs of the densities of the data and randomization.The dual problem yields an optimization objective in R p and hence, renders a scalable version of the optimization in the low dimensional paradigm.The other distinction from the optimization posed in the primal form is that the dual is based on simply the log-MGFs corresponding to the distributions of data and randomization.In the low dimensional situation or when we do not have closed forms for the conjugates of the log-MGFs of the generative model, we can solve for the dual of the optimization problem instead.
Theorem 3. Denoting Λ f (.|β * ) as the log-MGF of data generative density f and Λ g (.) as the log-MGF of randomization Ω, the dual to the optimization approximating the selection probability log P((S, O) ∈ R|β * ) in ( 12) is given by ) and D, P, q are coefficients of linear terms of map (7).
See proof in the appendix A.1.A point to note is that dual formulation of the approximating optimization involves computing the conjugate of the barrier imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 penalty function on the optimization variables.Since the constraints on the active and inactive optimization problems are separable, this involves solving conjugates of |E| and p − |E| univariate functions that correspond to the active and inactive constraints respectively.That is the conjugate barrier takes the additive form b Details of the computation of the conjugates of the barrier functions used in our implementations are given in Appendix B.
Remark 3. The dual of the constrained Chernoff-based optimization in (11) for the canonical constraint region where P T j,−E denotes the j-th row of matrix transpose of P −E .This is by observing that the convex conjugate of the characteristic function χ R j,E (.) representing the sign constraints on the active optimization variables is and that for the cube constraints χ R j,−E (.) on the inactive subgradients is

Marginalizing over multiple selections
The optimization problem described above is aimed to approximate the selection probability of an event based on a single randomized data query of the form (7). It is however, common practice to apply stages of screening or query the data base multiple times to arrive at a selected set.An example might be laboratory A performing an initial scan of thousands of potential predictors to select a pool that passes a suitably chosen thresholding criterion and laboratory B conducting another screening of predictors.The analyst is interested in combining both screening results to guide her to inference on the same data set that has been analyzed by the two laboratories.
The approximation presented in ( 12), ( 15) and ( 16) can be marginalized over multiple randomizations from multiple stages and hence, be extended to multi-stage selective algorithms.The next Lemma renders an approximation to the normalizer for a K-stage randomized selection with query in each stage corresponding to an inversion map with o k being the optimization variables for the randomized program in stage k.The selection region, determined by constraints on optimization variables o K at each stage, separable in the active and inactive coordinates as before, is given by the unconstrained data-space augmented with the constrained region on optimization variables from each query.
) as the conjugate of the log-MGF Λ g k of randomization Ω k , an upper bound to the logarithm of the exact selection probability log P((S, Proof.The proof is easy to see as with independent randomizations in each stage of selection , we have log An optimization over α ∈ R and α k ∈ R, {k = 1, 2, ..., K} and a minimax equality gives the bound A similar computation of the log-MGF of the augmented vector (S, as in the proof of Theorem 1 based on the change of variables facilitated by the inversion maps in (3.4) completes the proof.
The smooth analog of the constrained optimization in Lemma 1 is given by The dual formulation of this approximation, optimizing over dual variables Remark 4. Cost of optimization: The optimization in Lemma 1, decomposed into active and separable inactive problems can be solved in its primal form in effectively d while the dual has an effective cost of solving a Kp dimensional optimization, if the selected sizes are of smaller order than p.

Sampler: Langevin random walk
We describe below a Langevin random walk to sample from the pseudo posterior post randomized queries based on the approximate normalizer in ( 12), ( 15) and ( 16).The method of approximating a target distribution using a Langevin diffusion is studied in Roberts and Tweedie (1996).Another alternative to the simple Langevin sampler implemented in this work, is a Metropolis version with an accept reject step; the afore mentioned reference introduces the "Metropolis adjusted" version of the algorithm.Depending on the regime of inference, we require the log-MGFs of the generative density and the randomization density for solving the approximating optimization in its dual form or the convex conjugates of the log-MGFs while solving for the primal.A new update β * (K) based on a Langevin random walk with target as the pseudo selective posterior πE (β * |S) is given by where η is the step-size and (K) ∼ N (0, I).This allows us to provide samplebased effect size estimates in the form of credible intervals and point estimates for any function of the parameter of interest β * in the generative model.All that the sampler in ( 19) requires is calculating the gradient of the logposterior πE as a function of each new draw β * (K) .For a Gaussian generative model on data vector S with mean parametrized as µ(β * ), the below theorem shows that the gradient of the log-pseudo posterior can be computed in terms of the optimizer to the problem in (12).
Theorem 4. The gradient of the log-pseudo selective posterior log πE (.|S = s) at β * (K) for a Gaussian generative density for data vector S with mean µ(β * ) : R k → R d and a variance-covariance matrix Σ f given by with respect to parameter β * is given by where For the dual optimization in ( 16) , the optimizer s * can be derived from the K.K.T. conditions as where u * is the dual variable that optimizes arg min This shows that all we need for inference is the solution to the optimization problem cast as ( 12), ( 15) and ( 16) at each fresh draw.See Appendix A.1 for a proof of Theorem 4.
The proof of this is straight-forward from the estimating equation in (20).In the following section, we show that the approximate normalizer in Theorems 1 and 2 give a valid exponential rate to the selection probability on a large deviation scale under a Gaussian randomization and a Gaussian generative density.Under these conditions, the selective MLE obtained by maximizing the pseudo truncated law in Lemma 2 is consistent for β * .

Limiting approximation on large deviation scale
We fix some notations that apply to this section.In the implementations in Section 6, the columns of the predictor matrix X are normalized by a factor of 1/ √ n.We introduce the suppressed scale and denote the normalized X as X/ √ n in this section.The optimization for such a query is described in details in 5.1 under Section 5.The set-up is similar to the randomized logistic lasso query considered in Tian and Taylor (2015), except that we use the Lasso query instead of the logistic Lasso query to illustrate the results in this section.With data where , the Lasso query is given by The tuning parameter is set at a theoretical value Denote the scaled versions of the data vector and optimization variables as S n where with Sn as the mean of data variables Unlike Tian and Taylor (2015) which assumes local alternatives of the form β E,n = o(n −1/2 ), the selection probability is on the scale of a large deviation probability if β E,n = O(1).To simplify notations, we denote β E,n = β E hereafter.
We assume that the randomization instance ω in ( 4) is from a Gaussian density, which is used in all the experiments in Section 6.The infinite divisibility property of Gaussian densities allows us to write perturbation ω = √ nω n where ωn is the mean of n i.i.d.Gaussian variables ω i , i = 1, 2, • • • , n.The tuning parameter λ n converges to a constant; thus, we can treat it as a constant and denote it as λ.Also, noting that X T X/n converges in probability to a constant, we can consider the matrices D n , P n , q n in the inversion map for (4) as fixed.We use notations D, P, q for the affine inversion map.Finally, let based on the inversion map, with Ōn interpreted as the mean of Theorem 5 gives the limiting rate of decay of the volume of a compact and convex selection region R = R S × R O for R S ⊂ R p and R O ⊂ R p with respect to the probability density of the augmented vector ( Sn , Ōn ), whenever the data vector mean satisfies a large deviation principle.Define with Sn as the mean of the data vector array Theorem 5. Whenever the limiting log-MGF sequence (1).Denoting the log-MGF of Gaussian randomization ω 1 as Λ g (.) with conjugate Λ * g (.) and the conjugate corresponding to (2).If the Gaussian randomization density supported on R p is independent in all p coordinates with the conjugate of the log-MGF corresponding to active coordinates denoted as Λ * g E and the selective constraints on the optimization variables are separable as in Theorem 2, then We use in the above theorem the fact that the mean vector Sn satisfies a large deviation principle with rate function Λ imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 Similarly, the conditional probability of Ōn given Sn has a limiting large deviation rate expressed in terms of Λ * g (h n (.)) composed with the affine inversion map h n (.) : R p → R p given by That is, lim This is a consequence of the observation that the limiting rate function is the conjugate of lim n 1 n log E(exp(nλ T Ōn )| Sn = s); the change of measure map yields the following for Gaussian randomization with variance Σ g .The proof then follows by an application of the below Lemma 3, a modified version of Varadhan's Lemma (see Dembo and Zeitouni (1998)).The smooth unconstrained optimizations in ( 12) and ( 15) with a continuous barrier penalty function, scaled appropriately also approximate the selection probability accurately as the sample size grows large.Proofs of the above theorem and Lemma 3 are included in the Appendix A.2.
Lemma 3.For a sequence of functions H n (.) that uniformly converge to a continuous function H on a compact, convex set R ⊂ R d , the limit holds for sequence of variables Zn ∈ R d satisfying a large deviation principle with a rate function Λ * (.).
Under a Gaussian generative density parametrized by β E , a Gaussian randomization with log-MGF Λ g (.) and a compact and convex selection region R = R S × R O and the same asymptotic set-up as in Theorem 5, it follows as a consequence that the sequence in Lemma 2 approximates the exact log-partition function imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 Denote the sequence of selective MLE obtained by maximizing the sequence of pseudo truncated likelihoods as β MLE n,E that satisfies the estimating equation Strong convexity of Γ n (•) with a lower bound B on the indices of convexity leads to the identity Convergence of the approximate log-partition sequence Γ n (•) to the exact one coupled with the identity above prove consistency of the selective MLE β MLE n,E using similar arguments as Theorem 7.6 in Panigrahi et al. (2016).

Illustrations of truncated Bayesian approach
We illustrate truncated Bayesian approach by revisiting some popular selective queries.These examples are discussed in the context of frequentist inference in Tian et al. (2016).In all the below examples, the generative law on the data vector is a Gaussian with mean parametrized as β * .In particular, we assume that E f [Y |X] = X * β * .We assume that the columns of the design matrix X are scaled by 1/ √ n and denote the scaled predictor matrix as X, suppressing the scale n.In particular, we assume independent Gaussian entries for X.The randomized queries are conducted using instances of randomization from a Gaussian density supported on R p with mean 0 and variance τ 2 I p .Remark 6. Prior information on parameters: We provide inferential results based on the selected model in Section 6 where X * = X E and β * = β E ∈ R |E| under a non-informative prior.Our methods however, do allow the analyst to elicit a prior from an expert or prior experiments post the selective analysis.Our simulation results show that in the absence of an informative prior, the analyst can still capitalize upon the merits of a Bayesian machinery to provide valid inference post selection.
The generic recipe for inference using the proposed methods is to compute the inversions maps and selection regions that characterize the output of a query.This is followed by solving the optimization problem for each draw β (K) of the sampler.A function of the optimal data vector gives the gradient of the approximate log-posterior at β (K) in Theorem 4; thus, we sample from a tractable version of the selective posterior to carry out Bayesian inference.For each of the below examples, we give an explicit approximating optimization based on the inversion map and selective constraints that characterize the randomized query.
We present below the canonical Lasso query with the 1 -penalty; we show simulations using both the primal and dual optimizations and a carved version of the Lasso query in Section 6.We describe the optimizations for forward stepwise in 5.2 and the thresholding query that is a screening stage of a multiquery in 5.3, these are examples of popular queries with penalties different from an 1 penalty.. Other natural extensions of the Bayesian approach include the imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 grouped selection of variables with a group Lasso penalty as in Loftus and Taylor (2015).We can also apply our methods to other interesting examples such as inference post selection of edges representative of the conditional dependence structure of variables via the graphical Lasso; frequentist selective inference in such a model has been addressed in Taylor and Tibshirani (2016).We do not explore these extensions here.

A Lasso query
A randomized version of Lasso with design X based on data vector S = y ∈ R n and randomization instance ω ∼ N (0, to give output (E, z E ), the active set with active signs.
The selection region imposed by the 1 -constrained query takes the completely separable form of orthants for active constraints and intervals for inactive constraints; that is Inversion map: The inversion map encoding selection output (E, z E ) is given by Based on the above inversion map, the approximating optimization with n + |E| and p optimizing variables in the primal and dual formulation respectively can be computed as below.
Primal problem: Under the linear model with mean X * β * and covariance matrix σ 2 I n , the approximation to log P((S, O) ∈ R|β * ) leading to pseudo posterior πE (β * |y) is where the volume of the inactive selection region conditional on S = s, O E = o E under the isotropic Gaussian randomization is computed as and D −E , P −E , q −E as given in ( 26).
Dual problem: The corresponding dual in terms of the logarithm of the Gaussian MGFs (both for data and randomization) and conjugate of the barrier function is given by inf where D, P, q are obtained from the map in ( 26).Calculation of the conjugate of the barrier function follows from Appendix B.
A carved query solves the randomized version of lasso described in (5.1) on a random split of the data S(X (1) , y (1) ) leading to the output (E, z E ).Such a query takes the form where r is the fraction of the data used in the above selective query.
Inversion map: The randomization inherited from the random split on the data, as described in Markovic and Taylor (2016) leads to the below inversion map The randomization described as above is asymptotically Gaussian with mean 0 and covariance Σ g and is asymptotically independent of the data vector S described for the random X lasso query.
Primal problem: While we can no longer use the reduced optimization in (15) (as the randomization inherited from the split is not independent in all p-coordinates), we can use the more general approximation to the normalizer in ( 12).The joint on the data and randomization is an asymptotic Gaussian, with the data mean parametrized as µ(β * ).The approximating optimization that we solve to sample from the pseudo selective posterior is given by with Σ g and Σ f estimated by bootstrap.

A forward stepwise query
We discuss the approximating optimization that we solve to give truncated Bayesian inference after 2 steps of forward stepwise selection (FS) next.This can be easily generalized to K steps.In Section 6, we give adjusted estimates in a Bayesian model after 1 step of FS.This can also be viewed as a sequential query on the data.
Inversion maps: Denoting E 1 = {j 1 } and E 2 = {j 1 , j 2 } and the predictor for second stage as X = P ⊥ j1 X −j1 (adjusted for selection of j 1 in the first step), the characterizing inversion maps for the two-stage sequential selection procedure are as below. .
giving selection regions where z j1 and z j2 represent the signs of the active variables entering the model in the first and second steps respectively.
Remark 7. Selection regions in FS: As we can see from above that the selection regions in this example take the form of a cone rather than the usual orthant and cube yielded by the 1 penalty in the variants of Lasso.We can still write the selection region as That is the inactive selective constraints are all separable in the p − 1 coordinates, although they are determined by the active optimization variable unlike the example of Lasso.The probability of the inactive optimization variables being constrained to be smaller in magnitude than |o j1 | can be computed exactly as B(o j1 ; s) conditional on data s and active variable o j1 .A similar computation goes through for more than 1 step of FS.Dual problem: Solve a dual optimization over u 1 ∈ R p and u 2 ∈ R p−1 , as stated below:

A 2-stage query: thresholding followed by Lasso
We present an example of a two-stage screening method in the linear regression setting with a fixed design matrix X with normalized columns; we derive the approximating optimization problem to provide inference in a Bayesian model with prior π on β * and Y |β * ∼ N (X * β * , σ 2 I).The selective analysis comprises of two stages of screening based on realizations ω 1 , ω 2 from independent Gaussian distributions, each with all i.i.d.mean 0 components and variance τ 2 I.
The first query is a randomized marginal screening across the Z-statistics at a nominal threshold vector α, that solves min This results in output (E 1 , z E1 ), the active set of marginally most correlated predictors with active signs from Stage-I screening.Denoting X = X E1 ∈ R n × R E1 , the predictor matrix with selected predictors from the first round of screening, the second query is a randomized lasso query that solves (5.1) with design matrix X to yield active set E 2 with signs z E2 .
The first step describes the inversion maps and selective constraints encoding the two selective queries where data vector S = Y .

Inversion maps:
Using the facts that the convex conjugate of a Gaussian log-MGF with mean µ and variance γ 2 I k at vector x is x − µ 2 2 /2γ 2 and the log-MGF is µ T x + γ 2 x 2 /2 we derive the primal and dual optimization problems to sample from the approximate posterior.
Primal problem: The primal marginalizes over the inactive sub-gradients followed by the optimization over active variables and data in n we have the optimization in the primal form as Dual problem: With P 1 and P 2 identified respectively as from the randomization maps, solve an optimization over u 1 ∈ R p and u 2 ∈ R |E1| as below: imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 6. Experiments

Simulated models
We conduct different experiments to show the coverage and risk properties of estimates, obtained using our methods in comparison to those based on untruncated approach.We vary generative models across our experiments: this highlights that our methods show good performance even under misspecified models.We gives estimates under commonly used selective queries with different losses and penalties.
In the first experiment, we use Model I in Section 1.2 to generate our data.The ground truth is the null model Y ∼ N (0, I).For a fixed design X, we draw Y ∈ R n for every repetition using the same X.For a random design X, we randomly draw X ∈ R n×p with Gaussian entries and draw Y conditional on X and the underlying parameter β in each repetition of the experiment.The columns of design X are scaled by 1/ √ n in all cases.The second experiment uses Model II in Section 1.2 as a generative mechanism.This is a Bayesian model with ground truth Θ E (β) = (X T E X E ) −1 X T E Xβ, determined by E in each trial.In table 3, we compare the empirical coverage of the credible intervals, the risk of the posterior mean and the length of intervals between the approximating method that aims at the pseudo selective posterior and the usual Bayesian posterior inference.For the Bayesian model, we report the empirical (Bayesian) FCR, Bayes risk of the posterior mean and the length of intervals in table 4. The queries as described in Examples under 5.1 are conducted under centered Gaussian randomization with variance τ 2 I p ; with the exception of the carved query which inherits randomization from a randomly chosen split of the data.For inference, we use the selected model N (X E β E , I) and a non-informative prior on β E , where E is the active set from the selective query.In both experiments, we use a misspecified likelihood and prior.Note that despite the fact that the model for inference is a mis-specified one under the true generative models, our methods display superiority in terms of coverage and risk properties in comparison to the unadjusted estimates.The first column states the query-the Lasso with a fixed and random design, a carved Lasso with a random design and 1 step of forward stepwise (FS) and the last column gives the regression dimensions n and p.The generative mechanism in the third experiment is a frequentist model that deviates from the all noise model considered in Experiment 1.It gives an assessment of estimates based on the output from a Lasso query with a fixed X design using the primal and dual problems by varying the sparsity levels in the true generative mechanism.Based on a fixed predictor matrix, we simulate Y ∈ R n in each draw as below for a sparse vector β S with true support S ⊂ {1, 2, We use the primal and dual formulation of the optimization in Example 5.1 for providing estimates in a high dimensional sparse problem n = 500, p = 3000 and in the low dimensional regime n = 3000, p = 500 respectively.We vary the sparsity levels as |S| = 0, 5, 10, 20 signals, each with magnitude 7. Tables 5 and  6 show that the adjusted estimates have superior risk and coverage properties as compared to the unadjusted estimates, both based on a selected model appended to a diffuse prior.The final experiment gives the performance of the estimates post the 2-stage screening query with a fixed X design, described in Section 5.3.We again use both the frequentist Model I and the Bayesian Model II to generate our data.We choose to provide inference using an adaptive target under a model, both of which are determined by the final screened model E 2 that combines the output from the two screenings.The coverage and risk comparisons for the above screening procedure are given in table 7. The first column gives the Model generating the data and the last column gives the dimensions of the simulation.
The only case where the Bayes risk of the adjusted estimate is slightly more than that of the unadjusted posterior mean is for Bayesian model when n = 200, p = 1000.

Data analysis: inference on causal variants
To illustrate the inferential gains with the truncated Bayesian method, we provide adjusted effect size estimates for SNPs (Single nucleotide polymorphisms) that have been data-mined as the strongest associations with gene expression.An analyst will be confronted in defending the strength of these associations if she does not overcome the selective bias encountered in estimation post datasnooping.With gene expression data as the outcome, we give adjusted effect imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 size estimates of SNPs that have been selected as the set of probable causal genetic variants.We highlight the differences between the adjusted Bayesian approach and the unadjusted counterpart (that is inadaptive to selection); we also depict the higher statistical power associated with the adjusted Bayesian estimates post a randomized selection as opposed to the estimates based on Lee et al. ( 2016) post a non-randomized selection.
The data analyzed in this work involves gene expression data Y ∈ R 97 for a gene collected from the human tissue -Liver for a sample of 97 densely genotyped individuals.More details on this data-set are included in the Appendix C. The exceedingly small sample size in this analysis does not allow the analyst to reserve a hold-out data set for inference.The goal here, is to quantify the effect sizes of variants that have been selected from a set of 5233 of potential predictors, namely X ∈ R 97×5233 as predictors that best explain the variance in expression levels of the gene under study.More specifically, the columns of X represent local genetic variants measured as SNPs that lie within 1MB of the transcription start site of the gene.This data has been investigated as a part of a genome-wide association study conducted in Consortium et al. (2015); Ongen et al. (2015) with focus on identifying the significant associations between gene expression and genetic variants across different human tissues.The aforementioned works aimed at recognizing genes with at least one causal variant, called eGenes.A more recent work Aguet et al. (2016) performs a secondary analysis on the eGenes to further identify variants that regulate the expression for these genes.This involves a search over the local variants around the genes.In this work, we employ one such selection procedure, the commonly used Lasso to pick promising predictors and apply our method to give estimates for effect sizes of these selected SNPs based on the truncated posterior.
Below, we outline the analysis that leads to the selection of SNPs.To aid interpretability and recovery of a meaningful set of effects, we a perform hierarchical clustering with a minimax linkage on the set of 5233 SNPs, see Bien and Tibshirani (2011).The distance measure between SNPs X i and X j is defined as d(X i , X j ) = 1 − ρ(X i , X j ) where ρ(X i , X j ) is the empirical correlation between two SNPs X i , X j ∈ R 97 .This algorithm introduced in Ao et al. ( 2004) clusters the SNPs and gives a prototype for each cluster.The number of clusters is chosen so that each of the 5233 SNPs has a correlation of at least 0.5 with at least one of the prototypes.Applying a typical selection procedure like the Lasso on the set of local variants without pruning it to prototypes is not ideal in this analysis as the local variants share substantial empirical correlation; the Lasso will typically suffer from an inability to recover the true set of signals.Reid and Tibshirani (2016) identifies this shortcoming of the Lasso and proposes inference on effect sizes post a Lasso on prototypes of clusters in such scenarios.While the prototypes in Reid and Tibshirani (2016) are determined in a greedy fashion; the cluster representative being the most associated with the response, we adapt a completely unsupervised approach here in order to determine the clusters and prototypes with no data-snooping.Using the described hierarchical clustering, we obtain 320 prototype SNPs, each of which has a correlation of at least 0.5 with the SNPs in its cluster.We finally run a randomized Lasso query imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 given by (5.1) on the prototype SNPs with Gaussian randomization.With the tuning parameter is set at the theoretical λ = σ • E(X T ), the randomized Lasso query selects a set of 21 potential regulatory variants; σ is estimated as 0.4.The ratio of the randomization to noise scale in the data is set at 0.5.
We provide inference for the population least squares coefficients that correspond to the selected set E of SNPs.That is, the adaptive target is used as a quantification of the effect sizes of the selected SNPs.We assume the selected model on the data for inference, that is, Y ∼ N (X E β E , σ2 I) and a non-informative prior on β E (similar to the simulations in Section 6.1). Figure 1 gives a comparison of the effect sizes of selected SNPs using the proposed truncated approach with the unadjusted Bayesian estimates.Under the diffuse prior, the unadjusted estimates will be centered around the OLS estimator (X T E X E ) −1 X T E y with variance given by the diagonal entries of (X T E X E ) −1 .The optimizations that we solve to obtain the truncated Bayesian estimates are laid out in Section 5. We note the differences in effect sizes based on the adaptive posterior and the unadjusted posterior; we can see that the selected SNPs at positions 492, 606, 2960, 3509, 3574 will be reported as significantly associated with the gene expression if the analyst did not account for selection.The adjusted inference however, shows that the effect sizes of these SNPs are significantly biased by selection; the adjusted Bayesian intervals for these reportedly significant SNPs cover 0.
We supplement the above randomized effect size estimates with non randomized frequentist inference of Lee et al. (2016) post the usual Lasso query (without the randomization term in (5.1)).Figure 2 plots the exact frequentist intervals post a Lasso selection compared against the unadjusted intervals.The nonrandomized selection includes 18 SNPs as opposed to 21 SNPs picked up by the randomized Lasso; the common SNPs picked by both queries occur at positions 158,492,606,1830,2259,2786,2876,2926,2960,3155,3509,3574.The exact frequentist intervals adjusted for selection again show that the SNPs at 492,606,2960,3509,3574 are no longer statistically significant effects, as opposed to the unadjusted estimates.The comparison with the randomized intervals in Figure 1 shows that the estimates post a randomized version of Lasso have more statistical power.This is highlighted in the shorter lengths of the randomized intervals when compared against the exact frequentist intervals of Figure 2. Adjusted inference post both randomized and non-randomized versions of the Lasso query identifies SNPs at 158, 2786, 2926 to be significantly associated with gene expression.
The simulations in 6.1 post different selective queries show that the Bayesian estimates have good frequentist properties under a non-informative prior.We show that the adjusted Bayesian estimates indeed mimic the adjusted frequentist estimates based on Panigrahi et al. (2017) under the diffuse prior for the selected SNPs. Figure 6 in the Appendix C depicts the adjusted frequentist intervals alongside the Bayesian intervals.To validate the inferential guarantees of our estimates in the above analysis, we conclude with a simulation design based on the predictor matrix of SNPs X as considered above.We consider a sparse regime varying the number of signals |S| ∈ {0, 1, 2, 3}.In this sparse regime, we simulate the response Y from a model based on the 5233 predictors with |S| true signals as follows: • subsample |S| clusters from the 320 clusters of SNPs (obtained by hierarchical clustering with a minimax linkage, described above) • subsample one SNP further from each of the |S| subsampled cluster as the positions of the true signals; the set of true signals is called S with cardinality |S|.• draw response y ∈ R 97 as y = X S β S + ; ∼ N (0, I), where |β j,S | are of equal strength for j ∈ S. We vary the the magnitude of signals over the set {10, 5, 2.5} corresponding to roughly K √ 2 • log p with K = 4.5, 2, 1; p = 320 respectively.These signal strengths correspond to three SNR regimesstrong, moderate and weak signal regime.
We evaluate the coverage and risk for the adjusted and unadjusted estimates averaged over the selected SNPs and across repetitions of an experiment with 50  trials in the 3 different signal regimes.In each repetition, we provide inference about the ground truth for the population least squares coefficients given by under the selected model and non-informative prior as before.Note that the prototypes might not be positions of true signals in our simulation study, thereby, the model we assume for inference might be a misspecified model.We see that even with a misspecified model, the adjusted Bayesian estimates show superior performance than the unadjusted estimates, both in terms of coverage and risk.We also compare the adjusted Bayesian inference post the randomized Lasso with the non-randomized exact frequentist estimates of Lee et al. (2016).The blue color gives the adjusted Bayesian inference under the diffuse prior, the grey color represents the exact frequentist inference of Lee et al. (2016) post a non-randomized Lasso query and the red color denotes the unadjusted Bayesian inference.Lee et al. (2016) does not give a selection adjusted point estimate; the grey curve in Figure 4 plots the risk of the OLS estimator (X

Concluding remarks
The motivations to adjust for selection through a truncation on the generative model is the same as the frequentist line of works; though the technicalities with imposing a Bayesian model post selection are different.While prior works make progress in formalizing a selective Bayesian methodology, the current work makes significant contributions in proposing a concrete computational recipe to approximate the selective posterior after systematic randomized procedures.
The methods extend to multi stage selective queries, marginalizing over randomizations from each selective stage.An attractive property of this approach is scalability in both regimes of inference with empirical demonstration of frequentist coverage with credible intervals and risk of the posterior mean based on the approximate selective posterior.An interesting future direction includes establishing a Bernstein von Mises result in the selective Bayesian paradigm.We empirically see that the truncated Bayesian methods somewhat recover frequentist coverage under diffuse priors just as they would in the untruncated regime of inference.From a purely application point of view, we see scope of the methodology in this work to be applied to genome-wide studies where the true model describing the association of phenotypes with variants is assumed to be highly sparse.Inference post identification of causal variants is an important goal; our methods can provide reliable and reproducible effect size estimates in such settings.The Bayesian model in particular, allows an analyst to leverage information from an objective or subjective prior that can arise from prior experimentation in these studies.Developing tools to sample from an intractable posterior modeled using the truncated framework has been the focus of this paper, this allows the analyst to take full advantage of the Bayesian machinery post selection and provide estimates with better coverage and risk properties than the usual Bayesian esimsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 timates.
based on decoupling of the randomization density into active and inactive coordinates is given by By a minimax equality for compact, convex selection region R S × R E and the expression for log-MGF using the change of measure derived in the proof of Theorem 1, we have the result.

Proof of Theorem 3:
Proof.With the introduction of variable v = Ds + P o + q, the dual of optimization inf gives the optimizing equations This yields and hence, follows (16).

Proof of Theorem 4:
Proof.Plugging in the conjugate of the log-Gaussian MGF of data vector S, the logarithm of the pseudo posterior can be written as The derivative of the log-pseudo posterior is thus given by ∂ log πE (β * |s) for s * satisfying arg inf The last equality follows by noting that (2).For the second part, we use the tower property of expectation to have Also, note that G s n (o E ) is a sequence of continuous, concave functions in o E , s converging to a concave function G s (o E ).The concavity in o E follows from the fact that convolution of log-concave densities with a log-concave indicator preserves concavity.Thus, we have inf finally note that H n (.) is a sequence of continuous functions with a uniform limit H(.) on R S .This completes the proof of the second part with similar arguments as the first part of the theorem.

Proof of Lemma 3
Proof.Using the uniform convergence of H n (.) to H(.) on compact R ∈ R d , we have for any x ∈ R 0 such that Λ * (x) − H(x) < ∞ and δ > 0 such that B(x, δ) ⊂ R 0 To prove the upper bound, let φ j = j min(d(z, R), 1) be a sequence of bounded continuous functions increasing to χ R (.), the characteristic function of R. Again using the uniform convergence of H n (.) to H(.) on R and Varadhan's limit lemma for continuous bounded function φ j (.), we have  2016) as part of an eQTL study.In fact, the mentioned paper aimed at discovering cis-eQTLs that are associations between local genetic variation and gene expression.This work also conducted a secondary analysis on genes that are believed to have at least one regulatory variant to identify potential causal variants.It is of natural interest for the biologist to be able to give reproducible estimates of the effect sizes of these discovered variants post a search over the set of all variants that leads to reporting/identifying the promising ones.We apply our methods in 6.2 to produces estimates for the effect sizes of these possible regulatory variants, chosen through a Lasso analysis.
Comparison of the Bayesian estimates with frequentist approach We have seen in Section 6 that the Bayesian estimates under a flat prior and modeled along the conditional approach have good frequentist properties like coverage and risk.For interested readers, we also compute the adjusted frequentist estimates using the methods in Panigrahi et al. (2017).The mentioned paper uses a sampler free approach to solve for an intractable pivot and provides adjusted intervals and an approximate selective MLE based on the truncated likelihood, conditioned on the selection event.For the discovered SNPs post a randomized Lasso on prototype SNPs as described in 6.2, we plot the frequentist estimates alongside the Bayesian estimates under the diffuse prior.The fact that the Bayesian intervals and posterior mean mimic the frequentist intervals and selective MLE validates that our Bayesian approach displays the Bernstein von Mises phenomenon as the unadjusted Bayesian estimates do.

Remark 5 .
Estimating equation for MAP: It is easy to see that equating (20) to 0 gives rise to an estimating equation for the selective MAP for β * .It gives rise to a convex objective for the MAP problem for any log-concave prior π on β * and a generative mean µ(.) that is linear in β * .Lemma 2 gives the selective MLE under a non-informative prior π ∝ 1 for a Gaussian density with natural parameter as β * .A standard gradient descent can be performed on the log-posterior to solve for the MAP in such cases.The pseudo selective MAP in the non-randomized scenario and the simple additive randomized settings is introduced inPanigrahi et al. (2016).Lemma 2. Under a Gaussian generative density for data vector S considered in 4 with mean parametrized as µ(β * ) = β * and Σ f = I, the approximate selective MLE β * MLE based on the pseudo truncated law ˜ E (•|β * ) given by lower semi-continuous and differentiable in D 0 and for any λ ∈ ∂D, lim γ→λ |∇Λ f (ν|β E )| = ∞, the following hold for a compact and convex selection region R = R S × R O .

Fig 1 :
Fig 1: Effect size estimates: posterior mean and credible intervals are based on the truncated and unadjusted Bayesian posterior.The adjusted intervals have an average length of 2.31, the unadjusted intervals have an average length of 1.86.

Fig 2 :
Fig 2: Effect size estimates: adjusted intervals are the exact frequentist intervals constructed by conditioning on the polyhedral selection region of Lasso.Unadjusted intervals are centered around the OLS estimator post Lasso, (X T E XE) −1 X T E y with variance of j-th coefficient given by (X T E XE) −1 j,j .The adjusted intervals have an average length of 4.25, the unadjusted intervals have an average length of 1.56.

Fig 3 :
Fig 3: Comparison of coverages in the strong, moderate and weak signal regime: the bar plot depicts the average coverages across 50 replications, averaged over the selected SNPs.The black dotted line marks the 90% target coverage.The adjusted Bayesian intervals and exact Lee et al. (2016) intervals cover the true target nearly 90% of the total replications.The unadjusted intervals clearly fall short of the target coverage.

Fig 4 :
Fig 4: Comparison of risk in strong, moderate and weak signal regime: the adjusted posterior mean has smaller risk than the inadaptive posterior mean post both randomized and non-randomized Lasso.The grey curve depicts the risk of the unadjusted posterior mean post non-randomized Lasso.

Fig 5 :
Fig 5: Comparison of lengths in strong, moderate and weak signal regime: the adjusted Bayesian intervals are much shorter than the exact frequentist intervals, they are comparable in length to the unadjusted intervals.

Fig 6 :
Fig 6: Effect size estimates: adjusted posterior mean and credible intervals based on the truncated and unadjusted Bayesian posterior; selective MLE and confidence intervals are based on the sampler free approach in Panigrahi et al. (2017).The Bayesian estimates, both posterior mean and intervals mimic the frequentist estimates.

Table 1 :
Model I : Coverage, Risk and length of intervals

Table 2 :
Model II : Bayesian CR, Bayes risk and length of intervals corresponds to parametrization β * = β E ∈ R |E| with E being the observed active set and the saturated model corresponds to a parametrization β * = µ ∈ R n .The corresponding X * 's in the two models are X E and the identity matrix I n respectively.Of course, other models are possible.
ŌE,n∈R E | Sn = swhere Ō−E,n is the vector of E coordinates of Ōn and similarly, ŌE,n is defined.Denoting Ō−E,n ∈ R −E | ŌE,n = o E , Sn = s) = 1 n log B(o E ,s) imsart-generic ver.2008/08/29 file: primal_dual_approach.tex date: September 12, 2017 we know that G s n (o E ) converges to a continuous function G s (o E ) uniformly on o E ∈ R E using a large deviation rate.Applying Lemma 3 gives ŌE , s) 1 ŌE ∈R E