A ﬂexible convex relaxation for phase retrieval

: We propose a ﬂexible convex relaxation for the phase retrieval problem that operates in the natural domain of the signal. Therefore, we avoid the prohibitive computational cost associated with “lifting” and semideﬁnite programming (SDP) in methods such as PhaseLift and com-pete with recently developed non-convex techniques for phase retrieval. We relax the quadratic equations for phaseless measurements to inequality constraints each of which representing a symmetric “slab”. Through a simple convex program, our proposed estimator ﬁnds an extreme point of the intersection of these slabs that is best aligned with a given anchor vector . We characterize geometric conditions that certify success of the proposed estimator. Furthermore, using classic results in statistical learning theory, we show that for random measurements the geometric certiﬁcates hold with high probability at an optimal sample complexity. We demonstrate the ef- fectiveness of the proposed method through numerical simulations using both independent random measurements and coded diﬀraction patterns. We also extend this formulation to include sparsity constraints on the target vector. With this additional constraint, we show that, considering “nested” measurements, the number of phaseless measurements needed to recover the sparse vector is essentially the same (to within a logarithmic fac- tor) as the number of linear measurements needed by standard compressive sensing techniques.


Introduction
Let x ∈ C N be a signal that we would like to recover from noisy phaseless measurements

Fig 1. A two-dimensional illustration of slabs intersecting at x
where the measurement vectors a i ∈ C N are given. To solve this phase retrieval problem with provable accuracy, different methods that rely on semidefinite relaxation have been proposed previously [e.g., 8,7,36]. While these methods are guaranteed to produce an accurate solution in polynomial time, they are not scalable due to the use of semidefinite programming (SDP). This drawback of SDP-based methods has motivated development of alternative non-convex methods that operate in the natural domain of the signal and exhibit better scalability [e.g., 25,10]. In this paper we follow a completely different approach and propose a convex relaxation of the phase retrieval problem that not only produces accurate solutions but also is scalable. Compared to the non-convex phase retrieval methods our approach inherits the flexibility of convex optimization both in analysis and application. In Section 3.3, we demonstrate this flexibility by showing how sparsity constraints can (almost seamlessly) be incorporated into the recovery algorithm.
The geometric idea at the core of our proposed method is the following. Relaxing each measurement equation in (1) to an inequality |a * i x | 2 ≤ b i creates a symmetric slab S i of feasible solutions as illustrated in Figure 1. Collectively, these slabs describe a "complex polytope" K of feasible solutions. In the noiseless regime (i.e., ξ i = 0 for all i), the target signal x would be one of the extreme points of K. To distinguish x among all of the extreme points, our idea is to find a hyperplane tangent to K at x . The crucial ingredient in this approach is an "anchor" vector a 0 ∈ C N \ {0} that acts as the normal for the desired tangent hyperplane and it is required to have a non-vanishing correlation with x in the sense that for some absolute constant δ ∈ (0, 1). The above geometric intuition is explained in more detail in Section 3.1. While our main result simply assumes that the anchor vector is given by an oracle, which may use the existing measurements, we discuss in Section 1.1 some realistic scenarios where a valid anchor vector exists or can be computed. We assume that the noise is non-negative (i.e., ξ i ≥ 0) and we have ξ ∞ ≤ η −1 x 2 2 for some constant η > 0. Note that the non-negativity of the noise can be dropped at the cost of a slight reduction in the effective signal-to-noise ratio. In particular, one can add the noise upperbound (i.e., η −1 x 2 2 ) to each measurement to ensure the non-negativity. Throughout we treat C N as an innerproduct space over R equipped with the symmetric inner-product ·, · : (x 1 , Clearly, in this setting C N will be a 2N -dimensional vector space. With these assumptions in place, we propose the solution to the convex pro- as a computationally efficient estimator for x . Of course, the points equal to x up to a global phase, namely, yield the same phaseless measurements. Therefore, the goal is merely to estimate a point in Tx accurately from the phaseless measurements (1). In Lemma 2, below in Section 3, we establish a geometric condition that is sufficient to guarantee accurate estimation of x via the convex program (3).
The sufficient condition given by Lemma 2 can be interpreted in terms of (non-)existence of a particularly constrained halfspace that includes all of the points a i a * i x . For random measurement vectors a i , this interpretation resembles the model and theory of linear classifiers studied in statistical learning theory, albeit in an unusual regime. Borrowing classic results from this area (summarized in Appendix A), we show that with high probability (3) produces an accurate estimate of x .
Specifically, in our main result, Theorem 1 in Section 3, we show that drawing i.i.d. random measurements, with the hidden constant factor on the right-hand side depending on δ, would suffice for the conditions of Lemma 2 to hold with probability ≥ 1 − ε. Consequently, solution x of (3) would obey x − x 2 η −1 x 2 .

Choosing the anchor vector
Our approach critically depends on the choice of the anchor vector a 0 that obeys an inequality of the form (2). Below we discuss two interesting scenarios where such a vector would be accessible.
Non-negative signals: Perhaps the simplest scenario is when the target signal x is known to be real and non-negative. In usual imaging modalities these model assumptions are realistic as natural images are typically represented by pixel intensities. For these types of signals we can choose a 0 = 1 Then, for (2) to hold it suffices that x 1 ≥ δ √ N x 2 for some absolute constant δ ∈ (0, 1). In particular, we need x to have at least δ 2 N non-zero entries.
Random measurements: A more interesting scenario is when we can construct the vector a 0 from the (random) measurements. An effective strategy is to set a 0 to be the principal eigenvector of the matrix Σ = 1 i . The principal eigenvector of Σ and its "truncated" variants have been used previously for initialization of the Wirtinger Flow algorithm [10] and its refined versions [11,42]. For example, the following result is shown in [10, Section VII.H].
which guarantees that v is also a valid anchor vector.

Sparse phase retrieval using an 1 regularizer
One of the benefits of posing the phase retrieval problem as a convex program in the natural domain of the signal is that the formulation is extensible. If we have additional information about the signal to be recovered that can be expressed using convex constraints and/or penalties, then the recovery procedure can take advantage of this information while remaining computationally tractable. A natural model for many imaging applications is that the target vector x is sparse. Here, we encourage the solution to be sparse by enforcing a constraint on the 1 norm: For our analysis below, we make the simplifying assumption that x 1 is known. In practice, the problem can always be put into Largrange form, with an appropriate multiple of x 1 subtracted from the functional a 0 , x . As in the case of ordinary phase retrieval, a complete analysis of (4) requires (i ) a rigorous method to construct the anchor, and (ii ) a sample complexity analysis given the anchor vector. For generic Gaussian measurements, there are obstacles that make a tight mathematical analysis of these procedures complicated. To keep things simple, we will introduce some simple structure on the measurement vectors a i ; this structure will make the sample complexity analysis a relatively straightforward extension of the ordinary case, and will give us a concrete procedure for constructing an anchor vector from a nearly optimal number of samples. We note, though, that a near-optimal sample complexity analysis is possible using an analysis based on Rademacher complexity (see, e.g., the general framework in [3]) in place of the VC-based analysis in Section 3. With generic measurements, the main obstacle currently is construction of an anchor with optimal sample complexity. We are not aware of any computationally efficient construction that improves on the suboptimal sample complexity that is quadratic in x 0 , but such constructions may exist (see, e.g., the discussion in [32]).
In this paper, we consider measurement vectors that have a "nested structure", a model proposed in [20,1,39]. Specifically, we have where w i ∈ C M0 are i.i.d. standard complex Gaussian vectors, and Ψ ∈ C N ×M0 is a (random) matrix satisfying the restricted isometry property (RIP), i.e., for some constant γ K ∈ (0, 1) we have for all K-sparse vectors x. These conditions are usually fulfilled for random Ψ [4] when Our second main result, stated precisely as Theorem 2 in Section 3.3 below, shows that if we take noisy phaseless measurements of a K -sparse vector x of the form (5) and we have an anchor vector in place that satisfies (2), then solving (4) when produces (with high probability) an estimate x that again obeys x − x 2 η −1 x 2 . Thus the number of phaseless nested measurements needed to recover x is of the same order as if we had taken linear measurements a i , x and used standard compressive sensing techniques to form an estimate. The nested measurements also give us a technique for forming a data-driven anchor vector. To do this, we start by forming a matrix similar to the one used for the ordinary case above, but formed from the w i instead of the a i . We take compute the principal eigenvector w 0 of Σ, and then form the anchor vector a 0 by hard thresholding Ψ w 0 , retaining only its K largest terms. In Section 3.3, we show that when this procedure results in an anchor vector that is sufficiently correlated with x . Thus it appears we merely pay an extra log factor over the number of measurements in the sample complexity bound (7).

Other variations and extensions
There are several other ways the method proposed above can be extended. We will briefly discuss some of these here, and leave further exploration for future research. While the core geometric ideas still apply to these extensions, significant modifications of our theoretical arguments would be necessary to analyze these extensions in full. The gross noise model considered in this paper can be pessimistic in scenarios where we have random noise or deterministic noise with a different type of bound. In these scenarios, augmenting the estimator by a noise regularization term could result in accuracy bounds that gracefully vary with the considered noise.
Another interesting extension to the proposed method, is to adapt the current theory to the case of blockwise independent measurements as in coded diffraction imaging. Our numerical experiments in Section (2) suggest that the proposed method still performs well with these structured measurements. Nevertheless, to extend the analysis we may need to revise the current simple arguments based on Vapnik-Chervonenkis theory using more sophisticated tools from the theory of empirical processes.
Finally, it is possible to extend our convex formulation to the more general problem of recovering a low-rank positive semidefinite matrix from rank-one projections. Suppose that an N × N rank-R positive semidefinite matrix S = X X T is observed through measurements of the form where the a i are measurement vectors and the ξ i represent measurement noise -instead of measuring the magnitude of an inner product, we are measuring the norm of a matrix-vector multiplication. One application of measurements of this type is the problem of covariance sketching [6,12,2]. Similar to (3), we can estimate X via the convex program where A 0 ∈ R N ×R is an anchor matrix that is appropriately correlated with X , where U is a matrix that has orthogonal columns, each with 2 -norm 1 √ R , that span the column space of X .

Related work
There is a large body of research on phase retrieval addressing various aspect of the problem (see [21] and references therein). However, we focus only on the relevant results mostly developed in recent years. Perhaps, among the most important developments are PhaseLift and similar methods that cast the phase retrieval problem as a particular semidefinite program [8,7,36]. The main idea used in [8] and [7] is that by lifting the unknown signal using the transformation xx * → X, the (noisy) phaseless measurements (1) that are quadratic in x can be converted to linear measurements of the rank-one positive semidefinite matrix X = x x * . With this observation, these SDP-based methods aim to solve the corresponding linear equations using the trace-norm to induce the rank-one structure in the solution. Inspired by the well-known convex relaxation of Max-Cut problem, PhaseCut method [36] considers the measurement phases as the unknown variables and applies a similar lifting transform to formulate a different semidefinite relaxation for phase retrieval. While these SDP-based methods are shown to produce accurate estimates of X at optimal sample complexity for certain random measurement models, they become computationally prohibitive in medium-to large-scale problems where SDP is practically inefficient.
More recently, there has been a growing interest in non-convex iterative methods for phase retrieval [see e.g., 25,10,30,11,42,37,33]. These methods typically operate in the natural space of the signal and thus do not suffer the drawbacks of the SDP-based methods. In [25], using a specifically chosen initialization, some accuracy guarantees for a variant of the classic methods of Gerchberg and Saxton [15] and Fienup [14] are provided. This alternating minimization method of [25] basically updates an estimate of the signal assuming an estimate of the measurements' phase and vice versa in an alternating fashion through several iterations. The established sample complexity for this approach is (nearly) optimal in the dimension of the target signal, but it does not vary gracefully with the prescribed precision. Phase retrieval via the Wirtinger Flow (WF), a non-convex gradient descent method at core, is proposed in [10]. It is shown that for random measurements that have Normal distribution or certain coded diffraction patterns, with an appropriate initialization the WF iterates exhibit the linear rate of convergence to the target signal. More recent work on the WF method introduce better initialization by excluding the outlier measurements and achieve the optimal sample complexity [11,42]. The WF class of algorithms and our proposed method both achieve optimal sample complexity (up to the constant factor) and have low computational cost. However, the WF methods need careful tuning of a step size parameter and their convergence analysis often relies on Gaussian measurements. This is partly because establishing robustness of non-convex methods generally requires stronger conditions. Our method provably works for a broader set of measurement distributions, has no tuning parameters, and can be implemented in various convex optimization software.
Shortly after a draft of this manuscript was first posted online, a few independent papers proposed and analyzed the same method and its variants. In [16], which refers to (3) as PhaseMax, a sharper constants in the sample complexity is obtained by assuming a stronger condition that the anchor is independent of the measurements in the analysis. Alternative proofs and variations that rely on matrix concentration inequalities appeared later in [19,18,17]. Another distinctive feature of our analysis compared to the mentioned results is that it is less sensitive to measurement distribution as it relies on VC-type bounds.
Until very recently, convex programming approaches to sparse phase retrieval considered some form of semidefinite relaxation combined with 1 -regularization to estimate the lifted sparse and rank-one target matrix [26,22]. With standard Gaussian measurements, these methods are shown to require M = O(K 2 ) measurements to produce accurate solutions [22]. This suboptimal sample complexity is observed more generally in estimation of simultaneously sparse and low-rank matrices using naïve convex relaxations [27]. The prohibitive computational cost of (sparse) phase retrieval methods based on semidefinite programming, motivated the study of iterative non-convex approaches. Among the non-convex methods with provable guarantees, methods based on alternating minimization [25] and gradient descent [38] are shown to produce accurate solutions for sparse phase retrieval with suboptimal sample complexity M = O(K 2 ).
While achieving optimal sample complexity for sparse phase retrieval with generic Gaussian measurements is still open, a few techniques that exploit specially structured measurements are shown to reach optimal sample complexity with a computationally efficient algorithm. Examples of these methods include [28,40] that construct measurements based on ideas from channel coding, as well as [20,1,39] that rely on "nested" measurements that are restricted to a low-dimensional subspace.
More recently, it is shown in [32] that, given a good initialization, the nonconvex projected gradient descent with sparsity constraint can produce accurate solution to sparse phase retrieval with M = O(K log(N/K )) Gaussian measurements. Furthermore, [17] considers the 1 -regularized form of (4) in the real-valued and noiseless scenario and shows that it achieves optimal sample complexity assuming the anchor vector is given. We emphasize that our general framework developed in [3] can establish optimal sample complexity for the 1 -regularized variant of (4), but it relies on tools other than VC-dimension to control the sample complexity.

Numerical experiments
We evaluated the performance of our proposed method on synthetic data with the target signal x ∼ Normal(0, 1 2 I) + ıNormal(0, 1 2 I) and measurements a i i.i.d. ∼ Normal(0, σ 2 ) in the other. For the latter noise model we replaced any negative b i by b i = 0 to avoid negative measurements and defined the input signal-to-noise ratio as SNR def = 10 log 10 The vector a 0 is constructed as in initialization of the Wirtinger Flow mentioned in Lemma 1 through 50 iterations of the power method. We implemented the convex program (3) by TFOCS [5] with smoothing parameter μ = 2 × 10 −3 and at most 500 iterations. Figure 2 illustrates the 0.9-quantile and median of the relative error min φ∈ [0,2π) x − e ıφ x 2 / x 2 observed over 100 trials of our algorithm for different sampling ratios M/N between 2 and 17. The plots also  show the effect of different levels of noise on the relative error for both of the considered noise models.
We also have another set of experiments comparing our method with the Wirtinger Flow method [10]. In these experiments we generate the measurement matrix A = [a 1 a 2 · · · a M ] * with i.i.d. standard complex Gaussian entries for different signal dimensions N and sampling ratios M/N . For each choice of M and N we compared the performance of our approach and the Wirtinger Flow in the noiseless setting over 75 trials. Figure 3 compares the two methods in terms of the achieved accuracy. The graph shows the median of the rela- tive error achieved by each of the two methods over the 75 trials. As can be seen, the Wirtinger Flow succeeds at a smaller sampling ratio (i.e., M/N = 4) whereas our approach requires essentially twice more measurements to succeed (i.e., M/N = 8). However, when both methods recover the signal accurately, our approach appears to be less demanding in terms of the computation and thus the running time. This claim is supported by Figure 4 that compares the median of the number of multiplications by A or A * in each of the two approaches. We excluded the operation counts of initializing for the Wirtinger Flow and constructing the anchor for (3) from the reported numbers as they are both based on the method in Lemma 1. As Figure 4 suggests, the computational cost of our method peaks at M/N = 7 which can be explained as follows. At sampling ratios that our method fails (i.e., M/N < 8) the convex solver might not achieve the tolerance level set to halt and run for the entire prescribed maximum number of iterations. Therefore, for these sampling ratios the convex solver runs more iterations (without success) compared to the cases with M/N ≥ 8 where the method succeeds. We examined the numerical performance of (4) in the sparse phase retrieval problem as well. Let M 0 = 2K log(N/K ) and M = cM 0 . For N = 1000 and different values of K and c, we generated M measurement vectors according to (5) with Ψ ∈ C N × M0 and w i ∈ C M0 being populated with i.i.d. standard complex Gaussian entries. Figure 5 illustrated the median of the relative error achieved by (4) over 100. As the graphs suggest for the considered range of K the proposed estimator recovers the signal (up to the numerical tolerance) for Furthermore, we evaluated our method using noiseless measurements with coded diffraction patterns as described in [9]. Specifically, with indices i = (k, ) for 1 ≤ k ≤ N and 1 ≤ ≤ L, we used measurements of the form a i = f k • φ Fig 5. Relative error achieved (4) for sparse phase retrieval which is the pointwise product of the k-th discrete Fourier basis vector (i.e., f k ) and a random modulation pattern with i.i.d. symmetric Bernoulli entries (i.e., φ ). The target signal is an N = 960 × 1280 ≈ 1.2 × 10 6 pixel image of a Persian Leopard. 2 We used L = 20 independent coded diffraction patterns {φ } 1≤ ≤L . Therefore, the total number of (scalar) measurements is M = LN ≈ 2.5 × 10 7 . Similar to the first simulation, the vector a 0 is constructed as the (approximate) principal eigenvector of 1 M i b i a i a * i through 50 iterations of the power method. The convex program is also solved using TFOCS, but this time with smoothing parameter μ = 10 −6 and restricting the total number of forward and adjoint coded diffraction operator to 500. The recovered image is depicted in Figure 6 which has a relative error of about 8.2 × 10 −8 .

Theoretical analysis
In this section we provide the precise statement of the our results. Main proofs are provided in Appendix B, and proofs of technical lemmas can be found in Appendix C. For the sake of simplicity in notation and derivation, but without loss of generality, we make the following assumptions. We assume that a * 0 x is a positive real number since any point in Tx is a valid target. Furthermore, we assume that x is unit-norm (i.e., x 2 = 1) and thus the bound on the noise reduces to ξ ∞ ≤ η −1 . We first establish, in Lemma 2, a geometric condition for success of phase retrieval through (3). Then we use this lemma to prove our main result for random measurements in Theorem 1. We also rely on tools from statistical learning theory that are outlined in Appendix A.

Geometry of intersecting slabs
To understand the geometry of (3) it is worthwhile to first consider the noiseless scenario. The feasible set is the intersection of the sets corresponding to the pairs (a i , b i ) for i = 1, 2, · · · , M. The sets S i are effectively symmetric "complex slabs". Denote their intersection by In (3) the objective function is linear, thus its maximizer is an extreme point of the convex constraint set K. Clearly, x as well as any other point in Tx are extreme points of K. However, K typically has other extreme points that are not equivalent to x . Intuitively, using the non-vanishing correlation of a 0 with x , the convex program (3) is effectively eliminating the superfluous extreme points of K. The geometric interpretation is that the hyperplane normal to a 0 that passes through x is also tangent to K, as Figure 1 suggests. It is not difficult to show that an analogous interpretation from the dual point of view is that a 0 is in the interior of cone {a i a * i x } 1≤i≤N (i.e., the conical hull of the points a i a * i x ). More generally, with noisy measurements, K is still a symmetric complex polytope that is convex and includes Tx due to non-negativity of the noise. We would like to find conditions that guarantee that the solution to (3) is close to x . More specifically, we would like to show that if x = x + h is any solution to (3) and t > 0 is some constant, then with h 2 > (tη) −1 the inequalities cannot hold simultaneously. The following lemma provides the desired sufficient condition.

Lemma 2. Let
and ≥ 0 be some constant. If every vector h ∈ R δ with h 2 > violates at least one of the inequalities then any solution x to (3) obeys Proof. It suffices to show that h = x − x obeys (12) and it belongs to R δ . Given that Feasibility of x and optimality of x guarantee that a 0 , h = a 0 , x − a 0 , x ≥ 0. Therefore, we have shown that h satisfies (12).
The constraints of (3) are invariant under a global change of phase (i.e., the action of T). It easily follows that the solution x to (3) should obey Im (a * 0 x) = 0. Therefore, we have Im (a * 0 h) = 0 as we assumed α = a * 0 x ∈ R. The same assumption also implies that a 0 = αx + a 0⊥ for a 0⊥ = (I − x x * ) a 0 which clearly obeys x * a 0⊥ = 0. Thus, using triangle inequality and the bound (2) we obtain The above inequality completes the proof as it is equivalent to h ∈ R δ .

Guarantees for random measurements
In this section we will show that if the vectors a i for 1 ≤ i ≤ M are drawn from a random distribution and (2) holds for a sufficiently large constant δ, then with high probability (3) produces an accurate estimate of x . Our strategy is to show that for a sufficiently large M the sufficient condition provided in Lemma 2 holds with high probability. For δ ∈ (0, 1) let C δ be the convex cone given by where x * y is implicitly assumed to be a real number. The polar cone of a set C is defined as C • def = {z : z, y ≤ 0 for all y ∈ C} .
It is easy to verify that the polar cone of C δ is Since a 0 ∈ C δ by assumption, it follows that for every h ∈ C • δ we have a 0 , h ≤ 0. Therefore, the inequality a 0 , h ≥ 0 can hold only for vectors z in the closure of the complement of C • δ which we denote by A typical positioning of a 0 and a i a * i x needed to guarantee unique recovery is illustrated in Figure 7. Theorem 1. Suppose that noisy phaseless measurements of a unit vector x are given as in (1) under bounded non-negative noise ξ 1 , ξ 2 , . . . , ξ M ∈ 0, η −1 . Let a 0 be an anchor vector obeying (2) for some constant δ ∈ (0, 1) and define R δ and C δ respectively as in (11) and (13). Furthermore, given a constant t > 0, suppose that for 1 ≤ i ≤ M the measurement vectors a i are i.i.d. copies of a random variable a ∈ C N that obeys for a constant p min (δ, t) ∈ (0, 1) depending only on parameters δ and t. 3 For any ε > 0, if we have then with probability ≥ 1 − ε the estimate x obtained through (3) obeys To have a concrete example, we can apply Theorem 1 to the case of measurements with normal distribution. Clearly, it suffices to quantify the constant p min (δ, t). For instance, Lemma 5 in Appendix C guarantees that we can choose The dependence of the sample complexity on p min (δ, t) in Theorem 1 is not optimal and can be improved. In our analysis the most critical factor that determines the sample complexity is a uniform lower bound for a certain empirical process. While we used Theorem 3 to establish the desired lower bound, a sharper result can be obtained using "variance-normalized" variants of the theorem. For example, the term 1/p 2 min (δ, t) in the sample complexity guaranteed by Theorem 1 can be improved to 1/p min (δ, t) by using [34,Theorem 4.4(2)] instead of Theorem 3 in the analysis.

Sparse phase retrieval
Theorem 2 below gives a precise recovery guarantee (in terms of number of samples needed) for recovering a K -sparse vector. Its proof, which can be found in Section B.2, follows that of Theorem 1, only with a modification of the set C δ in (13) that reflects the additional constraint that h be a descent vector for the 1 norm from x .
To ease some of the notation, in what follows we denote the phase of any z ∈ C by For any complex vector z, the operation z T applies entry-wise.

Theorem 2.
Suppose that noisy phaseless measurements of a K -sparse unit vector x supported on X are given as in (1) under bounded non-negative noise ξ 1 , ξ 2 , . . . , ξ M ∈ 0, η −1 . Let a 0 be an anchor vector obeying (2) for some constant δ ∈ (0, 1) and define R δ and C δ respectively as in (11) and (13). Furthermore, given a constant t > 0, suppose that for for a constant p min (δ, t) ∈ (0, 1) that depend on parameters δ, t, and implicitly on Ψ . For any ε > 0, if we have then with probability ≥ 1 − ε the estimate x obtained through (4) obeys As mentioned above, with M 0 K log(N/K ) the matrix Ψ can satisfy (6) for K = K and a constant γ K < 1. Therefore, Theorem 2 implies that it suffices to have to guarantee accuracy of (4) with probability ≥ 1 − ε.
There is one important thing that we are leaving unaddressed. Unlike ordinary unconstrained phase retrieval, we do not show the existence of a constant p min (δ, t) in (14) for a particular probability model for the w i . This appears to be a very technical calculation even in the case where the w i are iid Gaussian, and so we leave it for future work.
Constructing the anchor: As described above, the first step in forming the anchor vector is computing the principal eigenvector w 0 of When M δ M 0 log M 0 , Lemma 1 above guarantees that Lemma 3 below shows that the best K -sparse approximation of Ψ w 0 is sufficiently correlated with x . The proof is provided in Appendix B.

Lemma 3.
Let a 0 denote the best K -sparse approximation of Ψ w 0 where w 0 is a unit vector that satisfies (15), and suppose that Ψ satisfies the (6) with K = 2K and parameter γ 2K . Then a 0 is a valid anchor for which (2) holds, if we have For concreteness, if δ = 7 8 and γ 2K = 1 20 , then Lemma 3 guarantees that we can have δ = 1 3 .

Appendix A: Tools from statistical learning theory
For reference, here we provide some of the classic results in statistical learning theory that we employed in our analysis. We mostly follow the exposition of the subject presented by [13, chapters 13 and 14]. Intuitively, the shatter coefficient s(F, n) is the largest number of binary patterns that the functions in F can induce on n points. If F can induce all binary patterns on n points, F is said to "shatter" n points. Therefore, the VC-dimension of F is the largest number of points that F can shatter. [35], Sauer [29], Shelah [31]). For a class F of binary functions with VC-dimension d = dim VC (F) we have

Lemma 4 (Vapnik and Chervonenkis
In particular, The following theorem is originally due to Vapnik and Chervonenkis [35]. We restate the theorem as presented in [13]. [35]). Let F be a class of binary functions and x 1 , x 2 , . . . , x n be i.i.d. copies of an arbitrary random variable x. Then for every t > 0 we have

B.1. Theorem 1
The main ingredient in the proof below is the VC-theorem (i.e., Theorem 3) that helps bounding the deviation of an "empirical probability" from its expectation uniformly.
It then follows that Integrating again we obtain , β < 0.
We can calculate F (0) as F (0) = P (αv > g) where the last line follows from the fact that g √ v 2 +g 2 has a uniform distribution over [−1, 1]. Replacing F (0) in the expression of F (β) and straightforward simplifications yield the desired result.