Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

Sketching techniques have become popular for scaling up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, while still maintaining the statistical power of big data. In this paper, we study sketching from an optimization point of view: we first show that the iterative Hessian sketch is an optimization process with preconditioning, and develop accelerated iterative Hessian sketch via the searching the conjugate direction; we then establish primal-dual connections between the Hessian sketch and dual random projection, and apply the preconditioned conjugate gradient approach on the dual problem, which leads to the accelerated iterative dual random projection methods. Finally to tackle the challenges from both large sample size and high-dimensionality, we propose the primal-dual sketch, which iteratively sketches the primal and dual formulations. We show that using a logarithmic number of calls to solvers of small scale problem, primal-dual sketch is able to recover the optimum of the original problem up to arbitrary precision. The proposed algorithms are validated via extensive experiments on synthetic and real data sets which complements our theoretical results.


Introduction
Machine Learning has gained great empirical success from the massive data sets collected from various domains. Among them a major challenge is to utilize existing computational resources to build predictive and inferential models from such huge data sets, while maintaining the statistical power of big data. One remedy for the big data challenge is to build distributed computer systems and design distributed learning algorithms to make big data learning possible, however, distributed systems may not always available, and the cost of running distributed system can be much higher than one can afford, which makes distributed learning not suitable for all scenarios. An alternative remedy is to use the state-of-the-art randomized optimization algorithms to accelerate the training process, for example, researchers have proposed optimization algorithms for regularized empirical risk minimization problem, with provable fast convergence and low computational cost per iteration (see (Johnson and Zhang, 2013;Shalev-Shwartz and Zhang, 2013;Defazio et al., 2014) for examples), however, the speed of these optimization methods still heavily depends on the condition number of problem at hand, which can be undesirable for many real world problems.
Sketching (Woodruff, 2014), which approximates the solution via constructing some sketched, usually of smaller scale problem from the original data, has become an emerging technique for big data analytics. With the sketching technique, we can find solutions which approximately solve various forms of original large-scale problem, such as least square regression, robust regression, low-rank approximation, singular value decomposition, just to name a few. For survey and recent advances about sketching, we refer the readers to (Halko et al., 2011;Mahoney, 2011;Lu et al., 2013;Alaoui and Mahoney, 2014;Woodruff, 2014;Raskutti and Mahoney, 2015;Yang et al., 2015a;Oymak and Tropp, 2015;Drineas and Mahoney, 2016) and references therein. However, one major drawback of sketching is that typically it's not suitable for the case if we want high accurate solution: to obtain a solution with exponentially smaller approximation error, we often need to increase the sketching dimension also exponentially.
The situation has become better with recent work on "iterative sketch", e.g. iterative Hessian sketch (IHS) (Pilanci and Wainwright, 2016) and iterative dual random projection (IDRP) (Zhang et al., 2014). These methods are able to refine their approximate solution by iteratively solving some small scale sketched problem. Among these innovations, Hessian sketch (Pilanci and Wainwright, 2016) is designed by reducing the sample size of the original problem, while dual random projection (Zhang et al., 2014) is proposed by reducing the dimension. As a consequence, when the sample size and feature dimension are both large, IHS and IDRP still need to solve relatively large-scale subproblems as they can only sketch the problem from one perspective.
In this paper, we make the following improvement upon previous work: we first propose an accelerated version of IHS which requires the same computational cost to solve the IHS subproblem at each sketching iteration, while with provably fewer number of sketching iterations to reach certain accuracy; we then reveal the primal-dual connections between IHS (Pilanci and Wainwright, 2016) and IDRP (Zhang et al., 2014), which are independently proposed by two different groups of researchers. In particular, we show that these two methods are equivalent in the sense that dual random projection is performing Hessian sketch in the dual space. Finally, to alleviate the computational issues raised by big and high-dimensional learning problems, we propose a primal-dual sketching method that can simultaneously reduce the sample size and dimension of the sketched sub-problem, with provable convergence guarantees.
Organization The rest of this paper is organized as follows: in Section 2 we review the iterative Hessian sketch as an optimization process and propose a new algorithm with faster convergence rate. In Section 3 we show that the dual random projection is equivalent to Hessian sketch, and propose the corresponding accelerated dual random projection as well. In Section 4 we combine the sketching from both primal and dual perspectives, and propose iterative algorithms by reduc-ing both sample size and problem dimension. We provide several theoretical analysis in Section 5, though we defer few technical results to the appendices, and conduct extensive experiments in Section 6. Finally we summarize and discuss several future directions in Section 7.
Notation We use bold-faced letters such as w to denote vectors, and bold-faced capital letters such as X to denote matrices. Given a matrix X P R nˆp , we define the following matrix induced norm for any vector w P R p , }w} X " c w J X J Xw n .
We use N pµ, Σq to denote the multivariate normal distribution with mean µ and covariance Σ. We use I n and I p to denote the identity matrices of size nˆn and pˆp, and λ max pHq and λ min pHq to denote the maximum and minimum eigenvalue of H, respectively. For two sequences ta n u 8 n"1 and ta n u 8 n"1 , we denote a n À b n if a n ď Cb n always holds for n large enough with some constant C, and denote a n Á b n if b n À a n . We also use the notation a n " Opb n q if a n À b n , and use Op¨q for Op¨q to hide logarithmic factors.

Iterative Hessian Sketch as Optimization with Preconditioning
In this section, we first review the the iterative Hessian sketch proposed in (Pilanci and Wainwright, 2016) as an iterative preconditioned optimization process, and then propose a faster iterative algorithm by constructing better sketched problem to solve.
For ease of discussion, consider the following 2 regularized least-squares (a.k.a. ridge regression) problem: where X P R nˆp is the data matrix, y P R n is the response vector. Let w˚denote the optimum of problem (2.1). In real applications both n and p can be very large, thus sketching has become a widely used technique for finding an approximate solution of problem (2.1) efficiently (Drineas et al., 2011;Mahoney, 2011;Woodruff, 2014). In particular, to avoid solving a problem of huge sample size, the traditional sketching techniques (e.g. (Sarlos, 2006;) were proposed to reduce the sample size from n to m, where m ! n, by solving the following sketched problem: where Π P R nˆm is a sketching matrix, and typical choice of Π can be random Gaussian matrix, matrix with Rademacher entries, Sub-sampled Randomized Hadamard Transform (Boutsidis and Gittens, 2013) and Sub-sampled Randomized Fourier Transform (Rokhlin and Tygert, 2008), see discussions in Section 2.1 of (Pilanci and Wainwright, 2016) for details. Though the classical sketching has been successful in various problems with provable guarantees, as shown in (Pilanci and Wainwright, 2016) there exists a approximation limit for the classical sketching methods to be practically useful, that is, to obtain an approximate solution with high precision, the sketching dimension m needs to grow exponentially, which is impractical if we want a high accuracy approximation to the original problem, as the main purpose of sketching is to speed up the algorithms via reducing the sample size.
The main idea of Hessian sketch (Pilanci and Wainwright, 2016) is based on the following equivalent formulation of (2.1): and (Pilanci and Wainwright, 2016) proposed to only sketch the quadratic part }Xw} 2 2 with respect to X, but not the linear part xy, Xwy. So the Hessian sketch considers solving the following sketched problem: min wPR p P HS pX, y; Π, wq " min It is not hard to see that (2.4) has the following closed form solution: We see that different from classical sketching where both data matrix X and the response vector y are both sketched, in Hessian sketch the only sketched part is the Hessian matrix, through the following transform: Though the Hessian sketch suffers from the same approximation limit as the classical sketch, one notable feature of Hessian sketch is that one can implement an iterative extension to refine the approximation to higher precision. To this end, define the initial Hessian sketch approximation as w p1q HS : w p1q HS " arg min w w JˆX J ΠΠ J X 2n`λ 2 I p˙w´1 n xy, Xwy.
After obtaining w p1q HS , we can consider the following optimization problem: It is clear that w˚´ w p1q HS is the optimum for above problem. The main idea of iterative Hessian sketch (IHS) is to approximate the residual w˚´ w p1q HS by Hessian sketch again. At time t, let u ptq be the approximate of w˚´ w ptq HS via solving the following sketched problem: " w ptq HS` u ptq . The algorithm for IHS is shown in Algorithm 1. Since (2.6) is a sketched problem with sample size m, it can be solved more efficiently than the original problem (2.1). Besides, we can reuse the previous sketched data Π J X without constructing any new random sketching matrix. Moreover, (Pilanci and Wainwright, 2016) showed that the approximation error of IHS is exponentially decreasing when we increase the number of sketching iterations, thus IHS could find an approximated solution with -approximation error within Oplogp1{ qq iterations, as long as the sketching dimension m is large enough. Moreover, though this powerful technique is originally focused on the least-squares problem (2.1), the idea of IHS can be extended to solve more general problems, such as constrained least-squares (Pilanci and Wainwright, 2016), optimization with self-concordant loss (Pilanci and Wainwright, 2015a), as well as non-parametric methods (Yang et al., 2015b).
Though IHS improved the classical sketching by enabling us to find an high quality approximation more efficiently, it is imperfect due to the following reasons: • The "exponentially approximation error decreasing" guarantee relies on the basis that the sketching dimension m is large enough. The necessary sketching dimension depends on the intrinsic complexity of the problem, however, if the "sufficient sketching dimension" condition is violated, as we will show in experiments, IHS can even diverge, i.e. we obtain arbitrary worse approximation after applying IHS.
• As we will show later, even when the "sufficient sketching dimension" condition is satisfied, the decreasing speed of the approximation error in IHS can be significantly improved.
Here, we show that the iterative Hessian sketch is in fact an optimization process with preconditioning. For notation simplicity let Then it is not hard to see that the IHS in Algorithm 1 is performing the following iterative update: which is like a Newton update where we replace the true Hessian H with the sketched Hessian H. This update can be derived by using the change of variable z " H 1{2 w, and then applying gradient descent in the z space: Algorithm 2: Accelerated Iterative Hessian Sketch (Acc-IHS).

11
Update v pt`1q "´X J X n`λ I p¯p pt`1q .

Equivalence between Dual Random Projection and Hessian Sketch
While Hessian sketch (Pilanci and Wainwright, 2016) tries to resolve the issue of huge sample size, Dual Random Projection (Zhang et al., , 2014 is aimed to resolve the issue of highdimensionality, where random projection as a standard technique is used to significantly reduce the dimension of data points. Again consider the standard ridge regression problem in (2.1), but now random projection is used to transform the original problem (2.1) to a low-dimensional problem: where R P R pˆd is a random projection matrix, and d ! p. Let z " arg min z P RP pXR, y; zq. If we want to recover the original high-dimensional solution, (Zhang et al., 2014) observed that the naive recovered solution w RP " R z is a bad approximation, and propose to recover w˚from the dual solution, which leads to the dual random projection (DRP) approach. To see this, consider the dual problem of the optimization problem in (2.1) as Let α˚" arg max αPR n DpX, y; αq be the dual optimal solution. By standard primal-dual theory (Boyd and Vandenberghe, 2004), we have the following connection between the optimal primal and dual solutions: The dual random projection procedure works as follows: first, we construct solve the low-dimensional, randomly projected problem (3.1) and obtain the solution z, and then calculate the approximated dual variables by: based on the approximated dual solution α DRP . Then we recover the primal solution as: (3.5) By combining above derivations, it is not hard to see that dual random projection for ridge regression has the following closed form solution: In (Zhang et al., 2014) it has been shown that the recovered solution from dual, i.e. w DRP , is a much better approximation than the recovered solution directly from primal w RP . More specifically, they showed that w RP is always a poor approximation of w˚, because w RP lives in a random subspace spanned by the random projection matrix R. For w DRP , (Zhang et al., 2014) proved that as long as the projected dimension d is large enough, w DRP can be a good approximation of w˚. Moreover, (Zhang et al., 2014) proposed an iterative extension of DRP which can exponentially reducing the approximation error. To do so, suppose at iteration t we have the approximate solution w ptq DRP , and consider the following optimization problem: (3.7) It is clear to see w˚´ w ptq DRP is the optimum solution of above optimization problem. The idea of iterative dual random projection (IDRP) is to approximate the residual w˚´ w ptq DRP by applying dual random projection again. That is, we construct the following randomly projected problem given w ptq DRP : (3.8) Let z ptq to be the solution of (3.8), then we update the refined approximation of dual variables: DRP " y´Xw ptq DRP´X R z, as well as the primal variables DRP . The iterative dual random projection (IDRP) algorithm is shown in Algorithm 3. More generally, (Zhang et al., 2014) showed the iterative dual random projection can be used to solve any 2 regularized empirical loss minimization problem as long as the loss function is smooth, typical examples include logistic regression, support vector machines with smoothed hinge loss, etc.
Though a powerful technique to cope with high-dimensionality issue, IDRP suffers from the same limitations as IHS: i) it requires the "large projection dimension" condition to make the approximation error decreasing, ii) the convergence speed of IDRP is not optimal. As will shown later, actually the dual random projection is equivalent to apply the Hessian sketch procedure on the dual problem, and we propose an accelerated IDRP approach to overcome above discussed limitations.
Solve the projected problem in (3.8) and obtain solution z ptq .

Dual Random Projection is Hessian Sketch in Dual Space
In this section we present one of the key observation, i.e., the equivalence between Hessian sketch and dual random projection. Note that the Hessian sketch is used for sample reduction, while the dual random projection is utilized for dimension reduction. Recall that the dual maximization objective (3.2) is a quadratic with respect to α, and we can write the equivalent minimization objective in the following form: We can treat (3.9) as our primal problem and applying the Hessian sketch with sketching matrix R P R pˆd to find an approximated solution for α˚: α HS " arg min αPR n α JˆX RR J X J 2λn`1 2 I n˙α´x y, αy, (3.10) which has the closed form solution as If we substitute α HS to the primal-dual connection (3.3), we obtained an approximated primal solution: w " Compared with the DRP approximation (3.6), we see these two approximations coincident, thus we see that Dual Random Projection is Hessian sketch applied in dual space. For ridge regression Algorithm 4: Accelerated Iterative Dual Random Projection (Acc-IDRP)-Primal Version.
1 Input: Data X, y, projection matrix R.
Update the dual approximation by α pt`1q Solve the projected problem in (3.11) and obtain solution z pt`1q .

10
Update u pt`1q " r pt`1q´X R z pt`1q .

13
Update v pt`1q "´X X J n`λ I n¯p pt`1q .
14 end problem (2.1) one have closed form solutions for various sketching techniques as: As we can see above, Hessian sketch is essentially sketching the covariance matrix : while DRP is essentially sketching the Gram matrix :

Accelerated Iterative Dual Random Projection
Based on the equivalence between dual random projection and Hessian sketch established in Section 3.1, we proposed an accelerated iterative dual random projection algorithms which improves the convergence speed of standard iterative DRP procedure (Zhang et al., 2014). The algorithm is shown in Algorithm 4, in which at each iteration t, we call the solver for the following randomly projected problem based on the residual r ptq : The accelerated IDRP algorithms simulate the accelerated IHS algorithm (2), but run Acc-IHS in the dual space. However, Acc-IDRP is still an primal algorithm since it updates the corresponding dual variables back from solving the randomly projected primal problem (3.11). Algorithm 5 summarized the accelerated IDRP algorithms directly from the dual problem. We note that it is not a practical algorithm as it requires solving the relatively more expensive dual problem, however it is easier to understand as it directly borrows the ideas of Acc-IHS described in Section 25. For the dual version of Acc-IDRP algorithm, at each iteration it is required to solve the following dual optimization problem given the dual residual r ptq : As we will show later, though the computational cost per iteration of Acc-IDRP and standard IDRP is the same, Acc-IDRP has the following advantages over IDRP: • As a preconditioned conjugate gradient procedure, Acc-IDRP is guaranteed to converge, and reach the optimum w˚within n iterations, even when the projection dimension d is very small.
• When the projection dimension d is large enough to make standard IDRP converge quickly to the optimum, Acc-IDRP converges even faster.
1 Input: Data X, y, projection matrix R.
Update the dual approximation by α pt`1q Update u pt`1q by solving (3.12).

12
Update v pt`1q "´X X J n`λ I n¯p pt`1q .

Approach Suitable Situation Reduced Dimension Recovery Iterative
Classical Sketch large n, small p sample reductionˆR andom Projection small n, large p dimension reductionˆĤ essian Sketch large n, small p sample reduction DRP small n, large p dimension reduction the other. For modern massive datasets, it is usually the case where both n and p are very large, for example, the click-through rate (CTR) prediction data sets provided by Criteo 2 has n ě 4ˆ10 9 and p ě 8ˆ10 8 . Thus it is desirable to have a sketching method to simultaneously reduce the huge sample size and dimensionality. Inspired by the primal-dual view described in Section 3.1, we propose the iterative Primal-Dual Sketch, which only involves solving small scale problems. For the original problem (2.1) with data tX, yu, we first construct the randomly projected data, as well as the doubly sketched data, as follows: where XR is the randomly projected data, and Π J XR is doubly sketched data by sketching XR via sample reduction. We initialize the Primal-Dual Sketch solution as w p0q DS " p0q, and at every iteration, we first apply random projection on the primal problem (which is equivalent to Hessian Sketch on the dual problem), and obtain the following problem: (4.1) Algorithm 6: Iterative Primal-Dual Sketch (IPDS).
Solve the sketched problem in (4.2) and obtain solution ∆z pkq .
DS . 12 end which is the same as the iterative dual random projection subproblem (3.8). However, different fro IDRP, we don't directly solve (4.1), but apply the iterative Hessian sketch to find an approximate solution. Note that the expanded form of (4.1) is We apply Hessian sketch to above problem, and iteratively refine the solution. That is, we first initialize an approximate solution of (4.1) of z p0q as 0, then at inner loop iteration k find a solution for the following sketched problem: then update z pk`1q as z pk`1q " z pkq`∆ z pkq .
The key point is that for the subproblem (4.2), the sketched data matrix is only of size mˆd, compared to the original problem size nˆp, where n " m, p " d, in contrast, the IHS still need to solve sub-problems of size mˆp, while IDRP need to solve sub-problems of size nˆd. As will show in the theoretical analysis, we only need to call solvers of mˆd problem (4.2) logarithmic times to obtain a solution of high approximation quality. The pseudo code of Iterative Primal-Dual Sketch (IPDS) is summarized in Algorithm 6. It is also possible to perform iterative Primal-Dual Sketch via another direction, that is, first perform primal Hessian sketch, and then apply dual Hessian sketch to solve the sketched primal problem: Algorithm 7: Accelerated Iterative Primal-Dual Sketch (Acc-IPDS).
, and update the approximation by z pk`1q " z pkq`a pkq P p pkq P .

7
Update r pk`1q P " r pkq P`a pkq P v pkq , and update u pk`1q P by solving (4.3).
, and update the dual approximation by α 14 Update primal approximation: w , and update the approximation by z pk`1q " z pkq`a pkq P p pkq P .

19
Update r pk`1q P " r pkq P`a pkq P v pkq , and update u pk`1q P by solving (4.3).

20
Update β pk`1q P " xr pk`1q P ,u pkq P y xr pkq P ,r pkq P y , and update p pk`1q P "´u pk`1q P`β pk`1q P p pkq P .

21
Update v pk`1q P "´R J X J XR n`λ I p¯p pt`1q P , and update k " k`1. Also, the idea presented in Section can also be adopted to further reduce the number of calls to mˆd scale sub-problems, which leads to the accelerated iterative primal-dual sketch (Acc-IPDS) algorithm (Summarized in Algorithm 7). In Acc-IPDS, we maintains the both vectors in primal space u P , v P , r P and vectors in dual space u D , v D , r D , to make sure the updating directions for both primal variables and dual variables are conjugate with previous updating directions. Moreover, based on the residual vector r P , Acc-IPDS iteratively calls the solver to find solution of the following sketched linear system of scale mˆd: As we will show in the subsequent section, the number of calls for solving problem (4.3) only grows logarithmically with the inverse of approximation error.

Theoretical Analysis
In this section we present the theoretical analysis of various iterative sketching procedures, and defer the omitted proofs to Appendix A. First we will provide a unified analysis of Hessian sketch as dual random projection. The unified analysis basically follows the analysis of (Zhang et al., 2014) and (Pilanci and Wainwright, 2016), but simultaneously provide recovering guarantees for both primal and dual variables of interest. Then we move to the convergence analysis of the proposed accelerated IHS and IDRP algorithms, where we will show improved convergence speed over standard IHS and IDRP. Finally, we prove the iterative primal-dual sketch will converge to the optimum within iterations only grow logarithmically with the target approximation accuracy.

A unified analysis of Hessian Sketch and Dual Random Projection
In this section we provide a simple, unified analysis for the recovery performance of Hessian Sketch and Dual random projection. As in (Pilanci and Wainwright, 2016), we use the following notion of Gaussian width for any set K Ď R p : where g is a random vector drawn from normal distribution N p0, I p q. Intuitively speaking, if the set K is restrictive to certain directions, then WpKq should be small as well (Vershynin, 2015). Given a set K and a random matrix R P R pˆd , the following quantities will be important for further analysis: where S p´1 is the p-dimensional unit-sphere. Firstly we could like the sketching matrix R to satisfy and moreover, the matrix RR J becomes closer to I p as sketching dimension d increases. Thus we would like to push ρ 1 pK, Rq to be close to 1, and ρ 2 pK, R, vq to be close to 0.
For the sake of simplicity, we just assume the random matrix R is sampled i.i.d. from some 1 ? dsub-Gaussian distributions, this can be done by first sample a matrix R where entries are sampled i.i.d. from 1-sub-Gaussian distribution, then perform the normalization as: The following lemma, from , states how large the sketching dimension d should be to make ρ 1 pK, Rq, ρ 2 pK, R, vq to be close to 1 and 0, respectively.
Lemma 5.1. When R is sampled i.i.d. from 1{ ? d-sub-Gaussian distributions, then there exists universal constants C 0 such that, we have with probability at least 1´δ, we have For a set K Ď R p , define the transformed set X J K as where X P R nˆp XK " tu P R n |u " Xv, v P Ku.
To present the main results about the unified analysis. Let's recall the main reductions in Hessian sketch and dual random projection. For Hessian sketch, we perform sample reduction with the transformation X Ñ Π J X; for dual random projection, we perform dimension reduction with the transformation X Ñ XR, where Π P R nˆm , R P R pˆd . Let w HS be the approximated solution via Hessian sketch by solving (2.4), and the corresponding dual variables by the following transform α HS " y´X w HS .
Likewise, let α DRP and w DRP be the approximated dual variables and primal variables obtained by dual random projection. The following theorem established the recovery bound for α HS , α DRP and w HS , w DRP simultaneously.
Theorem 5.2. Suppose we perform Hessian sketch or dual random projection for problem (2.1) with sub-Gaussian sketching matrix Π P R nˆm (for HS) or R P R pˆd (for DRP). Then there exists universal constants C 0 such that with probability at least 1´δ, the following approximation error bounds for HS or DRP holds: For Hessian sketch:

3)
For dual random projection: Remark. We have the following remarks for Theorem 5.2.
• For general low-dimensional problems where n " p, W 2 pXR p q " p, thus we have } w HS´w˚} X À b p m log`1 δ˘} w˚} X , which is the recovery bound proved in (Pilanci and Wainwright, 2016) (Proposition 1 in their paper).
• For high-dimensional problems when p is large, W 2 pX J R n q " n, thus we have } w DRP´w˚} 2 À a n d log`1 δ˘} w˚} 2 . Moreover, when X is low-rank, i.e. rankpXq " r and r ! minpn, pq, we have W 2 pX J R n q " r, thus we have } w DRP´w˚} 2 À a r d log`1 δ˘} w˚} 2 , which is the recovery bound obtained in Theorem 1 of (Zhang et al., 2014), in fact the bound established in Theorem 5.2 improves Theorem 1 of (Zhang et al., 2014) by removing an additional ? log r factor.

Analysis of IHS and DRP when X is approximately low-rank
In this section we provide recovery guarantees for the case when the data matrix X is approximately low rank. To make X can we well approximated by a rank r matrix where r ! minpn, pq, we assume σ r`1 , the r`1-th singular value of X, is small enough. Suppose X admits the following singular value decomposition: wherer denotes the index tr`1, ..., maxpn, pqu. We also requires w˚can be well approximated by the a linear combination of top r left singular vectors of X, i.e. the remaining singular vectors are almost orthogonal with w˚, depends on the method (Hessian sketch or dual random projection), we require the following notion of orthogonality holds for ρ and which are small: where Vr P R pˆr is the remaining right singular vectors of X. Also, to simplify the results, let the entries of the sketching matrix Π P R mˆn and R P R pˆd are sampled i.i.d. from zero-mean Gaussian distributions, with variance 1 m and 1 d , respectively. We have the following recovery bounds for Hessian sketch and dual random projection: where 1 , 2 , τ 1 , τ 2 , υ 1 , υ 2 are defined as Remark. We make the following comments on Theorem 5.3: • When σ r`1 " 0, i.e. X is exactly rank r, above results becomes which reduces to the results in Theorem 5.2.
• We see that if we have σ r`1 , ρ, sufficiently small in the following order, i.e. for Hessian sketch: for DRP: the guarantees (5.7) still hold.

Analysis of the accelerated IHS and IDRP methods
In this section we provide convergence analysis for the proposed Acc-IHS and Acc-IDRP approaches. As discussed before, since Acc-IHS and Acc-IDRP are preconditioned conjugate gradient methods on primal and dual problems, respectively, with a sketched Hessian as a preconditioner. By classical analysis of preconditioned conjugate gradient (Luenberger), we have the following convergence guarantees: Proposition 5.4. For Acc-IHS, we have

8)
and Acc-IDRP, we have From Proposition 5.4, we know the convergence of Acc-IHS and Acc-IDRP heavily depends on the condition number κ HS pX, Π, λq and κ DRP pX, R, λq. Thus the key of the rest of the analysis is to upper bound the condition numbers. To analyze the condition numbers, we make use of the following result in (Mendelson et al., 2007).
Lemma 5.5. If the elements in Π P R nˆm are i.i.d. sampled from a zero-mean 1 m -sub-Gaussian distribution, then there exists universal constants C 0 such that, for any subset K Ď R n , with probability at least 1´δ, we have Based on above lemma, we have the following bounds on the condition numbers κ HS pX, Π, λq and κ DRP pX, R, λq: Theorem 5.6. If the sketching matrix Π P R nˆm and R P R pˆd are sampled from 1 ? m -sub-Gaussian and 1 ? d -sub-Gaussian distributions, repectively, then with probability at least 1´δ, the following upper bounds hold: With Theorem 5.6, we immediately obtain the following corollary which states the overall convergence for Acc-IHS and Acc-IDRP: Corollary 5.7. Suppose the sketching matrix Π P R nˆm and R P R pˆd are sub-Gaussian, if t, the number of iterations of Acc-IHS satisfies If the number of iterations of Acc-IDRP satisfies then we have with probability at least 1´δ, Remark. To compare the convergence rate of Acc-IHS, and Acc-IDRP with standard IHS (Pilanci and Wainwright, 2016) and IDRP (Zhang et al., 2014), we observe that • For IHS, the number of iterations to reach -accuracy (Corollary 1 in (Pilanci and Wainwright, 2016) , which is significant when ρ relatively large. Moreover, IHS requires m Á W 2 pXR p q to holds to converge, while Acc-IHS is always convergent.

¯U
. Moreover, IDRP requires d Á r log r to holds to converge, while Acc-IDRP is always convergent.

Analysis for the primal-dual sketch methods
In this section, we provide runtime theoretical analysis for the proposed primal-dual sketch methods, where the sketched dual problem is not solved exactly, but apprxoimately solved via sketching the primal problem again. At outer loop iteration t, the standard analysis of iterative Hessian sketch ((Pilanci and Wainwright, 2016) and Theorem 5.2), we have the following lemma: Lemma 5.8. Let w pt`1q HS be the iterate defined in Algorithm 1, then we have the following inequality: However, in iterative primal-dual sketch, we don't have access to the exact minimizer w pt`1q HS , instead an approximate minimizer w pt`1q HS which is close to w pt`1q HS . The key is the analyze the iteration complexities of inner loops.
Theorem 5.9. With probability at least 1´δ, we have the following approximation error bound for w pt`1q HS in iterative primal-dual sketch: With Theorem 5.9, we have the following iterative complexity for the proposed IPDS approach.
Corollary 5.10. If the number of outer loops t and number of inner loops k in IPDS satisfying the following: Then with probability at least 1´δ: Proof. Apply Theorem 5.9 and substitute above inequalites for t and k we get Remark. Since the total number of sketched subproblem to solve in iterative primal-dual sketch is tk. To obtain approximation error, we have the total number of subproblems is Thus the iterative primal-dual sketch will be efficient when the Gaussian width of set XR p and X J R n is relatively small. For example, when rankpXq " r ! minpn, pq, we can choose the sketching dimension in IPDS to be m, d Á r, and IPDS can return a solution with -approximation error by just solving log 2`1 ˘s mall scale subproblems of scale rˆr. We next provide iteration complexity for the proposed Acc-IPDS algorithms as shown in Algorithm 7.
Corollary 5.11. If the number of outer loops t and number of inner loops k in IPDS satisfying the following: Then with probability at least 1´δ: Proof. The proof is similar to the proof of Theorem 5.9 and then subsitute the lower bounds for t and k to obtain the result.

Runtime comparison for large n, large p, and low-rank data
To solve problem (2.1), the runtime usually depends on several quantities: sample size n, problem dimension p as well as the problem condition. To make the comparison simpler, we simply assume X is rank r, note that r might be much smaller than n, p: r ! n, p. For (2.1) generally the regularization parameter λ is chosen at the order of Op1{ ? nq to Op1{nq (Sridharan et al., 2009;, here in favor of the iterative optimization algorithms we simply choose the large λ, i.e. of order Op1{ ? nq. For iterative optimization algorithms, the convergence usually depend on the smoothness of the problem. In (2.1), the smoothness parameter is λ max´X J X n`λ I p¯, which is often of the order Oppq (e.g. random sub-Gaussian design). To compare the runtime for solving (2.1), we consider the following methods: • Solving Linear System: which solves the problem exactly using matrix inversion, which requires Opnp 2`p3 q.

Op¨q Op¨q Comment
Linear System np 2`p3 np 2`p3 LS with Low-rank SVD npr`r 3 npr`r 3 Gradient Descent`n 1.5 p 2˘l og`1 ε˘n 1.5 p 2 Acc.Gradient Descent`n 1.25 p 1.5˘l og`1 ε˘n 1.25 p 1.5 Coordinate Descent`n 1.5 p˘log`1 ε˘n 1.5 p SVRG,SDCA,SAG`np`n 0.5 p 2˘l og`1 ε˘n p`n 0.5 p 2 Catalyst,APPA`np`n 0.75 p 1.5˘l og`1 ε˘n p`n 0.75 p 1.5 DSPDC npr``nr`n 0.75 p 1.5 r˘log`1 ε˘n pr`n 0.75 p 1.5 r IHS + Catalyst np log p`n 0.25 p 1.5 r log 2`1 ε˘n p`n 0.25 p 1.5 r Fast when p ! n DRP + Exact np log n`pnr 2`r3 q log`1 ε˘n p`nr 2`r3 Fast when n ! p Iter.primal-dual sketch np log p`pn`r 3 q log 2`1 ε˘n p`r 3 Fast when r ! maxpp, nq Table 2: Comparison of various approaches for solving large scale problems (2.1), the runtime depend on n, p, r, ε.
• Linear System with Low-rank SVD: if we have the factorization X " UV J , where U P R nˆr , V P R pˆr . Then we can solve the matrix inversion efficiently with the Sherman- Which can be done in Opnpr`r 3 q.
• DSPDC (Yu et al., 2015): which requires O´´n`bn L λ p¯log`1 ε˘¯i terations, with each iteration Oprq. Since L " Oppq, λ " O p1{ ? nq. Also, to apply DSPDC, one should compute the low-rank factorization as a preprocessing step which takes Opnprq. Thus we have the overall runtime is O`npr``nr`n 0.75 p 0.5 r˘log`1 ε˘˘.

Experiments
In this section we present extensive comparisons for the proposed iterative sketching approaches on both simulated and real world data sets. We first demonstrate the improved convergence of the proposed Acc-IHS and Acc-IDRP algorithms on simulated data sets, and then show the proposed iterative primal-dual sketch procedure and its accelerated version could simultaneously reduce the sample size and dimensions of the problem, while still maintaining high approximation precision. Then we test these algorithms on some real world data sets.

Simulations for Acc-IHS and Acc-IDRP
We first examine the effectiveness of the proposed Acc-IHS and Acc-DRP algorithms on simulated data. The response variable ty i u iPrns are drawn from the following linear model: where the noise ji is sampled from a standard normal distribution. The true model β˚is a p-dimensional vector where the entries are sampled i.i.d. from a uniform distribution in r0, 1s. We first compare the proposed Acc-IHS with the standard IHS on some "big n", but relatively low-dimensional data. We generate tx i u iPrns from multivariate normal distribution with mean zero vector, and covariance matrix Σ, which controls the condition number of the problem. We will varying Σ to see how it affects the performance of various methods. We set Σ ij " 0.5 |i´j| for the well-conditioned setting, and Σ ij " 0.5 |i´j|{10 for the ill-conditioned setting. We fix the sample size n " 10 5 and varying the dimensions with p " 50, 100, 300. The results are shown in Figure 1, where for each problem setting, we test 3 different sketching dimensions (number inside parentheses in legend). We have the following observations: • For both IHS and Acc-IHS, the larger the sketching dimension m, the faster the iterative converges to the optimum, which is consistent with the theory, as also observed in (Pilanci and Wainwright, 2016) and (Zhang et al., 2014) for IHS and IDRP algorithm.
Figure 1: Comparion of IHS and Acc-IHS on various simulated datasets.
• When compared with IHS and Acc-IHS, we observed Acc-IHS converges significantly faster than IHS. Moreover, when the sketching dimension is small, IHS can diverge and go far away from the optimum, while Acc-IHS still converges.
• For all the cases we tested, Acc-IHS converges faster than IHS even when its sketching dimension is only 1{3 of the sketching dimension in IHS.
We then compare the proposed Acc-IDRP with the standard IDRP on high-dimensional, but relatively low-rank data. We generate tx i u iPrns from a low-rank factorization: X " UV J , where the entries in U P R nˆr , V P R pˆr are sampled i.i.d from standard normal distribution. We fix the sample size n " 10 4 and varying the dimensions with p " 2000, 5000, 20000, we also vary the rank r " 20, 50. The results are shown in Figure 2, where for each problem setting, we test 3 different sketching dimensions (number inside parentheses in legend). We have similar observations with the IHS case, i.e. Acc-IDRP always converges significantly faster than IDRP, even in the low sketching dimension case where IDRP diverge.
Above experiments validate the theoretical analysis which showed the accelerated procedures for IHS and IDRP could significantly boost the convergence of their standard counterpart. Since the computational cost per iteration of the standard iterative sketching techniques and their accelerated version is almost the same, thus Acc-IHS and Acc-IDRP will be useful techniques for iterative sketching with faster convergence speed.

Simulations for IPDS and Acc-IPDS
In this section we demonstrate how iterative primal-dual sketch and its accelerated version works for simulated data. We generated the data using the same procedure as the simulation for Acc-DRP, where we fix the low-rank data with rank 10, and varing the original sample size n and dimension p. For primal-dual sketching, we reduce the sample size to m, and dimension to d, where m ! n, d ! p. We also compare with standard IHS and IDRP, where for IHS we only perform sample reduction from n to m, for IDRP only data dimension is reduced from p to d. Thus the for subproblem size for IPDS (and Acc-IPDS), IHS, IDRP, are mˆd, mˆp, nˆd, respectively. For IPDS and Acc-IPDS, we terminate the inner loop when the 8 distance between two inner iterations are less than 10´1 0 . The results are shown in Figure 3, where the sketched dimension pm, dq is shown in legend. We have the following observations: • Though simultaneously reduce the sample size and data dimension, IPDS and Acc-IPDS are able to recover the optimum to a very high precision. However, they requires generally more iterations to reach certain approximation level compared with IHS and IDRP, where at each iteration we need to solve a substantial larger scale subproblem. Therefore, primal-dual sketching approach still enjoy more computational advantages. For example, on problem of scale pn, pq " p10000, 20000q, IHS and IDRP need to solve 5 sub-problems of scale pm, pq " p500, 20000q and pn, dq " p10000, 500q, respectively, while for Acc-IPDS, it is only required to solve 35 sub-problems of scale pm, dq " p500, 500q to obtain the same approximation accuracy.
• Acc-IPDS converges significantly faster than IPDS, which again verified the effectiveness of

Experiments on real datasets
We also conduct experiments on some real-world data sets where the statistics of them are summarized in Table 3. Among all the data sets, the first 3 are cases where sample size is significantly larger than the data dimension, where we used to compare the IHS and Acc-IHS algorithms; the middle 3 data sets are high-dimensional datasets with small sample size, where we compare to DRP and Acc-DRP procedures; the last 3 datasets are cases where sample size and data dimension are both relatively large, which is suitable for iterative primal-dual sketching methods. For the last 3 data sets we found that standard IHS and DRP often fails (unless with very large sketching dimension), thus we compared with Acc-IHS and Acc-DRP. We follow the same experimental setup with the simulation study, and the convergence plots are summarized in Figure 4. We have the following observations: • Acc-IHS and Acc-DRP converges significantly faster than IHS and DRP, respectively, where similar observation is drawn from simulation studies.
• For the last 3 data sets where n and p are both large, and the data is not exactly low-rank: IHS, DRP and IPDS often diverge because of the requirement of the sketching dimension to ensure convergence is high, while the accelerated versions still converges to the optimum. It is notable that the Acc-IPDS only requires solving several least squares problems where both sample size and dimension are relatively small.

Conclusion and Discussion
In this paper, we focused on sketching techniques for solving large-scale 2 regularized least square problems, we established the equivalence between the recently proposed two emerging techniques (Hessian sketch and dual random projection) from a primal-dual point of view, we also proposed accelerated methods for IHS and IDRP, from the preconditioned optimization perspective, and by combining the primal and dual sketching technique, we proposed a novel iterative primal-dual sketching approach which substantially reduced the computational cost in solving sketched subproblems.
Thus we haveˆX X J n`λ I n˙α˚´ˆX RR J X J n`λ I n˙ α DRP " 0.
So BˆX X J n`λ I n˙α˚´ˆX RR J X J n`λ I n˙ α DRP , α˚´ α DRP F " 0.
By some algebraic manipulations we have BˆX RR J X J n´X X J n˙α˚, α DRP´α˚F " pα˚´ α DRP q JˆX RR J X J n`λ I n˙p α˚´ α DRP q.
Then applying Lemma 5.1 we conclude the proof.