Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach

We study the problem of recovery of matrices that are simultaneously low rank and row and/or column sparse. Such matrices appear in recent applications in cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics. We propose a GDT (Gradient Descent with hard Thresholding) algorithm to efficiently recover matrices with such structure, by minimizing a bi-convex function over a nonconvex set of constraints. We show linear convergence of the iterates obtained by GDT to a region within statistical error of an optimal solution. As an application of our method, we consider multi-task learning problems and show that the statistical error rate obtained by GDT is near optimal compared to minimax rate. Experiments demonstrate competitive performance and much faster running speed compared to existing methods, on both simulations and real data sets.


Introduction
Many problems in machine learning, statistics and signal processing can be formulated as optimization problems with a smooth objective and nonconvex constraints. The objective usually measures the fit of a model, parameter, or signal to the data, while the constraints encode structural requirements on the model. Examples of nonconvex constraints include sparsity where the parameter is assumed to have only a few non-zero coordinates [60,125,100,108,135], group sparsity where the parameter is comprised of several groups only few of which are non-zero [81,69,61,34], and low-rankness where the parameter is believed to be a linear combination of few factors [6,35,38,50,65]. Common approach to dealing with nonconvex constraints is via convex relaxations, which allow for application of simple optimization algorithms and easy theoretical analysis [2,29,46,28,72]. From a practical point of view, it has been observed that directly working with a nonconvex optimization problem can lead to both faster and more accurate algorithms [103,132,126,116]. As a result, a body of literature has recently emerged that tries to characterize good performance of these algorithms [13,131,52].
In this work, we focus on the following optimization problem Θ ∈ arg min Θ∈Ξ f (Θ) (1) where Ξ ⊂ R m 1 ×m 2 is a nonconvex set comprising of low rank matrices that are also row and/or column sparse, where Θ 2,0 = |{i ∈ [m 1 ] | j∈[m 2 ] Θ 2 ij = 0}| is the number of non-zero rows of Θ. Such an optimization problem arises in a number of applications including sparse singular value decomposition and principal component analysis [116,82,58], sparse reduced-rank regression [20,83,35,36,111], and reinforcement learning [26,104,75,119,101]. Rather than considering convex relaxations of the optimization problem (1), we directly work with a nonconvex formulation. Under an appropriate statistical model, the global minimizer Θ approximates the "true" parameter Θ * with an error level . Since the optimization problem (1) is highly nonconvex, our aim is to develop an iterative algorithm that, with appropriate initialization, converges linearly to a stationary pointΘ that is within c · distance of Θ. In order to develop a computationally efficient algorithm, we reparametrize the m 1 × m 2 matrix variable Θ as U V with U ∈ R m 1 ×r and V ∈ R m 2 ×r , and optimize over U and V . That is, we consider (with some abuse of notation) the following optimization problem where U = U(s 1 ) = U ∈ R m 1 ×r | U 2,0 ≤ s 1 and V = V(s 2 ) = V ∈ R m 2 ×r | V 2,0 ≤ s 2 .
Such a reparametrization automatically enforces the low rank structure and will allow us to develop an algorithm with low computational cost per iteration. Note that even though U and V are only unique up to scaling and a rotation by an orthogonal matrix, Θ = U V is usually unique. We make several contributions in this paper. First, we develop an efficient algorithm for minimizing (2), which uses projected gradient descent on a nonconvex set in each iteration. Under conditions on the function f (Θ) that are common in the high-dimensional literature, we establish linear convergence of the iterates to a statistically relevant solution. In particular, we require that the function f (Θ) satisfies restricted strong convexity (RSC) and restricted strong smoothness (RSS), conditions that are given in Condition (RSC/RSS) below. Compared to the existing work for optimization over low rank matrices with (alternating) gradient descent, we need to study a projection onto a nonconvex set in each iteration, which in our case is a hard-thresholding operation, that requires delicate analysis and novel theory. Our second contribution, is in the domain of multi-task learning. Multitask learning is a widely used learning framework where similar tasks are considered jointly for the purpose of improving performance compared to learning the tasks separately [31]. We study the setting where the number of input variables and the number of tasks can be much larger than the sample size. Our focus is on simultaneous variable selection and dimensionality reduction. We want to identify which variables are relevant predictor variables for different tasks and at the same time we want to combine the relevant predictor variables into fewer features that can be explained as latent factors that drive the variation in the multiple responses. We provide a new algorithm for this problem and improve the theoretical results established in [83]. In particular, our algorithm does not require a new independent sample in each iteration and allows for non-Gaussian errors, while at the same time achieves nearly optimal error rate compared to the information theoretic minimax lower bound for the problem. Moreover, our prediction error is much better than the error bound proposed in [20], and matches the error bound in [100]. However, all of the existing algorithms are slow and cannot scale to high dimensions. Finally, our third contribution is in the area of reinforcement learning. We study the Multi-task Reinforcement Learning (MTRL) problem via value function approximation. In MTRL the decision maker needs to solve a sequence of Markov Decision Processes (MDPs). A common approach to Reinforcement Learning when the state space is large is to approximate the value function of linear basis functions (linear in some appropriate feature representation of the states) with sparse support. Thus, it is natural to assume the resulting coefficient matrix is low rank and row sparse. Our proposed algorithm can be applied to the regression step of any MTRL algorithm (we chose Fitted Q-iteration (FQI) for presentation purposes) to solve for the optimal policies for MDPs. Compared to [26] which uses convex relaxation, our algorithm is much more efficient in high dimensions.

Related Work
Our work contributes to several different areas, and thus is naturally related to many existing works. We provide a brief overview of the related literature and describe how it is related to our contributions. For the sake of brevity, we do not provide an extensive review of the existing literature.
Low-rank Matrix Recovery. A large body of literature exists on recovery of low-rank matrices as they arise in a wide variety of applications throughout science and engineering, ranging from quantum tomography to signal processing and machine learning [1,80,102,42]. Recovery of a low-rank matrix can be formulated as the following optimization problem where the objective function f : R m 1 ×m 2 → R is convex and smooth. The problem (3) is highly nonconvex and NP-hard in general [46,47]. A lot of the progress in the literature has focused on convex relaxations where one replaces the rank constraint using the nuclear norm. See, for example, [28,29,27,96,23,94,50,32,60,98,72,54,87,34,120,88,2,95,37,40,39,58,24,123,135,117] and references therein. However, developing efficient algorithms for solving these convex relaxations is challenging in regimes with large m 1 and m 2 [59]. A practical approach, widely used in large scale applications such as recommendation systems or collaborative filtering [105,73,49,138] relies on solving a nonconvex optimization problem where the decision variable Θ is factored as U V , usually referred to as the Burer-Monteiro type decomposition [21,22]. A stationary point of this nonconvex problem is usually found via a block coordinate descent-type algorithm, such as alternating minimization or (alternating) gradient descent. Unlike for the convex relaxation approaches, the theoretical understanding of these nonconvex optimization procedures has been developed only recently [67,68,65,55,56,57,103,132,133,17,16,107,38,137,136,48,78]. Compared to the classical nonconvex optimization theory, which only shows a sublinear convergence to a local optima, the focus of the recent literature is on establishing linear rates of convergence or characterizing that the objective does not have spurious local minima. In addition to the methods that work on the factorized form, [64,76,63,13] consider projected gradient-type methods which optimize over the matrix variable Θ ∈ R m 1 ×m 2 . These methods involve calculating the top r singular vectors of an m 1 × m 2 matrix at each iteration. When r is much smaller than m 1 and m 2 , they incur much higher computational cost per iteration than the methods that optimize over U ∈ R m 1 ×r and V ∈ R m 2 ×r . Our work contributes to this body of literature by studying gradient descent with a projection step on a non-convex set, which requires hard-thresholding. Hard-thresholding in this context has not been considered before. Theoretically we need a new argument to establish linear convergence to a statistically relevant point. [38] considered projected gradient descent in a symmetric and positive semidefinite setting with a projection on a convex set. Our work is most closely related to [132], which used the notion of inexact first order oracle to establish their results, but did not consider the hard-thresholding step.
Structured Low-rank Matrices. Low-rank matrices with additional structure also commonly arise in different problems ranging from sparse principal component analysis (PCA) and sparse singular value decomposition to multi-task learning. In a high-dimensional setting, the classical PCA is inconsistent [66] and recent work has focused on PCA with additional sparse structure on the eigenvectors [5,14,18,25,113,84,130]. Similar sparse structure in singular vectors arises in sparse SVD and biclustering [77,35,82,109,124,11,70,10]. While the above papers use the sparsity structure of the eigenvectors and singular vectors, it is also possible to have simultaneous low rank and sparse structure directly on the matrix Θ. Such a structure arises in multi-task learning, covariance estimation, graph denoising and link prediction [85,97]. Additional structure on the sparsity pattern was imposed in the context of sparse rank-reduced regression, which is an instance of multi-task learning [36,20,83,9,100]. Our algorithm described in Section 2 can be applied to the above mentioned problems. In Section 4, we theoretically study multi-task learning in the setting of [83]. We relax conditions imposed in [83], specifically allowing for non-Gaussian errors and not requiring independent samples at each step of the algorithm, while still achieving the near minimax rate of convergence. We provide additional discussion in Section 4 after formally providing results for the multi-task learning setting. In Section 5, we further corroborate our theoretical results in extensive simulations and show that our algorithm outperforms existing methods in multi-task learning.
Low-rank Plus Sparse Matrix Recovery. At this point, it is worth mentioning another commonly encountered structure on the decision variable Θ that we do not study in the current paper. In various applications it is common to model Θ as a sum of two matrices, one of which is low-rank and the other one sparse. Applications include robust PCA, latent Gaussian graphical models, factor analysis and multi-task learning [30,60,32,39,2,51,131,121,52]. While Burer-Monteiro factorization has been considered for the lowrank component in this context (see, for example, [131] and references therein), the low-rank component is dense as it needs to be incoherent. The incoherence assumption guarantees that the low-rank component is not too spiky and can be identified [28]. An alternative approach was taken in [52] where alternating minimization over the low-rank and sparse component with a projection on a nonconvex set was investigated.

Organization of the paper
In Section 2 we provide details for our proposed algorithm. Section 3 states our assumptions and the theoretical result with a proof sketch. Section 4 shows applications to multi-task learning, while Section 5 presents experimental results. Section 6 provides detailed technical proofs. Conclusion is given in Section 7.

Gradient Descent With Hard Thresholding
In this section, we detail our proposed algorithm, which is based on gradient descent with hard thresholding (GDT). Our focus is on developing an efficient algorithm for minimizing f (Θ) with Θ ∈ Ξ. In statistical estimation and machine learning a common goal is to find Θ * , which is an (approximate) minimizer of E[f (Θ)] where the expectation is with respect to randomness in data. In many settings, the global minimizer of (1) can be shown to approximate Θ * up to statistical error, which is problem specific. In Section 3, we will show that iterates of our algorithm converge linearly to Θ * up to a statistical error. It is worth noting that an argument similar to that in the proof of Theorem 1 can be used to establish linear convergence to the global minimizer Θ in a deterministic setting. That is, suppose ( U , V ) is a global minimizer of the problem (2) and Θ = U V . Then as long as the conditions in Section 3 hold for U , V in place of U * , V * , we can show linear convergence to Θ up to an error level defined by the gradient of the objective function at Θ. See the discussion after Theorem 1.
Our algorithm, GDT, uses a Burer-Monteiro factorization to write Θ = U V , where U ∈ R m 1 ×r and V ∈ R m 2 ×r , and minimizes where g(U, V ) is the penalty function defined as The role of the penalty is to find a balanced decomposition of Θ, one for which σ i ( U ) = σ i (V ), i = 1, . . . , r [136,131]. Note the value of the penalty is equal to 0 for a balanced solution, so we can think of the penalized objective as looking through minimizer of (2) for a one that satisfies U U − V V = 0. In particular, adding the penalty function g does not change the minimizer of f over Ξ. The convergence rate of GDT depends on the condition number of (U * , V * ), the point algorithm converges to. The penalty ensures that the iterates U, V are not ill-conditioned. Gradient descent with hard-thresholding on U and V is used to minimize (4). Details of GDT are given in Algorithm 1. The algorithm takes as input parameters η, the step size; s 1 , s 2 , the sparsity level; T , the number of iterations; and a starting point Θ 0 . The choice of starting point Θ 0 is very important as the algorithm performs a local search in its neighborhood. In Section 3 we will formalize how close Θ 0 needs to be to Θ * , while in Section 4 we provide a concrete way to initialize under a multi-task learning model. In general, we envisage finding Θ 0 by solving the following optimization problem where pen(Θ) is a (simple) convex penalty term making the objective (5) a convex optimization problem. For example, we could use the vector 1 norm, pen(Θ) = Θ 1 . The choice of penalty pen(Θ) should be such that solving the optimization problem in (5) can be done efficiently in a high dimensional setting. In practice, if solving the convex relaxation is slow, we can start from the all zero matrix and perform several (proximal) gradient steps to get an appropriate initialization. See for example [131]. Once an initial estimate Θ 0 is obtained, we find the best rank r approximation Θ = U Σ V to Θ 0 and use it to obtain the initial iterates U 0 and V 0 . In each step, GDT updates U and V by taking a gradient step and hard-thresholding the result. The operation Hard(U, s) keeps s rows of U with the largest 2 row-norm, while setting to zero other rows. Suppose that the target statistical parameter Θ * is in Ξ(r * , s * 1 , s * 2 ). The sparsity level s * 1 and s * 2 as well as the rank r * are not known in practice, but are needed in Algorithm 1. For the convergence proof we require that the input parameters to the algorithm are set as s 1 = c · s * 1 and s 2 = c · s * 2 for some c > 1. In practice, we use an information criteria to select these tuning parameters. For example, [100] develops the scale-free predictive information criterion to select the best sparsity parameters. The rank r can be estimated as in [19].
To the best of our knowledge, GDT is the first algorithm to deal with a nonconvex optimization problem over a parameter space that is simultaneously low rank and row and column sparse. In the following section we will provide conditions on the objective function f and the starting point Θ 0 which guarantee linear convergence to Θ * up to a statistical error. As an application, we consider the multi-task learning problem in Section 4. We show that the statistical error nearly matches the optimal minimax rate, while the algorithm achieves the best performance in terms of estimation and prediction error in simulations.

Theoretical Result
In this section, we formalize the conditions and state the main result on the linear convergence of our algorithm. We begin in Section 3.1 by stating the conditions on the objective function f and initialization that are needed for our analysis. In Section 3.2, we state Theorem 1 that guarantees linear convergence under the conditions to a statistically useful point. The proof outline is given in Section 3.3. In Section 4 to follow, we derive results for multi-task learning as corollaries of our main result.

Regularity Conditions
We start by stating mild conditions on the objective function f , which have been used in the literature on high-dimensional estimation and nonconvex optimization, and they hold with high-probability for a number of statistical models of interest [132,131,52]. Note that all the conditions depend on the choice of s 1 and s 2 (or equivalently, on c).
Restricted Strong Convexity and Smoothness (RSC/RSS). There exist universal constants µ and L such that for all Θ 1 , Θ 2 ∈ Ξ(2r, s 1 , s 2 ) where s 1 = (2c + 1)s * 1 and s 2 = (2c + 1)s * 2 . The next condition is on the initial estimate Θ 0 . It quantifies how close the initial estimator needs to be to Θ * so that iterates of GDT converge to statistically useful solution.
Initialization (I). Define µ min = 1 8 min{1, µL µ+L } and We require where We note that, in general, (7) defines a ball of constant radius around Θ * in which the initial estimator needs to fall into. In particular, when considering statistical learning problems, the initial estimator can be inconsistent as the sample size increases.
Next, we define the notion of the statistical error, Note that the statistical error quantifies how large the gradient of the objective evaluated at the true parameter Θ * can be in the directions of simultaneously low-rank and sparse matrices. It implicitly depends on the choice of c and as we will see later there is a trade-off in balancing the statistical error and convergence rate of GDT. As c increases, statistical error gets larger, but requires us to choose a smaller step size in order to guarantee convergence.
With these two conditions, we are ready to the choice of the step size in Algorithm 1.
Step Size Selection. We choose the step size η to satisfy Furthermore, we require η and c to satisfy and The condition that the step size η satisfies (9) is typical in the literature on convex optimization of strongly convex and smooth functions. Under (10) we will be able to show contraction after one iteration and progress towards Θ * . The second term in (10) is always smaller than 1, while the first term ξ 2 is slightly larger than 1 and is the price we pay for the hard thresholding step. In order to show linear convergence we need to balance the choice of η and ξ 2 to ensure that β < 1. From (10), we see that if we select a small step size η, then we need to have a small ξ 2 , which means a large c. Intuitively, if η is too small, it may be impossible to change row and column support in each iteration. In this case we have to keep many active rows and columns to make sure we do not miss the true signal. This leads to large s 1 and s 2 , or equivalently to a large c. However, the statistical error (8) will increase with increase of c and these are the trade-off on the selection of η and c.
Finally, (11) guarantees that the iterates do not run outside of the initial ball given in (7). In case (11) is violated, then the initialization point Θ 0 is already a good enough estimate of Θ * . Therefore, this requirement is not restrictive. In practice, we found that the selection of η and c is not restrictive and the convergence is guaranteed for a wide range of values of their values.

Main Result
Our main result establishes linear convergence of GDT iterates to Θ * up to statistical error. Since the factorization of Θ * is not unique, we turn to measure the subspace distance of the iterates (U t , V t ) to the balanced decomposition U * (V * ) = Θ * .
With this, we are ready to state our main result.
Furthermore, for The proof sketch of Theorem 1 is given in the following section. Conceptually, Theorem 1 provides a minimal set of conditions for convergence of GDT. The first term in equations (12) and (13) correspond to the optimization error, whereas the second term corresponds to the statistical error. These bounds show that the distance between the iterates and Θ * drop exponentially up to the statistical limit e stat , which is problem specific. In statistical learning problem, it commonly depends on the sample size and the signal-to-noise ratio of the problem.
Theorem 1 provides convergence in a statistical setting to the "true" parameter Θ * . However, as mentioned in Section 2, Algorithm 1 and Theorem 1 can also be used to establish linear convergence to a global minimizer in a deterministic setting. Suppose ( U , V ) ∈ arg min U ∈U ,V ∈V {f (U, V )} is a global minimizer and Θ = U V . Furthermore, assume that the conditions in Section 3.1 are satisfied with Θ in place of Θ * . Then we have that the iterates {Θ t } obtained by GDT converge linearly to a global minimum Θ up to the error e stat defined similar to (8) with Θ in place of Θ * . This error comes from sparsity and hard thresholding. In particular, suppose there are no row or column sparsity constraints in the optimization problem (2), so that we do not have hard-thresholding steps in Algorithm 1. Then we have e stat = 0, so that iterates {Θ t } converge linearly to Θ, recovering the result of [132].

Proof Sketch of Theorem 1
In this section we sketch the proof of our main result. The proof combines three lemmas. We first one quantify the accuracy of the initialization step. The following one quantifies the improvement in the accuracy by one step of GDT. The third lemma shows that the step size assumed in Theorem 1 satisfies conditions of the second lemma. Detailed proofs of these lemmas are relegated to Section 6.
Our first lemma quantifies the accuracy of the initialization step.
Lemma 2. Suppose that the input to GDT, Θ 0 , satisfies initialization condition (7). Then the initial iterates U 0 and V 0 obtained in lines 3 and 4 of Algorithm 1 satisfy where The proof of Lemma 2 is given in Section 6.1.
Then we have where ξ 2 = 1 + 2 √ c−1 . The proof of Lemma 3 is given in Section 6.2.
We have that the choice of step size (9) in Theorem 1 satisfies the condition (15) in Lemma 3.
The proof of Lemma 4 is given in Section 6.3.
Combining the three results above, we can complete the proof of Theorem 1. Starting from initialization Θ 0 satisfying the initialization condition (7), Lemma 2 ensures that (14) is satisfied for Z 0 and Lemma 4 ensures that the choice of step size (9) satisfies the step size condition (15) in Lemma 3. We can then apply Lemma 3 and get the next iterate Z 1 = Z + , which satisfies (16). Using the condition on statistical error (11), initialization (7), and a simple calculation, we can verify that Z 1 satisfies d(Z 1 , Z * ) ≤ I 0 . Therefore we can apply Lemma 2, Lemma 3, and Lemma 4 repeatedly to obtain for each t = 0, 1, ..., T . We then have Finally, for which shows linear convergence up to the statistical error.

Application to Multi-task Learning
In this section, we apply the theory developed in Section 3 on two specific problems. First, in Section 4.1, we apply GDT algorithm to a multi-task learning problem. We show that under commonly used statistical conditions the conditions on the objective function stated in Section 3.1 are satisfied with high-probability. Next, in Section 4.2 we discuss an application to multi-task reinforcement learning problem.

GDT for Multi-task Learning
We apply GDT algorithm to the problem of multi-task learning, which has been successfully applied in a wide range of application areas, ranging from neuroscience [111], natural language understanding [41], speech recognition [99], computer vision [100], and genetics [127] to remote sensing [122], image classification [74], spam filtering [118], web search [33], disease prediction [134], and eQTL mapping [69]. By transferring information between related tasks it is hoped that samples will be better utilized, leading to improved generalization performance. We consider the following linear multi-task learning problem where Y ∈ R n×k is the response matrix, X ∈ R n×p is the matrix of predictors, Θ * ∈ R p×k is an unknown matrix of coefficients, and E ∈ R n×k is an unobserved noise matrix with i.i.d. mean zero and variance σ 2 entries. Here n denotes the sample size, k is the number of responses, and p is the number of predictors.
There are a number of ways to capture relationships between different tasks and success of different methods relies on this relationship. [44] studied a setting where linear predictors are close to each other. In a high-dimensional setting, with large number of variables, it is common to assume that there are a few variables predictive of all tasks, while others are not predictive [108,91,81,71,115]. Another popular condition is to assume that the predictors lie in a shared lower dimensional subspace [7,6,129,8,114]. In contemporary applications, however, it is increasingly common that both the number of predictors and the number of tasks is large compared to the sample size. For example, in a study of regulatory relationships between genome-wide measurements, where micro-RNA measurements are used to explain the gene expression levels, it is commonly assumed that a small number of micro-RNAs regulate genes participating in few regulatory pathways [82]. In such a setting, it is reasonable to assume that the coefficients are both sparse and low rank. That is, one believes that the predictors can be combined into fewer latent features that drive the variation in the multiple response variables and are composed only of relevant predictor variables. Compared to a setting where either variables are selected or latent features are learned, there is much less work on simultaneous variable selection and rank reduction [19,35,36,100]. In addition, we when both p and k are large, it is also needed to assume the column sparsity on the matrix Θ * to make estimation feasible [83], a model that has been referred to as the two-way sparse reduced-rank regression model. We focus on this model here.
Multi-task Model (MTM) In the model (17), we assume that the true coefficient matrix Θ * ∈ Ξ(r, s * 1 , s * 2 ). The noise matrix E has i.i.d. sub-Gaussian elements with variance proxy σ 2 , which requires that each element e ij satisfies E(e ij ) = 0 and its moment generating function satisfies E[exp(te ij )] ≤ exp(σ 2 t 2 /2). The design matrix X is considered fixed with columns normalized to have mean 0 and standard deviation 1. Moreover, we assume X satisfies the following Restricted Eigenvalue (RE) condition [89] for some constant κ(s 1 ) and κ(s 1 ).
We will show that under the condition (MTM), GDT converges linearly to the optimal coefficient Θ * up to a region of statistical error. Compared to the previous methods for estimating jointly sparse and low rank coefficients [19,35,36,83], GDT is more scalable and improves estimation accuracy as illustrated in the simulation Section 5.
In the context of the multi-task learning with the model in (17), we are going to use the least squares loss. The objective function in is f (Θ) = 1 2n Y − XΘ 2 F and we write Θ = U V with U ∈ R p×r and V ∈ R k×r . The constraint set is set as before as U ∈ U(s 1 ) and V ∈ U(s 2 ) with s 1 = c · s * 1 , s 2 = c · s * 2 for some c > 1. The rank r and the sparsity levels s 1 , s 2 are tuning parameters, which can be selected using the information criterion as in [100].
In order to apply the results of Theorem 1, we first verify the conditions in Section 3.1. The condition (RSC/RSS) in is equivalent to and it holds with µ = κ(s 1 ) and L =κ(s 1 ).
Next, we discuss how to initialize GDT in the context of multi-task learning. Under the structural conditions on Θ * in the condition (MTM) there are a number of way to obtain an initial estimator Θ 0 . For example, we can use row and column screening [45], group lasso [128], and lasso [106] among other procedures. Here and in simulations we use the lasso estimator, which takes the form The benefit of this approach is that it is scalable to the high-dimensional setting and trivially parallelizable, since each column of Θ 0 can be estimated separately. The requirement of the initialization condition (I) is effectively a requirement on the sample size. Under the condition (MTM), a result of [89] shows that these conditions are satisfied with n ≥ s * 1 s * 2 log p log k. We then characterize the statistical error e stat under the condition (MTM).

Lemma 5. Under the condition (MTM), with probability at least
for some constant C.
The proof of Lemma 5 is given in Section 6.4.
With these conditions, we have the following result on GDT when applied to the multitask learning model in (17).

Corollary 6. Suppose that the condition (MTM) is satisfied. Then for all
, with probability at least 1 − (p ∨ k) −1 , we have for some constant C.
Note that if there is no error term E in the model (17), then Algorithm 1 converges linearly to the true coefficient matrix Θ * , since e stat = 0 in that case. The error rate in Corollary 6 matches the error rate of the algorithm proposed in [83]. However, our algorithm does not require a new independent sample in each iteration and allows for non-Gaussian errors. Compared to the minimax rate established in [83], we observe that our rate matches up to a multiplicative log factor. When r is comparable to log(p ∨ k) they match up to a constant multiplier. Therefore for large enough T , GDT algorithm attains near optimal rate. To the best of our knowledge, no computationally scalable procedure achieves the minimax rate (18). In case we do not consider column sparsity, that is, when s * 2 = k, Corollary 6 gives error rate and prediction error Compared to the prediction error bound kr + s * 1 r log p s proved in [20], we see that GDT error is much smaller with r+log p in place of r log p. Moreover, GDT error matches the prediction error (k +s * 1 −r)r +s * 1 log p established in [100], as long as k ≥ Cr which is typically satisfied.

Application to Multi-task Reinforcement Learning
Reinforcement learning (RL) and approximate dynamic programming (ADP) are popular algorithms that help decision makers find optimal policies for decision making problems under uncertainty that can be cast in the framework of Markov Decision Processes (MDP) [15,104]. Similar to many other approaches, when the sample size is small these algorithms may have poor performance. A possible workaround then is to simultaneously solve multiple related tasks and take advantage of their similarity and shared structure. This approach is called multi-task reinforcement learning (MTRL) and has been studied extensively [75,119,101]. In this section we show how GDT algorithm can be applied to the MTRL problem. A Markov decision process (MDP) is represented by a 5-tuple M = (S, A, P, R, γ) where S represents the state space (which we assume to be finite for simplicity); A is a finite set of actions; P a (s, s ) = Pr(s t+1 = s | s t = s, a t = a) is the Markovian transition kernel that measures the probability that action a in state s at time t will lead to state s at time t+1 (we assume P a to be time homogeneous); R(s, a) is the state-action reward function measuring the instantaneous reward received when taking action a in state s; and γ is the discount factor. The core problem of MDP is to find a deterministic policy π : S → A that specifies the action to take when decision maker is in some state s. Define the Bellman operator where Q : S × A → R is the state-action value function. The MDP can then be solved by calculating the optimal state-action value function Q * which gives the total discounted reward obtained starting in state s and taking action a, and then following the optimal policy in subsequent time steps. Given Q * , the optimal policy is recovered by the greedy policy: π * (s) = arg max a∈A Q * (s, a).
In MTRL the objective is to solve k related tasks simultaneously where each task k 0 ∈ {1, . . . , k} corresponds to an MDP: M k 0 = (S, A, P k 0 , R k 0 , γ k 0 ). Thus, these k tasks share the same state and action space but each task has a different transition dynamics P k 0 , stateaction reward function R k 0 , and discount factor γ k 0 . The decision maker's goal is to find an optimal policy for each MDP. If these MDPs do not share any information or structure, then it is straightforward to solve each of them separately. Here we assume the MDPs do share some structure so that the k tasks can be learned together with smaller sample complexity than learning them separately.
We follow the structure in [26] and solve this MTRL problem by the fitted-Q iteration (FQI) algorithm [43], one of the most popular method for ADP. In contrast to exact value iteration (Q t = T Q t−1 ), in FQI this iteration is approximated by solving a regression problem by representing Q(s, a) as a linear function in some features representing the state-action pairs. To be more specific, we denote ϕ(s) = [ϕ 1 (s), ϕ 2 (s), ..., ϕ ps (s)] as the feature mapping for state s where ϕ i : S → R denotes the ith feature. We then extend the state-feature vector ϕ to a feature vector mapping state s and action a as: where p = |A|×p s . Finally, for MDP k 0 , we represent the state-action value function Q k 0 (·, ·) as an |S| × |A| dimensional column vector with: where Θ k 0 is a p × 1 dimensional column vector. If Θ ∈ R p×k represents the matrix with columns Θ k 0 , k ∈ {1, . . . , k}, then we see that given the Q k 0 (s, a) state-action value functions, estimating the Θ matrix is just a Multi-Task Learning problem of the form (17) with the response matrix Y . = Q ∈ R n×k where n = |S| × |A| denotes the "sample size" with rows indexed by pairs (s, a) ∈ S × A, X . = Φ ∈ R n×p represents the matrix of predictors (features) with (s, a) th row as φ(s, a), and Θ * is the unknown matrix of ADP coefficients. Consistent with the GDT algorithm, to exploit shared sparsity and structure across the k MDP tasks, we will subsequently assume that the coefficient matrix Θ * is row sparse and low rank.
Algorithm 2 provides details of MTRL with GDT. We assume we have access to the generative model of the k MDPs so that we can sample reward r and state s from R(s, a) and P a (s, s ). With "design states" S k ⊆ S, n s . = |S k | given as input, for each action a and each state s ∈ S k , FQI first generates samples (reward r and transition state s ) from the generative model of each MDP. These samples form a new dataset according to Here Q t−1 k 0 is calculated using the coefficient matrix from previous iteration: We then build dataset D t k 0 = (s i , a), y t i,a,k 0 s i ∈S k ,a∈A with s as predictor and y as response, and apply GDT algorithm on the dataset {D t k 0 } k k 0 =1 to get estimator Θ t . This completes an iteration t and we repeat this process until convergence. Finally the optimal policy π t k 0 is given by greedy policy: π t k 0 (s) = arg max a∈A Q t k 0 (s, a) at each iteration t.
To derive theoretical result analogous to [26], we further assume R(s, a) ∈ [0, 1] and hence the maximum cumulative discounted reward Q max = 1/(1 − γ). Since each task is a meaningful MDP, we do not assume sparsity on columns. Suppose sup s ϕ(s) 2 ≤ L, we have the following theoretical result: Theorem 7. Suppose the linear model holds and suppose the conditions in Section 3 are satisfied for each Θ * a with rank r and row sparsity s * 1 , then after T iterations, with probability for some constant C.
Proof. We start from the intermediate result in [86]: The error term t k 0 (s , b) measures the approximation error in state s ∈ S and action b ∈ A. It can be bounded by We then have Taking average, and plugging in the main result (13) and the statistical error (19) we obtain our desired result.

Experiment
In this section we demonstrate the effectiveness of GDT algorithm by extensive experiments. Section 5.1 shows results on synthetic datasets while Section 5.2 and 5.3 show results on two real datasets.

Algorithm 2 Multi-Task Reinforcement Learning with GDT
for a = 1 to |A| do for k 0 = 1 to k, i = 1 to n s do Generate samples r t i,a,k 0 = R k 0 (s i , a) and s t i,a,k 0 ∼ P a,k 0 (s i , s ) Calculate y t i,a,k 0 = r t i,a,k 0 + γ max a Q t−1 k 0 (s t i,a,k 0 , a ) end for end for Estimate Θ t using GDT algorithm with X = X ((s i , a), . end for Output: Θ T

Synthetic Datasets
We present numerical experiments on MTL problem to support our theoretical analysis. Throughout this section, we generate the instances by sampling all entries of design matrix X, all nonzero entries of the true signal U * and V * , and all entries of the noise matrix E as i.i.d. standard normal.
Linear convergence. We first demonstrate our linear convergence result. Because it is hard to quantify linear convergence with statistical error, we turn to show the linear convergence in some special cases. Firstly, as we discussed after Corollary 6, suppose there is no error term E in the model (17), then Algorithm 1 converges linearly to the true coefficient matrix Θ * . In this case we choose p = 100, k = 50, r = 8, s * 1 = s * 2 = 10, and the estimation error is shown in Figure 1. Secondly, as we discussed at the end of Section 3.2, suppose there are no row or column sparsity constraints on Θ * , then Algorithm 1 converges linearly to global minimum Θ. In this case it is more likely that we are in low dimensions, therefore we choose p = 50. The estimation error is shown in Figure 2. We see that in both cases GDT has linear convergence rate.
Estimation accuracy. We compare our algorithm with the Double Projected Penalization (DPP) method in [83], the thresholding SVD method (TSVD) method in [82], the exclusive extraction algorithm (EEA) in [35], the two methods (denoted by RCGL and JRRS) in [20], and the standard Multitask learning method (MTL, with L 2,1 penalty). Here we set n = 50, p = 100, k = 50, r = 8, s * 1 = s * 2 = 10. The reason why we choose a relatively small scale is that many other methods do not scale to high dimensions, as will shown in Table 3. We will show the effectiveness of our method in high dimensions later. Except for standard MTL, all the other methods need an estimate of the rank to proceed for which we apply the rank estimator in [19]. For the methods that rely on a tuning parameter λ, we generate an independent validation set to select the "best" λ. For our method, we use s 1 = 2s * 1 and s 2 = 2s * 2 . We consider two coefficient matrix settings, one is only row sparse and the other one  is both row sparse and column sparse. Table 1 (row sparse setting) and Table 2 (row and column sparse setting) report the mean and the standard deviation of prediction errors, estimation errors and size of selected models based on 50 replications in each setting. We can see that in both cases GDT has the lowest estimation error. Also GDT has nearly the lowest prediction error, which is still impressive since for some of the other methods we select λ that minimizes the prediction error on validation set.
Running time. We then compare the running time of all these methods. We fix a baseline model size n = 50, p = 80, k = 50, r = 4, s * 1 = s * 2 = 10, and set a free parameter ζ. For ζ = {1, 5, 10, 20, 50, 100}, each time we increase n, p, s * 1 , s * 2 by a factor of ζ and increase k, r by a factor of √ ζ and record the running time (in seconds) of each method for a fixed tolerance level, whenever possible. If for some ζ the algorithm does not converge in 2 hours then we simply record ">2h" and no longer increase ζ for that method. Table 3 summarizes the results. We can see that GDT is fast even in very high dimension, while all of the other methods are computationally expensive.
Effectiveness in high dimension. We finally demonstrate the effectiveness of GDT algorithm in high dimensions. Table 1 and Table 2 are both in low dimensions because we want to compare with other algorithms and they are slow in high dimensions, as shown in    Table 3. Now we run our algorithm only and we choose p = 5000, k = 3000, r = 50, s * 1 = s * 2 = 100. The estimation error and objective value are shown in Figure 3 and Figure 4, respectively. In each figure, iteration 0 is for initialization we obtained by Lasso. We can see that both estimation error and objective value continue to decrease, which demonstrates the effectiveness and necessity of GDT algorithm. From Figure 3 we also find that early stopping can help to avoid overfitting (although not too much), especially when n is small.

Norwegian Paper Quality Dataset
In this section we apply GDT to Norwegian paper quality dataset. This data was obtained from a controlled experiment that was carried out at a paper factory in Norway to uncover the effect of three control variables X 1 , X 2 , X 3 on the quality of the paper which was measured by 13 response variables. Each of the control variables X i takes values in {−1, 0, 1}. To account for possible interactions and nonlinear effects, second order terms were added to the set of predictors, yielding X 1 , X 2 , X 3 , X 2 1 , X 2 2 , X 2 3 , X 1 · X 2 , X 1 · X 3 , X 2 · X 3 . The data set can be downloaded from the website of [62] and its structure clearly indicates that dimension reduction is possible, making it a typical application for reduced rank regression methods [62,4,20,100]. Based on the analysis of [19] and [4] we select the rank r = 3; also suggested by [19] we take s 1 = 6 and s 2 = k = 13 which means we have row sparsity only. GDT selects 6 of the original 9 predictors, with X 2 1 , X 1 · X 2 and X 2 · X 3 discarded, which is consistent with the result in [19].
To compare prediction errors, we split the whole dataset at random, with 70% for training and 30% for test, and repeat the process 50 times to compare the performance of the above methods. All tuning parameters are selected by cross validation and we always center the responses in the training data (and transform the test data accordingly). The average RMSE on test set is shown in Table 4. We can see that GDT is competitive with the best method, demonstrating its effectiveness on real datasets.

In vivo Calcium Imaging Data
As a microscopy technique in neuroscience, calcium imaging is gaining more and more attentions [53]. It records fluorescent images from neurons that have been loaded with genetically encoded fluorescent calcium indicator molecules. When a neuron fires an electrical action potential (or spike), calcium will enter the cell, attach to the fluorescent calcium indicator molecules, and then change its fluorescent properties. By recording the movies of this dynamic it allows us to identify the spiking activity from large populations of neurons.
To achieve this goal, [93] introduce a spatiotemporal model and we briefly introduce this model here. Suppose at each time step a field of view with k = 1 × 2 pixels is observed. Assume also that the field contains a total number of (possibly overlapping) K neurons. Denote c i as the calcium activity for each neuron i, it can be described as the following first order autoregressive model, where s i (t) is the number of spikes that neuron i fired at the t-th time step for t = 1, ..., T and γ is set to be γ = 1 − 1/(frame rate) as suggested by [110]. Furthermore, let a i denote the nonnegative spatial footprint vector for neuron i, our observation at each time step t is In matrix form we can rewrite (20) and (21) as Here C ∈ R T ×K , G ∈ R T ×T , S ∈ R T ×K , Y ∈ R T ×k and A ∈ R K×k . Combining them together, we obtain where X = G −1 is observed and Θ * = SA is the coefficient matrix. The support of A is the location of the neurons and the support of S is the time frames when the neurons fire. It is natural to have A to be row sparse since the area for neurons in the monitored area is small; and S to be column sparse since the neurons do not fire very frequently. Therefore our coefficient matrix Θ * will be both row sparse and column sparse. It is also low rank since it is the product of two "tall" matrices because the number of neurons are usually small. Now we see this is a multitask learning problem with simultaneous row-sparse, column-sparse and low rank coefficient matrix where n = p = T and k = 1 × 2 .
The calcium image data we use here is taken in vivo from the primary auditory cortex of a mouse that was transfected with the genetic calcium indicator GCaMP5 [3]. The dataset is a movie with 559 frames (acquired at approximately 8.64 frames/sec), where each frame is 135 × 131 pixels. Therefore we have n = p = 559 and k = 135 × 131 = 17, 685. We use r = 50, more conservative than the estimator given by [19] and we set s 1 = 100 row sparsity and s 2 = 3000 column sparsity. Figure 5 shows 5 most significant manually labeled regions from the dataset, which can be approximately regarded as the ground truths; Figure 6 are the corresponding signals estimated by our GDT algorithm. We can see that they match very well, which demonstrates the effectiveness of our method.

Technical Proofs
This section collects technical proofs.

Proof of Lemma 2
Let [ U , Σ, V ] = rSVD(Θ 0 ) be the rank r SVD of the matrix Θ 0 and let Since Θ is the best rank r approximation to Θ 0 , we have The triangle inequality gives us Now that both Θ and Θ * are rank r matrices, and according to (7) we have Then, Lemma 5.14 in [107] gives us where the second inequality comes from the initialization condition (7). Finally, Lemma 3.3 in [79] gives

Proof of Lemma 3
For notation simplicity, let Z = U V denote the current iterate and let Z + = U + V + denote the next iterate. Let S U = S(U ) ∪ S(U + ) ∪ S(U * ) and S V = S(V ) ∪ S(V + ) ∪ S(V * ). With some abuse of notation, we define the index set S Z = S U ∪ S V to represent coordinates of Z corresponding to U S U and V S V . For an index set S, let P(U, S) = With these notations, we can write and Let O ∈ O(r) be such that We have that where the last inequality follows from Lemma 3.3 of [79]. Therefore, where For the term T 1 , we have Theorem 2.1.11 of [90] gives where in (i) follows from the definition of statistical error and in (ii) we used the Young's inequality ab ≤ a 2 2 + b 2 2 , for a, b, > 0. Therefore, Finally, for the term T 13 , we have where the last inequality follows from the definition of statistical error and the observation . Under the assumptions, and therefore (24) Combining (23) and (24) we have We then have where the first inequality follows since A + B 2 F ≤ 2 A 2 F + 2 B 2 F , and the last inequality follows since max( U 2 , V 2 ) ≤ Z 2 . Combining the results, we have For R 1 , Lemma B.1 of [92] gives For R 12 , we have that where the first inequality follows from the definition of µ min , the second inequality follows from Lemma 5.4 of [107], and the last equality follows from σ r (Z * ) = 2σ r (Θ * ). For R 13 , recall that ∆Z satisfies (14), we have that R 13 ≤ 1 2 ∇g 2 · ∆Z F · 8 5 µ min σ r (Θ * ) ≤ 2 5 µ min σ r (Θ * ) · d 2 (Z, Z * ) + 1 4 Combining (25), (27), (28), and (29), we obtain For R 2 , we have Combining (26), (30), and (31), we have Under the choice of the step size, η ≤ 1 8 Z 2 2 · min 1 2(µ + L) , 1 , the second term and third term in (32) are non-positive and we drop them to get d 2 (Z, Z * )−2η · (T 1 + R 1 ) + 2η 2 · (T 2 + R 2 ) ≤ 1 − η · 2 5 µ min σ r (Θ * ) · d 2 (Z, Z * ) + η · L + µ L · µ · e 2 stat .

Proof of Lemma 4
Comparing (9) and (15) we see that we only need to show Z 2 2 ≤ 2 Z 0 2 2 . Let O ∈ O(r) be such that By triangular inequality we have where the third inequality follows from the definition of µ min and σ 2 r (Z * ) = 2σ r (Θ * ), and the fourth inequality follows from ab (a+b) 2 ≤ 1 4 . Similarly, we have Combining (34) and (35) we have which completes the proof.

Proof of Lemma 5
Let Ω(s, m) denote a collection of subsets of {1, . . . , m} of size s. Let S U ∈ Ω(s 1 , p) and S V ∈ Ω(s 2 , k) be fixed. With some abuse of notation, let W(S U ) = {U ∈ R p×2r | U S c U = 0, U S U 2 = 1} and W(S V ) = {V ∈ R k×2r | V S c V = 0, V S V F = 1}. Let N U ( ) and N V ( ) be the epsilon net of W U and W V , respectively. Using Lemma 10 and Lemma 11 of [112], we know that |N U ( )| ≤ (3 −1 ) 2r·s 1 , |N V ( )| ≤ (3 −1 ) 2r·s 2 , and For fixed U and V , the random variable tr E XU V is a sub-Gaussian with variance proxy σ 2 X S U U S U V S V 2 F . This variance proxy can be bounded as S U ∈Ω(s 1 ,p) (X X) S U S U 2 = nσ 2κ (s 1 ).
Using a tail bound for sub-Gaussian random variables, we get 1 n tr E XU S U V S V ≤ 2σ κ(s 1 ) log 1 δ n with probability at least 1 − δ. To obtain an upper bound on e stat , we will apply the union bound Ω (s 1 , p), Ω(s 2 , k), N U ( ) and N V ( ). We set = 1 2 and obtain e stat ≤ 8σ κ(s 1 ) n s 1 log p + s 2 log k + 2r(s 1 + s 2 ) log 6 + log 1 δ with probability at least 1 − δ. Taking δ = (p ∨ k) −1 completes the proof.

Conclusion
We proposed a new GDT algorithm to efficiently solve for optimization problem with simultaneous low rank and row and/or column sparsity structure on the coefficient matrix. We show the linear convergence of GDT algorithm up to statistical error. As an application, for multi-task learning problem we show that the statistical error is near optimal compared to the minimax rate. Experiments on multi-task learning demonstrate competitive performance and much faster running speed compared to existing methods. For future extensions, it would be of interest to extend GDT algorithm to non-linear models. Another potential direction would be to adaptively select the sparsity level s 1 and s 2 in hard thresholding step.