Capturing Between-Tasks Covariance and Similarities Using Multivariate Linear Mixed Models

We consider the problem of predicting several response variables using the same set of explanatory variables. This setting naturally induces a group structure over the coefficient matrix, in which every explanatory variable corresponds to a set of related coefficients. Most of the existing methods that utilize this group formation assume that the similarities between related coefficients arise solely through a joint sparsity structure. In this paper, we propose a procedure for constructing an estimator of a multivariate regression coefficient matrix that directly models and captures the within-group similarities, by employing a multivariate linear mixed model formulation, with joint estimation of covariance matrices for coefficients and errors via penalized likelihood. Our approach, which we term Multivariate random Regression with Covariance Estimation (MrRCE) encourages structured similarity in parameters, in which coefficients for the same variable in related tasks sharing the same sign and similar magnitude. We illustrate the benefits of our approach in synthetic and real examples, and show that the proposed method outperforms natural competitors and alternative estimators under several model settings.


Introduction
In many cases, a common set of predictor variables is used for predicting different but related target variables. For example, an on-demand transportation company may attempt forecasting demand and supply in different time frames and geographic locations; a realestate firm may be interested in predicting both the construction costs and the sale prices of residential apartments, given a set of project's physical and financial covariates, and external economic variables.
The general task of modeling multiple responses using a joint set of covariates can be expressed using multivariate regression (MR), or multiple response regression -a generalization of the classical regression model to regressing q > 1 responses on p predictors. In the MR settings, one is presented with n independent observations, {(X i , Y i )} n i=1 , where X i ∈ R p and Y i ∈ R q contain the predictors and responses for the ith sample, respectively. Let X = (X 1 , ..., X n ) T = (x 1 , ..., x p ) ∈ R n×p denote the predictor matrix and Y = (Y 1 , ..., Y n ) T = (y 1 , ..., y q ) ∈ R n×q denote the response matrix. For simplicity of notation, assume that the columns of X and Y have been centered so that we need not consider an intercept term. We further assume that the i.i.d N q (0, Σ) error terms are collected into an n × q error matrix E, where Σ is the among-tasks covariance matrix. The multivariate regression model is given by, where B is a p × q regression coefficient matrix. The random matrices in (1) are assumed to follow a matrix-variate normal distribution (Dawid, 1981;Gupta and Nagar, 2018), E ∼ M V N n×q (0, I n , Σ) and Y ∼ M V N n×q (XB, I n , Σ). For reasons that will later become clear, when considering the noise structure of the MR model, the precision matrix, Ω = Σ −1 , is commonly the preferred object. Straightforward prediction and estimation with the MR model can become quite challenging when the number of predictors and responses is large relative to n, as it requires one to estimate pq parameters. The univariate regression model (q = 1) has been widely studied, and numerous methods have been developed for variable selection (support recovery) and coefficients estimation. A naive approach to the MR problem is to apply one of these methods to each of the q tasks independently. However, in many cases, the different problems are related, and this oversimplified approach fails to utilize all the information contained in the data (see, e.g., Breiman and Friedman (1997) and Rothman, Levina, and Zhu (2010)). For a review of Bayesian approaches for estimation and prediction with the MR model see Deshpande, Rockova, and George (2017) and references therein.
In the MR literature, many approaches seek to reduce the number of parameters to be estimated through a penalized (or constrained) least squares framework. Bunea, She, and Wegkamp (2011) generalized the classical Reduced-Rank Regression (RRR) (Anderson, 1951;Izenman, 1975;Velu and Reinsel, 2013) to high dimensional settings, estimating a low-rank coefficient matrix by penalizing the rank of B. Yuan, Ekici, et al. (2007) proposed a method called Factor Estimation and Selection (FES), in which an L 1 -penalty is applied to the singular values of B. FES induces sparsity in the singular values of B, conducting dimension reduction and coefficients estimation simultaneously. One major drawback of dimension reduction techniques, is that the interpretation of the model is often limited, in terms of the original data, since the set of predictors is reduced to a few important principal factors.
The multivariate regression framework naturally induces a group structure over the coefficient matrix, B, in which every explanatory variable, x i for i = 1, ..., p, corresponds to a group of q coefficients, B i = (β i1 , ..., β iq ) (see Figure 1). While many approaches make no assumption over the group structure, others utilize it for learning structured sparsity. In the multi-task learning literature, the L 1 /L 2 -penalty, also known as the group lasso penalty (Yuan and Lin, 2006), has been applied with the rows of B as groups. The L 1 /L 2 -penalty can be viewed as an intermediate between the L 1 -penalty used in lasso regression (Tibshirani, 1996) and the L 2 -penalty used in ridge regression (Hoerl and Kennard, 1970), aimed Figure 1: The multivariate regression framework naturally induces a group structure over the coefficient matrix B, in which every explanatory variable, x i , corresponds to a group of q coefficients at utilizing the relatedness among tasks for identifying the joint support, i.e., the set of predictors with non-zero coefficients across all q responses (Obozinski, Wainwright, and Jordan, 2009). Peng et al. (2010) proposed a mixed constraint function, by applying both the lasso and the group lasso penalties to the elements and rows of B, respectively. This approach produces element-wise as well as row-wise sparsity in the coefficient matrix. Turlach, Venables, and Wright (2005) studied a different constraint function, placing an L ∞ -penalty over the rows of B. As noted by the authors, this method is only suitable for variable selection and not for estimation. Extensions of mixed norm penalties to overlapping groups have been proposed in order to handle more general and complex group structures (see, e.g., Kim and Xing (2012) and Y. Li, Nan, and Zhu (2015)). These methods produce highly interpretable models, however, they are limited to the case Ω ∝ I n , and do not account for correlated errors. Rothman, Levina, and Zhu (2010), Chen and Huang (2016), and Wilms and Croux (2018) have recently shown that accounting for this additional information in MR problems can be beneficial for both coefficients estimation and prediction.
In multivariate normal theory, the entries of Ω that equal zero correspond to pairs of variables that are conditionally independent, given all of the other variables in the data. The problem of sparse precision matrix estimation has drawn considerable recent attention, and several methods have been proposed for both support recovery and parameter estimation. Perhaps the most widely used approach is the graphical lasso (Friedman, Hastie, and Tibshirani, 2008), in which simultaneous sparsity structure identification and coefficients estimation are achieved by minimizing the L 1 -regularized negative log-likelihood function of Ω (Yuan and Lin, 2007;dÁspremont, Banerjee, and El Ghaoui, 2008;Rothman, Bickel, et al., 2008). Recently, sparse precision matrix estimation has also been considered in regression frameworks, in which the main goal for this explicit estimation is to improve prediction (Witten and Tibshirani, 2009;Rothman, Levina, and Zhu, 2010). Rothman, Levina, and Zhu (2010) proposed Multivariate Regression with Covariance Estimation (MRCE), a method for sparse multivariate regression that directly accounts for correlated errors. MRCE minimizes the negative log-likelihood function with an L 1 -penalty for both B and Ω, arg min where tr (·) denotes the trace, λ 1 and λ 2 are the regularization parameters and ω jj is the (j, j ) element of Ω. Lee and Liu (2012) extended the approach of Rothman, Levina, and Zhu (2010) to allow for weighted L 1 -penalties over the elements of B and Ω. Yin and H. Li (2011) considered a similar objective to the one in (2), and proposed an algorithm for the sparse estimation of the coefficient and inverse covariance matrices. However, unlike Rothman, Levina, and Zhu (2010), their method aimed at improving the estimation of Ω, rather than B. Our work further leverages correlations between the different problems to improve the accuracy of the estimators and predictions, by not only accounting for the correlation between the error terms but the similarities between the coefficients as well.
While MRCE accounts for correlated responses through the precision matrix Ω, it does not learn structured sparsity in B, essentially selecting relevant covariates for each response separately. In a recent work, Wilms and Croux (2018) proposed an algorithm for the multivariate group lasso with covariance estimation, replacing the lasso penalty in (2) with an L 1 /L 2 -penalty over a pre-specified group structure. Chen and Huang (2016) developed a method within the reduced-rank regression framework that simultaneously performs variable selection and sparse precision matrix estimation. These methods for learning group sparsity assume that the sparsity structure is known a-priori. Instead, Sohn and Kim (2012) proposed an approach for group sparse multivariate regression that can jointly learn both the response structure and regression coefficients with structured sparsity.
All the above methods which considered a group structure over the coefficient matrix, essentially assume that the within-group similarities arise solely through a joint sparsity structure. In many applications, these structured (and unstructured) sparsity assumptions are not suitable, for instance, if one expects many covariates of small or medium effect. Furthermore, these sparse estimators encourage within-group coefficients to be of similar absolute magnitude, and do not favor same sign coefficients. However, in various real-life examples it is more natural to encourage coefficients within the same group to also share a sign. To address these issues, we construct an estimator for the multivariate regression by directly modeling and capturing the within-group similarities, while also accounting for the error covariance structures. Our method, titled Multivariate random Regression with Covariance Estimation (MrRCE), involves a multivariate linear mixed model with an underlying group structure over the coefficient matrix, designed to encourage related coefficients to share a common sign and similar magnitude.
Multivariate Linear Mixed Models (mvLMMs) (Henderson, 1984) are MR models that relate a joint set of covariates to multiple correlated responses. mvLMMs are applied in many real-life problems and frequently used in genetics due to their ability to account for relatedness among observations (see, e.g., Kruuk (2004), Kang et al. (2010), Korte et al. (2012), and Vattikuti, Guo, and Chow (2012)). The mvLMMs model can be viewed as a generalization of MR (similar to the way Linear Mixed Models (LMMs) are a generalization of linear regression models), allowing both fixed and random effects. Consider the MR problem (1), but with an additional term for the set of random predictors, collected into the matrix Z = (Z 1 , ..., Z n ) T = (z 1 , ..., z r ) ∈ R n×r . The mvLMM model is given by, where B is a p × q fixed effect coefficient matrix and Γ is an r × q random effect coefficient matrix. Here, R and G are the common covariance matrices of columns and rows of Γ, respectively.
In this paper we consider the problem of estimation and prediction under the multivariate random effect regression -an mvLMMs model strictly involving random effects, Under the proposed formulation and unlike the standard mvLMM framework, we are interested in estimating not only the covariance components but also in predicting the random component Γ. Our method accounts for correlations between responses and similarities among coefficients, captured by estimating a joint equicorrelation covariance matrix for the rows of Γ (see Eq. 5 for details). Hence, the MrRCE method is an example of what one could call structured similarity learning, in which the different coefficient groups are assumed to be independent, whereas a within-group similarity is encouraged. This covariance structure for the random coefficient matrix reduces the MR problem of estimating pq parameters, into the problem of estimating two covariance components -the coefficients' common variance, and the intra-group correlation coefficient, or similarity level. The estimation of the covariance structure is achieved through a penalized likelihood, adding an L 1 -penalty over the off-diagonal entries of Ω = Σ −1 . The remainder of the paper is structured as follows. Section 2 describes the MrRCE method and corresponding computational algorithms. Section 3 establishes a connection between the proposed method and the multivariate Ridge estimator. Simulation studies are performed in Section 4 to compare our method with competing estimators, and Section 5 contains two real data applications of MrRCE. Section 6 concludes with a brief discussion.

The MrRCE Method
Consider the random effect regression model (4) with r = p. Assume both the error matrix E and the coefficient matrix Γ follow a matrix variate normal distribution, Further assume an equicorrelation structure for the matrix C, controlled by the unknown intra-group correlation coefficient ρ ∈ [0, 1), The unknown parameter ρ can be thought of as a relative measure of the within-group similarity (Chatfield, J. Zidek, and Lindsey, 2010). Large values for ρ correspond to high similarity among members of the same group, leading to a similar magnitude and same sign coefficients, whereas ρ = 0 corresponds to i.i.d draws for the entries of the coefficient matrix Γ.
Denote the negative log-likelihood function of Y by h (·), and the collection of parameters by Θ = {Ω, σ 2 , ρ}. We construct an estimator of Θ using a penalized normal likelihood, adding an L 1 -penalty over the off-diagonal entries of Ω, where λ ω > 0 is a regularization parameter.

The Algorithm
We propose an iterative tri-step algorithm for solving (6). On the first step we estimate a sparse precision matrix for the errors, followed by an optimization step over (σ, ρ) and an Empirical-Best Linear Unbiased Predictor (E-BLUP) (Henderson, 1975;Henderson, 1984) step for the prediction of Γ. Alg. 1 provides a schematic overview of the MrRCE algorithm. Using eigendecomposition (similar to Zhou and Stephens (2014) and Furlotte and Eskin (2015)), we write, C = U DU T and ZZ T = LSL T where S and D := D ρ = diag (d 1 (ρ) , ..., d q (ρ)) are diagonal matrices, and U is independent of ρ. We then multiply (4) by the orthogonal matrices U and L T from the right and left correspondingly, to obtain,Ỹ =ZΓ +Ẽ We lose the· notation and assume (with a slight abuse of notation) that the original data is of the form, Next, we describe the algorithm for solving (6) under the assumptions (8).
Starting value and Stopping Criteria. We start by taking Γ = 0 p×q , and note that both steps 1 and 2 ensure a decrease in the objective function value. We consider three alternatives for the MrRCE algorithm's stopping criteria.
1. Set 0 < K max ∈ N. Stop if the number of iteration k > K max .
2. Set a tolerance value, τ > 0. Iterate until the sum of absolute changes in the values of Θ in two successive iterations is smaller than the tolerance value.
3. Let Γ ridge = Z T Z + λI −1 Z T Y , the ridge regression (RR) estimator for Γ. We stop is the (j, ) entry of Γ * at iteration k. This stopping criterion was proposed by Rothman, Levina, and Zhu (2010), and is well-defined since the RR estimator always exists. The criterion requires tighter convergence for small effects versus large ones.

Connection to Ridge Regression
We present a connection between the MrRCE method and the Ridge Regression (RR) estimator (Hoerl and Kennard, 1970). More specifically, we explore a special case in which the BLUP for Γ derived at the third step of the MrRCE algorithm, is equivalent to the multivariate RR estimator (Brown and J. V. Zidek, 1980). For this special case, the MrRCE Algorithm 1 (MrRCE): Iterative tri-step optimization procedure (see text for details) Require: Regularization parameter λ ω > 0.
algorithm can be viewed as an alternating "L 1 -L 2 penalty" procedure: L 1 -penalty for the estimation of Ω, followed by an "L 2 -penalty" for the prediction of Γ.
Consider the model, The joint distribution of (y, γ) is given by, γ y ∼ N 0, Λ ΛZ T ZΛZΛZ T + Σ and the BLUP for the random coefficient vector is the expectation of γ conditional on y, The RR estimator can be extended to the multivariate case as in Brown and J. V. Zidek (1980) where K 0 is the pq × pq ridge matrix. We apply the generalized Sherman-Morrison-Woodbury (Sherman and Morrison, 1950;Woodbury, 1950) formula to the inverse ofZ TZ + K, to obtain,γ Eq. 11 can be simplified as follow, Thus, under the i.i.d error model, i.e., Σ 0 = σ 2 I q , setting K = (Σ 0 I p ) Λ −1 yields, This is a well known connection between the RR estimator and BLUP which proves the following result: Proposition 1. AssumingΣ 0 ∝ I, the BLUP estimator of the third step of the MrRCE algorithm is equivalent to the multivariate RR estimator with Ridge matrix K = Σ 0 I p Λ −1 .
To better understand this result, consider the case Σ 0 = σ 2 I q and Λ 0 = σ 2 γ C, where C = C ρ is an equicorrelation matrix with parameter ρ.
For simplicity, we only examine the penalty structure for q = 2, p = 1. Denote the coefficients vector by γ = (γ 11 , γ 12 ) T . The ridge penalty is given by, Note that (12) can be reduced to the univariate ridge penalty by setting ρ = 0, i.e., by considering i.i.d coefficients. For ρ > 0, the second term in (12) kicks-in. We note that b < 0 for ρ ∈ (0, 1), meaning that the second penalty term in (12) is negative, for same sign coefficients. This simple example illustrates that the MrRCE method favors equal sign coefficients, within groups.

Simulation Study
In this section, we compare the performance of the MrRCE method to other multivariate regression estimators, over several settings of simulated data sets. We show that the Mr-RCE method outperforms all competitors, in terms of Model Error, for the vast majority of simulated settings.

Estimators
We construct estimators using natural competitors of the MrRCE method, and report the results for the following methods: 1. Ordinary Least Squares (OLS): Perform q separate LS regressions.

Group Lasso:
Place an L 1 /L 2 -penalty over the rows of the coefficient matrix, with 3-fold cross-validation (CV) for the selection the tuning parameter.

Ridge Regression:
The tuning parameter is selected via leave-one-out cross-validation (LOO-CV) and is shared across all task.

Models
For each settings and every replication, we generate an n × p predictor matrix Z with rows drawn independently from N p (0, Σ Z ), where (Σ Z ) ij = ρ |i−j| Z and ρ Z = .7 (similar to Yuan, Ekici, et al. (2007), Peng et al. (2010), and Rothman, Levina, and Zhu (2010)). Following Rothman, Levina, and Zhu (2010), the coefficient matrix Γ is generated as the element-wise product of three matrices: First, we sample a p × q matrix W ∼ M V N p×q (0, I p , σ 2 C ρ ), with C ρ = I + ρ (J − I), where J is a matrix of ones and I is the identity matrix, both of dimensions q × q. The values of ρ are ranging from 0 to 0.8, where ρ = 0 corresponds to i.i.d samples, γ ij ∼ N (0, σ 2 ). Next, we set, denotes the element-wise product. The entries of the p × q matrix K are drawn independently from Ber (1 − s), and the elements in each row of the matrix Q are all equal zero or one, according to p independent Bernoulli draws with success probability 1 − s g . Hence, setting s, s g > 0 will induce element-wise and group sparsity in Γ. The rows of the error matrix E are drawn independently from N q (0, Σ). We consider several structures for the error covariance matrix, specified in the form of the transformed error covariance matrix, Σ := U T ΣU , where U is the orthogonal matrix obtained via eigendecomposition over the matrix C ρ (see Eq. 7): 1. Independent Errors. The errors are drawn i.i.d form N q (0, I q ). (1). We letΣ ij = ρ |i−j| E . The transformed error covariance matrix is dense, whereas the precision matrixΩ is a sparse, banded matrix.

Fractional Gaussian Noise (FGN). The transformed error covariance matrix is given by,Σ
with H = .95. Both the transformed error covariance matrixΣ and its inverse have a dense structure.
Both the transformed error covariance matrix and its inverse have a dense structure.

Performance Measure
For a given realization of the coefficient matrix and method m, and for each replication r, let γ (r) j denote the true coefficient vector andγ (r) j (m) denote the estimated coefficient vector, both for the jth response. The mean-squared estimation error is given by, where p (z) and Σ Z are the density function and covariance matrix of z, respectively. We evaluate the performance using the model error (ME), following the approach of Breiman and Friedman (1997), Yuan, Ekici, et al. (2007), and Rothman, Levina, and Zhu (2010), The ME over all N replications is averaged to obtain our performance measure,

Results
We simulate N = 200 replications with n = 50, p = 20 and q = 5, for each setting. The correlation parameter ρ ranges from 0 to 0.8, with 0.2 steps. We note that the MRCE method often failed to converge for medium-high values of the correlation parameter, leading to inferior performance (using the default number of maximal iterations in the R-package implementation). Significance tests were performed using paired t−test.
Independent Errors. We first consider an identity error covariance structure,Σ = I q , and set the sparsity and group sparsity levels at s = 0.2, s g = 0. Hence, for ρ = 0 (i.i.d sampled coefficients), we do not expect any advantage for our method, over the competitors. The average ME is displayed in Figure 2. Indeed, for ρ = 0, all methods perform nearly the same in terms of ME, aside from the OLS method, which performs significantly worse (none of the methods significantly outperform all others). For ρ > 0, the MrRCE method achieves significant improvement over all competitors (all p-values < 10e − 5). Autoregressive (AR). LetΣ ij = ρ |i−j| E , with ρ E = 0.75. We use two settings for the sparsity levels, s = s g = 0, and s = s g = 0.1. Although the transformed precision matrix is a sparse, banded matrix, the assumptions of MrRCE only partially hold, as we induce sparsity in Γ as well. The results are displayed in Figure 3. For both settings, the MrRCE method achieves the best ME performance, with a significant improvement over competing methods (all p-values < 10e − 3).
Fractional Gaussian Noise. This covariance structure for the error terms was also considered by Rothman, Levina, and Zhu (2010). We construct a dense coefficient matrix, by setting s = s g = 0. The results are presented in Figure 4, showing that our proposed method provides a considerable improvement over competitors (all p-values < 10e − 16). The margin by which MrRCE outperforms the other methods increases with ρ.
Equicorrelation. Finally, we letΣ ij = ρ E = 0.9 for i = j, and set s = s g = 0.1. The results are displayed in Figure 5. The MRCE method exploits the correlated errors, achieving better performance than the lasso, ridge and OLS methods, and is second only to MrRCE, which significantly outperforms all competitor methods for all values of ρ (all p-values < 10e − 5).

Applications
We consider two publicly available real-life datasets: 1. NYC Taxi Rides 2 . The data consists of the daily number of New-York City (NYC) taxi rides, ranging from January 2016 to December 2017.

2.
Avocado Prices 3 . The data was provided by the Hass Avocado Board website and represents weekly retail scan data for national retail volume (units) and price.
We measure and report the performance of the following methods: 7. MrRCE. Apply 3-fold CV for selecting the graphical lasso regularization parameter.
NYC Taxi Rides. We consider the problem of forecasting the performance of q = 2 taxi vendors in NYC, using historical records of the daily number of rides, spanning from January 2016 to December 2017 (n = 730). This multivariate time-series data is generated according to human activities and actions, and as such can be expected to be strongly affected by multiple seasonalities and holidays effects (see Figure 6). For a regular period P , we utilize the Fourier series to model the periodic effects (Andrew and Neil, 1993;Taylor and Letham, 2018), by constructing 2 · N P features of the form, Z P (t) = cos 2πnt P , sin 2πnt P n=1,...,N P We account for the weekly and yearly seasonalities and introduce the corresponding P -cyclic covariates. For a holiday H, which occurs at times T (H), we use a simple indicator predictors of the form, Lastly, we incorporate covariates for the modeling of a piecewise linear trend. These transformations shift the multivariate time-series problem into a feature space with p = 68, where the linear assumption is appropriate. We denote the transformed observations by, where Z (t) ∈ R p contains measurements of the covariates, Y (t) ∈ R q contains the q responses, and Y j (t) ∈ [0, 1] represents the scaled response of the jth task at time t, obtained by dividing the original observation by the maximal response value for that given task. We evaluate the forecast performance of the different methods using cross-validation like approach, in which we produce K forecasts at multiple cutoff points along the history (Taylor and Letham, 2018). For cutoff k = 0, ..., K − 1, we use the first n train,k = 365 + k · 14 days for training, and the next n test = 14 observations as the test set. The performance of method m over the kth "fold" is measured according to the Mean Squared Error (MSE), where T k are the time indices for the kth test set, andŷ j,t (m) is the forecast for the jth task at time t, produced using method m. Using the above procedure, we obtain K = 26 realizations of the MSE, {M SE m k } K−1 k=0 , for each method m. The mean and standard deviation of the MSE for each of the methods are reported in Table 1. The MrRCE method attains the best forecast performance, with lowest mean MSE and smallest standard deviation, followed by the Ridge and Separate-Ridge methods. A paired t-test confirms that the improvement in accuracy achieved by our method is significant (all p-values < 0.05). We also note that the estimated similarity level for this data isρ = 0.99.
Our proposed method attains the best prediction performance, with lowest mean MSE and smallest standard deviation. A paired t-test confirms that the improvement in accuracy is significant (all p-values < 0.05). We also report the estimated similarity level for this data, atρ = 0.699.

Summary and Discussion
We have presented the MrRCE method to produce an estimator of the covariance components and a predictor of the multivariate regression coefficient matrix. Our method exploits similarities among random coefficients and accounts for correlated errors. We have pro-  posed an efficient algorithm for computing MrRCE. By using simulated and real data, we have illustrated that the proposed method can outperform the commonly used methods for multivariate regression, in settings were errors or coefficients are related. Our method can be extended in several ways. For example, one could consider an arbitrary group structure over the coefficient matrix, model the similarities via different covariance structure, or allow for per-group similarity coefficient. In addition, as pointed out in Rothman, Levina, and Zhu (2010), one could use other penalties over the precision matrix Ω, in particular, ones which will introduce less bias.

Acknowledgement
This research was partially supported by Israeli Science Foundation grant 1804/16.