Inference for elliptical copula multivariate response regression models

The estimation of the coefficient matrix in a multivariate response linear regression model is considered in situations where we can observe only strictly increasing transformations of the continuous responses and covariates. It is further assumed that the joint dependence between all the observed variables is characterized by an elliptical copula. Penalized estimators of the coefficient matrix are obtained in a high-dimensional setting by assuming that the coefficient matrix is either element-wise sparse or row-sparse, and by incorporating the precision matrix of the error, which is also assumed to be sparse. Estimation of the copula parameters is achieved by inversion of Kendall’s tau. It is shown that when the true coefficient matrix is row-sparse, the estimator obtained via a group penalty outperforms the one obtained via a simple element-wise penalty. Simulation studies are used to illustrate this fact and the advantage of incorporating the precision matrix of the error when the correlation among the components of the error vector is strong. Moreover, the use of the normal-score rank correlation estimator is revisited in the context of high-dimensional Gaussian copula models. It is shown that this estimator remains as the optimal estimator of the copula correlation matrix in this setting. MSC 2010 subject classifications: Primary 62F12, 62J02; secondary 62H20, 62H99.


Introduction
Suppose that a q × 1 response vector Y has been observed on a random sample of subjects and that we wish to determine whether its behavior is influenced by explanatory variables forming a p×1 vector X measured on the same individuals.

Y. Zhao and C. Genest
A standard approach to this problem consists of assuming that there exist linear relationships between the components of Y and those of X, viz.
where ε is a q × 1 vector of errors and B * is a p × q matrix of coefficients to be selected wisely. This problem has been intensely investigated, even in the recently emerging context where the dimensions of X and Y are larger than the sample size. In the latter case, various sparsity conditions on B * are typically imposed in order to ensure that its estimation is feasible, reliable and efficient. Model (1.1) is not always realistic in practice. However, it will here be shown that it remains possible to estimate B * in a much broader class of models in which it is merely assumed that strictly increasing transformations of the components of X and Y (but not necessarily the original X and Y themselves) are linked by a multivariate linear regression model. To be more precise, suppose that there exist fixed but unknown functions f 1 , . . . , f p and g 1 , . . . , g q that are strictly increasing on R and such that where f (X 1 , . . . , X p ) = (f 1 (X 1 ), . . . , f r (X p )) and similarly, g(Y 1 , . . . , Y q ) = (g 1 (Y 1 ), . . . , g q (Y q )) . The estimation of B * in Model (1.2) has recently been considered by Cai and Zhang [4] in the special case where q = 1 and the vector (f (X), g(Y)) is jointly normal and scaled in such a way that its components have unit variance. These authors provide a rate-optimal estimation procedure for the vector B * which is adaptive to the unknown marginal transformations. They also use it to investigate how American crime statistics are connected to socio-economic and law enforcement data. This paper extends the results of Cai and Zhang [4] in two ways. First, by allowing (f (X), g(Y)) to have an elliptically contoured distribution [5], we cover cases in which the variables exhibit greater tail dependence than if they were jointly normal. The multivariate Student t distribution is an example. Second, we address issues pertaining to the estimation of B * that arise only when the response is of dimension q > 1 in Model (1.2). In this context, it is generally advisable to take into account the overall structure of the matrix B * to reduce the dimension of the problem. This is especially important when p and q are comparable to, or larger than, the sample size n, in which case traditional estimators, e.g., those based on the least squares principle, are either unfeasible or perform poorly.
In the framework of Model (1.2) with elliptical dependence structure and arbitrary dimension q > 1, we consider the estimation of B * under two conditions: row-sparsity, in which most rows of B * are assumed to be zero vectors, and element-wise sparsity, where most entries of B * are zero but not according to any specific pattern. Neither condition admits a univariate analog unless cov(ε) = Σ εε is diagonal. It is here shown that the estimation of a row-or element-wise sparse matrix B * can be achieved within the broad framework of Model (1.2) without assuming that Σ εε is diagonal.
In the limited context of Model (1.1), Yuan and Lin [44] showed that improved estimation of a row-sparse matrix B * can be achieved through a group Lasso penalty, with different groups corresponding to the different rows of B * . The advantage of group sparsity over simpler element-wise sparsity conditions on B * is further documented in [16,29,32]. However, these studies generally assume that the components of ε in (1.1) are uncorrelated. To avoid this restriction, we proceed as in [35,43], where this assumption is relaxed by requiring only that the precision matrix Ω εε = Σ −1 εε is sparse in Model (1.1). Both B * and Ω εε can then be estimated using a penalized least squares with Lasso-type penalties on the two matrices. As shown numerically by Rothman et al. [35], the estimation accuracy for B * is improved by incorporating an estimate of Ω εε when the components of the error vector ε are strongly correlated. The resulting optimization problem is only convex in B * for fixed Ω εε , and vice versa. We face a similar issue in Model (1.2) and instead of relying on an iterative procedure to estimate the two matrices, we mimic [35] by adopting a one-step method in which an improved estimation of B * is obtained after estimating Ω εε once.
Our main result, stated more precisely in Sections 5.3.1 and 5.3.2, is that under the row-sparsity assumption, the group penalty approach leads to a better estimation of B * than the element-wise penalty approach, as already reported by Lounici et al. [29] under Model (1.1). This conclusion could not be reached by simply aggregating the univariate-response estimator considered in [4] over the multiple responses. It is remarkable in that it holds even though the estimation is based on observations from (X, Y) rather than (f (X), g(Y)). We also show through simulation the benefit of incorporating Ω εε when the correlation among the components of the error vector is strong.
Another complication that arises in the broader context we consider is that the raw estimator of the design matrix may fail to be positive semidefinite. The naive formulation of the Lasso program is then no longer convex. To address this issue, we can either modify the Lasso program, as motivated in [6], or adopt a Dantzig selector program. In general, the former is computationally more efficient, while the latter yields faster convergence rates under milder conditions, as we will prove. Consequently, even when q = 1, our approach should be preferable to that of Cai and Zhang [4]; see Section 3.3.

Plan of the paper
The model is further discussed in Section 2. The estimator of the coefficient matrix B * is then developed in three stages described in Sections 3-5 as follows.
In Section 3, we consider a column-by-column estimator B of B * which does not use any information about the precision matrix Ω εε . As shown in Section 3.1, its columns are solutions to the Lasso program (3.2). An alternative approach via the Dantzig selector is described in Section 3.2.
In Section 4, the preliminary estimator B is used to obtain an estimator Ω εε of the precision matrix Ω εε . The estimator Ω εε is the solution to the graphical Lasso algorithm (4.2), though any algorithm, such as CLIME [3] or the D-trace loss [46], yielding comparable performance could also be used.
In Section 5, an improved estimation of the coefficient matrix B * is obtained by incorporating the estimator Ω εε . In addition to an element-wise sparsity structure, we consider a row-sparse model for B * and to this end, we impose a group penalty on the coefficient matrix for its second estimation. For the element-wise sparsity case, the final estimator B of B * is given in (5.4); for the row-sparse case, the final estimator B G via the group Lasso approach appears in (5.7), while (5.14) gives the final estimator B D,G via the group Dantzig selector; see, e.g., [21,26].
Section 6 reports the results of a modest simulation study comparing the estimators of B * produced with or without consideration of the precision matrix Ω εε , and with or without consideration of the possible row-sparse structure of B * . All proofs are grouped in Section 7 and concluding comments can be found in Section 8. Some auxiliary results are deferred to an Appendix.
As an aside, we revisit in Section F the normal-score rank correlation estimator or van der Waerden correlation matrix Σ n , which is known to be the optimal estimator of the copula correlation matrix of a Gaussian copula in fixed dimension. We show that Σ n actually retains its optimality in high-dimensional Gaussian copula models. Therefore, efficiency gains could be made by resorting to this estimator in many high-dimensional Gaussian copula modeling contexts where Kendall's tau and Spearman's rho are currently predominant. As this contribution concerns a different initial estimator of the copula component rather than the subsequent regression setup and the estimator of B * , its mostly selfcontained presentation can be read independently from the rest of the paper.

Notations and conventions
Let Φ −1 denote the standard normal quantile function. In what follows, the Kronecker product ⊗ and the Hadamard product • always take precedence over the usual matrix product. Furthermore, all functions act component-wise when applied to a vector or a matrix.
For any r ∈ [0, ∞], · r denotes the element-wise matrix r norm and · r is the matrix r-norm, i.e., for M ∈ R a×b , M r = max x∈R b , x r ≤1 Mx r . In particular · 1 is the maximum column sum and · ∞ is the maximum row sum. We also use · op ≡ · 2 to denote the operator norm. If M is symmetric, we write M 0 if it is positive semidefinite, in which case we also write λ max (M) for its largest eigenvalue (which coincides with its operator norm), λ min (M) for its smallest eigenvalue, and C(M) = λ max (M)/λ min (M) for its condition number.
Let Q be a generic Euclidean space, e.g., Q = R p×q or Q = R q . For conformal matrices or vectors M and N in Q, we define the inner product ·, · as M, N = tr(M N); hence in particular M 2 2 = M, M . For a norm R defined on Q, its dual norm R * is given by For any a ∈ N = {1, 2, . . .}, we write [a] = {1, . . . , a}. For readability, we will typically use i, j ∈ [n] as sample indices, and k ∈ [p] and ∈ [q] as coordinate indices, though sometimes k, ∈ [p+q]. For a matrix M, we use (M) k to denote its (k, )th element, (M) k• to denote its kth row, and (M) • to denote its th column. We generally use S as an index set, sometimes with subscripts, e.g., . If a ∈ R p and if S ⊂ [p], we use a S to denote the same vector as a but with entries at locations [p] \ S set to zero. If M ∈ R p×q , and if S ⊂ [p], we use (M) S• to denote the same matrix as M but with entire rows at locations [p] \ S set to zero, while if S ⊂ [p] × [q], we use (M) S to denote the same matrix as M but with entries at locations [p] × [q] \ S set to zero. By convention, the letter C with subscript denotes a universal constant that can be taken as fixed throughout the paper. Finally, we make the blanket assumption that the sample size n is even for simplicity.

Model setup and rank-based estimation
Suppose that Model (1.2) holds for fixed but unknown functions f 1 , . . . , f p and g 1 , . . . , g q that are strictly increasing on R. For identification purpose, further assume that these functions and the scale of ε are chosen in such a way that all components of f (X) and g(Y) have mean 0 and variance 1. The matrix parameter B * in Models (1.1)-(1.2) is then identifiable, as mentioned in Section 1 of [4].
The above identifiability conditions are not restrictive because Model (1.1) or (1.2) can always be modified to ensure that they hold; see Appendix E. Note also that while B * depends on the identifiability conditions, the latter cancel with those imposed on the transformation functions f and g when the coefficient matrix enters into the important task of predicting the responses from a sample of X. In that sense, the identifiability conditions are thus irrelevant and an accurate estimator of B * will contribute to good forecasts, as we discuss in Appendix E and in the real-world example in Section 6.2. Thus we confine ourselves to the specific task of estimating B * for the rest of the paper.
When the joint distribution of (f (X) , ε ) is normal with f (X) independent of ε, Model (1.2) is referred to as a Gaussian copula regression model. We here assume more generally that the joint distribution of (f (X) , ε ) has a non-degenerate elliptical distribution [5], with f (X) uncorrelated with ε. Then, the unobserved vector (f (X) , g(Y) ) also has a non-degenerate elliptical distribution, i.e., there exists a vector μ ∈ R p+q , a nonnegative continuous random variable R and a (p + q) × (p + q) invertible matrix A such that (f (X) , g(Y) ) = μ + RAU, where U is independent of R and uniformly distributed on the unit sphere in R p+q . The unique copula of the vector (f (X) , g(Y) ) is then said to be elliptical. The properties of elliptical copulas are reviewed, e.g., in [13]. Members of this class are characterized by a univariate generator ψ (in one-to-one correspondence with the distribution of R) and a copula correlation matrix.
Given that f and g are strictly increasing, the vectors Z = (X , Y ) and (f (X) , g(Y) ) have the same elliptical copula; see, e.g., Theorem 2.4.3 in [31]. For this reason, the model is called an elliptical copula multivariate response regression model. Under our identifiability conditions, the common copula correlation matrix Σ of Z and (f (X) , g(Y) ) coincides with the covariance matrix of the latter. Accordingly Σ can then be estimated from the observed sample of Z by inversion of Kendall's tau, irrespective of ψ.
Let Z 1 = (X 1 , Y 1 ) , . . . , Z n = (X n , Y n ) be a random sample of size n ≥ 2 from Z. Also let Z = (Z 1 , . . . , Z p+q ) and for each i ∈ [n], set Z i = (Z i1 , . . . , Z i(p+q) ) . For arbitrary k, ∈ [p + q], the value of Kendall's tau between the kth and th coordinates of Z is then defined [18,20] as Let T ∈ R (p+q)×(p+q) be the matrix whose (k, )th entry is τ k . When Z is elliptical, we have as mentioned, e.g., in [17,23]. Now consider the empirical analog T of T, whose (k, )th entry τ k is the empirical version of Kendall's tau between the kth and th coordinates of Z, viz.
A plug-in estimator of Σ is then given by Σ = sin(π T/2). (2. 2) It has also long been known that T is a (matrix) U -statistic, whose limiting distribution is (matrix) Gaussian and centered at T; see, e.g., [10]. Accordingly, Σ is a consistent estimator of Here the last equality in (2.3) follows from the joint ellipticity and the uncorrelatedness of X and ε.
Note that the plug-in estimator Σ XX based on Kendall's tau is not necessarily positive semidefinite; see, e.g., [42], as well as [8] for an early mention of this problem. This can be an issue in regularization routines involving Σ XX as a design matrix. When needed, we can rely on the projection Σ + XX of Σ XX onto the set of positive semidefinite matrices, viz.
This projection can be computed efficiently using the algorithm described in Appendix A of [6]. It also comes at a minimal cost in terms of the · ∞ norm because as stated in Eq. (2.3) of [6], we have As a final remark, note that in the special case of Gaussian copulas, Σ in (2.2) can be replaced by the better-performing normal-score rank correlation estimator Σ n discussed in Section F.

First estimation of B *
The rank-based estimator Σ of Σ can be used to estimate the coefficient matrix B * through (2.3). In this section we study a preliminary, column-by-column estimation of B * that ignores any information about the precision matrix Ω εε . In Section 3.1, a Lasso program based on the design matrix Σ + XX is considered; an alternative approach rooted in the Dantzig selector is described in Section 3.2.
For each ∈ [q], let β * denote the th column of B * and let S be the corresponding support set, i.e., the collection of indices corresponding to the nonzero elements of β * . For any α > 0, further define the cone set corresponding to the th column as The restricted eigenvalue (RE) of Σ XX over the cone set for the Lasso approach is defined as while for the Dantzig selector approach, we set κ D, = min x∈C (1) x Σ XX x/(2 x 2 2 ).
Finally, let R : R p → R with R(·) = · 1 be the penalty function for the column-by-column estimation of B * ; the dual of R is R * (·) = · ∞ .

The Lasso approach
For each ∈ [q], define the loss L : R p → R for the th column of B * by setting, for arbitrary β ∈ R p , As mentioned in Section 2.2 of [4], L is motivated by the standard least squares loss function for the th column of B * in Model (1.1), but with the marginally transformed (yet unobserved) quantities i∈ [n] f (X i )f (X i ) /n and i∈ [n] f (X i )g(Y i ) /n replaced by the plug-in estimators Σ + XX and Σ XY , respectively. The loss L is convex because Σ + XX is positive semidefinite. The Lasso estimator of β * is then given by where λ is a tuning parameter. Computationally, the term L (β) in (3.2) can be readily converted to an equivalent univariate response least squares form involving the vector β. This transpires from the discussion around Eq. (2.2) in [6], or by treating L as a special case of the loss L in (5.1); see Appendix D.
Hence (3.2) can be solved by various efficient algorithms for the standard Lasso. The recovery rate of the estimator β is given in the following proposition, which is the analog of Theorem 1 in [4]. In what follows, C 1 is a universal constant specified just below (7.2) in Section 7.
Proposition 3.1. Suppose that κ > 0 and n is sufficiently large to ensure that Suppose that the tuning parameter λ in (3.2) satisfies Then, with probability at least 1 − 1/p 2 − 2/(pq), we have Remark 3.2. The event for which (3.5) holds is the intersection of the events E ∞,1,n and E ∞,2,n introduced at the start of Section 7.1. Observe that E ∞,1,n ∩ E ∞,2,n does not depend on ∈ [q]. This fact will be used in the proofs of subsequent results.

The Dantzig selector approach
For each ∈ [q], define the loss L D, : R p → R for the th column of B * by setting, for arbitrary β ∈ R p , Note that this expression is identical to (3.1), but with Σ + XX replaced by Σ XX . In what follows, ∇ denotes the gradient operator. The Dantzig selector estimator of β * is then given by Elliptical copula multivariate regression 919 subject to where ∇L D, (β) = Σ XX β − ( Σ XY ) • and λ D, > 0 is a tuning parameter. Note that this Dantzig selector program is always convex, even when Σ XX is not positive semidefinite.
The following proposition specifies the recovery rate of the Dantzig selector estimator β D, . Proposition 3.3. Suppose that n is sufficiently large to ensure that where C(Σ XX ) is the condition number of Σ XX . Suppose that the tuning parameter λ D, in (3.7) satisfies λ D, ≥ C 1 β * 1 ln(p 2 )/n + ln(pq)/n . (3.11) Then, with probability at least 1 − 2/p 2 − 2/(pq), we have

Discussion
The two approaches described above have their own merit. In general, the Lasso program is computationally more efficient than the Dantzig selector. When ln(p)/n ≈ 0, however, Proposition 3.1 imposes a more stringent upper bound on the number s of nonzero elements in each column β * of B * than Proposition 3.3. This stems from the fact that Σ + XX given in Eq. (2.4) lacks some of the critical properties of U -statistics that Σ XX inherits from T. It is thus more difficult to control the magnitude of u ( Σ + XX − Σ XX )u than that of u ( Σ XX − Σ XX )u uniformly over unit vectors u. Refer to the proofs of Propositions 3.1 and 3.3 for details.
In practice, it may happen that Σ XX is positive semidefinite. When this occurs, the Lasso program inherits the same relaxed conditions as the Dantzig selector approach. Therefore, a natural question is: under what conditions is Σ XX likely to be positive semidefinite? Roughly speaking, the operator norm Σ XX − Σ XX op is on the order of p/n + p/n if we only keep factors that depend explicitly on the sample size n and the ambient dimension p; see, e.g., Lemma B.1 in the Appendix, or Theorem 2.2 in [41]. Thus, if we assume that the smallest eigenvalue of Σ XX is on the order of unity, then the transition point for Σ XX from likely being positive semidefinite to unlikely is when p becomes larger than n. This transition is also verified by simulation studies in Section 6.
For computational sake, and because the theoretical properties of the Lasso program and the Dantzig selector differ mostly through the condition on the number of nonzero elements of the columns of B * , we take the Lasso approach as our starting point and use B in (3.6) as our preliminary estimator of B * .
Finally we address the difference between our implementation and that of [4], who deal with the case q = 1. When Σ XX is not positive semidefinite, and transposing into our q > 1 context, Cai and Zhang [4] suggest either (i) to use the approach of [27] for non-convex M-estimation; or (ii) to project Σ XX onto the semidefinite cone to produce a positive semidefinite update Σ +,s XX so as to minimize the operator norm of Σ +,s XX − Σ XX when restricted to vectors with at most s nonzero elements. In our case, s will correspond to the number of nonzero elements of the single columns of B * .
Option (i) has the disadvantage of involving non-convex analysis and an extra tuning parameter that should bound from above the 1 norms of the single columns of B * ; see, e.g., Theorem 3.1 in [27]. The desire to avoid such complications was a major motivation in [6] (see their discussion in Section 1), which we follow. Option (ii) is difficult to carry out in practice: first, we cannot realistically expect s to be known, and second, even if we do know s, no simple algorithm is available to carry out the required projection, in contrast to (2.4). Therefore, our column-by-column estimation of B * is more straightforward and practical than that directly implied by [4].

Estimation of the precision matrix
In order to incorporate the precision matrix into a refined estimation of B * , we need to estimate Ω εε . Our starting point is an estimator of Σ εε . Then, to deduce an estimator of Ω εε from the estimator of Σ εε , we adopt the approach of [33,34,45]. From Eq. (2.3), we have Hence a plug-in estimate of Σ εε is given, for B as in (3.6), by (4.1) By analogy with (11) in [33], we estimate Ω εε by the solution to the graphical Lasso program, viz.
where λ Ω is some tuning parameter, and · 1 ,off is the 1 norm of the entries of the argument (required to be a square matrix) excluding the diagonal elements.
As is well-known, the graphical Lasso works on non-Gaussian input. Also note that the program (4.2) is always convex, even though Σ εε is not necessarily positive semidefinite. Program (4.2) can be solved using various graphical Lasso algorithms. If the selected algorithm requires the input to be positive semidefinite, we can simply project Σ εε onto the semidefinite cone in a manner analogous to (2.4) to produce an update Σ + εε , and substitute Σ εε by Σ + εε in (4.2). Then Σ + εε will satisfy an inequality analogous to (7.14) with at most an extra factor of 2 on the right-hand side, and inspection of the proof of Proposition 4.2 reveals that this leads to at most the same extra factor in the convergence rates in the proposition.
Following [33], we define the maximum degree or row cardinality of Ω εε as Following also the notation of [33], for any two subsets T and T of [q 2 ], let (Γ) T T denote the card(T )×card(T ) matrix with rows and columns of Γ indexed by T and T , respectively. Then, (s λ /κ ) ln(p 2 )/n + ln(q 2 )/n .
The recovery rate for the estimator Ω εε is stated below and derived in Section 7 under the following standard irrepresentability condition, introduced in Assumption 1 in [33], for the graphic Lasso.   (3.4), and n is sufficiently large to ensure that (3.3) holds (so that Proposition 3.1 applies). Further assume that Assumption 4.1 is satisfied and that n is sufficiently large to ensure that Finally, suppose that the tuning parameter λ Ω in (4.2) satisfies Then, with probability at least 1 − (1/p + 1/q) 2 , the estimator Ω εε satisfies and Remark 4.3. When (4.6) holds, Ω εε is positive semidefinite if n is large enough to ensure that Remark 4.4. The event with probability at least 1 − (1/p + 1/q) 2 in Proposition 4.2 is in fact the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n , for E ∞,1,n , E ∞,2,n and E ∞,3,n introduced at the start of Section 7.1. This explicit representation will be used in the proofs of subsequent results.

Second estimation of the coefficient matrix
We now study improved estimates of the coefficient matrix B * which exploit the estimator Ω εε from Section 4 and incorporate assumptions on the overall structure of B * . Preliminary notions are first reviewed in Section 5.1. An element-wise sparse model and a row-sparse model on B * are then considered in Sections 5.2 and 5.3, respectively. A Lasso program based on Σ + XX is used in Sections 5.2 and 5.3.1. Because the Lasso approach fails to reveal fully the benefit of using the group penalty under the row-sparse model, a group Dantzig selector program based on Σ XX is considered in Section 5.3.2.

Preliminaries
The loss functions L(·; S, S × , Ω) : R p×q → R used below to estimate B * depend on matrices S ∈ R p×p , S × ∈ R p×q , and Ω ∈ R q×q . For arbitrary B ∈ R p×q , set The quantities S, S × , Ω in (5.1) will either be Σ XX , Σ XY and Ω εε , respectively, or their estimates. When no ambiguity occurs, we write L(·; S, S × , Ω) as L.

Element-wise sparsity
A matrix B * is called element-wise sparse if its support set S ⊂ R p×q is such that s = card(S) p × q. Accordingly, to obtain an element-wise sparse estimator of B * , it is natural to regularize the least squares program with the penalty R(·) = · 1 , i.e., to estimate B * by where λ is a tuning parameter. As for (3.2), the term L(B; Σ + XX , Σ XY , Ω εε ) in (5.4) can be readily converted to an equivalent univariate response least squares form involving the vectorized B. Hence (5.4) can be solved by various efficient algorithms for the standard Lasso. See Appendix D for details.
The recovery rate for the estimator B is stated below and derived in Section 7 under the following assumption on the RE condition for the population loss function.
We define κ , the empirical counterpart to κ in Assumption 5.1, as Then, with probability at least 1 − (1/p + 1/q) 2 , the estimator B satisfies

Row sparsity
A matrix B * is called row-sparse if the set S G ⊂ [p] of indices corresponding to the nonzero rows of B * is such that s G = card(S G ) p. In order to obtain a row sparse estimator of B * , it is natural to regularize the least squares program by the penalty R G (·) = · 1 , 2 , i.e., the 1 norm of the 2 norms of the rows of a matrix, given by i.e., the maximum norm of the 2 norms of the rows of a matrix. In what follows, the appropriate cone set for the row-sparse model is given, for any α > 0, by i.e., for any matrix in the cone set C G (α), the · 1 , 2 norm of the rows of the matrix with indices in [p] \ S G is bounded by α times the · 1 , 2 norm of the rows of the same matrix with indices in S G .

The group Lasso approach
A group Lasso estimate of B * is given by where λ G is a tuning parameter. As discussed immediately following (5.4), (5.7) can be efficiently tackled by any of the fast solvers for the standard group Lasso. The recovery rate for B G is stated next and derived in Section 7 under the following assumption on the RE condition for the population loss function, which is the row-sparse analog of Assumption 5.1. We define the empirical counterpart to κ G in Assumption 5.3 as

Theorem 5.4. Suppose that Assumption 5.3 and the assumptions of Proposition 4.2 hold. Further suppose
(i) n is sufficiently large to ensure that (4.7) holds and, for κ G defined in (5.8), where C 2 is the absolute constant that appears in Proposition 7.5. Then, with probability at least 1 − (1/p + 1/q) 2 − 1/p, the estimator B G satisfies In its current form, the group Lasso approach does not fully reveal the benefit of the row-sparse model. For instance, say B * has exactly s G nonzero rows, and all elements of these rows are nonzero. There are then qs G nonzero elements in B * . In this case, if the tuning parameter λ is taken to be the lower limit of the specification (5.6), and if Δ 1 (Ω εε ) and the parameters that are not explicitly dependent on the ambient dimensions p and q (e.g., B * 1 ) are all bounded, then by Theorem 5.2, with high probability (i.e., for p and q large), the element-wise Lasso program from Section 5.2 yields qs G × ln(p)/n + ln(pq)/n. (5.10) By comparison, under the same conditions, but with the tuning parameter λ G taken to be the lower limit of the specification (5.9), the first term in the curly bracket in (5.9) is dominant, and so by Theorem 5.4 the group Lasso program yields Thus if p ≈ q, the element-wise and the group Lasso programs yield the same rate, even though the more structured row-sparse model should intuitively give the latter an advantage, as we discuss now. When the group Lasso approach is used in the traditional non-copula context with fixed design matrix and independent Gaussian errors, Corollary 4.1 in [29] states that the recovery rate for B * under the 2 (or Frobenius) norm is bounded above, for arbitrary p and q, by a constant multiple of s G /n × q + ln(p). (5.12) This bound is strictly better than the rate (5.11) implied by Theorem 5.4, and up to a possible replacement of ln(p) by ln(p/s G ), this is also the minimax lower bound; see, e.g., the discussion below Theorem 6.1 in [29]. Moreover, under suitable conditions (e.g., the entries in the nonzero rows of B * all have nonnegligible size) this upper bound is also strictly better than the lower bound achievable by the element-wise Lasso program; see, e.g., the discussion in Section 7 in [29].
For reasons similar to those given in Section 3.3, the bound on the recovery rate in Theorem 5.4 is suboptimal because ( Σ + XX − Σ XX )B * ∞, 2 is harder to control than ( Σ XX − Σ XX )B * ∞, 2 , as transpires from the proofs of Theorems 5.4 and 5.7. Also, Σ XX may be positive semidefinite, though if the smallest eigenvalue of Σ XX is on the order of unity, this is increasingly unlikely as p grows larger than n. On the event { Σ XX 0} = { Σ + XX = Σ XX }, we have the following alternative to Theorem 5.4. Proposition 5.5. Assume the same conditions as in Theorem 5.4, except that the tuning parameter λ G in (5.7) now satisfies Then, on the intersection of the event { Σ XX 0} and another event with probability at least 1 − (1/p + 1/q) 2 − 9/p, the estimator B G satisfies As will be explained below Theorem 5.7, if the tuning parameter λ G is taken to be the lower limit of the specification (5.13), whose right-hand side is precisely twice the right-hand side of (5.17), and under suitable conditions, the recovery rate for B * stated in Proposition 5.5 under the 2 norm matches (5.12) and thus reflects the improvements brought about by the group Lasso program (over an element-wise Lasso program) under a row-sparse model. However, this result is not universally applicable, because the stated rates are only shown on the event { Σ XX 0}, whose probability becomes very small when n is much smaller than p. To properly tackle the row-sparse model, it is thus preferable to estimate B * using a group Dantzig selector program, as detailed next.

The group Dantzig selector approach
The group Dantzig selector estimator of B * is defined as (5.14) subject to where λ D,G is a tuning parameter. As already noted in Section 3.2, this Dantzig selector program is always convex even when Σ XX is not positive semidefinite. The recovery rate for B D,G given next is derived in Section 7 under the following assumption on the RE condition for the population loss function, which is the Dantzig selector analog of Assumption 5.3.
We define κ D,G , the empirical counterpart to κ D,G in Assumption 5.6, as Then, with probability at least 1−(1/p+1/q) 2 −9/p, the estimator B D,G satisfies To compare the recovery rate given in Theorem 5.7 to (5.12) in terms of the 2 norm, assume for simplicity that all quantities not explicitly dependent on the ambient dimensions p and q are bounded. Further assume that the tuning parameter λ D,G is the lower limit of the specification (5.17) and that Condition (A) ensures that Ω εε is a good estimate of Ω εε while Condition (B) guarantees that the row cardinality p of B * is not too large compared to n. The latter condition stems from the second term on the right-hand side of (5.17), which traces back to a second order term in the Taylor expansion of the sine transformation (2.2) from the Kendall's tau matrix estimate T XX to the copula correlation matrix estimate Σ XX .
Under these mild conditions, λ D,G is on the order of {q + ln(p)}/n. Thus the recovery rate in (5.18), which is in terms of the 2 norm for estimating B * under the row-sparse multivariate response elliptical copula regression model, matches the recovery rate (5.12), which is also the rate achievable for the group Lasso estimator in the traditional non-copula context with Gaussian errors. In addition, this rate is indeed superior to that achievable with the element-wise Lasso estimator.

Numerical performance
In this section we investigate the finite-sample numerical properties of the proposed estimators.

Simulation studies
In this simulation study, we focus exclusively on estimators for B * based on the plug-in estimator (2.2) for Σ by inversion of Kendall's tau. Under the univariate response scenario considered in Section 4.1 of [4], when f and g are not the identity function, such rank-based estimators for B * (i) routinely offer tenfold improvement in estimation accuracy compared to those based on sample covariance matrix for Σ, because the sample covariance matrix is not robust under marginal transformations, and (ii) are almost as good as the oracle estimator that has full knowledge of f and g. Therefore, we omit comparisons with methods based on the sample covariance matrix.
For the high-dimensional setting, we chose to expand B * by enlarging p. Two competing factors are at work here. As discussed below Proposition 5.5 and Theorem 5.7, we know that under a row-sparse model for B * and under appropriate conditions such as those stated in Eq. (5.20), the group Lasso estimator B G achieves the rate (5.12) on the event { Σ XX 0} while the element-wise Lasso estimator B achieves the rate (5.10). It is easily checked from these rates that enlarging p increasingly favors the group Lasso estimator. When p becomes comparable to, or larger than, n, however, the event { Σ XX 0} very often fails. The extent to which the group Lasso estimator remains superior to the element-wise Lasso estimator when Σ XX 0 no longer holds must then be assessed empirically.
The structure of B * was obtained as follows. First, following [35], Toeplitz structures were imposed on Σ XX and Σ εε by setting with D = 0.5 and the value of ρ ranging from 0 to 0.9 to model the varying strength of the correlation among the components of the error vector ε. Then Ω εε = Σ −1 εε is a tri-diagonal sparse matrix. To obtain a row sparse model on B * , we first generated a p × q matrix B * so that exactly 80% of its p rows (chosen at random) were entirely filled with zeros. For the remaining 20% of the rows, within each column 75% of the p/5 elements (chosen at random within each column) were drawn from the uniform distribution on the set {−1, +1} while the remaining 25% were made equal to zero.
To obtain an element-wise sparse model on B * instead, we first generated a p × q matrix B * so that in each of its columns, 80% of its p elements (chosen at random within each column) are equal to zero; the remaining 20% of the entries of B * were drawn from the uniform distribution on {−1, +1}.
In both cases, each column of B * was normalized to ensure that the diagonal elements of B * Σ XX B * equal 1 − D. As a result, the diagonal elements of the sum Σ εε + B * Σ XX B * are all 1 and hence this sum is indeed a correlation matrix. This normalized version of B * was taken to be the final coefficient matrix B * . Samples (X 1 , Y 1 ) , . . . , (X 100 , Y 100 ) were then generated from a multivariate normal distribution with covariance Σ given in (2.3). This served as the observed sample because Kendall's tau is invariant to increasing transforms of the margins and nothing else is needed to estimate B * .
Using this design, we were then able to compare the performance of three estimators: (i) the element-wise Lasso estimator B with precision matrix included as described in Section 5.2; (ii) the group Lasso estimator B G with precision matrix included as described in Section 5.3.1; (iii) as a benchmark, the element-wise Lasso estimator without precision matrix incorporated, obtained in the same way as B described in Section 5.2 but with precision matrix estimate Ω εε simply set to the identity matrix in R q×q . This estimator of B * seemed to perform better than the preliminary column-by-column estimator B described in Section 3.1. For the first two estimators, the tuning parameters λ 1 , . . . , λ q needed for the preliminary estimation in Section 3.1 were chosen column by column via 5-fold cross-validation, while the tuning parameter combination (λ Ω , λ) in Sections 4 and 5.2 (for the first estimator) or (λ Ω , λ G ) in Sections 4 and 5.3.1 (for the second estimator) were chosen jointly on a 2D-grid, also via 5-fold cross-validation.
Following [35], the performance of the three estimators was measured by for any estimator B. In each case, 500 replicates were drawn for each B * and ρ specification. The median of ME( B, B) based on the 500 repetitions is presented in Figure 1 for the three estimators. The graphs in the left panels of Figure 1 correspond to the element-wise sparse model. They should favor the element-wise Lasso estimator over the group Lasso estimator; as ρ increases, they should also favor the estimator with precision matrix incorporated over the one without. The latter comment applies to the graphs in the right panels of Figure 1, which correspond to the row-sparse model. In this case, however, the group Lasso estimator should be preferred over the element-wise Lasso estimator, at least in the moderate dimension case with (p, q) = (20, 10), where empirically only .04% of all the Σ XX generated (from both the element-wise sparse model and the row-sparse model, at all ρ values) failed to be positive semidefinite.
These are indeed the general trends featured in Figure 1, with a few anomalies. For the high-dimensional case with (p, q) = (100, 10), none of the Σ XX generated was positive semidefinite. Thus it may be a little surprising that even in this case, under the row-sparse model, the group Lasso estimator outperforms the element-wise Lasso estimator, by an even larger margin than under the moderate dimension setting. Perhaps the projection operation (2.4) does not cause much variation from Σ XX to Σ + XX , even though (2.5) is our only theoretical guarantee. Also, it is unclear why, in this high-dimensional case and under the element-wise sparse model, the group Lasso estimator still outperforms the element-wise Lasso estimator. This should be investigated further.

Illustration
We also studied the performance of our estimators on data summarily known as the "Better Life Index," gathered for the member states of the Organization for Economic Co-operation and Development (OECD). The index "aims to involve citizens in the debate on measuring the well-being of societies, and to empower them to become more informed and engaged in the policy-making process that shapes all our lives." For each member state, the index consists of 24 socioeconomic variables. We used the data for the first N = 34 member states in the Elliptical copula multivariate regression 931 2017 edition of the index, which is publicly available at https://stats.oecd. org/Index.aspx?DataSetCode=BLI We divided the 24 variables into q = 7 responses and p = 17 explanatory variables. Three of the response variables were those labeled "Life satisfaction average score," "Quality of support network" and "Feeling safe walking alone," which are arguably subjective measures. The four other response variables were household income and wealth, as well as measures of air and water quality.
Because in practice we do not have access to the true coefficient matrix B * , we measured the performance of the various estimators by their predictive power. Given that the sample size is small, we compute for each member state i ∈ [N ] an empirical predictor U * (−i) using the data {x j : j ∈ [N ], j = i} from the remaining member states, and then compared the empirical predictor to the actual response y i . Specifically, the empirical predictor was computed as in (E.1) using , where Φ denotes the cdf of a standard Normal distribution, N (0, 1), while F * n,k and G * n, are respectively the empirical distribution functions for the kth marginal of X and the th marginal of Y; see (F.1). Refer to Appendix E for the justification of these choices of f and g.
We considered seven estimators of B * : the element-wise Lasso estimator with precision matrix incorporated, B Ω ; the group Lasso estimator with precision matrix incorporated B G,Ω ; their counterparts B and B G without incorporating the precision matrix; the regular (i.e., without considering the copula structure) element-wise Lasso estimator B and the regular group Lasso estimator B G , neither incorporating the precision matrix; and finally the ordinary least squares estimator B OLS . Following [11], we considered the 1 norm of the deviation Table 1 summarizes the performance of the empirical predictor U * (−i) for the various estimators. The results can be summarized as follows: (a) Both the element-wise Lasso estimator B and the regular group Lasso estimator B G , which ignores the copula structure, perform better than the regular element-wise Lasso estimator B. (b) The estimator B G , which uses both the group penalty and the copula structure, thus combining the features of both B and B G , outperform the latter two estimators. (c) Further incorporating precision matrix estimation to the estimators B and B G , which result in the estimators B Ω and B G,Ω respectively, brought mixed results. (d) The group Lasso estimator B G,Ω with precision matrix estimation performs somewhat better than B G , but the element-wise Lasso estimator with precision matrix estimation B Ω performs slightly worse than its counterpart B.  Carefully checking the estimators Ω εε for the precision matrix reveals that the fitted Ω εε does not differ significantly from the identity matrix for most of the 34 empirical predictors calculated. In this illustration, therefore, the benefit of including precision matrix estimation outweighs the variation caused by the estimation for the group Lasso case, but not in the element-wise Lasso case. As a side note, personal income is consistently chosen by the estimator B G,Ω as the variable having the most predictive power.

Preliminaries
The following basic results and terminologies will be used in the proofs of the main results. First recall some basic deviation properties of Kendall's tau matrix T and the plug-in estimator Σ. It is straightforward to show, e.g., through the equation display just above (4.28) on p. 1207 of [41], that there exist events E ∞,1,n , E ∞,2,n , E ∞,3,n , with probabilities at least 1 − 1/p 2 , 1 − 2/(pq), 1 − 1/q 2 respectively, such that on the events E ∞,1,n , E ∞,2,n , E ∞,3,n , respectively. Here C 1 is an absolute constant that can be set equal to √ 2π. Eqs. (2.1), (2.2) and the Lipschitz property of the sine function imply that Next, we introduce some terminology regarding the Lasso and the Dantzig selector. Given a generic Euclidean space Q, consider a pair of subspaces M ⊂ M of Q. We denote the orthogonal complement of M by M ⊥ . Following [30], we say that a norm-based penalty function R : Next, we define the subspace compatibility constant with respect to the pair (R, · 2 ) as Then, we denote the projections of θ ∈ Q onto M, M and M ⊥ by θ M , θ M and θ M ⊥ , respectively. By a cone set, we mean a subset C of Q satisfying the property that there exists Next, let F : Q → R be a generic loss function and let β * denote the true (but unknown) parameter value. Let also C ⊂ Q be an arbitrary constraint set which will typically be a "cone set." Following [30], we define, for all δ ∈ Q, Then F is said to satisfy a restricted strong convexity (RSC) condition with curvature κ F and tolerance function τ F over the set C if If τ F (β * ) is zero, then F is said to satisfy a restricted eigenvalue (RE) condition with constant κ F over the set C. Conversely, the RE condition with constant κ F over a set C implies the RSC condition with curvature κ F and tolerance function equal to zero over the same set C. The definitions of δL and the RE condition of L given in Section 5.1 are special cases of these general concepts.
Finally, for the analysis of Dantzig selectors, a variant of the above RSC condition will be used. By analogy with [28], a function F is said to satisfy RSC-D condition with curvature κ F and tolerance function τ F over the set C if If F is a quadratic function, as will be the case here, the left-hand sides of (7.7) and (7.8) are both quadratic in δ (the exact expressions differ by a factor 2), and thus the RSC and RSC-D conditions are largely identical on the same cone set C, though they could differ further for more general loss functions. One relation between the two conditions is that, as commented on p. 565 of [28], if F is convex, then the RSC condition implies the RSC-D condition.

Proofs for Section 3
Proof of Proposition 3.1. The proof relies on the following result on the RSC condition.
Proposition 7.1. On the event E ∞,1,n the loss function L satisfies RSC condition (7.7) with curvature κ − 16 C 1 s ln(p 2 )/n and tolerance function vanishing on the cone set C (3).
Proof of Proposition 7.1. We fix arbitrary δ ∈ C (3). It is easy to see that It then follows from two successive applications of Hölder's inequality and the definition of κ that Because δ ∈ C (3), we also have Now we focus on the event E ∞,1,n . Plugging (7.10) into (7.9), we have where the last step follows by (7.6), which holds on the event E ∞,1,n . This concludes the proof.
We wish to apply Corollary 1 in [30]. To this end, we need to (i) identify the subspaces M and M; (ii) check that the appropriate RSC condition holds; (iii) check that the tuning parameter is large enough compared to the noise level. We carry out these tasks in sequence. We focus on the event E ∞,1,n ∩ E ∞,2,n , whose probability is at least 1 − 1/p 2 − 2/(pq). First, we set Then β * ∈ M and the penalty R is decomposable with respect to (M, M ⊥ ).
Furthermore, recall that the dual norm R * of R is given by R * (·) = · ∞ . We then have and hence In the transition to the last line, we have invoked (7.4) and (7.6), and the last inequality follows by the choice of λ in (3.4). The conclusions of the proposition then follow from Corollary 1 in [30].
Proof of Proposition 3.3. Set δ = 1/p 2 and consider the event E op,k,n = E op,k,δ,n from Lemma B.1, whose probability is at least 1 − 1/p 2 . The proof relies on the following result on the RSC-D condition. Proof. We fix an arbitrary δ ∈ C (1). If s = 0, then δ = 0 and the conclusion of the proposition follows trivially, so we assume that s > 0. We have from which it is easy to see that δ|. (7.12) We fix δ = 1/p 2 , let k ∈ N be an arbitrary integer, and assume that (B.1) holds. We focus on the event E ∞,1,n ∩ E op,k,n , on which (B.2) in Lemma B.1 holds with u = δ (and δ = 1/p 2 ). We also have where the second inequality follows because δ ∈ C (1) implies δ 2 1 ≤ 4 s δ 2 2 by a derivation similar to that of (7.10).
To balance the terms 4 s / √ k and √ k in the last line of (7.13), we choose k = 4s . Then (B.1) translates into (3.8), which holds by assumption, and so (7.13) also holds. Plugging (7.13) into (7.12), we then find as claimed in the proposition.
We wish to apply Lemma A.2. To this end, we check the three conditions at the beginning of the proof of Proposition 3.1, but with the appropriate RSC-D condition instead of the RSC condition. We focus on the event E ∞,1,n ∩E ∞,2,n ∩ E op,4s ,n whose probability is at least 1 − 2/p 2 − 2/(pq).
First, we set M and M as in (7.11) again. Then again β * ∈ M, the penalty R is decomposable with respect to (M, M ⊥ ), and the subspace compatibility constant is Ψ(M) = √ s . Next, starting from Proposition 7.2, we conclude that, over the cone set C (1), the loss function L D, satisfies RSC-D condition (7.8) with tolerance function equal to zero and curvature κ D, > 0, because by (3.9) we have κ D, ≤ 2κ D, − {8 C (Σ XX ) s ln(12p)/n + 2 C 2 1 s ln(p 2 )/n}. Furthermore, the dual norm R * of R is given by R * (·) = · ∞ , and we have Therefore, where the last inequality follows by the choice of λ D, in (3.11). The conclusions of the proposition then follow from Lemma A.2.

Proof of Proposition 4.2
Proof. We have From now on we focus on the event E ∞,1,n ∩E ∞,2,n ∩E ∞,3,n whose probability is at least 1 − (1/p + 1/q) 2 . Then in the last line above, the first term satisfies the bound (7.5), while for the second term, We treat the terms on the right-hand side above in sequence. First, and hence Here we have invoked the conclusion of Proposition 3.1 (see Remark 3.2). Next, Finally, we have In the end, we obtain that (7.14) The conclusions of our proposition then follow from a slight variation of Theorem 1 in [33]. We let Σ εε be some generic estimator of Σ εε , and suppose that on some event A, the estimator Σ εε satisfies the error bound Σ εε − Σ εε ∞ ≤ δ. Note that in the context and notation of Theorem 1 in [33], Σ εε is the observed covariance matrix for a random sample drawn directly from a distribution with covariance Σ εε , and Σ εε − Σ εε ∞ ≤δ f (n, p τ ) with probability at least 1 − p 2−τ .

Proofs for Section 5
Proof of Theorem 5.2. The proof relies in part on the following result on the RSC condition. We treat the three terms on the right-hand side of (7.15) one by one. For the first term, by definition of δL(Δ, B * ; Σ XX , Σ XY , Ω εε ) and Assumption 5.1, we have For the second term, we have For the third term, we have and hence Now, we bound the last factor in (7.18) as and therefore, continuing from (7.18), we have Combining (7.15), (7.16), (7.17) and (7.19), we have and hence, upon writing the right-hand side in a different form,

Y. Zhao and C. Genest
By a derivation similar to that of (7.10), we have Then, plugging (7.21) into (7.20), we conclude that 22) Next we focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n . Then (7.6) holds, and if in addition the assumptions in Proposition 4.2 are satisfied, then Proposition 4.2 states that (4.5) and (4.6) also hold. Thus we can further bound from below the right-hand side of (7.22) as This concludes the proof of Proposition 7.3.
We wish to apply Corollary 1 in [30]. To this end, we check the three itemized conditions at the beginning of the proof of Proposition 3.1. We focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n whose probability is at Then B * ∈ M, and the penalty R is decomposable with respect to (M, M ⊥ ).
Proof of Theorem 5.4. The proof partially relies on the following result on the RSC condition. Proposition 7.4. Suppose that Assumption 5.3 holds, the assumptions of Proposition 4.2 hold, and n is sufficiently large to ensure that (4.7) holds. Then on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n the loss L(·; Σ + XX , Σ XY , Ω εε ) satisfies RSC condition (7.7) with curvature κ G introduced in (5.8) and tolerance function equal to zero over the cone set C G (3).
Proof. We fix arbitrary Δ ∈ C G (3), and use the same decomposition (7.15) in the proof of Proposition 7.3. The treatment for the second term in the last line of (7.15) remains the same as (7.17). For the first term, by definition of δL(Δ, B * ; Σ XX , Σ XY , Ω εε ) and Assumption 5.3, we have For the third term, suppose that Ω εε is positive semidefinite. Then we can take its symmetric positive semidefinite square root Ω 1/2 εε and use the fact that The second inequality follows by (B.4) in Proposition B.2. Therefore, combining (7.15), (7.23), (7.17) and (7.24) we have

Y. Zhao and C. Genest
Then, plugging (7.26) into (7.25), we conclude that 27) Now we focus on the event E ∞,1,n ∩E ∞,2,n ∩E ∞,3,n . Then (7.6) holds, and if in addition the assumptions in Proposition 4.2 are satisfied, then Proposition 4.2 states that (4.6) also holds. Moreover, if in addition (4.7) holds, then Ω εε is indeed positive semidefinite. Then (7.27) holds, and furthermore we can lower bound the right-hand side of (7.27) as This concludes the proof of Proposition 7.4.
We wish to apply Corollary 1 in [30]. To this end, we check the three itemized conditions at the beginning of the proof of Proposition 3.1. We focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞, 3 and assumption on κ G , over the cone set C G (3), the loss L(·; Σ + XX , Σ XY , Ω εε ) satisfies RSC condition (7.7) with tolerance function equal to zero and curvature κ G = κ G /2 > 0. Finally, we verify that The necessary bound for Σ XY − Σ XY ∞ , 2 , which appears in (7.29), is given in the next proposition. Proposition 7.5. There exists an event F 1,n with probability at least 1 − 1/p such that on the event F 1,n we have, for some absolute constant C 2 , Proof. The proof can be found in Section C.1.
Proposition 7.6. Let D ∈ R p×q be an arbitrary non-random matrix. There exists an event F D,n with probability at least 1 − 4/p such that on the event E ∞,1,n ∩ F D,n we have Here C 2 is the same absolute constant as introduced in Proposition 7.5.
Proof. The proof can be found in Section C.2.
From Proposition 7.6, where we take D to be B * Ω εε or B * , we conclude that there exist events F 2,n and F 3,n , each with probability at least 1 − 4/p, such that we have respectively on E ∞,1,n ∩ F 2,n and E ∞,1,n ∩ F 3,n . From now on we further focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n ∩ F 1,n ∩ F 2,n ∩ F 3,n . Starting from (7.30), (7.31), (7.32) and Proposition 7.5, we then have The last step follows by Proposition 4.2 and the choice of λ G in (5.13). Next, focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n ∩ F 1,n ∩ F 2,n ∩ F 3,n , whose probability is at least 1 − (1/p + 1/q) 2 − 9/p, intersected with the event which is identical to (7.15) apart from a factor of 2.
We fix Δ ∈ C G (1), and focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n . If s G = 0, then Δ = 0 and the conclusion of the proposition follows trivially, so we assume that s G > 0. The treatment for the second term in the last line of (7.34) remains the same as (7.17). For the first term, by definition of δL(Δ, B * ; Σ XX , Σ XY , Ω εε ) and Assumption 5.6, we have For the third term, because the assumptions stated in Proposition 4.2 are satisfied, (4.6) holds. Together with (4.7), this further implies that Ω εε is positive semidefinite. Hence we can take its symmetric positive semidefinite square root Ω 1/2 εε and use Now, fix an arbitrary integer k ∈ N, and assume that (B.1) holds with δ = 1/p 2 . In addition to the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n , we further focus on the event E op,k,n . Then, (B.2) in Lemma B.1 applies with δ = 1/p 2 , and the first step of (7.13) with the substitution of δ by (Δ Ω Now the right-hand term can be rewritten as (7.38) By a derivation similar to that of (7.26), now for Δ ∈ C G (1), we have Then, plugging (7.39) into (7.38), we conclude that . Now, to balance the terms 4 s G / √ k and √ k in the last line above, we choose k = 4 s G . Then (B.1) translates into (3.8) with s replaced by s G , which holds by assumption, and therefore we have Then, from (7.34), (7.35), (7.17) and (7.40), we have Finally, invoking (4.6) in (7.41) yields the conclusion of the proposition.
We wish to apply Lemma A.2. To this end, we check the three itemized conditions at the beginning of the proof of Proposition 3.1, except that now we check that appropriate RSC-D condition, instead of RSC condition, holds. We focus on the event E ∞,1,n ∩ E ∞,2,n ∩ E ∞,3,n ∩ E op,4sG,n ∩ F 1,n ∩ F 2,n ∩ F 3,n whose probability is at least 1 − (1/p + 1/q) 2 − 9/p. Here the event F 1,n is introduced in Proposition 7.5, and the events F 2,n and F 3,n are introduced below Proposition 7.6.
First, we set M and M as in (7.28) again. Then again B * ∈ M, the penalty R G is decomposable with respect to (M, M ⊥ ), and the subspace compatibility constant is Ψ(M) = √ s G . Next, by Proposition 7.7 and assumption on κ D,G , over the cone set C G (1), the loss L(·; Σ XX , Σ XY , Ω εε ) satisfies RSC-D condition (7.8) with tolerance function equal to zero and curvature κ D,G = κ D,G > 0. Finally, essentially by (7.33), we can verify that The conclusions of the theorem then follow from Lemma A.2.

Discussion
In this paper, we studied the estimation of the coefficient matrix in an elliptical copula multivariate response regression model. In this model, the joint distribution of the responses and covariates has an elliptical copula and arbitrary marginal distributions, and after applying some unknown monotonic marginal transformations, the responses and covariates are jointly elliptically distributed and are linked to one another in a linear regression model. As such, this model is much more flexible than the conventional multivariate response linear regression model. We provide penalized estimators of the coefficient matrix that in particular are computationally efficient and adaptive to the unknown marginal transformations, incorporate the precision matrix of the error vector and can take advantage of the potential row-sparsity of the coefficient matrix.
Extensions are possible. First, we could consider the issue of support recovery of B * , which we have not dealt with here. In particular, analogous to the improvement in the estimation of B * that we have seen by employing a group penalty under a row-sparse model for B * , it would be interesting to investigate the extent to which group penalty will help with the recovery of row-support of B * in our copula context. Second, our current algorithm terminates after an improved estimation of B * is obtained. To further improve the estimation accuracy of B * and Ω εε , we could attempt to reiterate the procedures in Sections 4-5, but starting with the improved estimator of B * instead of B in (4.1). Convergence of the iteration scheme analogous to that derived in [22] is then needed.
We follow the convention set in Section 7.1. We let Q be a generic Euclidean space, and consider two subspaces M ⊂ M of Q. Assume that the norm-based penalty function R is decomposable with respect to (M, M ⊥ ), let R * denote the dual of R, and let Ψ(M) be the subspace compatibility constant with respect to the pair (R, · 2 ). We define the cone set Let L : Q → R be a generic loss function and let β * denote the true (but unknown) parameter value. To estimate β * , consider the solution β to the Dantzig selector program β = argmin β∈Q R(β), where λ D is a tuning parameter. Further let δ = β − β * . The first lemma below is a slight extension of a well-known result on the geometry of the solution of a Dantzig selector program; see, e.g., Section III in [9]. The second lemma states bounds on δ.
Proof. The derivation is standard. Here we follow the first part of the proof of Theorem 4.3 in [21]. By the triangle inequality and the assumption that β * ∈ M, we have . From the decomposability of R and the assumption β * ∈ M, we also have . Furthermore, given that β * is a feasible solution, we have R( β) ≤ R(β * ) and hence we can conclude.
Lemma A.2. Assume that β * ∈ M and that λ D ≥ R * {∇L(β * )}. Further assume that L satisfies RSC-D condition (7.8) with curvature κ F = κ 1 > 0 and tolerance function τ F (β * ) = κ 2 R 2 (δ) with κ 2 ≥ 0 over the cone set C, and that Then and Proof. First observe that the assumption on λ D makes β * a feasible solution, so that δ ∈ C by Lemma A.1. Next note that by Hölder's inequality and the triangle inequality, The assumption on λ D and the fact that β is a feasible solution then imply that the right-hand side is bounded above by 2λ D R(δ). Furthermore, the triangle inequality, the fact that δ ∈ C, and the definition of the subspace compatibility constant successively imply that Therefore, Moreover, inequality (A.3) and the condition (7.8) on L imply

B.1. Deviation bound in operator norm for the plug-in estimator based on Kendall's tau
We propose a slight reformulation of Corollary 4.8 from [1].
For arbitrary u ∈ R p , it follows easily from Lemma 4.9 and the proof of Next, Lemma 4.6 in [1] also states that on an event E op,k,δ,n with probability at least 1 − δ, Thus (B.2) follows from (B.3), the above bound on E op,k,δ,n , and the bound (7.1) on E ∞,1,n .

B.2. A matrix inequality
We have the following result.

Proposition B.2. Consider an arbitrary matrix
Proof. Assume without loss of generality that A has entries a k ≥ 0. Then Thus it suffices to show that for each pair (k, which is an easy consequence of the Cauchy-Schwarz inequality.

B.3. Conditional moments of Gaussian distributions
Lemma B.3. Let Y and Z be N (0, 1) random variables that are in addition jointly normal with correlation ρ. For any integer r ∈ N, where r!! denotes the double or semifactorial of an integer r, i.e., the product of all the integers from 1 up to r that have the same parity (odd or even) as r.
Proof. Given that Y |Z = z has the same distribution as ρz + (1 − ρ 2 ) 1/2 Y , we can write from which the conclusion follows from the formula for the rth absolute moment of Y .
Lemma B.4. Let X 1 , . . . , X n be centered independent random variables such that for every integer k ≥ 2 and all i ∈ [n], E(|X i | k ) ≤ k!σ 2 i c k−2 /2, and set Appendix C: Bounding the · ∞ , 2 norm of some random matrices Let S q−1 = {v ∈ R q : v 2 = 1} be the unit Euclidean sphere in R q and let N be a (1/2)-net of S q−1 equipped with the Euclidean norm; see Definition 5.1 in [40]. From, e.g., Lemma 5.2 in [40], we can and will take N to satisfy card(N ) ≤ {1+1/(1/2)} q = 5 q . Let ι k ∈ R p have a 1 at the kth coordinate, and 0 elsewhere. Then, for any matrix M ∈ R p×q , The following result shows that the supremum over v ∈ S q−1 in the above can be replaced by a supremum over the net N at the cost of a constant factor.
Proof. In view of (C.1), we can find k * ∈ [p] and v * ∈ S q−1 such that This is enough to conclude.

C.1. Proof of Proposition 7.5
First note that because the sine function is Lipschitz, we have So it suffices to bound the right-hand side from above. The proof proceeds in three steps.

C.1.1. Reduction to a net
By Proposition C.1, we have For fixed k ∈ [p], v ∈ N and δ > 0, consider the event Using the Chernoff bound technique, we then have, for all t > 0,

C.1.2. Decoupling the U -statistic
and for any permutation i 1 , . . . , i n of the integers 1, . . . , n, define Summing over all possible permutations, we can then write in terms of the sample (X 1 , Y 1 ), . . . , (X n , Y n ). It then follows from (C.2) and the convexity of the exponential function that

C.1.3. Conversion into an average of sub-Gaussian random variables
For each i ∈ [n/2], let For each i ∈ [n/2], further let U k,i = ι k S X,i and W v,i = v S Y,i . Then

Y. Zhao and C. Genest
Given that the variables S X,1 , . . . , S X,n/2 are iid, as are S Y,1 , . . . , S Y,n/2 , we have and hence (C.4) becomes The problem is thus reduced to finding an upper bound on the moment generating function of U k,1 W v,1 . To this end, define the sub-Gaussian norm · ψ2 of a random variable R by as in Definition 5.7 of [40]. By Remark 5.18 in [40], and considering that |U k,1 | = 1, we have Furthermore, by Lemma 5.5 in [40], notably the equivalence of the constants K 2 and K 4 in that lemma up to an absolute constant factor, there exists an absolute constant C such that where the second inequality follows by (C.6).
Next, Lemmas 4.4-4.5 in [1] imply that for all i ∈ [n/2], S Y,i is C(Σ YY )-sub-Gaussian, i.e., for any fixed w ∈ R q , we have Thus, for all θ ∈ R, By the above and again by Lemma 5.5 in [40], in particular the equivalence of the constants K 2 and K 4 in that lemma up to an absolute constant factor, the centered random variable W v,1 satisfies for some absolute constant C . Combining (C.7) and (C.8), we conclude that with C = (4C ) × (C ) 2 . Combining (C.9) and (C.5), we get for all t ∈ (0, ∞), and hence also, by minimizing over all such values of t, (5) on the event F 1,n , which is the desired conclusion.

C.2. Proof of Proposition 7.6
It mimics the proof of Proposition 7.5 but extra steps are needed to tackle the right-multiplication of Σ XX − Σ XX by D. First, a Taylor expansion yields which, by the triangle inequality, is bounded above by where T XX is a symmetric random matrix such that each entry (T XX ) k is a random number strictly between ( T XX ) k and (T XX ) k . The second summand in (C.10) is bounded above by It remains to find an upper bound on the first summand in (C.10).

C.2.1. Reduction to a net
By Proposition C.1, the first summand in (C.10) is no larger than For fixed k ∈ [p], v ∈ N and δ > 0, consider the event Using the Chernoff bound technique, we have, for all t > 0,

C.2.2. Treating the cosine function transformation
By Lemma E.1 in [1], there exist vectors a 1 , a 2 , . . . and b 1 , b 2 , . . . , all belonging to R p , with a r ∞ , b r ∞ ≤ 1 for every integer r ∈ N, and a sequence t 1 , t 2 . . . . of non-negative numbers adding up to 4, such that For any vector u, let diag(u) be the diagonal matrix with the elements of u arranged on the diagonal. Plugging the above expression into the expectation term on the right-hand side of (C.12), and calling on the fact that the exponential function is convex, we find

C.2.3. Decoupling the U -statistic
and for any permutation i 1 , . . . , i n of the integers 1, . . . , n, write Summing over all possible permutations, we can again write Thus, for any integer r ∈ N, and from the convexity of the exponential function, this expression is bounded above by Therefore, where S X,1 , . . . , S X,n/2 are as defined in (C.3) and iid.

C.2.4. Conversion into an average of sub-Gaussian random variables
For each i ∈ [n/2], define U k,r,i = ι k diag(a r )S X,i and V v,r,i = v D diag(b r )S X,i . Given that the pairs (U k,r,i , V v,r,i ) are iid, we can rewrite the right-hand side of (C.14) as It then follows from (C.13) that The problem is thus reduced to finding an upper bound on the moment generating function of U k,r,1 V v,r,1 . To this end, we use its sub-Gaussian property.
Given that a r ∞ ≤ 1, we have |U k,r,1 | ≤ 1 and hence U k,r,1 V v,r,1 ψ2 ≤ V v,r,1 ψ2 . By Remark 5.18 in [40], we also have U k, r,1 V v,r,1 − E(U k,r,1 V v,r,1 ) ψ2 ≤ 2 U k,r,1 V v,r,1 ψ2 ≤ 2 V v,r,1 ψ2 . (C.16) By Lemma 5.5 in [40], and specifically the equivalence of the constants K 2 and K 4 in that lemma up to an absolute constant factor, there exists an absolute constant C as introduced in the proof of Proposition 7.5, such that where the second inequality follows by (C.16).
Next, Lemmas 4.4-4.5 in [1] imply that for all i ∈ [n/2], S X,i is C(Σ XX )-sub-Gaussian, i.e., for any fixed u ∈ R p , we have Thus, for all θ ∈ R, From the above and again by Lemma 5.5 in [40], in particular the equivalence of the constants K 2 and K 4 in that lemma up to an absolute constant factor, the centered random variable V v,r,1 satisfies for the same constant C as in (C.9). In view of (C. 19), (C.15) thus becomes for all t ∈ (0, ∞), and hence also, by minimizing over all such values of t, Thus Pr{(B k,v,δ * ) } ≤ 4/(p 2 5 q ) as soon as Finally, consider the event F D,n = ∩ k∈ [p],v∈N B k,v,δ * . From the above upper bound on Pr{(B k,v,δ * ) }, valid for all k ∈ [p] and v ∈ N , together with the union bound, we get Pr(F D,n ) ≥ 1 − 4/p. Summing up (in particular recall (C.11)), we have, on the event E ∞,1,n ∩ F D,n , and hence, by invoking (7.1), we have, on the same event, which is the desired conclusion.

Appendix D: Computational aspects
We can write (5.1) in a form to which a solver for a standard Lasso problem can be readily applied. Section 2 of [6], specifically at the top of p. 6 around Eq. (2.2), describes a procedure similar in spirit but that applies to a univariate response problem. To see how, first recall that S, S × and Ω are supposed to be proxies for Σ XX , Σ XY and Ω εε , respectively. We further assume that S and Ω are positive definite. Let A ∈ R (pq)×(pq) be a Cholesky factor of Ω ⊗ S, so that A A = Ω ⊗ S. Next, let y ∈ R (pq) be such that A y = vec(S × Ω). We can then rewrite the right-hand side of (5.

Appendix E: Identifiability issues and prediction
Suppose we are given Model (1.1) but not all components of X and Y have mean 0 and variance 1. The model can then be transformed into a form that does. Model (1.1) implies that where diag denotes the diagonal matrix with diagonal elements identical to those of the argument. Obviously, all components of diag(Σ XX ) −1/2 {X − E(X)} and diag(Σ YY ) −1/2 {Y − E(Y)} have mean 0 and variance 1. Then, by making the simultaneous substitutions, we can convert the original Model (1.1) into a new form where all components of the covariates and the responses have mean 0 and variance 1. In Model (1.2), the substitutions performed on f (X) and g(Y) can be absorbed into the functions f and g, respectively.
Now we briefly discuss the prediction problem for Model (1.2) under the elliptical copula multivariate response regression model, where given a value X * of the covariate, we would like to predict the response Y * . Our discussion essentially follows Section 2.1 in [11], which deals with a variant of the Gaussian copula regression model. In the ideal setting where the transformation functions f and g and the coefficient matrix B * are known, the oracle predictor for the median of Y * is This results from adapting Eq. (4) in [11] to our context using the conditionals of elliptical distributions; see, e.g., Theorem 2.18 in [12]. The oracle predictor U * has a unique value, irrespective of the identifiability conditions. Prediction at other quantile levels of Y * is also possible but more involved.
In practice, f , g and B * are not available, and it is natural to substitute them by their estimates f , g and B when the latter are available. Then, an empirical predictor for the median is given by where g −1 is a suitable inverse of g. Therefore, accurate estimations of f , g and B * are all important for achieving good prediction performance. We have addressed the estimation of B * in this paper.
In the most commonly encountered Gaussian copula regression model, the estimators of f and g are also straightforward to construct. We focus our discussion on f because the treatment for g is analogous. In this case, because f (X) is multivariate normal and all its components have mean 0 and variance 1 by our identification conditions, f (x 1 , . . . , x p ) = (f 1 (x 1 ), . . . , f p (x p )) is explicitly given by f r = Φ −1 • F r , where F r is the marginal distribution function of the rth coordinate of X, and Φ −1 is the N (0, 1) quantile function. To obtain an estimator f r of f r , it is typical to substitute F r in f r = Φ −1 •F r by (sometimes a Winsorized, i.e., truncated version of) the empirical marginal distribution function F * n,k in (F.1); see, e.g., Eq. (3.26) in [19] and Section 4 in [25]. In the more general case where f (X) is elliptical, Φ −1 should be replaced by proper quantile functions for the marginals of f (X) to obtain f .

Appendix F: Use of the normal-score rank correlation estimator in the Gaussian copula case
In this Appendix, we examine the properties of the normal-score rank correlation estimator Σ n , also known as van der Waerden correlation matrix, as an estimator of the copula correlation matrix when the underlying copula is Gaussian.
The discussion is carried out in a general context. For space consideration, only a brief discussion on the application of Σ n to the Gaussian copula regression problem is included at the end. Let X ∈ R p be a continuous random vector with Gaussian copula and copula correlation matrix Σ. Let F 1 , . . . , F p denote the marginal distribution functions of X = (X 1 , . . . , X p ) . For each i ∈ [n], let X i = (X i,1 , . . . , X i,p ) be an independent copy of X. We define the (rescaled version of the) empirical marginal distribution function for the kth coordinate of X as Then, the normal-score rank correlation estimator or van der Waerden correla- See, e.g., Eq. (7) on p. 113 in [15]. Here φ n is a deterministic correction factor given by For each i ∈ [n], define the Gaussianized observation and let Σ n be the corresponding sample covariance matrix, viz.
The correction φ n is asymptotically negligible, but with it the diagonal elements of Σ n all equal to 1, and so Σ n becomes a genuine correlation matrix. The elements of Σ n belong to multivariate rank order statistics that are common in the literature; see [15,36] for some early references.
In fixed dimension, the van der Waerden correlation matrix Σ n enjoys one crucial property: it is asymptotically semiparametrically efficient. More precisely, Σ n achieves the asymptotic covariance matrix lower bound in the Hájek-Le Cam convolution theorem; see [19]. The analogous estimators based on Kendall's tau and Spearman's rho do not share this property.
Whether or not the dimension is fixed, it also stems readily from (F.4) that Σ n is always positive semidefinite. As a result, when the Lasso estimator studied earlier in this paper is equipped with the estimator Σ n , it no longer requires the projection trick in (2.4) that sometimes leads to complication in the analysis, as in Section 3.3 or 5.3.1. Therefore, at least in fixed dimension, when estimating the copula correlation matrix for Gaussian copulas, the van der Waerden correlation matrix Σ n should clearly be favored over the estimators based on Kendall's tau and Spearman's rho.
That the estimator Σ n is semiparametrically efficient in the fixed-dimensional setting is equivalent to the characterization that Σ n is asymptotically linear in the efficient influence function; see, e.g., Lemma 25.23 in [39]. The main goal of this Appendix is to establish a high-dimensional counterpart to the above statement in Theorem F.1. The theorem states that, in high dimensions, each element of the estimator Σ n is the summation of a term which is linear in the efficient influence function (thus this term is asymptotically normal after scaling by n 1/2 ) and a remainder term which is, with high probability, uniformly small (at a rate n −1 up to logarithm factors) over all coordinates.
While the elements of the Kendall's tau or the Spearman's rho matrix are also asymptotically linear in their own influence function, the latter is inefficient. Thus, our result implies that the covariance matrix of Σ n is smaller than both the Kendall's tau and the Spearman's rho estimators, so long as the remainder terms do not appreciably affect the covariance matrix. Further discussion is provided in Section F.3.
To state Theorem F.1, first define the efficient influence function for estimating elements of Σ as in, e.g., Theorem 3.1 in [19]. For any given correlation coefficient ρ ∈ R and arbitrary marginal distribution functions G and H, let In the statement of Theorem F.1 and in its proof, detailed in Section F.2, C denotes a finite absolute constant that does not depend on n and p (nor on k, k , m, r which occur later). However, this constant C may change at each occurrence. Furthermore, denotes an inequality that holds up to such a C as the multiplicative factor.
where the remainder term R n,kk satisfies, with probability at least 1 − C/p 2 , |R n,kk | ≤ C{ln(p) ln 2 (n) + ln 3/2 (p)}/n. (F.6) The condition ln(p) = o(n) is common in the literature on high-dimensional statistics. The other condition ln(n) = O(p) is purely for convenience: with it, δ n,p in (F.7) admits the simple upper bound C ln 1/2 (p). This simplification will be used repetitively without further mention.

F.1. Preliminaries
At a high level, the proof follows that of Theorem 3.1 in [19] up to Eq. (3.35). However, the derivation there is for the fixed-dimensional setting, and even then only heuristic, and with the remainder term simply stated as having order o p (n −1/2 ) in contrast to our (F.6). Theorem 3.1 in [19] was formally established using the result from [36], which in turn is strictly for the fixed-dimensional setting. The core of our proof is showing that the term B 21,kk − B 21,kk is small, which relies heavily on the realization that this term has a hidden canonical Ustatistic structure.
We also record two auxiliary lemmas whose proofs we omit. Below, φ denotes the N (0, 1) density.
Proof. We treat the "tail region" where i ∈ [n] satisfies F k (X i,k ) ∈ A 1,n separately from the region where i ∈ [n] satisfies F k (X i,k ) ∈ A 2,n . We write We only analyze the term B 12,kk in detail; this will culminate in the bound (F.16). The term B 11,kk is bounded similarly (with a possibly different C). This follows by an argument similar to that for B 12,kk , but simply using the bound max i∈[n] |Φ −1 {F * n,k (X i,k )}| ln 1/2 (n), which follows from the facts that min{F * n,k (X i,k ), 1 − F * n,k (X i,k ) : i ∈ [n]} ≥ 1/(n + 1), and the right-hand inequality in (F.10). We first control the expectation of B 12,kk . We let F kk be the bivariate distribution function of (X k , X k ), and let F n,kk be its empirical counterpart based on {(X i,k , X i,k ) : i ∈ [n]}. We can write and so n 1/2 1{u ∈ (0, a n ]}{ln(1/2u) + ln 1/2 (1/2u)}du n −1/2 ln(p) ln(n). (F.14) The second step follows by integrating over z upon conditioning on y, using the fact that given X 1,k = y, the variable Φ −1 {F k (X 1,k )} is Gaussian with mean r kk Φ −1 {F k (y)} and variance 1 − r 2 kk , and calling on Lemma B.3. In the third step, we have invoked the right-hand inequality in (F.10).
Next, we study the concentration of B 12,kk around E(B 12,kk ). We will apply Bernstein's inequality, slightly simplified here as Lemma B.4. To this end, we bound the rth absolute moment, where r ≥ 2, of the integrand in (F.13). We obtain where we have again invoked Lemma B.3. We first concentrate on the first term in the curly bracket above. Using the right-hand inequality of (F.10), we have 2 2r−1 1{u ∈ (0, a n ]} ln r (1/2u)du and the right-hand side can be successively rewritten as follows: (−1) r 2 2r−1 1{u ∈ (0, a n ]} ln r (2u)du Furthermore, the right-hand most term in the above expression reduces to Therefore, In the last step, we simply used the formula for the sum of a geometric series. For the second term in the curly bracket in (F.15), Hölder's inequality implies 1{u ∈ (0, a n ]} ln r (1/2u)du 1/2 1{u ∈ (0, a n ]}du Therefore, overall the left-hand side of (F.15) can be bounded above by the multiple of a constant which is uniform over all integers r ≥ 2, and the quantity uniformly over r and n. We can then apply Bernstein's inequality with c ln{1/(2a n )} and σ 2 i a n ln 2 (1/2a n ), and conclude that Combining the above with the expectation bound (F.14) earlier, and setting u = C ln(p) for C large enough, we conclude that Together with an identical bound for B 11,kk (as argued earlier), we conclude that there exists an event E 5,kk ,n with probability at least 1 − 1/p 4 on which |B 1,kk | ≤ Cn −1/2 ln(p) ln(n). (F.17) Part 2: Treatment of the term B 2,kk . By Taylor expansion to second order and Lemma F.3, the term B 2,kk can be written as In the above, for each i, the quantity F * n,k,i is a random number strictly between F * n,k (X i,k ) and F k (X i,k ). We can write B 21,kk equivalently as We intend to approximate B 21,kk by We now show that B 21,kk − B 21,kk is indeed small through a U -statistic argument. Introduce the functions f n,kk , g n,kk : R 2 × R 2 → R specifically as where we have introduced the quantities F n,k (y)dF n,kk (y, z).
The quantity Δ 21,kk ,1 is clearly (the "off-diagonal" part of) a U -statistic, with kernel g n,kk . Moreover, E{g n,kk (X k , X k ; y 2 , z 2 )} = 0, E{g n,kk (y 1 , z 1 ; X k , X k )} = 0. (F.20) Indeed, the first equality follows by integration with respect to the measure F kk whereas the second stems from the fact that E{1(X k ≤ y 1 ) − F k (y 1 )} = 0. Thus g n,kk is canonical (i.e., completely degenerate) for the measure F kk . If the kernel g n,kk were "nice," this would suggest that Δ 21,kk ,1 could be on the order of n −1/2 . However, this kernel is unbounded, and it diverges ever more quickly as n increases. Thus, establishing a deviation inequality for Δ 21,kk ,1 is quite tedious. The formal treatment of Δ 21,kk ,1 , as well as the "leftover" terms Δ 21,kk ,2 (which is the "diagonal", i.e., the i = j part of a U -statistic) and Δ 21,kk ,3 , will be considered next.
We wish to apply Theorem 3.2 in [14] to bound, for every integer m ≥ 3, the mth moment of the quantity above. Via Markov's inequality, this will lead to a tail probability bound for the quantity above and, via the decoupling theorem, to a tail probability bound for the un-decoupled Δ 21,kk ,1 .
Let · L 2 →L 2 be defined as in (3.9) in [14]. Also for a (generic) random variable X, let E X denote the expectation taken with respect to the random variable X only. Then, Theorem 3.2 in [14] states that there exists a universal constant K < ∞ such that (for the canonical kernel h n,kk of two variables, and independent random variables X i , X * i for all i ∈ [n]), for every integer m ≥ 2, Note that while the statement of the theorem requires a bounded kernel, inspection of the proof reveals that it is not necessary. If in doubt, we can always truncate the unbounded kernel h n,kk , and then let the truncation level go to infinity on both sides of (F. 25).
In what follows, sometimes we will write h n,kk simply as h. To apply (F.25), we need to bound four quantities, namely We carry out these tasks in sequence. For (i), first by Jensen's inequality, and the latter can be rewritten as follows: Therefore, upon invoking Lemma B.3 and then (F.10), we can deduce that For (ii), first by definition (see (3.9) in [14]), we can write h L 2 →L 2 as so that, by Jensen's inequality, Now, by (F.26), the right-hand side of this inequality can be bounded above by For (iii), we first compute the inner expectation with respect to X * j . To this end, the contributions from the squares of the four terms in the curly bracket in (F.24) are computed separately. For the first term involving h n,kk ,1 , E X * j {h 2 n,kk ,1 (y 1 , z 1 ; X * j,k , X * j,k )} = E X * j {f 2 n,kk (y 1 , z 1 ; X * j,k , X * j,k )}.
Then, realizing that 1(X * j,k ≤ y 1 ) is Bernoulli with success probability F k (y 1 ), we can rewrite the right-hand term as For the second term involving h n,kk ,2 , given that where the last step follows from (F.26). For the third term involving h n,kk ,3 , note that (F.28) To be precise, the last equality holds with probability 1 when eventually we set y 1 = X i,k . If F k (y 1 ) ≤ a n , (F.28) can be rewritten and bounded from above as follows: If a n < F k (y 1 ) ≤ 1/2, dividing the range of integration based on the position of F k (X * j,k ) relative to F k (y 1 ) and 1/2, and making multiple uses of (F.10), we can rewrite (F.28) and bound it as follows: As the case where F k (y 1 ) > 1/2 is analogous, we conclude that For the fourth term involving h n,kk ,4 , note that h n,kk ,4 (y 1 , z 1 ; y 2 , z 2 ) = −E{h n,kk ,3 (y 1 , z 1 ; X j,k , X j,k )}.
Therefore, by Jensen's inequality, which then admits the same bound as (F.29). Thus, for the term (iii), we first have where the right-hand side is defined by For m ≥ 3, the expectation ofh m/2 (X i,k , X i,k ) can be bounded above as follows: dF kk (y, z) Finally, for the term (iv), first compute, for any integer m ≥ 3,  Next, we rewrite so that The first step follows by integrating over z upon conditioning on y and the fact that given X k = y, the variable Φ −1 {F k (X k )} is Gaussian with mean r kk Φ −1 {F k (y)}. The third step and (F.37) follow by integration by parts, and upon setting D 1 = 1{F k (y) ∈ A 1,n } √ n Φ −1 {F k (y)} 2 d(F n,k − F k )(y) and D 2 = √ n Φ −1 (1 − a n ) 2 (F n,k − F k ){F ← k (1 − a n )} − √ n Φ −1 (a n ) 2 (F n,k − F k ){F ← k (a n )}.
The term D 1 has expectation zero, and its concentration around expectation can be treated in a similar way as B 12,kk earlier to arrive at a bound similar to (F.16). Specifically, there exists an event E 7,k,n with probability at least 1−1/p 3 on which |D 1 | ≤ Cn −1/2 ln(p) ln(n). (F.38) Lemma F.2 also easily ensures that on E 4,k,n , |D 2 | ≤ Cn −1/2 ln(p) ln(n). (F.39) We finally treat the term B 22,kk from (F.18). By the mean value theorem and (F.9) on the event E 4,k,n , we have, on the same event, and for F k (y) ∈ A 1,n , Here F * n,k,i is a random number strictly between F * n,k,i and F k (X i,k ). Therefore, |B 22,kk | can be bounded above by on the event E 4,k,n , and hence, by (F.8), Together these bounds yield that, on the intersection of some events E n,k and E n,kk with probabilities at least 1 − C/p 3 and 1 − C/p 4 respectively, the magnitude of the sum of these remainder terms is bounded by Cn −1/2 {ln(p) ln 2 (n) + ln 3/2 (p)}. Taking the intersection of E n,k and E n,kk over k, k ∈ [p] then yields the conclusion of Proposition F.5.
By an analogous argument, the term Ξ 3,n,kk (F.11) satisfies where the remainder term R 2,n,kk satisfies the same bound as R 1,n,kk in Proposition F.5. Finally, the term Ξ 4,n,kk in (F.11), which is a second-order term, is treated by the following proposition. The conclusion of Theorem F.1 then follows.
For the first term on the right-hand side above, as in (F.12), we again decompose the sum over i ∈ [n] into the cases F k (X i,k ) ∈ A 1,n and F k (X i,k ) ∈ A 2,n . We then apply the mean value theorem to the summands in the second case. We arrive at In the above, for each i, F * n,k,i is a random number strictly between F * n,k (X i,k ) and F k (X i,k ). For the term B 5,k , similar to (F.17), there exists an event E 9,k,n with probability at least 1 − 1/(2p 3 ) on which B 5,k ≤ Cn −1/2 ln(p) ln(n).
For the term B 6,k , we can follow the derivation of the bound (F.43) for the term B 22,kk in (F.18) (in fact, the present case is easier because of the lack of the function Φ −1 in the numerator). We obtain that there exists an event E 10,k,n with probability at least 1 − 1/(2p 3 ) on which |B 6,k | ≤ Cn −1/2 ln(p) ln(n).
Then, it is easy to see that the bound (F.44) holds on the intersection ∩ k∈[p] (E 9,k,n ∩ E 10,k,n ) with probability at least 1 − 1/p 2 . This concludes the proof of Proposition F.7.

F.3. Discussion
To the best of our knowledge, this paper is the first to provide a justification for the use of a matrix Σ n of normal-score rank correlations or van der Waerden correlations to estimate the copula correlation matrix in high-dimensional Gaussian copula models. The analysis of this coefficient faces the double hurdle of an unbounded score function (namely Φ −1 ) and the non-independence of the Gaussianized observations in (F.3). Nonetheless, in view of the clear superiority of the estimator Σ n in the fixed-dimensional setting, perhaps it is a bit surprising that it did not receive more attention in high dimensions. Theorem F.1 shows that the estimator Σ n retains its advantages in the latter regime. Moreover, this result has other immediate and broad practical applications which we briefly describe below.
It should first be observed that in the literature, circumstances often require deviation inequalities on the element-wise maximum norm of the deviation between the target Σ and an estimate thereof, previously often based on Kendall's tau or Spearman's rho. For this purpose but based on the estimator Σ n , the remainder term in (F.5) converges so fast that it can practically be ignored. Next, the leading term in (F.5), which is linear in the efficient influence function, involves only iid, centered, sub-exponential random variables. Hence establishing such deviation inequalities for this term is trivial. Thus Theorem F.1 implies that in all such circumstances, the van der Waerden correlation matrix Σ n can be used instead. In fact, as mentioned earlier, efficiency gains can be expected from the use of Σ n due to its established linearity in the efficient influence function. Relevant examples include, but are not limited to, [11,24,42] and, in the present context but restricted to the Gaussian copula regression model, Propositions 3.1, 4.2 and Theorem 5.2, for which the estimator Σ in (2.2) of the copula correlation matrix Σ in (2.1) can simply be replaced by Σ n .
Second, it sometimes happens that matrix deviation inequalities stronger than those in terms of the element-wise maximum norm are called for. For instance, in this paper Proposition 3.3 and later in the case of row-sparsity, Theorems 5.4 and 5.7 require a deviation inequality for either the operator norm or the · ∞, 2 norm. For the leading term in (F.5), this should again be straightforward. In contrast, simply aggregating over the remainder term R n,kk element-wise may yield suboptimal (though useful) results. Properly treating the remainder term in these cases may require more refined arguments, which we do not pursue further herein.
For completeness, we should mention here that a Winsorized version Σ n of the van der Waerden correlation matrix Σ n was considered by Liu et al. [25], who were able to establish a convergence rate of n −1/4 , up to logarithm factors. Such a rate is clearly too slow to be competitive. These authors later claimed in [24] that the "efficiency result (from [19]) cannot be generalized to the high-dimensional setting." To establish a satisfactory convergence rate, estimation of the copula correlation matrix in Gaussian copulas then quickly and overwhelmingly switched to the Kendall's tau or the Spearman's rho estimator, whose explicit U -statistic structure with bounded score function makes these estimators more amenable to analysis; see, e.g., [1,24,42]. Despite the claim from [25] that the lack of Winsorization "does not lead to accurate inference," Theorem F.1 is in fact obtained without truncation. Nevertheless, it may be that some truncation, perhaps not as heavy as that in [25], may still improve performance.
In [24], a simulation study comparing Σ n or its Winsorized version Σ n to the estimators of Σ based on Kendall's tau and Spearman's rho estimator was carried out. The authors concluded that without contamination, all these estimators performed similarly while with contamination, the estimators based on Kendall's tau and Spearman's rho outperformed Σ n and Σ n . We carried out a small simulation study to verify these claims. Specifically we repeated Models 1 and 2 in Section 4.1 in [42], with n = p = 100. Without contamination, on average Σ n outperforms the Kendall's tau and Spearman's rho estimators by 5% in terms of Frobenius norm. With contamination present, the message is mixed: with light, deterministic contamination, the estimator Σ n can outperform the Kendall's tau and the Spearman's rho estimators by as much as 20%, although in other scenarios the estimator Σ n may perform less well. While 5% gain may seem modest, in fixed dimensions and in a constrained parametrization (by a lower-dimensional copula parameter θ) such as the Toeplitz matrix model, the efficiency gain for θ through the one-step estimator that involves Σ n can sometimes be as large as 400%; see, e.g., Example 5.2 in [37]. We defer the high-dimensional analogy of this phenomenon to future studies.