Sparse Graph Learning Under Laplacian-Related Constraints

We consider the problem of learning a sparse undirected graph underlying a given set of multivariate data. We focus on graph Laplacian-related constraints on the sparse precision matrix that encodes conditional dependence between the random variables associated with the graph nodes. Under these constraints the off-diagonal elements of the precision matrix are non-positive (total positivity), and the precision matrix may not be full-rank. We investigate modifications to widely used penalized log-likelihood approaches to enforce total positivity but not the Laplacian structure. The graph Laplacian can then be extracted from the off-diagonal precision matrix. An alternating direction method of multipliers (ADMM) algorithm is presented and analyzed for constrained optimization under Laplacian-related constraints and lasso as well as adaptive lasso penalties. Numerical results based on synthetic data show that the proposed constrained adaptive lasso approach significantly outperforms existing Laplacian-based approaches. We also evaluate our approach on real financial data.


I. INTRODUCTION
Graphical models provide a powerful tool for analyzing multivariate data [9], [23]. In a statistical graphical model, the conditional statistical dependency structure among p random variables x 1 , x 1 , · · · , x p , (x = [x 1 x 2 · · · x p ] ), is represented using an undirected graph where there is no edge between nodes i and j iff random variables x i and x j associated with these two nodes, are conditionally independent. The precision matrix Ω of x encodes this conditional dependence. Such models for x have been extensively studied where a focus has been to estimate Ω. Given n samples of x, in highdimensional settings, one estimates Ω under some sparsity constraints; see [1], [7], [12], [14], [22], [25], [26], [28], [29], [34], [35].
More recently, several authors have considered Gaussian graphical models under the constraint that the distribution is multivariate totally positive of order 2 (MTP 2 ), or equivalently, that all partial correlations are non-negative (see [24], [38] and references therein). Such models are also known as attractive Gaussian random fields [36]. Note that a Gaussian distribution is MTP 2 if and only if its precision matrix Ω is an M-matrix, i.e., Ω ij ≤ 0 for all i = j [19]. As discussed in [38], MTP 2 is a strong form of positive dependence, which is relevant for modeling in various applications. A large majority of the prior work does not impose total positivity. Graphical models have also been inferred from consideration other than statistical [9]. One class of graphical models are based on signal smoothness [8], [9], [15], [16] where graph learning from data becomes equivalent to estimation of the graph Laplacian matrix [9], [15]. The graph Laplacian is positive semi-definite with non-positive off-diagonal entries, hence, can be viewed as rank-deficient precision matrix for an MTP 2 Gaussian random vector. Another set of approaches are based on statistical considerations under the graph Laplacian constraint [9], [10], [20], [30], [31], [40] where Laplacian L (or a generalized version) plays the role of the precision matrix Ω. Thus, under Gaussian distribution we have an MTP 2 model. A key contribution of [40] has been to show that under convex lasso ( 1 ) penalty, Laplacian-constrained log-likelihood approaches do not yield sparse graphs; non-convex penalties are required; see also [41].
Recent reviews of various graph learning approaches may be found in [33] and [39]. A large variety of graph learning models and approaches exist, motivated by diverse applications in signal processing, machine learning, and other areas. In [39] existing graph learning methods are classified into four broad categories: graph signal processing based methods, matrix factorization based methods, random walk based methods, and deep learning based methods. In terms of these four categories, our approach falls in the category of graph signal processing based methods. On the other hand, [33] categorizes graph learning methods based on two graph construction steps: (1) determine the edge set E (see Sec. II-A), called E-step, and (2) based on E, determine the edge weight matrix W (see Sec. II-A), called W -step, even though in some methods these two steps may be merged into one, or the second step may be executed first yielding W which then determines E. For instance, our approach yields a matrix equivalent to W which then determines E (see Sec. II-A).
A class of graph learning approaches are motivated by specific application tasks such as clustering and semi-supervised classification. Examples of such approaches include [17], [18] and relevant references in [33] and [39]. In such approaches an important consideration is how to incorporate prior information relevant to the intended application, in the graph model. For instance, both local and global structure information is incorporated in the model of [18], together with a rank constraint on the graph Laplacian to reflect the number of clusters. As noted in [33], "... how to select a suitable graph construction/learning strategy in practice ... is a challenging problem without a universal solution, since it depends on many factors ..."

A. Our Contributions
We investigate modifications to widely used penalized loglikelihood approaches to enforce total positivity but not the Laplacian structure. The graph Laplacian can then be extracted from the off-diagonal precision matrix. We use logsum penalty [6] resulting in adaptive lasso (initialized with lasso), and our approach does not require prior knowledge of the nature of the graph Laplacian (how many components, generalized or not, etc.). An alternating direction method of multipliers (ADMM) algorithm is presented for constrained optimization under total positivity. Numerical results based on synthetic data show that the proposed constrained adaptive lasso approach significantly outperforms existing Laplacianbased approaches [10], [15], [40].

B. Outline and Notation
The rest of the paper is organized as follows. In Sec. II we formulate the problem for the case where precision matrix Ω is full rank. Past related work for the two cases, Ω is full rank and Ω is rank-deficient, is discussed in Sec. II. An ADMM algorithm is presented in Sec. III to optimize the proposed cost function and a pseudocode for the ADMM algorithm is given in Algorithm 1. In Sec. IV we analyze consistency (Theorem 1) and sparsistency (Theorem 2) of the proposed approach. Numerical results based on synthetic as well as real data are presented in Sec. V to illustrate the proposed approach. Proofs of Theorems 1 and 2 are given in the two appendices.
We use S 0 and S 0 to denote that the symmetric matrix S is positive semi-definite and positive definite, respectively. The set of real numbers is denoted by R. For a set V , |V | or card(V ) denotes its cardinality, i.e., the number of elements in V . Given A ∈ R p×p , we use φ min (A), φ max (A), |A| and tr(A) to denote the minimum eigenvalue, maximum eigenvalue, determinant and trace of A, respectively, and A † to denote its pseudo-inverse. For B ∈ R p×q , we define its operator norm, the Frobenius norm and the vectorized 1 norm, respectively, as is a diagonal matrix with the same diagonal as A, and A − = A − A + is A with all its diagonal elements set to zero. The symbol ⊗ denotes the matrix Kronecker product and 1 A denotes the indicator function (which equals 1 if A is true, else 0). For y n , x n ∈ R p , y n x n means that y n = O(x n ) and x n = O(y n ), where the latter means there exists 0 < M < ∞ such that x n ≤ M y n ∀n ≥ 1. The notation y n = O P (x n ) for random vectors y n , x n ∈ R p means that for any ε > 0, there exists 0 < M < ∞ such that II. PROBLEM FORMULATION AND RELATED WORK In this section we formulate the problem for the case where precision matrix Ω is full rank, but later in simulations, we apply it to rank-deficient Ω also. Past related work for the two cases, Ω is full rank and Ω is rank-deficient, is also discussed.

A. Graphical Models and Graph Laplacians
An undirected simple weighted graph is denoted is the set of undirected edges, and W = W ∈ R p×p stores the non-negative weights W ij ≥ 0 associated with the undirected edges. If W ij > 0, then edge {i, j} ∈ E, otherwise edge {i, j} ∈ E. In a simple graph there are no self-loops or multiple edges, so E consists of distinct pairs {i, j}, i = j and W ii = 0. In graphical models of data variables x ∈ R p , a weighted graph G = (V, E, W ) (or unweighted G = (V, E)) with |V | = p is used to capture relationships between the p variables x i 's [9], [23]. If {i, j} ∈ E, then x i and x j are related in some sense, with higher W ij indicating stronger similarity or dependence. A statistical graphical model G is a conditional independence graph (CIG) where {i, j} ∈ E iff x i and x j are conditionally independent. In particular, Gaussian graphical models (GGMs) are CIGs where x is multivariate Gaussian. Suppose x has positive semi-definite covariance matrix Σ with precision matrix Ω = Σ † . Then Ω ij , the (i, j)-th element of Ω, is zero iff x i and x j are conditionally independent.
The (combinatorial) graph Laplacian of G = (V, E, W ) is defined as L = D − W where D is the diagonal weighted degree matrix with D ii = p j=1 W ij . This makes rank(L) < p and off-diagonal elements L ij = −W ij ≤ 0 for i = j. A generalized graph Laplacian is defined as L g = L + V where V is diagonal [10]. If all diagonal elements are V are strictly positive, then L g is positive-definite. There has been considerable recent interest in GGMs where one takes Ω = L ( [10], [20], [40]), or Ω = L g 0 ( [10], [21], [30], [31], [36]- [38]). Both cases result in Ω ij ≤ 0 for i = j, and this model is addressed in this paper. Our objective is determine E and L = D − W for both cases. We estimate Ω asΩ under the constraint Ω ij ≤ 0 for i = j, and then setŴ = −Ω − andL =D −Ŵ .

B. Full-Rank Precision Matrix Under Total Positivity
Suppose we are given n i.i.d. observations {x(t)} n t=1 , x ∈ R p , where x is zero-mean Gaussian with covariance Σ and precision matrix Ω. In graphical lasso [14], withΣ = 1 n n t=1 x(t)x (t)), one seeks Ω to yield min Ω 0 f L (Ω) where λ Ω − 1 = λ i =j |Ω ij | is the lasso penalty with λ > 0. In this paper we investigate approaches for the case where we have an additional constraint Ω ∈ V p where V p is the space of of all p×p matrices V that are symmetric with non-positive off-diagonal elements The convex penalty λ Ω − 1 in (1) is replaced with the nonconvex log-sum penalty (LSP) i =j P λ (Ω ij ) motivated by [6] (and [44]), defined as ( is small) That is, replace (1) with (4) and seek solution to As for the SCAD (smoothly clipped absolute deviation) penalty in [22], we solve the nonconvex problem (5) iteratively, where in each iteration, the problem is convex. Using ∂P λ (|θ|)/∂|θ| = λ/(|θ| + ), a local linear approximation to P λ (|θ|) around θ 0 yields a symmetric linear function With θ 0 fixed, we need to consider only the term dependent upon θ for optimization w.r.t. θ: Suppose we have a "good" initial solutionΩ to the problem (from e.g., using lasso f L (Ω) instead of f LSP (Ω)). Then, givenΩ, using the local linear approximation to P λ (Ω ij ) as in (7), after ignoring terms dependent uponΩ, we have Therefore, in the next iteration we seek with λ ij as in (9). This is then adaptive lasso [44]; strictly speaking, [44] has = 0. If we initialize withΩ ij = 0 (or some other constant) for all i = j, we obtain a lasso cost. Since in each iteration we have a convex optimization problem, we obtain a global minimum to the linearized problem. But since the original problem (5) is nonconvex because LSP is nonconvex, overall, we are only guaranteed a local minimum of the original problem. The unconstrained lasso minimizer of f L (Ω) specified in (1) is consistent [35] (where local consistency implies global consistency), and consistency of the constrained lasso minimizer of f L (Ω) under the additional constraint Ω ∈ V p follows as in the proof of Theorem 1 in Sec. IV. Therefore, if we initialize with the constrained lasso minimizer of f L (Ω), i.e., chooseΩ ij to be constrained lasso minimizer, we should expect the local minimum of the iterative solution to the original nonconvex problem to be close to the global minimum.

C. Related work and comparisons
There are two lines of related work on statistical models: one dealing with MTP 2 models where the precision matrix is full-rank [36]- [38], and the other dealing with explicit Laplacian constraint [9], [10], [20], [40], [41]. Note that neither sparsity nor large sample size is required under MTP 2 assumption for existence of precision matrix estimate [36]. Full-rank precision matrix assumption is central to [36]- [38], whereas empirical evidence suggests that our approach does not require it. [37] deals with certain theoretical guarantees under Stein loss for precision matrix estimation. [38] does not estimate the precision matrix, only the edges. [20], [40], [41] assume single-component Laplacians (only one zero eigenvalue) as precision matrix whereas in this paper it is not required. Our approach can handle multi-component Laplacians. [20] uses some spectral constraint which we do not, while [40], [41] use non-convex penalties (SCAD and related minimax concave penalty (MCP) in [40], and MCP in [41]) while we use non-convex LSP. In our approach, replacing LSP with SCAD yielded only marginal improvements over the convex lasso constraint in our numerical results. [9], [10] use convex

III. ADMM SOLUTION
To solve (10) we will use alternating direction method of multipliers (ADMM) [5] after variable splitting. Using variable splitting, consider The scaled augmented Lagrangian for this problem is [5] where U is the dual variable, and ρ > 0 is the penalty parameter. Given the results Ω (k) , V (k) , U (k) of the kth iteration, in the (k + 1)st iteration, an ADMM algorithm executes three updates: Solution to update (a) follows from [5, Sec. 6.5] and is given in step 5 of Algorithm 1, presented later in this section. For any ρ > 0, by construction, Ω (k+1) 0. This is so even if we apply the algorithm to a problem with true Ω 0. In We claim that the solution is given by where, with (a) + := max(0, a) and (a) − := min(0, a), denotes scalar soft thresholding for negative values of a and hard thresholding for a > 0. When i = j, we need to minimize where ν ≥ 0 is the Lagrange multiplier for the inequality constraint V ij ≤ 0. With v * denoting an optimal solution, the KKT conditions for minimization are where ∂L v denotes the subdifferential of L v at v * and When A ij > 0, our claimed solution is v * = 0. We need to check if ν ≥ 0 and 0 ∈ ∂L v for some |t| ≤ 1. The choice t = 0 and ν = ρA ij > 0 satisfies the KKT conditions. When A ij ≤ 0, our claimed solution is the well-known softthresholding solution which satisfies the KKT conditions with ν = 0. If |A ij ) ≤ ρ/λ, then v * = 0 and t = ρA ij /λ satisfies the KKT conditions since |t| ≤ 1. If |A ij ) > ρ/λ, then the given solution with t = A ij /|A ij | satisfies the KKT conditions. This proves that the solution (14) minimizes A pseudocode for the ADMM algorithm used in this paper is given in Algorithm 1 where the outer loop (indexed by k o in lines 2, 3 and 14 of the code) refers to iterative minimization of f LSP (Ω) given by (9), and the inner loop (indexed by k in lines 6-12 and 14) refers to minimization of a local linear approximation to f LSP (Ω), as specified in (10). For (constrained) lasso we take λ ij = λ 0 for all i = j in Algorithm 1; this is outer loop iteration k o = 1. In subsequent outer loop iterations, we use λ ij as specified in line 14 of the code. As implemented in this paper, we run the outer loop for a fixed number of outer iterations, and we obtain excellent results with two outer iterations. One could use a stopping criterion for outer loop also.
In Algorithm 1, we use the stopping (convergence) criterion following [ x ∈ R p , regularization and penalty parameters λ 0 and ρ 0 , tolerances τ abs and τ rel , variable penalty factor µ, LSP parameter , maximum number of outer loop iterations k o,max , maximum number of inner loop iterations k i,max .
Set Ω (k+1) = QDQ . 8: Define thresholding operator S neg (a, β) : (14): 10: Check convergence. Set tolerances Update penalty parameter ρ : case, at (k + 1)st iteration, the primal residual is given by Ω (k+1) − V (k+1) and the dual residual by ρ (k) (V (k+1) − V (k) ). Convergence criterion is met when the norms of these residuals are below primary and dual tolerances τ pri and τ dual , respectively; see line 8 of Algorithm 1. In turn, τ pri and τ dual are chosen using an absolute and relative criterion as in line 10 of Algorithm 1 where τ abs and τ rel are user chosen absolute and relative tolerances, respectively. As stated in [5,Sec. 3.4.1], one may use "possibly different penalty parameters ρ (k) for each iteration, with the goal of improving the convergence in practice, as well as making performance less dependent on the initial choice of the penalty parameter." Line 11 of Algorithm 1 follows typical choices given in [5,Sec. 3.4.1].

IV. THEORETICAL ANALYSIS
In this section we analyze consistency (Theorem 1) and sparsistency (Theorem 2) of the proposed approach under the assumption that Ω 0; proofs are in the Appendix. For consistency we follow the method of [35] which deals with the lasso penalty. Dependence of p and λ on sample size n is explicitly denoted as p n and λ n , respectively.
Given real numbers δ 1 ∈ (0, 1), δ 2 > 0 and C 1 > 1, let C 2 = 1 + C 1 , and Suppose the regularization parameter λ n / satisfies Then if the sample size n > max{N 1 , N 2 } and assumptions (A1)-(A2) hold true, there exists a local minimizerΩ λ such that with prob. greater than 1 − 1/p τ −2 n . In terms of rate of convergence, Ω λ − Ω 0 F = O P (r n ) • Sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one, as n → ∞ [22]. Theorem 2 deals with sparsistency ofΩ λ . Its proof follows that of [22,Theorem 2] pertaining to lasso and SCAD penalties Theorem 2 (Sparsistency): Suppose Theorem 1 holds true so that (27) holds. In addition, suppose that there exists a sequence η n → 0 such that Ω λ − Ω 0 = O P (η n ) and ln(p n )/n + η n = O(λ n ). Then with prob. tending to one, Remark 1: For both consistency and sparsistency to be satisfied, the chosen regularization parameters λ n 's need to be compatible. Theorem 1 imposes upper and lower bounds on the rate of λ n and Theorem 2 specifies a lower bound. Therefore, for both consistency and sparsistency to be satisfied, we must have Its consequences depend upon η n required to attain Ω λ − Ω 0 = O P (η n ). As discussed in [22] for lasso, we consider two cases, using the inequalities Then for (28) to hold true, we should have 1 + √ p n + s n0 1 + (p n /(s n0 )), which holds only if

V. NUMERICAL RESULTS
We now present numerical results for both synthetic and real data to illustrate the proposed approach. In synthetic data examples the ground truth is known and this allows for assessment of the efficacy of various approaches. In real data examples where the ground truth is unknown, our goal is visualization and exploration of the dependency structures underlying the data, similar to [7], [30], [31], [40].

A. Synthetic Data
We consider two Gaussian graphical models: a chain graph where p nodes are connected in succession, and an Erdös-Rènyi graph G p,per ER where p nodes are connected with probability p er = 0.03. In each model, in the upper triangular With Ω = Ω , we take Ω ii = − p j=1 Ω ij for every i, yielding the combinatorial Laplacian L = Ω. Now add κI to Ω with κ picked to make minimum eigenvalue of Ω + κI equal to 0, 0.001 or 0.1, and with ΦΦ = (Ω + κI) † , we generate x = Φw with w ∈ R p as Gaussian w ∼ N (0, I).
While single-component combinatorial Laplacian matrix for the chain graph is always connected (i.e., degree of each node is at least one), that for a p-node Erdös-Rènyi graph G p,per ER may not always be so, particularly if the probability p er of any two nodes being connected is low. In our simulation p er = 0.03. Therefore, in each run, we checked if every node had a degree ≥ 1. If not, we randomly connected an unconnected node to one of the other p − 1 nodes (with Ω ij uniformly distributed over [−0.3, −0.1]).
We apply six methods for estimating the true edgeset E 0 and true off-diagonal Ω − 0 : (i) Combinatorial graph Laplacian (CGL) method of [10], using MATLAB function estimate_cgl.m from [11]. (ii) Generalized graph Laplacian (GGL) method of [10], using MATLAB function estimate_ggl.m from [11] (iii) The signal smoothness-based method of [15] which yields the weighted adjacency matrix W , equaling −Ω − 0 . The matlab software is available in [32]. (iv) Laplacian-constrained graphical model (LCGM) method of [40] with MCP penalty, using R-implementation cited in [40] (v) Our proposed constrained adaptive lasso (CAL) method with LSP (λ ij = λ/(|Ω ij | + ), = 10 −5 ). (vi) Our proposed constrained lasso (CL) method (λ ij = λ), the solution of which taken to beΩ for the CAL approach. The performance measures are F 1 -score ∈ [0, 1] for efficacy in edge detection (higher is better), and normalized Frobenius error norm in estimating Ω − 0 (off-diagonal true Ω 0 ) (lower is better), defined as cΩ − −Ω − 0 F / Ω − 0 F where c = 1 for all approaches except [15] which yields W up to a scale factor, therefor, c is picked to minimize the error norm in that case. The F 1 -score is defined as and E 0 andÊ denote the true and estimated edge sets, respectively. In Fig. 1 for a sample size n = 400, we show the performance as a function of penalty parameter λ (λ/ for adaptive lasso, and β for [15]) for κ ∈ {0, 0.001, 0.1}: higher λ (lower β for [15]) should lead to sparser graphs. CGL and LCGM are designed specifically for κ = 0 whereas other approaches do not crucially depend on it; we did not implement CGL and LCGM for κ = 0.1. As noted in [40], [41], Laplacian-constrained log-likelihood approaches do not yield sparse graphs under convex penalties; we see this in the performance of CGL: it was not implemented for κ = 0.1, and for κ = 0.001 its Frobenius error norm is off the graph (> 17). Our proposed CAL has the best F 1 performance for all values of κ: for a choice of some λ, F 1 score exceeds 0.95 whereas other approaches do not perform nearly as well. [15] performs better than other approaches except proposed CAL, in terms of F 1 score, but has poor Frobenius error performance (as was noted in [10]). In Fig. 2 we show performance for sample sizes n = 50, 100, 200, 400, 2000 for [15], [40] and proposed constrained adaptive lasso (CAL), where penalty parameters were optimized for best F 1 scores. LCGM with MCP penalty [40] (not implemented for κ = 0.1) has some numerical conditioning problem for n = 100, resulting in poor F 1 score and excessive Frobenius error (such phenomenon has also been noted in [41]). We see that the proposed CAL performs the best independent of κ value, while LCGM sharply deteriorates for κ > 0.
In Fig. 3 we present a comparison between the results of optimization of f LSP (Ω) with and without the constraint (2), labeled "Const. Adap. Lasso" and "Unconst. Adap. Lasso," respectively. The results are for Erdös-Rènyi graph G 100,0.03 ER , p = 100, n ∈ {50, 100, 200, 400}, based on 100 runs. The largest performance improvement due to imposition of the constraint (2) is in estimation of Ω (equivalently W ), which is not surprising since Ω 0 ∈ V p .
In Table I we present results for a fixed sample size n = 400 with varying graph size p ∈ {100, 200, 400, 1000, 2000} to illustrate performance with scaling of the problem size. The results are for Erdös-Rènyi graph G p,3/p ER where two nodes are connected with probability p er = 3/p (as in [15]), and the for data generation we used κ = 0. We show the F 1score and average time per run for four approaches, proposed CAL, GGL [10], LCGM [40] and signal smoothness-based method [15], where penalty parameters were optimized for best F 1 scores. All algorithms were run on a Window Home 10 operating system with processor Intel(R) Core(TM) i5-6400T CPU @2.20 GHz with 12 GB RAM, and all MATLAB implementations were run on MATLAB R2020b. The shown results are based on 10 runs only as time per run increases significantly for larger values of p. It is seen that while the proposed CAL method is most demanding computationally, its F 1 -score performance is the best by a wide margin.
In Table II we show results for the chain graph, corresponding to Fig. 2, for sample sizes n = 50, 100, 200, 400 and κ ∈ {0, 0.001}. The discussion pertaining to Fig. 2 applies here as well.
In Fig. 4a, for n = 400 and p = 100, we show the performance of our proposed CAL approach as a function of λ/ , when applied to a two-component Laplacian precision matrix (two zero eigenvalues), κ = 0, p = 100, each component is independent Erdös-Rènyi G 50,0.03

ER
. We see that our approach works well (whereas LCGM [40] is designed only for single-component Laplacians: one zero eigenvalue). In Figs. 4b and 4c we show the true and estimated weighted adjacency matrices for a single run using the λ value from Fig. 4a that maximizes the F 1 score.
1) Model Selection: In practice, one would select λ via cross-validation or an information criterion. For selection of λ, we use the Bayesian information criterion (BIC) based on optimized − ln f X (X) ∝ n 2 tr(ΣΩ) − ln |Ω| . The tuning parameter λ is selected over a grid of values to minimize BIC. We search over λ values in the range [λ , λ u ] selected via the following heuristic. We first find the smallest λ, labeled λ sm , for which we get a no-edge model (i.e., |Ê| = 0). Then we set λ u = λ sm and λ = λ u /100 for synthetic data, search over 10 logarithmically spaced values in [λ , λ u ]. The given choice of λ u precludes "extremely" sparse models while that of λ precludes "very" dense models. The results based on 20 Monte Carlo runs are shown in Table III for Chain and Erdös-Rènyi graphs with p = 100 and κ = 0, for two sample sizes: n=200 and 2000. The BIC-based selected λ was used in each run to estimate Ω and compute F 1 -score and Frobenius error norm. The proposed approach seems to work well. No such approaches are available in [10], [15], [40], [41].

B. Real data: Financial Time Series
We consider daily share prices (at close of the day) of 97 stocks in S&P 100 index from Jan. 1, 2013 through Jan. 1, 2018, yielding 1259 samples. This data was gathered from Yahoo Finance website. If y m (t) is share price of mth stock on day t, we consider (as is conventional in such studies) x m (t) = ln(y m (t)/y m (t − 1)) as the time series to analyze, yielding n = 1258 and p = 97. These 97 stocks are classified into 11 sectors (according to the Global Industry Classification Standard) and we order the nodes to group them as information   based on 10 runs. Sample size n=400, variable graph size p. Entry *** indicates that the algorithm failed to converge in 1000 sec.

VI. CONCLUSIONS
The problem of learning a sparse undirected graph under graph Laplacian-related constraints on the sparse preci-sion matrix was considered. Under these constraints the offdiagonal elements of the precision matrix are non-positive and the precision matrix may not be full-rank. We investigated modifications to widely used penalized log-likelihood approaches to enforce total positivity but not the Laplacian structure. The graph Laplacian can then be extracted from the off-diagonal precision matrix. An ADMM algorithm was presented for constrained optimization under Laplacian-related constraints and LSP penalty. Numerical results based on synthetic data show that the proposed constrained adaptive lasso approach significantly outperforms existing Laplacianbased approaches. We also evaluated our approach on real financial data. Our approach is applicable independent of the prior knowledge of the nature of the graph Laplacian (how many components, generalized or not), as illustrated by our synthetic data results based on one-and two-component Laplacian precision matrices (with one and two zero eigenvalues, respectively). However our theoretical results hold only under the assumption that Ω 0 APPENDIX Lemma 1 follows from [34,Lemma 1]. Lemma 1: Under Assumption (A2), the sample covarianceΣ satisfies the tail bound Chain Graph: number of nodes p=100 sample size n 50 100 200 400 Approach κ = 0: F 1 score (±σ) Kalofolias [15] 0     for τ > 2, if the sample size n > N 1 , where C 0 is defined in (21) and N 1 is defined in (24).
where M and r n are as in (22) and (23), respectively. Since G(∆) ≤ G(0) = 0, if we can show that inf ∆ {G(∆) : ∆ ∈ Θ n (M )} > 0, then the minimizer∆ must be inside Θ n (M ), and hence ∆ F ≤ M r n . It is shown in [35, (9)] that ln and v denoting a scalar, Noting that Ω −1 = Σ and using P λ (θ) = λ ln(1 + |θ|/ ) , we can rewrite G(∆) as Following [35, p. 502], we have where we have used the fact that Ω 0 = Σ −1 . We now consider A 2 in (32). We have where E c 0 denotes the complement of set E 0 For an index set B and a matrix C ∈ R pn×pn , we write C B to denote a matrix in R pn×pn such that [C B ] ij = C ij if (i, j) ∈ B, and [C B ] ij = 0 if (i, j) ∈ B. Using this notation, to bound L 21 , using Cauchy-Schwartz inequality and Lemma 1, with probability > 1 − 1/p τ −2 n , We consider L 22 later as a part of A 3 where where we have used that fact that, for {i, j} ∈ E c 0 , Ω 0ij = 0, hence, P λn (Ω 0ij ) = 0.