COSMO: A Conic Operator Splitting Method for Convex Conic Problems

Garstka, Michael; Cannon, Mark; Goulart, Paul

doi:10.1007/s10957-021-01896-x

COSMO: A Conic Operator Splitting Method for Convex Conic Problems

Open access
Published: 29 August 2021

Volume 190, pages 779–810, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

COSMO: A Conic Operator Splitting Method for Convex Conic Problems

Download PDF

2735 Accesses
21 Citations
6 Altmetric
Explore all metrics

Abstract

This paper describes the conic operator splitting method (COSMO) solver, an operator splitting algorithm and associated software package for convex optimisation problems with quadratic objective function and conic constraints. At each step, the algorithm alternates between solving a quasi-definite linear system with a constant coefficient matrix and a projection onto convex sets. The low per-iteration computational cost makes the method particularly efficient for large problems, e.g. semidefinite programs that arise in portfolio optimisation, graph theory, and robust control. Moreover, the solver uses chordal decomposition techniques and a new clique merging algorithm to effectively exploit sparsity in large, structured semidefinite programs. Numerical comparisons with other state-of-the-art solvers for a variety of benchmark problems show the effectiveness of our approach. Our Julia implementation is open source, designed to be extended and customised by the user, and is integrated into the Julia optimisation ecosystem.

Douglas–Rachford splitting method for semidefinite programming

Article 13 August 2015

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Article 22 February 2016

Peaceman–Rachford splitting for a class of nonconvex optimization problems

Article 13 May 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider convex optimisation problems in the form

$$\begin{aligned} \begin{array}{lll} \text{ minimise } &{} f(x) &{}\\ \text{ subject } \text{ to } &{} g_i(x) \preceq _{{\mathcal {K}}_i} 0, &{} i=1,\ldots ,l\\ &{} h_i(x) = 0, &{} i=1,\ldots ,k, \end{array} \end{aligned}$$

(1)

with proper convex cones ${\mathcal {K}}_i \subseteq {\mathbb {R}}^{m_i}$ and where we assume that both the objective function $f :{\mathbb {R}}^n \rightarrow {\mathbb {R}}$ and the inequality constraint functions $g_i :{\mathbb {R}}^{m_i} \rightarrow {\mathbb {R}}$ are convex, and that the equality constraints $h_i(x) := a_i^\top x -b_i$ are affine. Convex optimisation problems feature heavily in a wide range of research areas and industries, including problems in machine learning [16], finance [14], optimal control [12], and operations research [11]. Concrete examples of problems fitting the general form (1) include linear programming (LP), convex quadratic programming (QP), second-order cone programming (SOCP), and semidefinite programming (SDP) problems. Methods to solve each of these standard problem classes are well known and a number of open- and closed-source solvers are widely available. However, the trend for data and training sets of increasing size in decision making problems and machine learning poses a challenge for state-of-the-art software.

The most common approach taken by modern convex solvers is the interior-point method (IPM) [63], which stems from Karmarkar’s original projective algorithm [35], and is able to solve both LPs and QPs in polynomial time. IPMs have since been extended to problems with positive semidefinite (PSD) constraints in [3, 30]. The primal-dual IPMs apply variants of Newton’s method to iteratively find a solution to a set of optimality KKT conditions. At each iteration, the algorithm alternates between a Newton step that involves factoring a Jacobian matrix and a line search to determine the magnitude of the step to ensure a feasible iterate. Most notably, the Mehrotra predictor–corrector method in [39] forms the basis of several implementations because of its strong practical performance [62]. However, IPMs typically do not scale well for very large problems since the Jacobian matrix must be re-calculated and re-factored at each step.

There are two main approaches to overcome this limitation, both of which are areas of active research. The first of these is a renewed focus on first-order methods with computationally cheaper per-iteration-cost. First-order methods are known to handle larger problems well at the expense of reduced accuracy compared to IPMs. In the 1960s, Everett [21] proposed a dual decomposition method that allows one to decompose a separable objective function, making each iteration cheaper to compute. Augmented Lagrangian methods by Miele [40,41,42], Hestenes [31], and Powell [49] are more robust and helped to remove the strict convexity conditions on problems, while losing the decomposition property. By splitting the objective function, the alternating direction method of multipliers (ADMM), first described in [25], allowed the advantages of dual decomposition to be combined with the superior convergence and robustness of augmented Lagrangian methods. Subsequently, it was shown that ADMM can be analysed from the perspective of monotone operators as a special case of Douglas–Rachford splitting [20] as well as of the proximal point algorithm in [52], which allowed further insight into the method. ADMM methods are simple to implement and computationally cheap, even for large problems. However, they tend to converge slowly to a high accuracy solution and the detection of infeasibility is more involved. Thus, they are typically used in applications where a modestly accurate solution is sufficient [48].

The second area of active research has been the exploitation of sparsity in the problem data. A method of exploiting the sparsity pattern of PSD constraints in an IPM was developed in [24]. This work showed that if the coefficient matrices of an SDP exhibit an aggregate sparsity pattern represented by a chordal graph, then both primal and dual problems can be decomposed into a problem with many smaller PSD constraints on the nonzero blocks of the original matrix variable. These blocks are associated with subsets, called cliques, of graph vertices. Moreover, it can be advantageous to merge some of these blocks, for example, if they overlap significantly. The optimal way to merge the blocks, or equivalently the corresponding graph cliques, after the initial decomposition depends on the solver algorithm and is still an open question. Sparsity in SDPs has also been studied for problems with underlying graph structure, e.g. optimal power-flow problems in [44] and graph optimisation problems in [2].

Related Work Widely used solvers for conic problems, especially SDPs, include SeDuMi [57], SDPT3 [59] (both open source, MATLAB), and MOSEK [45] (commercial, C) among others. All of these solvers implement primal-dual IPMs. Both Fukuda [24] and Sun [58] developed interior-point solvers that exploit chordal sparsity patterns in PSD constraints. Some heuristic methods to merge cliques have been proposed for IPMs in [44] and been implemented in the SparseCoLO package [22] and the CHOMPACK package [4]. Several solvers based on the ADMM method have been released recently. The solver OSQP [56] (C) detects infeasibility based on the differences of the iterates [7], but only solves LPs and QPs. The C-based SCS [47] implements an operator splitting method that solves the primal-dual pair of conic programs in order to provide infeasibility certificates. The underlying homogeneous self-dual embedding method has been extended by [66] to exploit sparsity and implemented in the MATLAB solver CDCS. The conic solvers SCS and CDCS are unable to handle quadratic cost functions directly. Instead, they are forced to reformulate problem with quadratic objective functions by adding a second-order cone constraint, which increases the problem size and removes any favourable sparsity. Moreover, they rely on self-dual embedding formulations to detect infeasibility.

Outline In Sect. 2, we define the general conic problem format of interest and describe the ADMM algorithm used by COSMO. Section 3 explains how to apply chordal decomposition to large sparse SDPs. In Sect. 4, we describe a new clique merging strategy for SDPs and compare it to existing approaches. Implementation details and code-related design choices are discussed in Sect. 5. Section 6 shows benchmark results of COSMO vs. other state-of-the art solvers on a number of test problems. Section 7 concludes the paper.

Contributions In this paper we make the following contributions:

1.
We describe a first-order method for large conic problems that directly supports quadratic objective functions, thus reducing overheads for applications with both quadratic objective function and PSD constraints. This also avoids a major disadvantage of conic solvers compared to native QP solvers, i.e. no additional matrix factorisation for the conversion is needed and favourable sparsity in the objective can be maintained. These benefits are demonstrated using benchmark sets of QPs and SDPs with norm constraints.
2.
For large SDPs with complex sparsity patterns, we show that substantial performance improvements can be achieved relative to standard chordal decomposition methods by recombining some of the sub-blocks (i.e. the cliques) of the initial decomposition in an optimal way. We show that our new clique graph-based merging strategy reduces the projection time of the algorithm by up to 60% in benchmark tests. The combination of this decomposition strategy with a multithreaded projection step allows us to solve very large sparse and structured problems orders of magnitudes faster than competing solvers.
3.
We describe our open-source solver COSMO written in the fast and flexible programming language Julia. The design of this package is highly modular, allowing users to extend the solver by specifying a custom linear system solvers and by defining their own convex cones and custom projection methods. We show how a custom projection method can speed up the algorithm by a factor of 5 for large QPs.

Notation The following notation and definitions will be used throughout this paper. Denote the space of real numbers $\mathbb {R}$, the n-dimensional real space $\mathbb {R}^n$, the n-dimensional zero cone $\{0\}^n$, the nonnegative orthant ${\mathbb {R}}_+^n$, the space of symmetric matrices $\mathbb {S}^n$, and the set of positive semidefinite matrices $\mathbb {S}^n_+$. Denote the vectorisation of a matrix X by stacking it columns as $x =\text {vec}(X)$ and the inverse operation as $\text {vec}^{-1}(X)=\text {mat}(x)$. We denote the space of vectorised symmetric positive semidefinite matrices as ${\mathcal {S}}_+^n$. For a convex cone $\mathcal {K}$, denote the polar cone by $\mathcal {K}^\circ =\left\{ y \in {\mathbb {R}}^n \mid \underset{x \in \mathcal {K}}{\sup } \langle x,y \rangle \le 0 \right\} $ and, following [51], the recession cone of $\mathcal {K}$ by $\mathcal {K}^{\infty } =\left\{ y \in {\mathbb {R}}^n \mid x + ay \in \mathcal {K},\, \forall x \in \mathcal {K}, \forall a \ge 0 \right\} $. We denote the indicator function of a nonempty, closed convex set $\mathcal {C} \subseteq \mathbb {R}^n$ by $I_{\mathcal {C}}$ and the projection of $x \in \mathbb {R}^n$ onto $\mathcal {C}$ by ${\varPi }_{\mathcal {C}}(x)=\underset{y \in \mathcal {C}}{{{\,\mathrm{argmin}\,}}} \left\Vert x-y\right\Vert _2^2$.

2 Problem Description and Core Algorithm

We address convex optimisation problems with a quadratic objective function and a number of conic constraints in the standard form:

$$\begin{aligned} \begin{array}{ll} \text{ minimise } &{} \textstyle {\frac{1}{2}}x^\top Px + q^\top x\\ \text{ subject } \text{ to } &{} Ax + s = b, \, s \in {\mathcal {K}}, \end{array} \end{aligned}$$

(2)

where $x \in {\mathbb {R}}^n$ is the primal decision variable and $s\in {\mathbb {R}}^m$ is the primal slack variable. The objective function is defined by positive semidefinite matrix $P \in {\mathbb {S}}^n_+$ and vector $q \in {\mathbb {R}}^n$. The constraints are defined by matrix $A \in {\mathbb {R}}^{m\times n}$, vector $b \in {\mathbb {R}}^m$ and a non-empty, closed, convex cone ${\mathcal {K}}$ which itself can be a Cartesian product of cones in the form

$$\begin{aligned} {\mathcal {K}} = {\mathcal {K}}_1^{m_1} \times {\mathcal {K}}_2^{m_2} \times \cdots \times {\mathcal {K}}_N^{m_N}, \end{aligned}$$

(3)

with cone dimensions $\sum _{i=1}^N m_i = m$. Note that any LP, QP, SOCP, or SDP can be written in the form (2) using an appropriate choice of cones. The conditions for optimality (assuming linear independence constraint qualification) follow from the Karush–Kuhn–Tucker (KKT) conditions:

$$\begin{aligned} Ax +s = b, \quad Px + q - A^\top y = 0, \quad s \in {\mathcal {K}}, \quad y \in ({\mathcal {K}}^\infty )^\circ . \end{aligned}$$

(4)

Assuming strong duality, if there exists a $x^* \in {\mathbb {R}}^n$, $s^* \in {\mathbb {R}}^m$, and $y^* \in {\mathbb {R}}^m$ that fulfil (4) then the pair $(x^*,s^*)$ is called the primal solution and $y^*$ is called the dual solution of problem (2).

ADMM Algorithm We employ the splitting used in [56] and extend it to accommodate constraints defined using proper convex cones and transform problem (2) into ADMM-compatible format. The problem is rewritten by introducing the dummy variables ${\tilde{x}}= x $ and ${\tilde{s}}= s$:

$$\begin{aligned} \text {minimise} ~&\textstyle {\frac{1}{2}}{\tilde{x}}^\top P {\tilde{x}}+ q^\top {\tilde{x}}+ I_{Ax+s=b}({\tilde{x}},{\tilde{s}}) + I_{{\mathcal {K}}}(s)\nonumber \\ \text {subject to}&({\tilde{x}},{\tilde{s}}) = (x,s), \end{aligned}$$

(5)

where the indicator functions of the sets $\{ (x,s) \in {\mathbb {R}}^n \times {\mathbb {R}}^m \mid Ax + s = b\}$ and ${\mathcal {K}}$ were used to move the constraints of (2) into the objective. We then apply standard form ADMM [13] to (5) and refer the reader to [56] for a derivation of the algorithm steps shown in Algorithm 1.

The evaluation of line 2 of the algorithm amounts to solving an equality constrained QP in the variables ${\tilde{x}}$ and ${\tilde{s}}$. The solution is obtained from the reduced linear system representing the corresponding KKT optimality conditions

$$\begin{aligned} \begin{bmatrix} P + \sigma I &{} A^\top \\ A &{}- \frac{1}{\rho }I \end{bmatrix}\begin{bmatrix}{\tilde{x}}^{k+1} \\ \nu ^{k+1}\end{bmatrix}= \begin{bmatrix}-q+\sigma x^k \\ b-s^k+\frac{1}{\rho }y^k\end{bmatrix}, \quad {\tilde{s}}^{k+1} = s^k - \frac{1}{\rho }\left( \nu ^{k+1} + y^{k}\right) , \end{aligned}$$

(6)

with dual variable $\nu \in {\mathbb {R}}^m$.

The particular choice of splitting in (5) ensures that the coefficient matrix in (6) is always quasi-definite, and hence always has a well-defined $\hbox {LDL}^\top $ factorisation. Assuming a problem dimension $N =m+n$, the factorisation must be computed only once at the start of the algorithm at a cost of $\frac{1}{3}N^3$ flops. Afterwards, the evaluation of line 2 involves forward- and back-substitution steps involving at most $2N^2$ flops. However, in practice the coefficient matrix is very sparse, so a permutation allows us to achieve much sparser factors and therefore computationally cheaper substitution steps. Lines 3 and 5 involve only vector scaling and additions at a complexity of $\mathcal {O}(n)$ and $\mathcal {O}(m)$, respectively. The projection onto ${\mathcal {K}}$ in line 4 is crucial to the performance of the algorithm depending on the particular cones employed in the model: projections onto the zero-cone or the nonnegative orthant are inexpensive at $\mathcal {O}(m)$, while a projection onto the positive-semidefinite cone of dimension $n_\mathcal {K}$ involves an eigenvalue decomposition. Since direct methods for eigendecompositions have a complexity of ${\mathcal {O}}(n_\mathcal {K}^3)$, this turns line 4 into the most computationally expensive operation of the algorithm for large SDPs; Sects. 3 and 4 describe how to improve the efficiency of this step.

Termination Criteria To measure progress of the algorithm, we define the primal and dual residuals of the problem as:

$$\begin{aligned} r_p =Ax + s -b, \quad r_d =Px + q - A^\top y. \end{aligned}$$

(7a)

According to Section 3.3 of [13], a valid termination criterion is that the size of the norms of the residual iterates in (7) are small. Our algorithm terminates if the residual norms are below the sum of an absolute and a relative tolerance term:

$$\begin{aligned} \left\Vert r_p^k\right\Vert _\infty&\le \epsilon _{\mathrm {abs}} + \epsilon _{\mathrm {rel}} \, \text {max} \left\{ \left\Vert Ax^k\right\Vert _\infty ,\left\Vert s^k\right\Vert _\infty , \left\Vert b\right\Vert _\infty \right\} , \end{aligned}$$

(8a)

$$\begin{aligned} \left\Vert r_d^k\right\Vert _\infty&\le \epsilon _{\mathrm {abs}} + \epsilon _{\mathrm {rel}} \, \text {max} \left\{ \left\Vert Px^k\right\Vert _\infty ,\left\Vert q\right\Vert _\infty , \left\Vert A^\top y^k\right\Vert _\infty \right\} , \end{aligned}$$

(8b)

where $\epsilon _{\mathrm {abs}}$ and $\epsilon _{\mathrm {rel}}$ are user defined tolerances. Moreover, we use the infeasibility conditions based on one-step differences $\delta x^k = x^k - x^{k-1} $, $\delta y^k = y^k - y^{k-1}$ in [7] to detect infeasibility and provide certificates.

3 Chordal Decomposition

As noted in Sect. 2, for large SDPs the principal performance bottleneck for Algorithm 1 is the eigendecomposition required in the projection step of line 4. However, since large-scale SDPs often exhibit a certain structure or sparsity pattern, a sensible strategy is to exploit any such structure to alleviate this bottleneck. It is well known that, if the aggregated sparsity pattern is chordal, Agler’s [1] and Grone’s [28] theorems can be used to decompose a large PSD constraint into a collection of smaller PSD constraints and additional coupling constraints. The projection step applied to the set of smaller PSD constraints is usually significantly faster than when applied to the original constraint. Since the projections are independent, they are also easily parallelised.

In this section, we describe the standard mathematical machinery behind chordal decomposition, with a view to improving it substantially in Sect. 4 when applied to large-scale SDPs.

3.1 Graph Preliminaries

In the following, we define graph-related concepts that are useful to describe the sparsity structure of a problem. A good overview of this topic is given in the survey paper [61]. Consider the undirected graph G(V, E) with vertex set $V=\{1,\ldots ,n\}$ and edge set $E \subseteq V \times V$. If all vertices are pairwise adjacent, i.e. $E = \{ \{v,u\} \mid v,u \in V, \, v \ne u \}$, the graph is called complete. We follow the convention of [61] by defining a clique as a subset of vertices ${\mathcal {C}}_{} \subseteq V$ that induces a maximal complete subgraph of G. The number of vertices in a clique is given by the cardinality $|{\mathcal {C}}_{}|$. A cycle is a path of edges (i.e. a sequence of distinct edges) joining a sequence of vertices in which only the first and last vertices are repeated. The following decomposition theory relies on a subset of graphs that exhibit the important property of chordality. A graph G is chordal (or triangulated) if every cycle of length greater than three has a chord, which is an edge between nonconsecutive vertices of the cycle. A non-chordal graph can always be made chordal by adding extra edges. An undirected graph with n vertices can be used to represent the sparsity pattern of a symmetric matrix $S \in {\mathbb {S}}^{n}$. Every nonzero entry $S_{ij} \ne 0$ in the lower triangular part of the matrix introduces an edge $(i,j)\in E$. An example of a sparsity pattern and the associated graph is shown in Fig. 1a, b.

For a given sparsity pattern G(V, E), we define the following symmetric sparse matrix cones:

$$\begin{aligned} {\mathbb {S}}^{n}\left( E, 0\right)&=\left\{ S \in {\mathbb {S}}^{n}\mid S_{ij} = S_{ji} = 0 \; \text {if } i \ne j \text { and } (i,j) \notin E\right\} , \end{aligned}$$

(9)

$$\begin{aligned} {\mathbb {S}}_{+}^{n}\left( E, 0\right)&=\left\{ S \in {\mathbb {S}}^{n}(E, 0) \mid S \succeq 0\right\} . \end{aligned}$$

(10)

Given the definition in (9) and a matrix $S \in {\mathbb {S}}^{n}\left( E, 0\right) $, note that the diagonal entries $S_{ii}$ and the off-diagonal entries $S_{ij}$ with $(i,j)\in E$ may be zero or nonzero. Moreover, we define the cone of positive semidefinite completable matrices as

$$\begin{aligned} {\mathbb {S}}_{+}^{n}(E, ?) =\left\{ Y \mid \exists {\hat{Y}} \in {\mathbb {S}}_{+}^{n}, Y_{ij} = {\hat{Y}}_{ij}, \text {if}\, i=j \,\text {or}\, (i,j) \in E \right\} . \end{aligned}$$

(11)

For a matrix $Y \in {\mathbb {S}}_{+}^{n}(E, ?)$, we can find a positive semidefinite completion by choosing appropriate values for all entries $(i, j) \notin E$. An algorithm to find this completion is described in [61]. An important structure that conveys a lot of information about the nonzero blocks of a matrix, or equivalently the cliques of a chordal graph, is the clique tree (or junction tree). For a chordal graph G, let $\mathcal {B} = \{\mathcal {C}_1, \ldots , \mathcal {C}_p \}$ be the set of cliques. The clique tree $\mathcal {T}(\mathcal {B},\mathcal {E})$ is formed by taking the cliques as vertices and by choosing edges from $\mathcal {E} \subseteq \mathcal {B} \times \mathcal {B}$ such that the tree satisfies the running-intersection property, i.e. for each pair of cliques ${\mathcal {C}}_{i}$, ${\mathcal {C}}_{j} \in \mathcal {B}$, the intersection ${\mathcal {C}}_{i} \cap {\mathcal {C}}_{j}$ is contained in all the cliques on the path in the clique tree connecting ${\mathcal {C}}_{i}$ and ${\mathcal {C}}_{j}$. This property is also referred to as the clique-intersection property in [46] and the induced subtree property in [61]. For a given chordal graph, a clique tree can be computed using the algorithm described in [15]. The clique tree for an example sparsity pattern is shown in Fig. 1c. In a post-ordered clique tree the descendants of a node are given consecutive numbers, and a suitable post-ordering can be found via depth-first search. For a clique ${\mathcal {C}}_{\ell }$, we refer to the first clique encountered on the path to the root as its parent clique ${\mathcal {C}}_{\mathrm {par}}$. Conversely ${\mathcal {C}}_{\ell }$ is called the child of ${\mathcal {C}}_{\mathrm {par}}$. If two cliques have the same parent clique we refer to them as siblings.

We define the function $\mathrm {par} : 2^V \rightarrow 2^V$ and the multivalued function $\mathrm {ch} : 2^V \rightrightarrows 2^V$ such that $\mathrm {par}({\mathcal {C}}_{\ell })$ and $\mathrm {ch}({\mathcal {C}}_{\ell })$ return the parent clique and set of child cliques of ${\mathcal {C}}_{\ell }$, respectively, where $2^{V}$ is the power set (set of all subsets) of V. Note that each clique in Fig. 1c has been partitioned into two sets. The upper row represents the separators $\eta _\ell = {\mathcal {C}}_{\ell } \cap \mathrm {par}({\mathcal {C}}_{\ell })$, i.e. all clique elements that are also contained in the parent clique. We call the sets of the remaining vertices shown in the lower rows the clique residuals or supernodes $\nu _\ell = {\mathcal {C}}_{\ell } {\setminus } \eta _\ell $. Keeping track of which vertices in a clique belong to the supernode and the separator is useful as the information is needed to perform a positive semidefinite completion. Following the authors in [29], we say that two cliques ${\mathcal {C}}_{i}$, ${\mathcal {C}}_{j}$ form a separating pair $\mathcal {P}_{ij} = \{ {\mathcal {C}}_{i},{\mathcal {C}}_{j} \}$ if their intersection is non-empty and if every path in the underlying graph G from a vertex ${\mathcal {C}}_{i} {\setminus } {\mathcal {C}}_{j}$ to a vertex ${\mathcal {C}}_{j} {\setminus } {\mathcal {C}}_{i}$ contains a vertex of the intersection ${\mathcal {C}}_{i} \cap {\mathcal {C}}_{j}$.

3.2 Agler’s and Grone’s Theorems

To explain how the concepts in the previous section can be used to decompose a positive semidefinite constraint, we first consider an SDP in standard primal form and standard dual form:

$$\begin{aligned} \begin{array}{llll} \underset{}{\mathrm{minimise}}&{}\langle C, X \rangle &{}\underset{}{\mathrm{maximise}}&{}b^\top y \\ \mathrm{subject~to}&{}\langle A_k, X \rangle = b_k, \, k =1,\ldots , m&{} \mathrm{subject~to}&{} \sum _{k = 1}^{m}A_k y_k + S = C\\ &{} X \in {\mathbb {S}}_{+}^{n}, &{} &{} S \in {\mathbb {S}}_{+}^{n},\\ \end{array} \end{aligned}$$

(12)

with variable X, coefficient matrices $A_k, C \in {\mathbb {S}}^{n}$, dual variable $y \in {\mathbb {R}}^m$ and slack variable S. Let us assume that the problem matrices in (12) each have their own sparsity pattern $A_k \in {\mathbb {S}}^{n}(E_{A_k}, 0) \text { and } C \in {\mathbb {S}}^{n}(E_{C}, 0)$. The aggregate sparsity of the problem is given by the graph G(V, E) with edge set $E = E_{A_1} \cup E_{A_2} \cup \cdots \cup E_{A_m} \cup E_{C}$. In general G(V, E) will not be chordal, but a chordal extension can be found by adding edges to the graph. We denote the extended graph as $G(V,{\bar{E}})$, where $E \subseteq {\bar{E}} $. Finding the minimum number of additional edges required to make the graph chordal is an NP-complete problem [64]. Consider a 0–1 matrix M with edge set E. A commonly used heuristic method to compute the chordal extension is to perform a symbolic Cholesky factorisation of M [24], with the edge set of the Cholesky factor then providing the chordal extension ${\bar{E}}$. To simplify notation, we assume that E represents a chordal graph or has been appropriately extended. Using the aggregate sparsity of the problem, we can modify the constraints on the matrix variables in (12) to be in the respective sparse positive semidefinite matrix spaces:

$$\begin{aligned} X \in {\mathbb {S}}_{+}^{n}(E, ?) \; \text {and} \; S \in {\mathbb {S}}_{+}^{n}(E, 0). \end{aligned}$$

(13)

We further define the entry-selector matrices $T_\ell \in {\mathbb {R}}^{ |{\mathcal {C}}_{\ell }| \times n }$ for a clique ${\mathcal {C}}_{\ell }$ with $(T_\ell )_{ij} = 1$ if $ {\mathcal {C}}_{\ell }(i) = j$ and 0 otherwise, where ${\mathcal {C}}_{\ell }(i)$ is the ith vertex of ${\mathcal {C}}_{\ell }$.

Assume that the sparsity pattern of the constraint is defined by a chordal graph G(V, E) with a set of maximal cliques $\{ {\mathcal {C}}_{1},\ldots ,{\mathcal {C}}_{p}\}$. Then, we can express the constraints in (13) in terms of multiple smaller coupled constraints using Grone’s and Agler’s theorems:

$$\begin{aligned} \begin{array}{ll} \hbox { Grone's theorem } [39]: &{} X \in {\mathbb {S}}_{+}^{n}(E, ?) \Leftrightarrow X_\ell = T_\ell X T_\ell ^\top \in {\mathbb {S}}_{+}^{|{\mathcal {C}}_{\ell }|}, \hbox { for }\ell =1,\ldots , p.\\ \hbox {Agler's theorem } [38]: &{} S \in {\mathbb {S}}^n_+(E,0) \Leftrightarrow \exists S_\ell \in {\mathbb {S}}_{+}^{|{\mathcal {C}}_{\ell }|}\hbox { for } \ell =1,\ldots ,p\\ &{} \hbox { s.t. }\ S = \sum _{\ell =1}^p T_\ell ^\top S_\ell T_\ell . \end{array} \end{aligned}$$

Applying Grone’s theorem to the primal form SDP in (12) while restricting X to the positive semidefinite completable matrix cone as in (13) yields the decomposed primal SDP:

$$\begin{aligned} \begin{array}{lllll} \underset{}{\mathrm{minimise}}&{}\quad \langle C, X \rangle \quad &{} &{} \underset{}{\mathrm{maximise}} &{} \quad b^\top y \quad \\ \mathrm{subject~to}&{} \langle A_k, X \rangle = b_k, &{} &{} \mathrm{subject~to}&{} \sum _{k = 1}^{m}A_k y_k + \\ &{} X_\ell = T_\ell X T^\top _\ell , &{} &{} &{} \sum _{\ell = 1}^p T^\top _\ell S_\ell T_\ell = C \\ &{} X_\ell \in {\mathbb {S}}_{+}^{|\mathcal {C}_\ell |}, &{} &{} &{} S_\ell \in {\mathbb {S}}_{+}^{|\mathcal {C}_\ell |}, \end{array} \end{aligned}$$

(14)

for $k =1,\ldots , m $ and $ \ell = 1,\ldots ,p$. Similarly, by utilising Agler’s theorem we can transform the dual form SDP in (12) with the restriction on S in (13) to obtain the decomposed dual SDP. Note that the matrices $T_\ell $ serve to extract the submatrix $S_\ell $ such that $S_\ell =T_\ell S T_\ell ^\top $ has rows and columns corresponding to the vertices of the ${\mathcal {C}}_{\ell }$.

4 Clique Merging

Given an initial decomposition with (chordally completed) edge set E and a set of cliques $\{{\mathcal {C}}_{1},\ldots ,{\mathcal {C}}_{p}\}$, we are free to re-merge cliques back into larger blocks. This is equivalent to treating structural zeros in the problem as numerical zeros. Considering the decomposed dual problem in (14), the effects of merging two cliques ${\mathcal {C}}_{i}$ and ${\mathcal {C}}_{j}$ are twofold. First, we replace two positive semidefinite matrix constraints of dimensions $\left| {\mathcal {C}}_{i}\right| $ and $\left| {\mathcal {C}}_{j}\right| $ with one constraint on a larger clique with dimension $\left| {\mathcal {C}}_{i} \cup {\mathcal {C}}_{j} \right| $, where the increase in dimension depends on the size of the overlap. Second, we remove consistency constraints for the overlapping entries between ${\mathcal {C}}_{i}$ and ${\mathcal {C}}_{j}$, thus reducing the size of the linear system of equality constraints. When merging cliques these two factors must be balanced, and the optimal merging strategy depends on the particular SDP solution algorithm employed.

In [46, 58], a clique tree is used to search for favourable merge candidates. We will refer to their two approaches as SparseCoLO and the parent–child strategy, respectively, in the sequel. We will then propose a new merging method in Sect. 4.2 whose performance is superior to both methods when used in ADMM. Given a merging strategy, Algorithm 2 describes how to merge a set of cliques within the set $\mathcal {B}$ and update the edge set $\mathcal {E}$ accordingly.

4.1 Existing Clique Tree-Based Strategies

The parent–child strategy described in [58] traverses the clique tree depth-first and merges a clique ${\mathcal {C}}_{\ell }$ with its parent clique ${\mathcal {C}}_{\mathrm {par}(\ell )} =\mathrm {par}({\mathcal {C}}_{\ell })$ if at least one of the two following conditions are met:

$$\begin{aligned} \left( \left| \mathcal {C}_{\mathrm {par}(\ell )}\right| - \left| \eta _\ell \right| \right) \left( \left| {\mathcal {C}}_{\ell }\right| - \left| \eta _\ell \right| \right) \le t_\mathrm {fill}, \quad \max \left\{ \left| \nu _\ell \right| , \left| \nu _{\mathrm {par}(\ell )}\right| \right\} \le t_\mathrm {size}, \end{aligned}$$

(15)

with heuristic parameters $t_{\mathrm {fill}}$ and $t_{\mathrm {size}}$. These conditions keep the amount of extra fill-in and the supernode cardinalities below the specified thresholds. The SparseCoLO strategy described in [23, 46] considers parent–child as well as sibling relationships. Given a parameter $\sigma > 0$, two cliques ${\mathcal {C}}_{i}, {\mathcal {C}}_{j}$ are merged if the following merge criterion holds

$$\begin{aligned} \min \left\{ \frac{\left| {\mathcal {C}}_{i} \cap {\mathcal {C}}_{j}\right| }{\left| {\mathcal {C}}_{i}\right| }, \frac{\left| {\mathcal {C}}_{i} \cap {\mathcal {C}}_{j}\right| }{\left| {\mathcal {C}}_{j}\right| } \right\} \ge \sigma . \end{aligned}$$

(16)

This approach traverses the clique tree depth-first, performing the following steps for each ${\mathcal {C}}_{\ell }$:

1.
For each clique pair $\left\{ ({\mathcal {C}}_{i}, {\mathcal {C}}_{j}) \mid {\mathcal {C}}_{i}, {\mathcal {C}}_{j} \in \mathrm {ch}\left( {\mathcal {C}}_{\ell } \right) \right\} $, check if (16) holds, then:
- ${\mathcal {C}}_{i}$ and ${\mathcal {C}}_{j}$ are merged, or
- if $({\mathcal {C}}_{i} \cup {\mathcal {C}}_{j}) \supseteq {\mathcal {C}}_{\ell }$, then ${\mathcal {C}}_{i}$, ${\mathcal {C}}_{j}$, and ${\mathcal {C}}_{\ell }$ are merged.
2.
For each clique pair $\left\{ \left( {\mathcal {C}}_{i}, {\mathcal {C}}_{\ell }\right) \mid {\mathcal {C}}_{i} \in \mathrm {ch}\left( {\mathcal {C}}_{\ell }\right) \right\} $, merge ${\mathcal {C}}_{i}$ and ${\mathcal {C}}_{\ell }$ if (16) is satisfied.

We note that the open-source implementation of the SparseCoLO algorithm described in [46] follows the algorithm outlined here, but also employs a few additional heuristics. An advantage of these two approaches is that the clique tree can be computed easily and the conditions are inexpensive to evaluate. However, a disadvantage is that choosing parameters that work well on a variety of problems and solver algorithms is difficult. Secondly, clique trees are not unique and in some cases it is beneficial to merge cliques that are not directly related on the particular clique tree that was computed. To see this, consider a chordal graph G(V, E) consisting of three connected subgraphs $G_a(V_a, E_a), G_b(V_b, E_b)$ and $G_c(V_c, E_c)$ with:

$$\begin{aligned} V_a = \{3, 4, \ldots , m_a\}, \, V_b = \{m_{a}+2, m_{a}+3, \ldots , m_b\}, \, V_c = \{m_{b}+1, m_{b}+2, \ldots , m_c\} \end{aligned}$$

and some additional vertices $\{1,2,m_a + 1\}$. The graph is connected as shown in Fig. 2a, where the complete subgraphs are represented as nodes $V_a,V_b,V_c$. A corresponding clique tree is shown in Fig. 2b. By choosing the cardinality $\left| V_c\right| $, the overlap between cliques ${\mathcal {C}}_{1} = \{1, 2\} \cup V_c$ and ${\mathcal {C}}_{3} = \{m_a+1\} \cup V_c $ can be made arbitrarily large while $\left| V_a\right| $, $\left| V_b\right| $ can be chosen so that any other merge is disadvantageous. However, neither the parent–child strategy nor SparseCoLO would consider merging ${\mathcal {C}}_{1}$ and ${\mathcal {C}}_{3}$ since they are in a “nephew–uncle” relationship. In fact, for the particular sparsity graph in Fig. 2a, eight different clique trees can be computed. Only in half of them do the cliques ${\mathcal {C}}_{1}$ and ${\mathcal {C}}_{3}$ appear in a parent–child relationship. Therefore, a merge strategy that only considers parent–child relationships would miss this favourable merge in half the cases.

4.2 A New Clique Graph-Based Strategy

To overcome the limitations of existing strategies, we propose a merging strategy based on the reduced clique graph $\mathcal {G}(\mathcal {B}, \xi )$, which is defined as the union of all possible clique trees of G; see [29] for a detailed discussion. The set of vertices of this graph is given by the maximal cliques of the sparsity pattern. We then create the edge set $\xi $ by introducing edges between pairs of cliques $({\mathcal {C}}_{i}, {\mathcal {C}}_{j})$ if they form a separating pair ${\mathcal {P}}_{ij}$. We remark that $\xi $ is a subset of the edges present in the clique intersection graph which is obtained by introducing edges for every two cliques that intersect. However, the reduced clique graph has the property that it remains a valid reduced clique graph of the altered sparsity pattern after performing a permissible merge between two cliques. This is not always the case for the clique intersection graph. For convenience, we will refer to the reduced clique graph simply as the clique graph in the following sections. Based on the permissibility condition for edge reduction in [29], we define a permissibility condition for merges:

Definition 4.1

(Permissible merge) Given a reduced clique graph $\mathcal {G}(\mathcal {B}, \xi )$, a merge between two cliques $({\mathcal {C}}_{i}, {\mathcal {C}}_{j}) \in \xi $ is permissible if for every common neighbour ${\mathcal {C}}_{k}$ it holds that ${\mathcal {C}}_{i} \cap {\mathcal {C}}_{k} = {\mathcal {C}}_{j} \cap {\mathcal {C}}_{k}$.

We further define an edge weighting function $e\left( {\mathcal {C}}_{i}, {\mathcal {C}}_{j}\right) = w_{ij}$ with $e :2^{V} \times 2^{V} \rightarrow {\mathbb {R}}$ that assigns a weight $w_{ij}$ to each edge $({\mathcal {C}}_{i}, {\mathcal {C}}_{j}) \in \xi $: This function is used to estimate the per-iteration computational savings of merging a pair of cliques depending on the targeted algorithm and hardware. It evaluates to a positive number if a merge reduces the per-iteration time and to a negative number otherwise. For a first-order method, whose per-iteration cost is dominated by an eigenvalue factorisation with complexity $\mathcal {O}\bigl (\left| {\mathcal {C}}_{}\right| ^3\bigr )$, a simple choice would be:

$$\begin{aligned} e({\mathcal {C}}_{i}, {\mathcal {C}}_{j}) = \left| {\mathcal {C}}_{i}\right| ^3 + \left| {\mathcal {C}}_{j}\right| ^3 - \left| {\mathcal {C}}_{i} \cup {\mathcal {C}}_{j}\right| ^3. \end{aligned}$$

(17)

More sophisticated weighting functions can be determined empirically; see Sect. 6.1. After a weight has been computed for each edge $({\mathcal {C}}_{i}, {\mathcal {C}}_{j})$ in the clique graph, we merge cliques as outlined in Algorithm 3. This strategy considers the edges in terms of their weights, starting with the permissible clique pair $({\mathcal {C}}_{i}, {\mathcal {C}}_{j})$ with the highest weight $w_{ij}$. If the weight is positive, the two cliques are merged and the edge weights for all edges connected to the merged clique ${\mathcal {C}}_{m} = {\mathcal {C}}_{i} \cup {\mathcal {C}}_{j}$ are updated. This process continues until no edges with positive weights remain.

The clique graph for the clique tree in Fig. 1c is shown in Fig. 3a with the edge weighting function in (17). Following Algorithm 3, the edge with the largest weight is considered first and the corresponding cliques are merged, i.e. $\{3, 6, 7, 8\}$ and $\{6, 7, 8, 9\}$. The merge is permissible because both cliques intersect with the only common neighbour $\{4,5,8\}$ in the same way. The revised clique graph $\mathcal {G}(\hat{\mathcal {B}}, {\hat{\xi }})$ is shown in Fig. 3b. Since no positively weighted edges remain, the algorithm stops. After Algorithm 3 has terminated, it is possible to recompute a valid clique tree from the revised clique graph. This can be done in two steps. First, the edge weights in $\mathcal {G}(\hat{\mathcal {B}}, {\hat{\xi }})$ are replaced with the cardinality of their intersection ${\tilde{w}}_{ij} = \left| {\mathcal {C}}_{i} \cap {\mathcal {C}}_{j}\right| $ for all $({\mathcal {C}}_{i},{\mathcal {C}}_{j}) \in {\hat{\xi }}$. Second, a clique tree is then given by any maximum weight spanning tree of the newly weighted clique graph, e.g. using Kruskal’s algorithm described in [36].

Our merging strategy has several clear advantages over competing approaches. Since the clique graph covers a wider range of merge candidates, it will consider edges that do not appear in clique tree-based approaches such as the “nephew–uncle” example in Fig. 2. Moreover, the edge weighting function allows one to make a merge decision based on the particular solver algorithm and hardware used. One downside is that this approach is more computationally expensive than the other methods. However, our numerical experiments show that the time spent on finding the clique graph, merging the cliques, and recomputing the clique tree is only a very small fraction of the total computational savings relative to other merging methods when solving SDPs.

5 Open-Source Implementation

We have implemented our algorithm in the conic operator splitting method (COSMO), an open-source package written in Julia [9]. The source code and documentation are available at https://github.com/oxfordcontrol/COSMO.jl.

Problems can be passed to COSMO via a direct interface, the Julia modelling packages JuMP [19] and Convex.jl [60], and a Python interface. By default, COSMO supports the zero cone, the nonnegative orthant, the hyperbox, the second-order cone, the PSD cone, the exponential cone and its dual, and the power cone and its dual. We further allow the user to define their own convex cones^{Footnote 1} and custom projection functions. To implement a custom cone $\mathcal {K}_c$ the user must provide a projection function that projects an input vector onto the cone. An example that shows the advantages of defining a custom cone is provided in Sect. 6.3. Custom projection functions in COSMO have been used by the authors in [53] to reduce solve time by a factor of 20. To solve the linear system in (6) the user can choose a direct solver such as QDLDL [56], SuiteSparse [17], Pardiso [34, 55], or an indirect solver such as conjugate gradient, minimum residual method.

Multithreaded Projection Given a problem with multiple conic constraints, i.e. $\mathcal {K} = \mathcal {K}_1 \times \cdots \mathcal {K}_p$, the projection step in Algorithm 1

$$\begin{aligned} s^{k+1} = \varPi _{{\mathcal {K}}}\left( \alpha {\tilde{s}}^{k+1}+(1-\alpha )s^k+\frac{1}{\rho }y^k\right) \end{aligned}$$

(18)

decomposes into p independent projections onto the individual cones and can be parallelised. This is particularly advantageous when used in combination with chordal decomposition, which typically yields a large number of smaller PSD constraints.

For the eigendecomposition involved in the projection step of a PSD constraint, the LAPACK [6] function sveyr is used, which can also utilise multiple threads. Consequently, this leads to two-level parallelism in the computation, i.e. on the higher level the projection functions are carried out in parallel and each projection function independently calls sveyr. Determining the optimal allocation of the CPU cores to each of these tasks depends on the number of PSD constraints and their dimensions and is a difficult problem. The impact of using different numbers of Julia threads and different BLAS settings on the accumulated projection time is shown for two example problems in Table 1.

Table 1 Impact of multithreading and BLAS settings on projection times

Full size table

We compare the cases that each call to sveyr is restricted to one single thread or that it gets automatically chosen by BLAS. We generate a best case SDP problem with a sparsity pattern, that can be decomposed into an equivalent problem with 120 $10\times 10$ PSD constraints. When each projection uses one thread, we see that doubling the number of threads halves the projection time. When BLAS chooses the number of threads for each projection dynamically we do not see an improvement for more than four threads. Next, we generate an SDP that decomposes into a problem with one dominant $400\times 400$ block and 80 $10\times 10$ blocks. For the BLAS single-threaded configuration, this means that the projection of the large matrix block dominates and the number of threads do not make a significant difference.

Consequently, for the problem sets considered in Sect. 6 we achieved the best performance by running sveyr single-threaded and using all physical cores to carry out the projection functions in parallel.

6 Numerical Results

This section presents QP and SDP benchmark results of COSMO against the interior-point solver MOSEK v9.0 and the accelerated first-order ADMM solver SCS v2.1.1. When applied to a QP, COSMO’s main algorithm becomes very similar to the first-order QP solver OSQP. To test the performance penalty of using a pure Julia implementation against a similar C implementation, we therefore also compare our solver against OSQP v0.6.0 on QP problems. We selected a number of problem sets to test different aspects of COSMO.

In Sect. 6.1, we first compare the impact of our clique graph-based merging algorithm and other clique tree-based strategies on the projection time of our solver. The clique graph-based merging strategy with COSMO is then compared to MOSEK and SCS on large structured SDPs from the SDPLib benchmark set [10], non-chordal SDPs generated with sparsity patterns from the SuiteSparse Matrix Collections [18], and SDPs with block-arrow shaped sparsity pattern.

In Sect. 6.2, we demonstrate the advantage of directly supporting a quadratic objective function in a conic solver. This is shown first by solving the nearest correlation matrix problem, a problem with both a quadratic objective and a positive semidefinite constraint. Secondly, we show that our conic solver performs well on a classic QP benchmark set, such as the Maros and Mészáros QP repository [38].

Finally, to highlight the advantages of custom constraints, we consider a problem set with doubly stochastic matrices in Sect. 6.3.

All the experiments were carried out on a computing node of the University of Oxford ARC-HTC cluster with 16 logical Intel Xeon E5-2560 cores and 64 GB of DDR3 RAM. All the problems were run using Julia v1.3 and the problems were passed to the solvers via MathOptInterface [37]. To evaluate the accuracy of the returned solution, we compute three errors adapted from the DIMACS error measures for SDPs [33]:

$$\begin{aligned} \epsilon _1 = \frac{\left\Vert A_ax - b_a\right\Vert _2}{1 + \left\Vert b_a\right\Vert _2}, \quad \epsilon _2 = \frac{\left\Vert Px + q - A_a^\top y_a\right\Vert _2}{1 + \left\Vert q\right\Vert _2}, \quad \epsilon _3 = \frac{ \left| x^\top P x + q^\top x - b_a^\top y_a\right| }{ 1 + \left| q^\top x\right| + \left| b_a^\top y_a\right| },\nonumber \\ \end{aligned}$$

(19)

where $A_a$, $b_a$ and $y_a$ correspond to the rows of A, b and y that represent active constraints. This is to ensure meaningful values even if the problem contains inactive constraints with very large, or possibly infinite, values $b_i$. The maximum of the three errors for each problem and solver is reported in the results below. We configured COSMO, MOSEK, SCS and OSQP to achieve an accuracy of $\epsilon = 10^{-3}$. We set the maximum allowable solve time for the Maros and Mészáros problems to ${5}\,{\mathrm{min}}$ and to $30\,{\mathrm{min}}$ for the other problem sets. Standard values were used for all other solver parameters. COSMO uses a Julia implementation of the QDLDL solver to factor the quasi-definite linear system. Similarly, we configured SCS to use a direct solver for the linear system. The datasets generated and analysed in this section are available from the corresponding author upon request.

6.1 Clique Merging for Decomposable SDPs

In this section, we first compare our proposed clique graph-based merge approach with the clique tree-based strategies of [46, 58], all of which were discussed in Sect. 4. We then implement the clique graph-based approach in our solver and compare it against other solvers on three different SDP problem sets.

Comparison of Clique Merging Strategies All three clique merging strategies discussed in Sect. 4 were used to preprocess large sparse SDPs from SDPLib, a collection of SDP benchmark problems [10]. This problem set contains maximum cut problems, SDP relaxations of quadratic programs, and Lovász theta problems. Moreover, we consider a set of test SDPs generated from (non-chordal) sparsity patterns of matrices from the SuiteSparse Matrix Collections [18]. Both problem sets have been used in previous work to benchmark structured SDPs [5, 66].

In this section, we consider the effect of different clique decomposition and merging techniques on the per-iteration computation times of COSMO when applied to these benchmark sets.

For the strategy described in [46], we used the SparseCoLO package to decompose the problem. The parent–child method discussed in [58] and the clique graph-based method described in Sect. 4.2 are available as options in COSMO. We further investigate the effect of using different edge weighting functions. The major operation affecting the per-iteration time is the projection step. This step involves an eigenvalue decomposition of the matrices corresponding to the cliques. Since the eigenvalue decomposition of a symmetric matrix of dimension N has a complexity of $\mathcal {O}\left( N^3 \right) $, we define a nominal edge weighting function as in (17). However, the exact relationship will be different because the projection function involves copying of data and is affected by hardware properties such as cache size. We therefore also consider an empirically estimated edge weighting function. To determine the relationship between matrix size and projection time, the execution time of the relevant function inside COSMO was measured for different matrix sizes. We then approximated the relationship between projection time, $t_{\mathrm {proj}}$, and matrix size, N, as a polynomial $t_{\mathrm {proj}}(N) = aN^3 + bN^2$, where a, b were estimated using least squares. The weighting function is then defined as

$$\begin{aligned} e({\mathcal {C}}_{i}, {\mathcal {C}}_{j}) = t_{\mathrm {proj}}\left( \left| {\mathcal {C}}_{i}\right| \right) + t_{\mathrm {proj}} \left( \left| {\mathcal {C}}_{j}\right| \right) - t_{\mathrm {proj}} \left( \left| {\mathcal {C}}_{i} \cup {\mathcal {C}}_{j}\right| \right) . \end{aligned}$$

(20)

Six different cases were considered: no decomposition (NoDe), no clique merging (NoMer), decomposition using SparseCoLO (SpCo), parent–child merging (ParCh), and the clique graph-based method with nominal edge weighting (CG1) and estimated edge weighting (CG2). The SparseCoLO method was used with default parameters. All cases were run single-threaded. Since the per-iteration projection times for some problems lie in the millisecond range every problem was benchmarked ten times and the median values are reported. Table 7 shows the solve time, the mean projection time, the number of iterations, the number of cliques after merging, and the maximum clique size of the sparsity pattern for each problem and strategy. The solve time includes the time spent on decomposition and clique merging. We do not report the total solver time when SparseCoLO was used for the decomposition because this has to be done in a separate preprocessing step in MATLAB which was orders of magnitude slower than the other methods.

Our clique graph-based methods lead to a consistent and substantial reduction in required projection time, and consequently to a reduction in overall solver time. The method with estimated edge weighting function CG2 achieves the lowest average projection times for the majority of problems. In four cases, ParCh has a narrow advantage. The geometric mean of the ratios of projection time of CG2 compared to the best non-graph method is $0.701$, with a minimum ratio of $0.407$ for problem mcp500-2. There does not seem to be a clear pattern that relates the projection time to the number of cliques or the maximum clique size of the decomposition. This is expected as the optimal merging strategy depends on the properties of the initial decomposition such as the overlap between the cliques. The merging strategies ParCh, CG1 and CG2 generally result in similar maximum clique sizes compared to SparseCoLO, with CG1 being the most conservative in the number of merges.

Table 2 Benchmark results for non-chordal sparse SDPs from SDPLib and SDPs generated with sparsity patterns from the SuiteSparse Matrix Collection

Full size table

Non-chordal Decomposable SDPs After verifying the effectiveness of the clique graph-based merging strategy on the projection step, we implemented the merging strategy CG2 in COSMO and compared it against MOSEK and SCS. Table 2 shows the benchmark results for the three solvers. The decomposition helps COSMO to solve most problems faster than MOSEK and SCS. This is even more significant for the larger problems that were generated from the SuiteSparse Matrix Collection. The decomposition does not seem to provide a major benefit for the slightly denser problems mcp500-3 and mcp500-4. Furthermore, COSMO seems to converge slowly for qpG51 and thetaG51. Similar observations for mcp500-3, mcp500-4, and thetaG51 have been made by the authors in [5]. Finally, many of the larger problems were not solvable within the time limit or caused out-of-memory problems if no decomposition was used in MOSEK and SCS.

Block-Arrow Sparse SDPs To demonstrate the scaling behaviour of our solver on decomposable SDPs with increasing dimension, we consider randomly generated SDPs in standard dual form (12) with a block-arrow aggregate sparsity pattern similar to test problems in [5, 66]. Figure 4 shows the sparsity pattern of the PSD constraint. The sparsity pattern is generated based on the following parameters: block size d, number of blocks $N_b$ and width of the arrow head w. Note that the graph corresponding to the sparsity pattern is always chordal and that, for this sparsity pattern, clique merging yields no benefit. In the following, we study the effects of independently increasing the block size d and the number of blocks $N_b$. The solve times for all solvers are shown in Table 3. The first part shows problems with increasing block size d and fixed number of blocks $N_b=50$, arrowhead width $w=10$ and number of constraints $m = 100$. In the second part, the number of blocks $N_b$ is varied for fixed $d = 10$, $w=10$, $m = 100$. The results show that both COSMO with and without chordal decomposition solve this problem type consistently faster than MOSEK and SCS. COSMO performs better than the other first-order solver (SCS) because it requires a the significantly smaller number of iterations. Furthermore, one can see that in both cases the solve time for each solver rises when the number of blocks and the block sizes are increased. The increase is smaller for COSMO(CD) which is more affected by the number of iterations than the problem dimension.

Table 3 Benchmark results for block arrow sparse SDPs with varying number of blocks $N_\mathrm{b}$ and block size d

Full size table

6.2 Advantages of Supporting a Quadratic Objective

COSMO’s problem format (2) allows us to handle QPs and SDPs with quadratic objectives, without the need to transform the quadratic term into an additional second-order cone constraint. This avoids an additional matrix factorisation, preserves sparsity in the objective, and in many cases achieves faster convergence due to the strict convexity of the objective. To demonstrate these advantages, we solve the nearest correlation matrix problem and problems from the Maros and Mészáros QP test set.

Nearest Correlation Matrix Consider the problem of projecting a matrix C onto the set of correlation matrices, i.e. real symmetric positive semidefinite matrices with diagonal elements equal to 1. This problem is relevant in portfolio optimisation [32]. The correlation matrix of a stock portfolio might lose its positive semidefiniteness due to noise and rounding errors of previous data manipulations. Consequently, we are interested to find the nearest correlation matrix X to a given data matrix $C \in {\mathbb {R}}^{n\times n}$:

$$\begin{aligned} \begin{array}{lll} \text{ minimise } &{} \textstyle {\frac{1}{2}} \left\Vert X- C\right\Vert _F^2 \\ \text{ subject } \text{ to } &{} \text {diag}(X) = {\mathbf {1}}_n, &{} X \in {\mathbb {S}}_+^n. \end{array} \end{aligned}$$

(21)

In order to transform the problem into the standard form (2) used by COSMO, C and X are vectorised and the objective function is expanded:

$$\begin{aligned} \begin{array}{lll} \text{ minimise } &{} (1/2)(x^\top x - 2c^\top x + c^\top c)\\ \text{ subject } \text{ to } &{} \begin{bmatrix}E \\ -I \end{bmatrix} x +s = \begin{bmatrix} {\mathbf {1}}_{n} \\ {\mathbf {0}}_{n^2} \end{bmatrix},&s \in \{0\}^n \times {\mathcal {S}}_+^n, \end{array} \end{aligned}$$

(22)

with $c=\text {vec}(C) \in {\mathbb {R}}^{n^2}$ and $x=\text {vec}(X) \in {\mathbb {R}}^{n^2}$. Here $E \in {\mathbb {R}}^{n\times n^2}$ is a matrix that extracts the n diagonal entries $X_{ii}$ from its vectorised form x. For the benchmark problems, we randomly sample the data matrix C with entries $C_{i,j} \sim \mathcal {U}(-1,1)$ from a uniform distribution. Table 4 shows the benchmark results for increasing matrix dimension n.

Table 4 Benchmark results for nearest correlation matrix problems

Full size table

Unsurprisingly, the first-order methods SCS and COSMO outperform the interior-point solver MOSEK for these large SDPs. Furthermore, for larger problems the solve times of COSMO and SCS scale in a similar way. However, SCS has to transform the objective into an additional second-order cone constraint, which increases the factorisation- and projection time compared to COSMO.

Maros and Mészáros QP Test Set The Maros and Mészáros test problem set [38] is a repository of challenging convex QP problems that is widely used to compare the performance of QP solvers. For comparison metrics we compute the failure rate, the number of fastest solve times and the normalised shifted geometric mean for each solver. The shifted geometric mean is more robust against large outliers (compared to the arithmetic mean) and against small outliers (compared to the geometric mean) and is commonly used in optimisation benchmarks; see [43, 56]. The shifted geometric mean $\mu _{g, s}$ is defined as $\mu _{g, s} =\root n \of {\prod _{p}(t_{p, s} + \mathrm {sh}) - \mathrm {sh} }$ with total solver time $t_{p, s}$ of solver s and problem p, shifting factor $\mathrm {sh}$ and size of the problem set n. In the reported results a shifting factor of $sh = 10$ was chosen and the maximum allowable time $t_{p, s} = {300}\,{\mathrm{s}}$ was used if solver s failed on problem p. Lastly, we normalise the shifted geometric mean for solver s by dividing by the geometric mean of the fastest solver. The failure rate $f_{r, s}$ is given by the number of unsolved problems compared to the total number of problems in the problem set. As unsolved problems, we count instances where the algorithm does not converge within the allowable time or fails during the setup or solve phase. Table 5 shows the normalised shifted geometric mean and the failure rate for each solver. Additionally, the number of cases where solver s was the fastest solver is shown.

Table 5 Shifted geometric mean and solver failure rates for the Maros and Mészáros QP test set

Full size table

OSQP shows the best performance in terms of lowest failure rates, number of fastest solves and in the shifted geometric mean of solve times. COSMO follows very closely. The shifted geometric mean of MOSEK seems to suffer from a higher failure rate compared to OSQP/COSMO, and SCS fails on a large number of problems. The higher failure rate is likely to be due to the necessary transformation into a second-order cone problem. For this problem set of QPs, COSMO’s algorithm reduces, with some minor differences, to the algorithm of OSQP. Consequently, this benchmark is useful to evaluate the performance penalty incurred by COSMO due to its implementation in the higher-order language Julia. The results in Table 5 show that the performance difference is very small. This is confirmed by the solve times for increasing problem dimension, shown in Fig. 5.

COSMO achieves lower solve times than both MOSEK and SCS especially for very small and very large problems. COSMO and OSQP have nearly identical performance, aside from very small problems that are solved in the range of $1 \times 10^{-5}\,\mathrm{s}\hbox { to }1 \times 10^{-4}\,\mathrm{s}$. The marginally better performance of OSQP for the smallest problems in the test set is the reason that OSQP is the faster solver in a larger number of cases in Table 5. This difference is primarily due to overheads incurred from features in our Julia implementation that support more than one constraint type during problem setup. We demonstrate the resulting extensibility and customisability of our solver by solving QPs with custom constraints in Sect. 6.3.

6.3 Custom Convex Cones

In many cases, writing a custom solver algorithm for a particular problem can be faster than using available solver packages if a particular aspect of the problem structure can be exploited to speed up parts of the computations. As mentioned earlier, COSMO supports user customisation by allowing the definition of new convex cones. This is useful if constraints of the problem can be expressed using this new convex cone and a fast projection method onto the cone exists. A fast specialised projection method in an ADMM framework has for example been used by the authors in [8] to solve the error-correcting code decoding problem.

To demonstrate the advantage of custom convex cones, consider the problem of finding the doubly stochastic matrix that is nearest, in the Frobenius norm, to a given symmetric matrix $C \in {\mathbb {S}}^{n}$. Doubly stochastic matrices are used for instance in spectral clustering [65] and matrix balancing [50]. A specialised algorithm for this problem type is discussed in [54]. Doubly stochastic matrices have the property that all rows and columns each sum to one and all entries are nonnegative. The nearest doubly stochastic matrix X can be found by solving the following optimisation problem:

$$\begin{aligned} \begin{array}{ll} \text{ minimise } &{} \textstyle {\frac{1}{2}} \left\Vert X- C\right\Vert _F^2\\ \text{ subject } \text{ to } &{} X_{ij} \ge 0, \, X {\mathbf {1}} = {\mathbf {1}}, \, X^\top {\mathbf {1}} = {\mathbf {1}}, \end{array} \end{aligned}$$

(23)

with symmetric real matrix $C \in {\mathbb {S}}^{n}$ and decision variable $X \in {\mathbb {R}}^{n\times n}$. This problem can be solved as a QP in the following form using equality and inequality constraints:

$$\begin{aligned} \begin{array}{lll} \text{ minimise } &{} \frac{1}{2}(x^\top x - 2c^\top x + c^\top c)\\ \text{ subject } \text{ to } &{} \begin{bmatrix} {\mathbf {1}}^\top _n \otimes I_n \\ I_n \otimes {\mathbf {1}}^\top _n \\ - I_{n^2} \end{bmatrix} x + s = \begin{bmatrix} {\mathbf {1}}_{2n} \\ {\mathbf {1}}_{2n} \\ {\mathbf {0}}_{n^2} \end{bmatrix},&s \in \{0\}^{4n} \times {\mathbb {R}}_+^{n^2}, \end{array} \end{aligned}$$

(24)

with $x = \mathrm {vec}(X)$ and $c = \mathrm {vec}(C)$ However, the problem can be written in a more compact form by using a custom projection function to project the matrix iterate onto the affine set of matrices ${\mathcal {C}}_{\sum }$, whose rows and columns each sum to one. In general the projection of vector $s \in {\mathbb {R}}^n$ onto the affine set $\mathcal {C}_a = \{ s \in {\mathbb {R}}^n \mid As = b \}$ is given by $\varPi _{\mathcal {C}_a}(s) = s - A^\top \left( A A^\top \right) ^{-1} (A s - b)$, where A is assumed to have full rank. In the case of $\mathcal {C}_a = \mathcal {C}_{\sum }$ we can exploit the fact that the inverse of $AA^\top $ can be efficiently computed. The projection can be carried out as described in Algorithm 4; see Appendix A.1 for a derivation.

Notice that Algorithm 4 can be implemented efficiently without ever assembling and storing A and ${\mathbf {1}} {\mathbf {1}}^\top $. By using the custom convex set ${\mathcal {C}}_{\sum }$ and the corresponding projection function, (23) can now be rewritten as:

$$\begin{aligned} \begin{array}{lll} \text{ minimise } &{} (1/2)(x^\top x - 2c^\top x + c^\top c)\\ \text{ subject } \text{ to } &{} \begin{bmatrix} -I_{n^2} \\ - I_{n^2} \end{bmatrix} x + s = {\mathbf {0}}_{2 n^2},&s \in {\mathcal {C}}_{\sum } \times {\mathbb {R}}_+^{n^2}. \end{array} \end{aligned}$$

(25)

The sparsity pattern of the new constraint matrix A only consists of two diagonals and the number of nonzeros reduces from $3n^2$ to $2n^2$. We expect this to reduce the initial factorisation time of the linear system in (6) as well as the forward- and back-substitution steps.

Figure 6 shows the total solve time of all the solvers for problem (23) with randomly generated dense matrix C with $C_{ij} \sim {\mathcal {U}}(0, 1)$ and increasing matrix dimension. Additionally, we show the solve time for COSMO in the problem form (23) and with a specialised custom set and projection function as in (25).

It is not surprising that COSMO and OSQP scale in the same way for this problem type. MOSEK is slightly slower for smaller problem dimensions and overtakes COSMO/OSQP for problems of dimensions $n \ge 500$. This is probably due to the use in MOSEK of a faster multithreaded linear system solver while OSQP/COSMO relies on the single-threaded solver QDLDL. The longer solve time of SCS is due to slow convergence of the algorithm for this problem type. Furthermore, when the problem is solved with a custom convex set as in (25) COSMO(CS) is able to outperform all other solvers. Table 6 shows the total solve time and the factorisation time of the two versions of COSMO for small, medium and large problems. As predicted, the lower solve time can be mainly attributed to the faster factorisation time.

Table 6 Solve times and factorisation times of COSMO and COSMO(CS) for small, medium and large doubly stochastic matrix problems

Full size table

7 Conclusions

This paper describes the first-order solver COSMO which combines direct support of quadratic objectives, infeasibility detection, custom constraints, chordal decomposition of PSD constraints and automatic clique merging. The performance of the solver is illustrated on a number of benchmark problems that challenge different aspects of modern solvers. WE show how our clique graph-based merging strategy can achieve up to 50% time reduction in the projection step when solving sparse SDPs. We further show how multithreading and chordal decomposition can be utilised to solve very large SDPs that arise in real applications within minutes where competing solvers run out of memory or need hours to find a solution. Our implementation in the high-level Julia language facilitates rapid development and testing of ideas and allows users to customise the solver for their applications. We illustrate the utility of these features by demonstrating a 5x reduction in solve time when using a custom constraint projection for large QPs. It further allows the abstraction of precision and array types which in principle make it possible for COSMO to run on GPUs. Further performance gains are likely to be achieved by exploring acceleration methods to speed up convergence to higher accuracies and reduce the dependency on problem scaling.

Notes

To allow infeasibility detection, the user has to either define a convex cone, a convex compact set, or a composition of the two.

References

Agler, J., Helton, W., McCullough, S., Rodman, L.: Positive semidefinite matrices with a given sparsity pattern. Linear Algebra Appl. 107, 101–149 (1988)
Article MathSciNet MATH Google Scholar
Alizadeh, F.: Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim. 5(1), 13–51 (1995)
Article MathSciNet MATH Google Scholar
Alizadeh, F., Haeberly, J., Overton, M.L.: Primal-dual interior-point methods for semidefinite programming: convergence rates, stability and numerical results. SIAM J. Optim. 8(3), 746–768 (1998)
Article MathSciNet MATH Google Scholar
Andersen, M.S., Vandenberghe, L.: CHOMPACK: a python package for chordal matrix computations (2015)
Andersen, M.S., Dahl, J., Vandenberghe, L.: Implementation of nonsymmetric interior-point methods for linear optimization over sparse matrix cones. Math. Program. Comput. 2(3–4), 167–201 (2010)
Article MathSciNet MATH Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A.: LAPACK Users’ Guide. SIAM, Philadelphia (1999)
Book MATH Google Scholar
Banjac, G., Goulart, P., Stellato, B., Boyd, S.: Infeasibility detection in the alternating direction method of multipliers for convex optimization. J. Optim. Theory Appl. 183(2), 490–519 (2019)
Article MathSciNet MATH Google Scholar
Barman, S., Liu, X., Draper, S.C., Recht, B.: Decomposition methods for large scale LP decoding. IEEE Trans. Inf. Theory 59(12), 7870–7886 (2013)
Article MathSciNet MATH Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Article MathSciNet MATH Google Scholar
Borchers, B.: SDPLIB 1.2, a library of semidefinite programming test problems. Optim. Methods Softw. 11(1–4), 683–690 (1999)
Article MathSciNet MATH Google Scholar
Borm, P., Hamers, H., Hendrickx, R.: Operations research games: a survey. Top 9(2), 139 (2001)
Article MathSciNet MATH Google Scholar
Boyd, S., El Ghaoui, L., Feron, E., Balakrishnan, V.: Linear Matrix Inequalities in System and Control Theory, vol. 15. SIAM, Philadelphia (1994)
Book MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Article MATH Google Scholar
Boyd, S., Busseti, E., Diamond, S., Kahn, R.N., Koh, K., Nystrup, P., Speth, J.: Multi-period trading via convex optimization. Found. Trends Optim. 3(1), 1–76 (2017)
Article Google Scholar
Coleman, T.F., Li, Y.: Compact clique tree data structures in sparse matrix factorizations. In: Large-scale Numerical Optimization. SIAM (1990)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Article MATH Google Scholar
Davis, T.A.: Suitesparse: a suite of sparse matrix software. http://faculty.cse.tamu.edu/davis/suitesparse.html (2015)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. (TOMS) 38(1), 1–25 (2011)
MathSciNet MATH Google Scholar
Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Article MathSciNet MATH Google Scholar
Eckstein, J.: Splitting methods for monotone operators with applications to parallel optimization. Ph.D. thesis, Massachusetts Institute of Technology (1989)
Everett, H.: Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Oper. Res. 11(3), 399–417 (1963)
Article MathSciNet MATH Google Scholar
Fujisawa, K., Kim, S., Kojima, M., Okamoto, Y., Yamashita, M.: User’s manual for SparseCoLO: Conversion methods for sparse conic-form linear optimization problems. Research Report B-453, Dept. of Math. and Comp. Sci. Japan, Tech. Rep., pp. 152–8552 (2009)
Fujisawa, K., Fukuda, M., Nakata, K.: Preprocessing sparse semidefinite programs via matrix completion. Optim. Methods Softw. 21(1), 17–39 (2006)
Article MathSciNet MATH Google Scholar
Fukuda, M., Kojima, M., Murota, K., Nakata, K.: Exploiting sparsity in semidefinite programming via matrix completion I: general framework. SIAM J. Optim. 11(3), 647–674 (2001)
Article MathSciNet MATH Google Scholar
Gabay, D., Mercier, B.: A dual algorithm for the solution of non linear variational problems via finite element approximation. Institut de recherche d’informatique et d’automatique (1975)
Garstka, M., Cannon, M., Goulart, P.: COSMO: a conic operator splitting method for large convex problems. In: 18th European Control Conference (ECC) (Naples, Italy), pp. 1951–1956 (2019)
Garstka, M., Cannon, M., Goulart, P.: A clique graph based merging strategy for decomposable SDPs. IFAC-PapersOnLine 53(2), 7355–7361 (2020). 21th IFAC World Congress
Article Google Scholar
Grone, R., Johnson, C.R., Sá, E.M., Wolkowicz, H.: Positive definite completions of partial Hermitian matrices. Linear Algebra Appl. 58, 109–124 (1984)
Article MathSciNet MATH Google Scholar
Habib, M., Stacho, J.: Reduced clique graphs of chordal graphs. Eur. J. Comb. 33(5), 712–735 (2012)
Article MathSciNet MATH Google Scholar
Helmberg, C., Rendl, F., Vanderbei, R.J., Wolkowicz, H.: An interior-point method for semidefinite programming. SIAM J. Optim. 6(2), 342–361 (1996)
Article MathSciNet MATH Google Scholar
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)
Article MathSciNet MATH Google Scholar
Higham, N.J.: Computing the nearest correlation matrix-a problem from finance. IMA J. Numer. Anal. 22(3), 329–343 (2002)
Article MathSciNet MATH Google Scholar
Johnson, D., Pataki, G., Alizadeh, F.: Seventh dimacs implementation challenge: semidefinite and related problems (2000). https://dimacs.rutgers.edu/Challenges/Seventh
Kalinkin, A., Anders, A., Anders, R.: Intel Math Kernel Library PARDISO* for Intel Xeon Phi manycore coprocessor. Appl. Math. 6(08), 1276 (2015)
Article Google Scholar
Karmarkar, N.: A new polynomial-time algorithm for linear programming. In: Proceedings of the 16th Annual ACM Symposium on Theory of Computing, pp. 302–311. ACM (1984)
Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)
Article MathSciNet MATH Google Scholar
Legat, B., Dowson, O., Garcia, J.D., Lubin, M.: MathOptInterface: a data structure for mathematical optimization problems (2020). arXiv: 2002.03447 [math.OC]
Maros, I., Mészáros, C.: A repository of convex quadratic programming problems. Optim. Methods Softw. 11(1–4), 671–681 (1999)
Article MathSciNet MATH Google Scholar
Mehrotra, S.: On the implementation of a primal-dual interior point method. SIAM J. Optim. 2(4), 575–601 (1992)
Article MathSciNet MATH Google Scholar
Miele, A., Cragg, E.E., Iyer, R.R., Levy, A.V.: Use of the augmented penalty function in mathematical programming problems, part 1. J. Optim. Theory Appl. 8(2), 115–130 (1971)
Article MathSciNet MATH Google Scholar
Miele, A., Cragg, E.E., Levy, A.V.: Use of the augmented penalty function in mathematical programming problems, part 2. J. Optim. Theory Appl. 8(2), 131–153 (1971)
Article MathSciNet MATH Google Scholar
Miele, A., Moseley, P.E., Levy, A.V., Coggins, G.M.: On the method of multipliers for mathematical programming problems. J. Optim. Theory Appl. 10(1), 1–33 (1972)
Article MathSciNet MATH Google Scholar
Mittelmann, H.: Benchmarks for optimization software (2020). http://plato.asu.edu/bench.html. Accessed 22 June 2020
Molzahn, D.K., Holzer, J.T., Lesieutre, B.C., DeMarco, C.L.: Implementation of a large-scale optimal power flow solver based on semidefinite programming. IEEE Trans. Power Syst. 28(4), 3987–3998 (2013)
Article Google Scholar
MOSEK, Aps. The MOSEK optimization toolbox for MATLAB manual. Version 8.0 (2017)
Nakata, K., Fujisawa, K., Fukuda, M., Kojima, M., Murota, K.: Exploiting sparsity in semidefinite programming via matrix completion II: implementation and numerical results. Math. Program. 95(2), 303–327 (2003)
Article MathSciNet MATH Google Scholar
O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169, 1042–1068 (2016)
Article MathSciNet MATH Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Article Google Scholar
Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher, R. (ed.) Optimization, pp. 283–298. Academic Press, London, New York (1969)
Rao, S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S.: A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159(7), 1665–1680 (2014)
Article Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Article MathSciNet MATH Google Scholar
Rontsis, N., Goulart, P.J., Nakatsukasa, Y.: Efficient semidefinite programming with approximate ADMM (2019). arXiv:1912.02767 [math.OC]
Rontsis, N., Goulart, P.: Optimal approximation of doubly stochastic matrices. In: International Conference on Artificial Intelligence and Statistics, pp. 3589–3598. PMLR (2020)
Schenk, O., Gärtner, K., Karypis, G., Röllin, S., Hagemann, M.: Pardiso solver project. http://www.pardiso-project.org (2010)
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: OSQP: an operator splitting solver for quadratic programs. Math. Program. Comput. 12, 637–672 (2020)
Article MathSciNet MATH Google Scholar
Sturm, J.F.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim. Methods Softw. 11(1–4), 625–653 (1999)
Article MathSciNet MATH Google Scholar
Sun, Y., Andersen, M.S., Vandenberghe, L.: Decomposition in conic optimization with partially separable structure. SIAM J. Optim. 24(2), 873–897 (2014)
Article MathSciNet MATH Google Scholar
Toh, K., Todd, M.J., Tütüncü, R.H.: SDPT3–a MATLAB software package for semidefinite programming, version 1.3. Optim. Methods Softw. 11(1–4), 545–581 (1999)
Article MathSciNet MATH Google Scholar
Udell, M., Mohan, K., Zeng, D., Hong, J., Diamond, S., Boyd, S.: Convex optimization in Julia. In: SC14 Workshop on High Performance Technical Computing in Dynamic Languages (2014)
Vandenberghe, L., Andersen, M.S.: Chordal graphs and semidefinite optimization. Found. Trends Optim. 1(4), 241–433 (2015)
Article Google Scholar
Wright, S.J.: Primal-Dual Interior-Point Methods, vol. 54. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Wright, S., Nocedal, J.: Numerical Optimization, vol. 35. Springer, Berlin (1999)
MATH Google Scholar
Yannakakis, M.: Computing the minimum fill-in is NP-complete. SIAM J. Algebraic Discrete Methods 2(1), 77–79 (1981)
Article MathSciNet MATH Google Scholar
Zass, R., Shashua, A.: Doubly stochastic normalization for spectral clustering. In: Advances in Neural Information Processing Systems, pp. 1569–1576 (2007)
Zheng, Y., Fantuzzi, G., Papachristodoulou, A., Goulart, P., Wynn, A.: Chordal decomposition in operator-splitting methods for sparse semidefinite programs. Math. Program. 180(1), 489–532 (2020)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Oxford, Parks Road, Oxford, OX1 3PJ, UK
Michael Garstka, Mark Cannon & Paul Goulart

Authors

Michael Garstka
View author publications
You can also search for this author in PubMed Google Scholar
Mark Cannon
View author publications
You can also search for this author in PubMed Google Scholar
Paul Goulart
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Garstka.

Additional information

Communicated by Suliman Saleh Al-Homidan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Preliminary versions of this work appeared in [26, 27]. MG is supported by the Clarendon Scholarship.

Appendix

1.1 Projection onto ${\mathcal {C}}_{\sum }$

The projection of a symmetric matrix $S \in {\mathbb {S}}^{n}$ onto the set of matrices where the sum of rows and columns each equal to one can be written as the following optimisation problem:

$$\begin{aligned} \begin{array}{llll} \text{ minimise }&\textstyle {\frac{1}{2}} \left\Vert s_p - s\right\Vert _F^2&\text{ subject } \text{ to }&A s_p = {\mathbf {1}}_{2n-1} \end{array} \end{aligned}$$

(26)

where $s = \text {vec}(S)$ is the vectorised matrix, $s_p$ is the projected vector and A is partitioned $A = [A_r, A_c]^\top $. $A_r = [{\mathbf {1}}_n^\top \otimes I_n]$ is used to constrain the rows of S and $A_c = [I_{n-1} \otimes {\mathbf {1}}_n^\top , {\mathbf {0}}_{n-1 \times n}]$ is used to constrain the columns of S. Notice that $A_c$ is a $(n-1) \times n$ matrix because the redundant constraint on the last column was removed. The KKT conditions for (26) are given by:

$$\begin{aligned} A s_p&= {\mathbf {1}}_{2n-1} \text { and } s_p - s + A^\top \eta = 0, \end{aligned}$$

(27)

with dual variable $\eta $. Eliminating $s_p$ and solving for $\eta $ yields $\eta = (A A^\top )^{-1} (A s - {\mathbf {1}}_{2n-1})$. The projected vector $s_p$ can be recovered from (27):

$$\begin{aligned} s_p = s - A^\top \eta . \end{aligned}$$

(28)

It turns out that the inverse of $AA^\top $ can be efficiently computed without a factorisation. To see this, write the equation for $\eta $ as a linear system:

$$\begin{aligned} \begin{bmatrix} n I_n &{} {\mathbf {1}}_n {\mathbf {1}}_{n-1}^\top \\ {\mathbf {1}}_{n-1} {\mathbf {1}}_{n}^\top &{} n I_{n-1} \end{bmatrix} \begin{bmatrix} \eta _1 \\ \eta _2 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \end{bmatrix} \, \text {with } \begin{bmatrix} r_1 \\ r_2 \end{bmatrix} = A s - {\mathbf {1}}_{2n-1}, \end{aligned}$$

(29)

where $\eta $ and the right-hand side were partitioned in such a way that $\eta _1, r_1 \in {\mathbb {R}}^n$ and $\eta _2, r_2 \in {\mathbb {R}}^{n-1}$. By eliminating $\eta _1$ from (29) and applying the Sherman–Morrison formula to compute the matrix inverse we compute $\eta _2$

$$\begin{aligned} \left( I_{n-1} - \textstyle {\frac{1}{n}} {\mathbf {1}}_{n-1} {\mathbf {1}}_{n-1}^\top \right) \eta _2&=\textstyle {\frac{1}{n}} \left( r_2 - \textstyle {\frac{1}{n}} {\mathbf {1}}_{n-1} {\mathbf {1}}_{n}^\top r_1 \right) \end{aligned}$$

(30)

$$\begin{aligned} \eta _2&= \textstyle {\frac{1}{n}} \left( I_{n-1} + {\mathbf {1}}_{n-1} {\mathbf {1}}_{n-1}^\top \right) \left( r_2 - \textstyle {\frac{1}{n}} {\mathbf {1}}_{n-1} {\mathbf {1}}_{n}^\top r_1 \right) . \end{aligned}$$

(31)

By substituting $\eta _2$ into the upper equation of (29), we compute $ \eta _1 = \textstyle {\frac{1}{n}} (r_1 - {\mathbf {1}}_{n} {\mathbf {1}}_{n-1}^\top \eta _2 )$. Having computed the dual variable $\eta $, the projected vector $s_p$ is obtained by solving (28).

1.2 Benchmark Results

See Table 7.

Table 7 Benchmark results for non-chordal sparse SDPs from SDPLib and SDPs generated with sparsity patterns from the SuiteSparse Matrix Collection for different merging strategies

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Garstka, M., Cannon, M. & Goulart, P. COSMO: A Conic Operator Splitting Method for Convex Conic Problems. J Optim Theory Appl 190, 779–810 (2021). https://doi.org/10.1007/s10957-021-01896-x

Download citation

Received: 08 January 2021
Accepted: 16 June 2021
Published: 29 August 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10957-021-01896-x

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

COSMO: A Conic Operator Splitting Method for Convex Conic Problems

Abstract

Similar content being viewed by others

Douglas–Rachford splitting method for semidefinite programming

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Peaceman–Rachford splitting for a class of nonconvex optimization problems

1 Introduction

2 Problem Description and Core Algorithm