GPU Acceleration of ADMM for Large-Scale Quadratic Programming

The alternating direction method of multipliers (ADMM) is a powerful operator splitting technique for solving structured convex optimization problems. Due to its relatively low per-iteration computational cost and ability to exploit sparsity in the problem data, it is particularly suitable for large-scale optimization. However, the method may still take prohibitively long to compute solutions to very large problem instances. Although ADMM is known to be parallelizable, this feature is rarely exploited in real implementations. In this paper we exploit the parallel computing architecture of a graphics processing unit (GPU) to accelerate ADMM. We build our solver on top of OSQP, a state-of-the-art implementation of ADMM for quadratic programming. Our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation.


Introduction
Convex optimization has become a standard tool in many engineering fields including control [GPM89,RM09], signal processing [MB10], statistics [Hub64,Tib96,CWB08], finance [Mar52, CT06, BMOW14, BBD + 17], and machine learning [CV95]. In some of these applications one seeks solutions to optimization problems whose dimensions can be very large. For such problems, classical optimization algorithms, such as interior-point methods, may fail to provide a solution.
In the last decade operator splitting methods, such as the proximal gradient method and the alternating direction method of multipliers (ADMM), have gained increasing attention in a wide range of application areas [BPC + 11, PB13, BSM + 17]. These methods scale well with the problem dimensions, can exploit sparsity in the problem data efficiently, and are often easily parallelizable. Moreover, requirements on the solution accuracy are often moderate because of the noise in the data and arbitrariness of the objective. This supports the use of operator splitting methods, which return solutions of a medium accuracy at a reasonable computational effort.
Graphics processing units (GPUs) are hardware accelerators that offer an unmatched amount of parallel computational power for their relatively low price. They provide far greater memory bandwidths than conventional CPU-based systems, which is especially beneficial in applications that process large amounts of data. It is thus no surprise that the use of GPUs has seen many applications in the area of machine learning, ranging from training deep neural networks [KSH17, LTH + 17, CDY + 19] to autonomous driving [LTZG17]. Many software tools for machine learning, including PyTorch [SDC + 19], Ten-sorFlow [ABC + 16], Theano [The16], and CNTK [SA16], have native support for GPU acceleration.
This paper explores the possibilities offered by the massive parallelism of GPUs to accelerate solutions to large-scale quadratic programs (QPs). We build our solver on top of the ADMM-based OSQP solver [SBG + 19]. The authors in [OCPB16] have demonstrated that GPUs can be used to accelerate the solution to the linear system arising in their method. We follow a similar approach to accelerate OSQP by replacing its direct linear system solver with an indirect (iterative) one, which we implement on the GPU. Moreover, we perform all vector and matrix operations on the GPU, which further improves the performance of our implementation. While the authors in [AK17,CMR19] use GPUs to solve linear programs (LPs) and QPs in batches, i.e., they solve numerous different problems within one operation, our solver is designed for solving a single but large-scale problem at a time.

Outline
We introduce the problem of interest in Section 2 and summarize the algorithm used by the OSQP solver in Section 3. We then present in Section 4 an alternative method for solving the linear system arising in OSQP. We give a short summary of general GPU programming strategies in Section 5, followed by implementation details of the proposed GPU-based solver in Section 6. Section 7 demonstrates the performance of our solver on large-scale numerical examples. Finally, Section 8 concludes the paper.

Notation
Let R denote the set of real numbers, R n the n-dimensional real space, R m×n the set of real m-by-n matrices, and S n ++ (S n + ) the set of real n-by-n symmetric positive (semi)definite matrices. We denote by I and 1 the identity matrix and the vector of all ones (of appropriate dimensions), respectively. For a vector x ∈ R n , we denote its i-th element by x i , the Euclidean norm by x 2 := √ x T x, and the ℓ ∞ norm by x ∞ := max i |x i |. For a matrix K ∈ S n ++ , we denote the K-norm of x ∈ R n by x K := √ x T Kx. The gradient of a differentiable function f : R n → R evaluated at x ∈ R n is denoted by ∇f (x). For a nonempty, closed, and convex set C ⊆ R n , we denote the Euclidean projection of x ∈ R n onto C by Π C (x) := argmin y∈C x − y 2 . The Euclidean projections of x ∈ R n onto the nonnegative and nonpositive orthants are denoted by x + := max(x, 0) and x − := min(x, 0), respectively.

Problem Description
Consider the following QP: where x ∈ R n is the optimization variable. The objective function is defined by a positive semidefinite matrix P ∈ S n + and a vector q ∈ R n , and the constraints by a matrix A ∈ R m×n and vectors l and u so that l i ∈ R ∪ {−∞}, u i ∈ R ∪ {+∞}, and l i ≤ u i for all i = 1, . . . , m. Linear equality constraints can be encoded in this way by setting l i = u i .

Optimality and Infeasibility Conditions
By introducing a variable z ∈ R m , we can rewrite problem (1) in an equivalent form (2) The optimality conditions for problem (2) are given by [SBG + 19] Ax − z = 0 (3a) where y ∈ R m is a Lagrange multiplier associated with the constraint Ax = z. If there exist x ∈ R n , z ∈ R m , and y ∈ R m that satisfy (3), then we say that (x, z) is a primal and y is a dual solution to problem (2).
Problem (1) need not have a solution. If there existsȳ ∈ R m such that then problem (1) is infeasible and we say thatȳ is a certificate of primal infeasibility. Similarly, if there existsx ∈ R n such that for all i = 1, . . . , m, then the dual of problem (1) is infeasible and we say thatx is a certificate of dual infeasibility. We refer the reader to [BGSB19,Prop. 3.1] for more details.

OSQP Solver
OSQP is an open-source numerical solver for convex QPs. It is based on ADMM and was shown to be competitive to and even faster than commercial QP solvers [SBG + 19]. An iteration of OSQP is shown in Algorithm 1. The scalar α ∈]0, 2[ is called the relaxation parameter, and σ > 0 and R ∈ S n ++ are the penalty parameters. OSQP uses diagonal positive definite matrix R, which makes R −1 easily computable. In step 6 of Algorithm 1 it evaluates the Euclidean projection onto the box [l, u] := {z ∈ R m | l ≤ z ≤ u}, which has a simple closed-form solution where min and max operators should be taken elementwise.
If problem (2) is solvable, then the sequence (x k , z k , y k ) generated by Algorithm 1 converges to its primal-dual solution [BGSB19, SBG + 19]. On the other hand, if the problem is primal or dual infeasible, then the iterates (x k , z k , y k ) do not converge, but the sequence (δx k , δz k , δy k ) := (x k − x k−1 , z k − z k−1 , y k − y k−1 ) always converges and can be used to certify infeasibility of the problem. In particular, if the problem is primal infeasible then δy := lim k→∞ δy k will satisfy (4), whereas δx := lim k→∞ δx k will satisfy (5) if it is dual infeasible [BGSB19, Thm. 5.1].

Termination Criteria
For the given iterates (x k , z k , y k ), we define the primal and dual residuals as The authors in [SBG + 19] show that the pair (z k , y k ) satisfies optimality conditions (3c)-(3d) for all k > 0 regardless of whether the problem is solvable or not. If the problem is also solvable, then the residuals r k prim and r k dual will converge to zero [BGSB19, Prop. 5.3].
A termination criterion for detecting optimality is thus implemented by checking that r k prim and r k dual are small enough, i.e., where ε prim > 0 and ε dual > 0 are some tolerance levels, which are often chosen relative to the scaling of the algorithm iterates [BPC + 11, §3.3].

Solving the KKT System
Step 4 of Algorithm 1 requires the solution to an equality-constrained QP, which is equivalent to solving the following linear system: from whichz k+1 can be obtained as We refer to the matrix in (7) as the KKT matrix.
OSQP uses a direct method that computes the exact solution to (7) by first computing a factorization of the KKT matrix and then performing forward and backward substitutions. The KKT matrix is symmetric quasi-definite for all σ > 0 and R ∈ S n ++ , which ensures that it is nonsingular and has a well-defined LDL T factorization with diagonal D [GIK00]. Since the KKT matrix does not depend on the iteration counter k, OSQP performs factorization at the beginning of the algorithm, and reuses the factors in subsequent iterations.

Preconditioning
A known weakness of ADMM is its inability to deal effectively with ill-conditioned problems and its convergence can be very slow when data are badly scaled. Preconditioning is a common heuristic aiming to speed-up convergence of first-order methods. OSQP uses a variant of Ruiz equilibration [Rui01,KRU14], given in Algorithm 2, which computes a cost scaling scalar c > 0 and diagonal positive definite matrices D and E that effectively modify problem (1) into the following: where the optimization variables arex = D −1 x,z = E −1 z andȳ = cE −1 y, and the problem data arē P = cDP D,q = cDq,Ā = EAD,l = El,ū = Eu.

Parameter Selection
OSQP sets α = 1.6 and σ = 10 −6 by default and the choice of these parameters does not seem to be critical for the ADMM convergence rate. However, the choice of R = diag(ρ 1 , . . . , ρ m ) is a key determinant of the number of iterations required to satisfy a termination criterion. OSQP sets a higher value of ρ i that is associated with an equality constraint, i.e., whereρ > 0. Having a fixed value ofρ does not provide satisfactory performance of the algorithm across different problems. To compensate for this sensitivity, OSQP adopts an adaptive scheme, which updatesρ during the iterations based on the ratio between norms of the primal and dual residuals [Woh17].
The proposed parameter update scheme makes the algorithm much more robust, but also introduces additional computational burden since updating R changes the KKT matrix in (7), which then needs to be refactored. Updatingρ is thus performed only a few times during the runtime of the algorithm.

Preconditioned Conjugate Gradient Method
An alternative way to solve the equality-constrained QP in step 4 of Algorithm 1 is by using an indirect method. As observed in [SBG + 19], eliminating ν k+1 from (7) results in the reduced KKT system from whichz k+1 can be obtained asz k+1 = Ax k+1 . Note that the reduced KKT matrix is always positive definite, which allows us to use the conjugate gradient (CG) method for solving (9).

Conjugate Gradient Method
The CG method is an iterative method for solving linear systems of the form where K ∈ S n ++ is a symmetric positive definite matrix. The method computes the solution to the linear system in at most n iterations [NW06, Thm. 5.1]. However, when solving large-scale linear systems, one aims to terminate the method after d ≪ n iterations, which yields an approximate solution to (10).
Solving (10) is equivalent to solving the following unconstrained optimization problem:

Conjugate directions
A set of nonzero vectors {p 0 , . . . , p n−1 } is said to be conjugate with respect to K if Successive minimization of f along the conjugate directions p k , i.e., evaluating produces x k+1 that minimizes f over the expanding subspace S k , which is spanned by the previous conjugate directions {p 0 , . . . , p k } [NW06, Thm. 5.2]. The minimization in (11a) has the following closed-form solution: where r k := ∇f (x k ) = Kx k − b is the residual at step k.

Conjugate gradient
There are various choices for the conjugate direction set {p 0 , . . . , p n−1 }. For instance, the eigenvectors of K form a set of conjugate directions with respect to K, but are impractical to compute for large matrices. The cornerstone of the CG method is its ability to generate a set of conjugate directions efficiently. It computes a new direction p k using only the previous direction p k−1 , which imposes low computational and memory requirements. In particular, a new direction p k is computed as a linear combination of the negative gradient −r k and the previous direction p k−1 , The scalar β k is determined from the conjugacy requirement (p k ) T Kp k−1 = 0, leading to The first conjugate direction is set to the negative gradient, i.e., p 0 = −r 0 . Combining the successive minimization (11) and the computation of conjugate directions (12) yield the CG method.

Preconditioning
Since the CG method is a first-order optimization method, it is sensitive to the problem scaling. To improve convergence of the method, we can precondition the linear system by using a coordinate transformationx where C ∈ R n×n is a nonsingular matrix. Applying the CG method to the transformed linear system yields the preconditioned conjugate gradient (PCG) method, which is shown in Algorithm 3 [NW06, Alg. 5.3]. It turns out that C need not be formed explicitly, but rather acts through M = C T C.
In general, a good preconditioner should satisfy M ≈ K and at the same time make the linear system My = r easy to solve [GVL13, §11.5]. One of the simplest choices is the diagonal or Jacobi preconditioner, which contains the diagonal elements of K, making M −1 easily computable.
More advanced choices include the incomplete Cholesky, the incomplete LU, and polynomial preconditioners. The incomplete preconditioners produce an approximate decomposition of K with a high sparsity, so that solving My = r is computationally cheap. The family of polynomial preconditioners include the Chebyshev and least-squares polynomial preconditioners, both of which require a bound on the spectrum of K [AMO92].

GPU Architecture and Programming Strategies
GPUs have been used for general-purpose computing for more than two decades [BHS13]. They come in many different variations and architectures, but we will restrict our discussion to the latest NVIDIA Turing-based architecture. Most of the concepts also apply to older NVIDIA GPUs; for further details, we refer the reader to [NVIa].
A GPU consists of an array of several streaming multiprocessors (SMs), each of which contains multiple integer and floating-point arithmetic units, local caches, shared memory, and several schedulers. The on-board RAM of the GPU is called global memory, in contrast to the shared memory that is local to each SM. While an SM of the Turing generation has 96 KB of shared memory, the global memory is much larger and is typically in order of GB. Compared to the shared memory, it has a much higher latency and a lower bandwidth, but is still much faster than system RAM; bandwidths of 500 GB/s are not uncommon for the GPU global memory, whereas system RAM is limited to 40-50 GB/s.
The main challenge in using GPUs is to leverage the increasing number of processing cores and develop applications that scale their parallelism. A solution designed by NVIDIA to overcome this challenge is called CUDA, a general-purpose parallel computing platform and programming model.

CUDA Architecture
CUDA is an extension of the C programming language created by NVIDIA. Its main idea is to have a large number of threads that solve a problem cooperatively. This section explains how threads are organized into cooperative groups and how CUDA achieves scalability.

Kernels
Kernels are special C functions that are executed on a GPU and are defined by the global keyword. In contrast to regular C functions, kernels get executed N times in parallel by N different threads, where each thread executes the same code, but on different data. This concept is known as single instruction, multiple data (SIMD). The number of threads is specified when calling a kernel, which is referred to as a kernel launch.

Thread hierarchy
While kernels specify the code that is executed by each thread, the thread hierarchy dictates how the individual threads are organized. CUDA has a two-level hierarchy to organize the threads, a grid-level and a block-level. A grid contains multiple blocks and a block contains multiple threads. A kernel launch specifies the grid size (number of blocks) and the block size (number of threads per block).
The threads within one block can cooperate to solve a subproblem. The problem needs to be partitioned into independent subproblems by the programmer so that a grid of thread blocks can solve it in parallel. Each block is scheduled on one of the available SMs, which can happen concurrently or sequentially, depending on the number of blocks and available hardware. If there are enough resources available, then several blocks can be scheduled on a single SM.
The threads within a single block have a unique thread index that is accessible through the built-in variable threadIdx, which is defined as a 3-dimensional vector. This allows threads to be indexed in one-, two-, or three-dimensional blocks, which allows for a natural indexing in the problem domain. Similarly, blocks within a grid have a unique block index that is accessible through the variable blockIdx. It is also defined as a 3-dimensional vector and allows for one-, two-, or three-dimensional indexing.

Accelerating numerical methods
Numerical methods make extensive use of floating-point operations, but their performance is not solely determined by the system's floating-point performance. Although GPUs offer magnitudes larger floating-point power than CPUs, it is the memory bandwidth that limits the performance of many numerical operations [VCC + 10]. Fortunately, GPUs also offer an order of magnitude larger memory bandwidth, but utilizing their full potential is not an easy task since the parallel nature of the GPU requires different programming strategies.
Listing 1 implements the simple Basic Linear Algebra Subprograms (BLAS) routine axpy (y = y + ax) on the CPU. The code uses a simple for loop to iterate through the elements of x and y. A GPU implementation of axpy is shown in Listing 2. The code looks very similar to the CPU version, but has two important differences. First, the for loop is replaced by a simple if condition. Instead of one thread iterating through a loop element-by-element, there is a thread for each element to be processed, which is a common pattern in GPU computing. Second, a thread ID is used to determine the data element which each thread is operating on. A global thread ID is calculated from a local thread index and a block index. The if condition disables threads with a thread ID larger than the number of elements. This is necessary since threads are launched in blocks and the total number of threads usually does not match the number of elements, while the cost of few idle threads is negligible. Figure 1 compares the achieved memory throughput and the floating-point performance of the CPU and GPU implementations of the axpy operation; see Section 7 for the hardware specifications. For large vector sizes the simple GPU implementation is approximately 15 times faster. Note, however, that small problems cannot be accelerated well with GPUs, as there is not enough work to keep the GPU busy and a kernel launch and data transfer come with a constant overhead that cannot be amortized.
The GPU reaches the maximum memory throughput of 522 GB/s, which is 85% of its theoretical peak of 616 GB/s, whereas the peak value of the floating-point performance is around 87 GFLOPS/s, which is less than 1% of its theoretical peak of 13.45 TFLOPS/s. This shows that the performance of axpy is clearly limited by the memory bandwidth.

Segmented reduction
A reduction is an operation that takes a vector x ∈ R n and an associative binary operator ⊕, and returns a scalar y ∈ R [Läm08], This abstract formulation allows us to formulate many operations as a reduction, among others the sum of elements, the maximum value of elements, the ℓ 1 norm, the ℓ ∞ norm etc. The only difference between reduction and segmented reduction is that the latter reduces individual segments of x and outputs a vector that computes reduction over the segments. There exist very efficient parallel implementations for both reduction and segmented reduction, and thus any problem that can be reformulated as one of them can be easily accelerated by a GPU.

CUDA Libraries
There exist multiple libraries shipped with the CUDA Toolkit that implement various functions on the GPU [NVIb]. We summarize in the sequel the NVIDIA libraries used in this work.
• Thrust is a CUDA C ++ library based on the C ++ Standard Template Library (STL). It provides a high-level interface for high-performance parallel applications and all essential data parallel primitives, such as scan, sort, and reduce. • cuBLAS is a CUDA implementation of BLAS, which enables easy GPU acceleration of code that uses BLAS functions. We use only level-1 cuBLAS API functions that implement the inner product, axpy operation, scalar-vector multiplication, and computation of norms. • cuSPARSE is a CUDA library that contains a set of linear algebra subroutines for handling sparse matrices. It requires the matrices to be in one of the sparse matrix formats described in the next section.

COO
The coordinate (COO) format is one of the simplest sparse matrix formats. It is mainly used as an intermediate format to perform matrix operations, such as transpose, concatenation, or the extension of an upper triangular to a full symmetric matrix. It holds the number of rows m, the number of columns n, the number of nonzero elements nnz, and three arrays of dimension nnz: Value, RowIndex, and ColumnIndex. The cuSPARSE API assumes that the indices are sorted by rows first and then by columns within each row, which makes the representation unique.
The 4 × 5 matrix given below: Note that we use the zero-based indexing in the example above and throughout the paper.

CSR
The compressed sparse row (CSR) format differs from the COO format only in the RowIndex array, which is compressed in the CSR format. The compression can be understood as a two-step process. First, we determine from RowIndex the number of nonzero elements in each row, which results in an array of length m. Then, we calculate the cumulative sum of this array and insert a zero at the beginning, which results in an array of length m+1. The obtained array is denoted by RowPointer since it points to the beginning of a row in Value and ColumnIndex arrays.
The RowPointer array has the property that the difference between its two consecutive elements,

RowPointer[k+1] -RowPointer[k],
is equal to the number of nonzero elements in row k. Noting that RowPointer[0] = 0 and applying the property above recursively, it follows that

RowPointer[m] = nnz.
Matrix A given in (13) has the following CSR representation: The CSR format is used for Sparse Matrix-Vector multiplication (SpMV) in cuSPARSE.

CSC
The compressed sparse column (CSC) format differs from the CSR format in two ways: the values are stored in the column-major format and the column indices are compressed. The compressed array has dimension n+1 and is denoted by ColumnPointer. Matrix A given in (13) has the following CSC representation: Value = 1 7 5 2 1 1 4 1 RowIndex = 0 3 1 2 1 3 0 2 ColumnPointer = 0 2 4 6 6 8 .
The CSC format is not used directly for computations in cuSPARSE. However, we can interpret the CSC representation of a matrix A as the CSR representation of A T using the following mapping:

GPU Acceleration of OSQP
Profile-driven software development is based on identifying major computational bottlenecks in the code, as performance will increase the most when removing these [BHS13]. This section identifies and analyzes computational bottlenecks of OSQP when solving large-scale QPs, and shows how we can remove them by making use of GPU's parallelism.

OSQP Computational Bottlenecks
Given a QP in the form (1), we denote the total number of nonzero elements in matrices P and A by N := nnz(P ) + nnz(A). Profiling the OSQP code reveals that for largescale problem instances all operations whose execution time scales with N represent a potential bottleneck since N is typically much larger than the number of QP variables n and constraints m.
The main computational bottleneck is using a direct linear system solver to tackle the KKT system (7). When N is very large, the computational cost of factoring the KKT matrix becomes prohibitively large. This issue also limits the number of parameter updates, which can improve convergence rate of the algorithm, but require the KKT matrix to be refactored. Furthermore, in each ADMM iteration we need to evaluate forward and backward substitutions, which cannot be fully parallelized.
As shown in Section 3.1, evaluating termination criteria requires several SpMVs. Performing these computations in each ADMM iteration can slow down the solver considerably, and thus, OSQP evaluates these criteria every 25 iterations by default so that the overall computational burden is reduced. This means that the algorithm can terminate only when the iteration counter k is a multiple of 25.
Profiling reveals that the matrix equilibration procedure described in Algorithm 2 is also demanding for large-scale problems, where the main bottlenecks are computing the column-norms in step 5 and matrix scaling in step 7.

Representation of Matrices
OSQP represents matrices P and A in the CSC format. Moreover, since P is symmetric, only the upper triangular part of P is actually stored in memory. The preferred way of storing matrices in the GPU memory is using the CSR format since it has a superior SpMV performance on the GPU. However, when using A in the CSR format, computing A T y is around 10 times slower than computing Ax [NVIb]. This inefficiency can be avoided by storing both A and A T in the CSR format, though this doubles the memory requirements.
Similarly, storing only the upper triangular part of P is memory-efficient, but computing P x in that case is much slower than when the full P is stored [NVIb]. Therefore, we store the full P in the CSR format since it improves the SpMV performance.

Reduced KKT System
As discussed in Section 4, we can avoid factoring the KKT matrix by solving the reduced KKT system (9) with the PCG method, which only evaluates SpMVs and can be easily parallelized. Although OSQP uses the parameter matrix of the form R = diag(ρ 1 , . . . , ρ m ), where ρ i is set as in (8), numerical tests show that this choice of R makes the PCG method converge slowly. This can be understood by looking at the effect of R on the reduced KKT matrix. Since R appears in the term A T RA, setting it as in (8) has the effect of scaling the rows of A by different values, which effectively increases the condition number of the matrix.
The convergence rate of the PCG method can be improved by using R =ρI instead. This choice will in general result in more iterations of Algorithm 1, but will reduce the number of iterations of Algorithm 3 considerably. The linear system (9) now reduces to Note that the coefficient matrix above need not be formed explicitly. Instead, the matrixvector product r ← (P + σI +ρA T A)x can be evaluated as z ←ρAx r ← P x + σx + A T z.

Preconditioner
We use the Jacobi preconditioner, for which solving My = r amounts to a simple diagonal matrix-vector product. The diagonal of the Jacobi preconditioner for (14) can be computed as diag(M) = diag(P ) + σ1 +ρ diag(A T A).
Note that we need not compute the full product A T A, but only its diagonal elements, where A i denotes the i-th column of A.

Parameter update
Once diag(P ) and diag(A T A) are available, computing M becomes extremely easy. This makes the parameter update computationally cheap since we only need to update the preconditioner M. This allows us to updateρ more often than is done in OSQP. Our numerical tests perform well whenρ is updated every 10 iterations.

Termination criteria and warm starting
The solution to (14) need not be carried out exactly for Algorithm 1 to converge [BPC + 11, §3.4.4]. This fact can be used to motivate an early termination of the PCG method, meaning that we solve (14) only approximately at first, and then more accurately as the iterations progress. This can be achieved by performing a relatively small number of PCG iterations to obtain an approximate solution, and using warm-starting by initializing x 0 in Algorithm 3 to the solutionx k computed in the previous ADMM iteration.
Finding a good termination criterion for Algorithm 3 is essential for reducing the total runtime of ADMM. If the PCG method returns solutions with low accuracy, then ADMM may converge slower, or even diverge. On the other hand, if the PCG method solves the subproblems with unnecessarily high accuracy, this may increase the total runtime of ADMM. The SCS solver [OCPB16] sets ε in Algorithm 3 as a decreasing function of the ADMM iteration counter k. We adopt a different strategy in which ε is determined based on the ADMM residuals. In particular, we use wherer k prim := Er k prim andr k dual := cDr k dual are the scaled primal and dual residuals. Parameter λ ∈]0, 1[ ensures that ε is always lower than the geometric mean of the scaled primal and dual residuals. We set λ = 0.15 and ε min = 10 −7 . Since we use the ℓ ∞ norms for ADMM residuals, we use the same norm in step 3 of Algorithm 3. As ε depends on the ADMM residuals, which are computed when evaluating ADMM termination criteria, we evaluate these criteria after every 5 ADMM iterations.

Matrix Equilibration
As mentioned in Section 6.1, the main bottlenecks of the matrix equilibration procedure are computing the column-norms in step 5 and matrix scaling in step 7. The matrix scaling requires pre-and post-multiplying P and A by diagonal matrices, which is equivalent to scaling rows and columns of P and A.

Computing column norms
The CSR representation of a sparse matrix allows for efficient computation of its row norms since the RowPointer array defines segments of the Value array corresponding to different rows. Since we also store the matrix transpose, we can efficiently compute column norms of a matrix since they are equivalent to the row norms of its transpose.
A naive approach would be to have one thread per row computing its norm, but this approach is not the most efficient. First, the workload may be distributed poorly among the threads since one row can have zero elements, and another can have many. Second, the memory is accessed almost randomly as each thread iterates through its row, which can considerably deteriorate performance.
A more efficient way of computing the row norms of a matrix in the CSR format is to represent the operation as a segmented reduction, where the segments are defined by the RowPointer array and, in the case of the ℓ ∞ norm, the associated binary operator is given by x 1 ⊕ x 2 = max(|x 1 |, |x 2 |).

Matrix post-multiplication
Matrix post-multiplication refers to evaluating the product MD, where D ∈ R n×n is a diagonal matrix stored as an n-dimensional vector, and M ∈ R m×n is a general sparse matrix in the CSR format. The ColumnIndex array can be used to determine the diagonal element of D that multiplies each element of A, This operation can be performed by many threads concurrently and independently. As the memory read and write access to the array Value is fully coalesced, all memory addresses can be combined into a larger transaction. However, the read access from D can be partly coalesced, but this does not impact the performance too much. 6.5 cuOSQP Table 1 summarizes the main differences between OSQP and our GPU implementation of Algorithm 1. Although our implementation requires two times more memory to store the problem matrices, it does not need to store any matrix factorizations. Moreover, we can reduce the memory requirements by using the single-precision floating-point representation, which also leads to faster computations (see Section 7.4).
We refer to our CUDA C implementation of the OSQP solver as cuOSQP. The code is available online at https://github.com/oxfordcontrol/osqp/tree/cuda-1.0, and its Python interface at https://github.com/oxfordcontrol/cuosqp cuOSQP uses cuBLAS, cuSPARSE, and Thrust libraries, which are included within the CUDA Toolkit. Our code has been tested on both Linux and Windows machines.

Numerical Results
We evaluate performance of cuOSQP and compare it against OSQP (version 0.6.0), which has already been benchmarked against other QP solvers in [SBG + 19]. Our main goal is to demonstrate how a parallel GPU implementation can improve performance of an optimization solver for large-scale problems. The sizes of benchmark problems range from 10 4 to 10 8 nonzero elements in P and A. We use the default parameters for both solvers. By default, we use the single-precision floating-point representation with cuOSQP, but we also compare the single-and double-precision variants in Section 7.4.
All numerical tests were performed on a Linux-based system with an i9-9900K @ 3.6GHz (8 cores) processor and 64 GB of DDR4 3200Mhz RAM, which is equipped with the NVIDIA GeForce RTX 2080 Ti GPU with 11 GB of VRAM.

OSQP Benchmark Problems
We use the set of benchmark problems described in [SBG + 19, Appendix A], which consist of QPs from 7 problem classes, ranging from standard random problems to applications in control, finance, statistics, and machine learning. The problems are available online at [SB19] and are summarized in the sequel.
• Control. The problem of controlling a linear time-invariant dynamical system can be formulated as the following constrained finite-time optimal control problem: • Equality. This class consists of the following equality-constrained QPs: • Huber. Huber fitting or the robust least-squares problem performs linear regression under the assumption that there are outliers in the data. The problem can be written as where the Huber penalty function φ hub : R → R is defined as • Lasso. The least absolute shrinkage and selection operator (lasso) is a well-known technique aiming to obtain a sparse solution to a linear regression problem by adding an ℓ 1 regularization term in the objective. The problem can be formulated as minimize Ax − b 2 2 + λ x 1 .
• Portfolio. Portfolio optimization is a problem arising in finance that seeks to allocate assets in a way that maximizes the risk-adjusted return. The problem has the following form: where x ∈ R n represents the portfolio, µ ∈ R n the vector of expected returns, γ > 0 the risk aversion parameter, and Σ ∈ S n + the risk covariance matrix. • Random. This class consists of the following QP with randomly generated data: • SVM. Support vector machine (SVM) problem seeks an affine function that approximately classifies two sets of points. The problem can be stated as where b i ∈ {−1, +1} is the set label and a i the vector of features for the i-th point.
All instances were obtained from realistic non-trivial random data. For each problem class we generate 10 different instances for 15 dimensions giving a total of 1050 problems. As a performance metric, we use the average runtime across 10 different problem instances of the same size. Figures 2-3 show the computation runtimes achieved by OSQP and cuOSQP. The figures show that OSQP is faster than cuOSQP for problem sizes of the order up to 10 5 . However, for larger problem instances cuOSQP is significantly faster. Furthermore, the slope of the runtimes achieved by OSQP is approximately constant, whereas for cuOSQP it is flatter for smaller problems and increases for larger. This behavior is expected since smaller problems cannot fully utilize the GPU and the data transfer latency is predominant. Figure 4 shows the number of ADMM iterations needed to satisfy the termination condition (6). Updatingρ every 10 iterations helps decrease the total number of ADMM iterations for the problem classes Equality, Lasso, and SVM, which explains the obtained speedup shown in Figures 2-3. For the Control, Portfolio, and Random classes the benefit is not apparent, while for the Huber class updatingρ less frequently seems to work better; in fact, our numerical tests indicate that the smallest number of iterations is achieved whenρ is kept constant.

QDLDL
When compared to OSQP's default linear system solver QDLDL [GSB], the maximum speedups achieved by cuOSQP range from 15 to 270 times. The largest reduction in runtime is achieved for the Equality class, where OSQP takes 6.5 min to solve the largest problem instance, while cuOSQP solves it in 1.4 s. The second largest reduction is achieved for the SVM class with a reduction from 5.6 min to 3.2 s.
For some problems, one can observe that OSQP runtimes do not necessarily increase with the problem size. This behavior comes from computing the permutation of the KKT matrix prior to its factorization, which is performed by the AMD routine [ADD04], whose runtimes do not depend only on the number of nonzero elements in the KKT matrix.

MKL Pardiso
Apart from its single-threaded QDLDL linear system solver, OSQP can be interfaced with Intel MKL Pardiso [Int], a multi-threaded parallel direct sparse solver. Figures 2-3 show that the computation runtimes are monotonically increasing with the problem size when using MKL Pardiso. Also, for smaller problem sizes OSQP is faster when using QDLDL, but for larger problems using MKL Pardiso reduces its runtimes significantly. However, the maximum ratio of runtimes achieved by OSQP and cuOSQP is still between 3.7 and 57 times, depending on the problem class. Figure 5 shows the average computation times when running cuOSQP on the Portfolio benchmark class for both single-and double-precision floating-point representations. The penalty in computation times when using double-over single-precision is less than 2 times over all problem sizes. Moreover, our numerical results suggest that for other problem classes this penalty is even smaller (data not shown). This is counter-intuitive at first since the GPU used in our tests has 32 times higher single-precision floatingpoint performance than in double-precision. However, most numerical methods that we use, especially SpMV, are memory-bound operations, which means that the computation times are limited by the memory bandwidth.

Conclusions
We have explored the possibilities offered by the massive parallelism of GPUs to accelerate solutions to large-scale QPs and have managed to solve problems with hundreds of millions nonzero entries in the problem matrices in only a few seconds. Our implementation cuOSQP is built on top of OSQP, a state-of-the-art QP solver based on ADMM. The large speedup is achieved by using the PCG method for solving the linear system arising in ADMM and by parallelizing all vector and matrix operations. Our numerical tests confirm that GPUs are not suited for solving small problems for which the CPU implementation is generally much faster. Our open-source implementation is written in CUDA C, and has been tested on both Linux and Windows machines.