On the Geometry and Refined Rate of Primal-Dual Hybrid Gradient for Linear Programming

We study the convergence behaviors of primal-dual hybrid gradient (PDHG) for solving linear programming (LP). PDHG is the base algorithm of a new general-purpose first-order method LP solver, PDLP, which aims to scale up LP by taking advantage of modern computing architectures. Despite its numerical success, the theoretical understanding of PDHG for LP is still very limited; the previous complexity result relies on the global Hoffman constant of the KKT system, which is known to be very loose and uninformative. In this work, we aim to develop a fundamental understanding of the convergence behaviors of PDHG for LP and to develop a refined complexity rate that does not rely on the global Hoffman constant. We show that there are two major stages of PDHG for LP: in Stage I, PDHG identifies active variables and the length of the first stage is driven by a certain quantity which measures how close the non-degeneracy part of the LP instance is to degeneracy; in Stage II, PDHG effectively solves a homogeneous linear inequality system, and the complexity of the second stage is driven by a well-behaved local sharpness constant of the system. This finding is closely related to the concept of partial smoothness in non-smooth optimization, and it is the first complexity result of finite time identification without the non-degeneracy assumption. An interesting implication of our results is that degeneracy itself does not slow down the convergence of PDHG for LP, but near-degeneracy does.


Introduction
Linear programming (LP) is one of the most fundamental and important classes of optimization problems in operation research and computer science with a vast range of applications, such as network flow, revenue management, transportation, scheduling, packing and covering, and many others [10,17,3,9,14,29,47,53].Since the 1940s, LP has been extensively studied in both academia and industry.The state-of-the-art methods to solve LP problems, simplex methods [18] and interior-point methods (IPMs) [34], are quite reliable to provide solutions with high accuracy and they serve as the base algorithms for nowadays commercial LP solvers.The success of both methods depends heavily on the efficient factorization methods to solve linear systems arising in the updates, which makes both algorithms highly challenging to further scale up.There are two fundamental reasons for this: (1) the storage of the factorization is quite memory-demanding and it usually requires significantly more memory than storing the original LP instance; (2) it is highly challenging to take advantage of modern computing architectures such as distributed computing and high-performance computing on GPUs when solving linear systems because factorization is sequential in nature.
Recent applications surge the interest in developing new algorithms for LPs with scale far beyond the capability of simplex methods and IPMs.Matrix-free, i.e., no need to solve linear system, is a central feature of promising candidate algorithms, which guarantees low per-iteration computational cost and being friendly to distributed computation.In this sense, first-order methods (FOMs) have become an attractive solution.The update of FOMs requires only the gradient information and is known for its capability of parallelization, thanks to the recent development in deep learning.In the context of LP, the computational bottleneck of FOMs is merely matrix-vector multiplication which is fairly cheap and exhibits easily distributed implementation, in contrast to solving linear systems for simplex methods and IPMs that is costly and highly nontrivial to parallelize.
One notable instance that exemplifies the effectiveness of such methodology is the recent development of an open-source LP solver PDLP1 [4].A distributed version of PDLP has been used to solve real-world LP instances with as many as 92 billion non-zeros in the constraint matrix, which is a hundred to a thousand times larger than the scale state-of-the-art commercial LP solvers can solve [54].The base algorithm of PDLP is primal-dual hybrid gradient (PDHG) [12], a form of operator splitting method with alternating updates between the primal and dual variables.The implementation of PDLP also involves a few other enhancements/heuristics on top of PDHG, such as restart, preconditioning, adaptive step-size, etc, to further improve the numerical performance [4].A significant difference between PDLP and other commercial LP solvers is that PDLP is totally matrix-free, namely, there is no necessity for PDLP to solve any linear system.The main computational bottleneck is matrix-vector multiplication which is much cheaper for solving large instances and more tailored for large-scale application in the distributed setting.
Figure 1: PDHG on four LP relaxation of instances from MIPLIB Despite the numerical success, there is a huge gap between the theoretical understanding and the practice of PDHG for LP.Recent works [50,6] present the complexity theory of (restarted) PDHG for LP, which shows that the iterates of PDHG linearly converge to an optimal solution, but the linear convergence rate depends on the global Hoffman constant of the KKT system, which is known to be exponentially loose.On the other hand, it is evident that the numerical performance of the algorithm does not depend on the overly-conservative Hoffman constant; instead, the algorithm often exhibits a two-stage behavior: an initial sublinear convergence followed by a linear convergence.For instances, Figure 1 shows the representative behaviors of PDHG on four instances from MIPLIB 2017.The convergence patterns of the algorithm differ dramatically across these LP instances.The instance mad converges linearly to optimality within only few thousands of steps.Instances neos16 and gsvm2r13 eventually reach the linear convergence stage but beforehand there exists a relatively flat slow stage.The instance sc105 does not exhibit linear rate within twenty thousand iterations.Furthermore, the eventual linear rates are not equivalent: gsvm2r13 exhibits much faster linear rate than neos16.Similar observation holds for the "warmup" sublinear stage: neos16 has much shorter sublinear period than gsvm2r13.The global linear convergence with the conservative Hoffman constant [50,6] is clearly not enough to interpret the diverse empirical behaviors.The goal of this paper is to bridge such gap by answering the following two questions: • How to understand the two-stage behaviors of PDHG for LP, and what are the geometric quantities that drive the length of the first stage and the convergence rate of the second stage?
• Can we obtain complexity theory of PDHG for LP without using the overly-conservative global Hoffman constant?
1: for k = 0, 1, ... do 2: Algorithm 1 presents the primal-dual hybrid gradient method (PDHG, a.k.aChambolle and Pock algorithm [12]) for LP (2).Without loss of generality, we assume the primal and the dual step-sizes are the same throughout the paper.This can be achieved by rescaling the problem instance [6].The previous works [13,26,50] show the following complexity of PDHG for LP to achieve ϵ-accuracy solution: where R is essentially an upper bound on the norm of the iterates and H is the Hoffman constant of the KKT system of LP: The major issue of the above complexity is the reliance on the global Hoffman constant H, which is notoriously conservative.Consider a general linear inequality system F x ≤ g.A commonly-used characterization to its Hoffman constant [60] is The inner part of the Hoffman constant is an extension of the minimal non-zero singular value of the submatrix F J , which is expected to appear in the linear convergence rate of first-order methods [55].However, the outer maximization optimizes over exponentially many subsets of linear constraints, which makes the Hoffman constant overly conservative and cannot characterize the behaviors of the algorithms.Intuitively this is because when calculating the global Hoffman constant, one needs to look at the local geometry on every boundary set (i.e., extreme points, edges, faces, etc) of the feasible region, and consider the worst-case situation.See Appendix B for an example where the Hoffman constant can be arbitrarily loose.
In this paper, we show that the performance of PDHG for LP does not rely on the overlyconservative global Hoffman constant; instead, the algorithm exhibits a two-stage behavior: • In the first stage (see Section 3 for more details), the algorithm aims to identify the nondegenerate variables.Notice that the algorithm may not converge to an optimal solution that satisfies strict complementary slackness, which we call degeneracy in this paper, and this definition is consistent with the partial smoothness literature.We further call the primal variables that satisfy strict complementary slackness the non-degenerate variables.This stage finishes within a finite number of iterations, and the convergence rate in this stage is sublinear.
The driving force of the first stage is how the non-degenerate part of the LP is close to degeneracy.
• In the second stage (see Section 4 for more details), the algorithm effectively solves a homogeneous linear inequality system.The algorithm converges linearly to an optimal solution, and the driving force of the linear convergence rate is a local sharpness constant for the homogeneous linear inequality system.This local sharpness constant is much better than the global Hoffman constant, and it is a generalization of the minimal non-zero singular value of a certain matrix.Intuitively, this happens due to the structure of the homogeneous linear inequality system so that one just needs to focus on the local geometry around the origin (See [61] for a detailed characterization), avoiding going through the exponentially-many boundary set as in the calculation of global Hoffman constant.
To gain more intuitions of the two-stage convergence behavior, we consider a simple yet representative class of two-dimension house-shaped dual LPs parameterized by 0 ≤ δ ≤ κ < 1 (see Figure Parameter δ essentially measures how close the dual LP is to degeneracy.When δ = 0 (i.e., the top three constraints in Figure 2a intersect), the problem is degenerate; when δ is close to 0, the problem is near-degenerate; when δ is reasonably large, the problem is far away from degeneracy.Parameter κ roughly measures the condition number of the constraint matrix, i.e., the ratio between the minimal and the maximal singular values of the matrix.Figure 2b presents the numerical performance of PDHG on this two-dimension instance with different choices of δ and κ.Indeed, with a proper choice of the parameters δ and κ, the convergence behavior of Figure 2b mimics well the real MIPLIB instances shown in Figure 1.Furthermore, one can see that the smaller the value of κ is, the slower the eventual linear convergence is; the smaller the value of δ is, the longer the initial slow convergence stage.Yet, if δ = 0, i.e., the problem is degenerate, the problem does not have the slow first stage.The above numerical observations are not limited to this two-dimension example, and they are formalized for general LP in Section 3 and Section 4.
Indeed, this two-stage behavior has been documented in the literature of interior-point methods and non-smooth optimization.For example, [71,28,72] discuss the two-stage behaviors of interior-point method (IPM) for LP: IPM has linear convergence in the first stage and super-linear convergence in the second stage.The phase transition happens when certain active basis are identified, and the length of the first phase depends on a certain metric that measures how close the analytical center of the optimal solution set (i.e., the converging optimal solution) is to degeneracy.In the context of non-smooth optimization, there are also fruitful results studying identification of firstorder methods under partial smoothness [70,38,43,44].However, almost all of these works require the non-degeneracy condition, which, in the context of LP, refers to the algorithm converges to an optimal solution that satisfies strict complementary slackness.Unfortunately, this condition barely holds for real-world LP instances with PDHG as well as many other problems, and thus it is also called "irrepresentable condition" in the literature [24,25].Note that the identification results of IPMs also have an implicit reliance on the non-degeneracy condition, since the converging optimal solution of IPMs (i.e., analytical center of optimal solution face) always satisfies strict complementary slackness [71,72,28].In contrast to these literatures, our work presents the first complexity result on finite time identification without the non-degeneracy assumption, in the context of LP.Indeed, our results suggest that degeneracy itself does not slow down the convergence for first-order methods, but near-degeneracy does.

Contributions
The goal of the paper is to provide a theoretical foundation for PDLP, a new first-order method LP solver based on PDHG.The contributions of the paper can be summarized as follows: • Motivated by the empirical behaviors of PDHG for LP, we propose a two-stage characterization of its convergence behaviors: -In the first stage, PDHG attempts to identify the active variables and the near-todegeneracy parameter drives the identification complexity.
-In the second stage, PDHG attempts to solve a homogeneous linear inequality system, and a local sharpness constant controls the linear convergence rate.
• We provide the first linear convergence complexity of PDHG for LP without the dependency on the global Hoffman constant.
• Of independent interest, we present the first complexity bound on finite-time identification in the partial-smoothness context without the "irrepresentative" non-degeneracy condition.
• PDLP [4] utilizes primal-dual hybrid gradient method (PDHG) as its base algorithm and introduces practical algorithmic enhancements, such as presolving, preconditioning, adaptive restart, adaptive choice of step size, and primal weight, on top of PDHG.Right now, it has three implementations: a prototype implemented in Julia (FirstOrderLp.jl) for research purposes, a production-level C++ implementation that is included in Google OR-Tools, and an internal distributed version at Google.The internal distributed version of PDLP has been used to solve real-world problems with as many as 92 billion non-zeros [54], which is one of the largest LP instances that are solved by a general-purpose LP solver.
• ABIP [46,21] is an ADMM-based interior-point method.The core algorithm of ABIP is still a homogeneous self-dual embedded interior-point method.Instead of approximately minimizing the log-barrier penalty function with a Newton step, ABIP utilizes multiple steps of alternating direction method of multipliers (ADMM).The O 1 ϵ log 1 ϵ sublinear complexity of ABIP was presented in [46].Recently, [21] includes new enhancements, i.e., preconditioning, restart, hybrid parameter tuning, on top of ABIP (the enhanced version is called ABIP+).ABIP+ is numerically comparable to the Julia implementation of PDLP.ABIP+ now also supports a more general conic setting when the proximal problem associated with the log-barrier in ABIP can be efficiently computed.
• ECLIPSE [7] is a distributed LP solver designed specifically for addressing large-scale LPs encountered in web applications.These LPs have a certain decomposition structure, and the effective constraints are usually much less than the number of variables.ECLIPSE looks at a certain dual formulation of the problem, then utilizes accelerated gradient descent to solve the smoothed dual problem with Nesterov's smoothing.This approach is shown to have O( 1 ϵ ) complexity, and it is used to solve web applications with 10 12 decision variables [7] and real-world web applications at LinkedIn platform [64,1].
• SCS [58,57] tackles the homogeneous self-dual embedding of general conic programming using ADMM.As a special case of conic programming, SCS can also be used to solve LP.Each iteration of SCS involves projecting onto the cone and solving a system of linear equations with similar forms so that it only needs to store one factorization in memory.Furthermore, SCS supports solving linear equations with an iterative method, which only uses matrix-vector multiplications.
Complexity theory of FOMs for LP.One major drawback of applying first-order methods to solving LP is their slow tail convergence.Due to the lack of strong convexity, one can only obtain sublinear convergence rate when directly applying the classic results of FOMs on LP [55,8].This suggests that FOMs cannot produce the high-accuracy solutions that are typically expected from LP solvers within a reasonable time limit.Luckily, it turns out that LP satisfies additional growth condition structures that can improve the convergence of FOMs from sublinear to linear.For example, [22] showed that a variant of ADMM can have linear convergence for LP.It is shown in [43,44,45] that under non-degeneracy condition, many primal-dual algorithms have eventual linear convergence when solving LP, even though it may take a long time before reaching the linear local regime.Recently, [6] introduced a sharpness condition that is satisfied by LP, and proposed a restarted scheme for a variety of primal-dual methods to solve LP.It turns out that the restarted variants of many FOMs, such as PDHG, extra-gradient method (EGM) and ADMM, can achieve global linear convergence, and such linear convergence is the optimal rate, namely, there is a worstcase LP instance such that no FOMs can achieve better than such linear rate.Later on, [50] shows that PDHG (without restart) also has linear convergence for LP, and the linear convergence rate is slower than that in [6].However, the obtained linear convergence rates in [6,50] depend on the global Hoffman constant of the KKT system of LP, which is known to be very loose and cannot characterize the behaviors of the algorithm.In this paper, we aim to develop a tighter complexity theory of PDHG for LP without using global Hoffman constant.
Primal-dual hybrid gradient method (PDHG).PDHG was initially proposed for applications in image processing and computer vision [12,15,23,32,74].The first convergence guarantee of PDHG was obtained in [12], which shows the O(1/k) sublinear ergodic convergence rate of PDHG for convex-concave primal-dual problems.Later on, simplified analyses for the O(1/k) sublinear ergodic rate of PDHG were presented in [13,51].More recently, it is shown that the last iterates of PDHG exhibit linear convergence under a mild regularity condition that is satisfied by many applications, including LP [26,50].Moreover, many variants of PDHG have been proposed, including adaptive version [27,52,62,68] and stochastic version [2,11].It is also shown that PDHG is equivalent to Douglas-Rachford Splitting up to a linear transformation [48,59].
Identification and partial smoothness.Active-set identification is a classic topic in constrained optimization and in non-smooth optimization, with both theoretical and computational importance.Partial smoothness, a notion meaning smoothness along some directions and sharpness in other directions, plays a crucial role in the analysis of identification.Under the partial smoothness assumption, a refined analysis is proposed for analyzing the identification property of certain algorithms.More specifically, manifold identification of dual averaging is shown in [35].It is shown in [42,43] that the finite time identification of forward-backward-type methods.In [44,45], it is shown the active set identification and local convergence rate of many primal-dual algorithms including PDHG and ADMM.In [63], it shows that stochastic methods also share a similar identification property.Similar results are derived for Newton's method in [39].The eventual sharp linear rate of Douglas-Rachford Splitting on basis pursuit is derived under non-degeneracy condition in [20].
In [19], the non-smooth dynamic of sub-gradient methods near active manifolds is studied.
However, all the above works assume the non-degeneracy condition of the problem, namely, 0 ∈ ri(F(x * )), where ri(•) represents the relative interior of a set and F(x * ) denotes the sub-differential set at the converging optimal solution x * .The only exceptions that do not require the nondegeneracy condition are [25,24], where they showed that the active set of the iterates can potentially be larger than the active set of optimal solutions because of degeneracy.Unfortunately, it is unclear how to check the non-degeneracy condition in general, thus it is also called "irrepresentable condition" [25,24].In the case of LP, this condition means that the converging optimal solution satisfies strict complimentary slackness, which unfortunately is almost never the case when using FOMs to solve real-world LP instances.To the best of our knowledge, this work is the first complexity result of finite-time identification without the non-degeneracy condition.
On a related note, the identification behaviors of interior-point methods (IPMs) for LP have been observed [71,72,28], where it was shown that IPM has super-linear convergence after identification.Indeed, these results implicitly utilize the fact that IPM always converges to the analytical center of the optimal face, which satisfies strict complementary slackness [71,72,28].

Organization
We discuss the sharpness of homogeneous linear inequality system and collect necessary convergence results of PDHG in Section 2. The analysis of identification stage (Stage I) and local convergence (Stage II) are presented respectively in Section 3 and 4. Section 5 illustrates and verifies the theoretical results through numerical experiments.We conclude the paper and propose several future research directions in Section 6.

Notations
For matrix A ∈ R m×n , denote A j the j-th column of matrix A and Denote ∥A∥ 2 the operator norm of a matrix A. Denote ∥x∥ 2 the Euclidean norm for a vector x and ⟨x, y⟩ for its associated inner product.For a positive definite matrix M ≻ 0, let ⟨x, y⟩ M = ⟨x, M y⟩ and ∥x∥ M = √ x T M x the norm induced by the inner product.Let dist M (z, Z) represent the distance between point z and set Z under the norm In particular, dist(z, Z) = min u∈Z ∥u − z∥ 2 .Denote P Z (z) := arg min u∈Z {∥u − z∥ 2 } the projection onto set Z under l 2 norm.Denote the optimal solution set to (2) as Z * .f (x) = O(g(x)) means that for sufficiently large x, there exists constant C such that f (x) ≤ Cg(x).Let ι C (•) be the indicator function of set C. Let z = (x, y), and denote and 1 n represents all 1's vector of length n.For set

Preliminaries
We here present preliminary results on sharpness of primal-dual problems, convergence properties of PDHG and sharpness of homogeneous linear inequality system that we will use later.

Sharpness of primal-dual problems
Consider a generic convex-concave primal-dual problem where L(x, y) is a convex function in x and a concave function in y.We use z = (x, y) to represent the primal and the dual solution together.
Normalized duality gap was introduced in [6] as a progress metric for primal-dual problems: Definition 1.For a primal-dual problem (6) and a solution z = (x, y), the normalized duality gap with radius r is defined as where W r (z) = {ẑ ∈ Z | ∥z − ẑ∥ 2 ≤ r} is a ball centered at z with radius r intersected with Z = X × Y.
The normalized duality gap ρ r (z) is a valid progress metric for LP, since it provides an upper bound on the KKT residual: Proposition 1 ([6, Lemma 4]).It holds for any z = (x, y) with x ≥ 0 and ∥z∥ 2 ≤ R that for any r ∈ (0, R], The sharpness of a primal-dual problem is defined in [6]: Definition 1]).We say a primal-dual problem is α-sharp on the set S if ρ r (z) is α-sharp on S for all r, i.e., it holds for all z ∈ S that The following proposition shows that the primal-dual form of LP is sharp.

Convergence of PDHG iterates
In this subsection, we present some basic convergence properties of PDHG on LP in literature.
Many of these convergence results hold for generic primal-dual problems, and we here state them in the context of LP for convenience.
The next lemma presents the non-expansiveness and convergence of PDHG on LP: ).Consider the iterations {z k } ∞ k=0 of PDHG on LP (Algorithm 1).Let Z * be the optimal solution set to (2).Then it holds for any k that The next theorem presents the sublinear and linear convergence rate of PDHG on LP.These results can be obtained utilizing a similar proof technique as in [50].We present the formal proof in Appendix A for completeness.
We comment that we assume the step-size s ≤ 1 2∥A∥ 2 in Theorem 1 and throughout the paper later on.While many of our results hold for s < 1 ∥A∥ 2 , assuming s ≤ 1 2∥A∥ 2 makes it easy to convert between P s norm and ℓ 2 norm, as stated in Lemma 2: It holds for any z ∈ R m+n that:

Sharpness of linear inequality system
The existing linear convergence results of PDHG for LP [50,6] heavily depend on the Hoffman constant of the KKT system.However, it is well-known that the global Hoffman constant is generally very loose and cannot be easily characterized.Recently, [61] presents a simple and tight bound on the Hoffman constant (i.e., inverse of sharpness constant) for homogeneous linear inequality system.We here present these characterizations and discuss their fundamental differences.
First, recall the definition of the sharpness constant of a linear inequality system: Definition 3. Consider a linear inequality system F x = g, F x ≤ g and its solution set We call α > 0 a sharpness constant to the system if it holds for any The results of Hoffman [33,60] show the existence of a sharpness constant α for any linear inequality system.The value of sharpness constant α is the inverse of the Hoffman constant to the linear inequality system.Indeed, it is highly difficult to provide simple characterization to sharpness constant α.One characterization to α for system F x = g, F x ≤ g is described in [60]: where Informally speaking, the inner minimization in (11) computes an extension of minimal positive singular value for a certain matrix, which is specified by an "active set" J of constraints.The outer minimization takes the minimum of these extended minimal positive singular values over all possible "active sets" (intuitively, it goes through every extreme point, edge, face, etc., of X * ).While the inner minimization is expected when characterizing the behaviors of an algorithm, similar to the strong convexity in a minimization problem, there are usually exponentially many "active sets" for a polytope X * , thus α defined in ( 11) is known to be a loose bound and it is generally NP-hard to compute its exact value or even just a reasonable bound.
On the other hand, it turns out a simple characterization exists for the sharpness of a homogeneous linear inequality system as shown recently in [61].It does not require going through the exponential many "active sets" as in (11) and dramatically simplifies the characterization of the sharpness constant.
Proposition 3 (Proposition 1-3, Theorem 1 in [61]).Consider a linear inequality system F x = 0, F x ≤ 0 where F ∈ R m×n and F ∈ R m×n , and its solution set Denote α the sharpness constant to the system, i.e., it holds for any

Then:
(a) There exists a unique partition P ∪ Q = {1, ..., m} such that for F = FP FQ , (b) Moreover, define where and Remark 1. Similar to the interpretation of the inner minimum in (11), α 0 (K) and α 0 (L) can be interpreted as an extension of the minimal non-zero singular value of the matrices.They are indeed highly related to and can be lower-bounded by the distance-to-infeasibility defined in Renegar's condition number [65].α(L, K) essentially measures the angle between K and L and it is strictly positive by the construction of L and K [61].
Compared to the characterization of sharpness constant of a non-homogeneous linear system (11), which requires computing the minimum over potentially exponentially many values, the bound in (12) for homogeneous linear system essentially just looks at the corresponding values of two sub-systems and combines them using the angle measurement α(L, K).As a result, the sharpness constant of a homogeneous system is much simpler and tighter than the non-homogeneous counterpart, and can be computed in polynomial time [61].

Stage I: finite time identification
In this section, we characterize the initial slow convergence stage and the phase transition between the two stages as described in Section 1.It turns out that PDHG can identify the "active basis" of the converging optimal solution in the first stage.We show that this stage always finishes (i.e., the phase transition happens) within a finite number of iterations, and the length of the first phase is driven by a quantity that characterizes how close the non-degeneracy part of the optimal solution is to degeneracy.
Such phase transition behavior is closely related to the concept of partial smoothness that was proposed in previous literature to understand non-smooth optimization and primal-dual algorithms [70,37,56,30,31,40,16,36,45].In contrast to the previous literature, we do not assume the non-degeneracy condition of the problem.In the context of LP, the non-degeneracy condition in partial smoothness refers to that the iterates converge to an optimal solution which satisfies strict complimentary slackness.Such condition is also called "irrepresentable condition" in the optimization literature [25,24], because it is highly rare for first-order methods to converge to a strictly complementary optimal solution in practice.
Indeed, the major difficulty of our analysis is to handle the degeneracy of the problem.To the best of our knowledge, this work is the first complexity result on finite-time identification without the non-degeneracy assumption.
To describe the identification property of the algorithm, we first introduce some new notations.
Definition 4. For a linear programming (2) and an optimal primal-dual solution z * = (x * , y * ), we define the index partition (N, B 1 , B 2 ) of the primal variable coordinates as Since (x * , y * ) is an optimal solution pair to the LP, it follows from complementary slackness that x * N = 0.However, (x * , y * ) may not satisfy strict complementary slackness, i.e., B 2 can be a non-empty set.The lack of strict complementary slackness is called degeneracy herein.
In analogy to the notions in simplex method, we call the elements in N the non-basic variables, the elements in B 1 the non-degenerate basic variables, and the elements in B 2 the degenerate basic variables.Furthermore, we call the elements in N ∪ B 1 the non-degenerate variables and the elements in B 2 the degenerate variables.We also denote B = B 1 ∪ B 2 as the set of basic variables.
Next, we introduce the non-degeneracy metric δ, which essentially measures how close the nondegenerate part of the optimal solution is to degeneracy: Definition 5.For a linear programming (2) and an optimal primal-dual solution pair z * = (x * , y * ), we define the non-degeneracy metric δ as: The non-degeneracy metric δ looks at the non-degenerate variables (i.e.N ∪ B 1 ) and measures how close they are to degeneracy.For non-degenerate non-basic variables in N , it computes the minimal scaled reduced cost min i∈N c i −A T i y * ∥A∥ 2 ; for non-degenerate basic variables in B 1 , it computes the minimal primal variable slackness min i∈B 1 x * i ; and then δ takes the minimum of the two.For every optimal solution z * to a linear programming, the non-degeneracy metric δ is always strictly positive.The smaller the δ is, the closer the non-degenerate part of the solution is to degeneracy.In the rest of this section, we show that the value of δ drives the identification of the algorithm.
Next, we introduce the sharpness to a homogeneous conic system α L 1 that arises in the complexity of identification.We can view it as a local sharpness condition of a homogeneous linear inequality system around the converging optimal solution.Following the results stated in Section 2.3, α L is an extension of the minimal non-zero singular value of a certain matrix and does not depend on the overly-conservative global Hoffman constant.
More formally, we consider the following homogeneous linear inequality system and denote (15).Denote α L 1 the sharpness constant to (15), i.e., for any (u, v) ∈ R m+n , Theorem 2 presents the main result for this section.It essentially states that after a certain number of iterations, PDHG iterates are reasonably close to the converging optimal solution and the algorithm can identify the elements in set N and B 1 .Furthermore, we present the complexity theory for identification in terms of the non-degenerate metric δ and the local sharpness constant α L 1 .Notice that δ does not rely on the degenerate coordinates, i.e., those in B 2 , which shows that degeneracy does not slow down the identification, in contrast to previous literature on partial smoothness.We also comment that the set B 2 may not be identifiable by PDHG iterates due to the degeneracy of the problem.
Theorem 2 (Identification Complexity).Consider PDHG for solving (2) with step-size s ≤ 1 2∥A∥ 2 .Let {z k = (x k , y k )} ∞ k=0 be the iterates of the algorithm and let z * be the converging optimal solution, i.e., z k → z * .Denote (N, B 1 , B 2 ) the partition of the primal variable indices using Definition 4. Denote α L 1 the sharpness constant to the homogeneous linear inequality system (15).Denote R = 2 ∥z 0 − z * ∥ 2 + ∥z * ∥ 2 + 1.Then, it holds for any Remark 2. Theorem 2 shows that after iterations, the solution z k is δ/2-close to the convergent optimal solution z * .Furthermore, the set N and B 1 can be identified after K iterations.
The rest of this section presents the proof of Theorem 2. First, we consider the linear inequality system and denote The next lemma builds up the connections between Z * L and K.More specifically, it states that (a) Z * L is a shift of the cone K, (b) the sharpness constant of the two linear inequality systems ( 15) and ( 18) are the same, (c) one can lower bound dist(0, F(z)) using dist(z, Z * L ) and the sharpness constant of the systems.(b).α L 1 is the sharpness constant to system (18), that is, for any z = (x, y) ∈ R m+n , Proof.(a).For any z ∈ z * + K, there exists w = (u, v) ∈ K such that z = z * + w.By the definition of set N, B, B 1 , B 2 , we have On the other hand, for any z ∈ Z * L , we have z = z * + (z − z * ).Notice that Combining (19) and (20), we arrive at (c).From the definition of sharpness and Lemma 3, it holds for any z ∈ B R (0) with x ≥ 0 that where the third and fourth inequalities follow from Proposition 1 and [49, Proposition 1] respectively.
The next lemma shows when z is close enough to z * , the non-basic dual slacks in set N and the non-degenerate basic variables in set B 1 are bounded away from zero.
Lemma 4. For any z such that ∥z − z * ∥ 2 ≤ δ 2 , it holds for any i ∈ N , j ∈ B 1 that As a result, it holds for any j ∈ B 1 and i ∈ N that and thus by definition of δ, we have The next lemma connects Z * L and the optimal solution set Z * .Lemma 5.
Proof.Following the same proof of Lemma 4, we know that for any z ∈ B δ (z * ), Recall the definition of Z * L as the solution set to the system Note that combining ( 21) and (22) gives exactly the KKT system of original LP (2) and thus The following lemma ensures that after a certain number of iterations, the iterates {z k } stay in the ball centered at z * with radius δ 2 .Lemma 6.Under the same condition of Theorem 2, for any iteration k such that we have Proof.We first show that for any k ≥ K 1 , it holds that The proof heavily relies on the non-expansive nature of PDHG iterates.

Notice that
where the first and third inequalities use Lemma 2 while the second one follows from part (a) in Theorem 1. Furthermore, it holds that where the first inequality follows from part (c) in Lemma 3, the second and fourth inequalities utilize Lemma 2, and the third inequality is from part (b) of Theorem 1.Therefore, we have ∥z k+1 − z k ∥ 2 ≤ δ 16 and dist(z k , Z * L ) ≤ δ 8 whenever iteration number satisfies k ≥ K 1 .In the rest of the proof, we consider two cases, separating by whether ∥P Z * L (z k ) − z * ∥ 2 ≤ δ 2 or not.We will show that the first case implies ∥z k − z * ∥ 2 ≤ δ 2 , and it is impossible for the second case to happen, which then finishes the proof of the lemma.

Case I: ∥P
In this case, we have P Z * L (z k ) ∈ Z * L ∩ B δ/2 (z * ) ⊂ Z * from Lemma 5 and therefore it follows from part (c) in Lemma 1 that Thus, it holds that whereby Case II: We aim to show this case is impossible under conditions ∥z k+1 − z k ∥ 2 ≤ δ 16 and dist(z k , Z * L ) ≤ δ 8 .First, since the iterates converge to an optimal solution (i.e.z k → z * ), we know that and hence where the first one is due to the conclusion in Case I and the second one is the consequence of nonexpansiveness of projection.It then follows from Lemma 5 that If P Z * L (z N +1 ) = z * , then from (23) we have By triangle inequality, we have where the second inequality follows from and Consider the two-dimensional plane that z * , z and P Z * L (z N ) lay (see Figure 3).We know by construction of z that P Z * L (z N +1 ) lies on the segment between z * and z.Let and ∥P Z * L (z N ) − z∥ 2 = w δ 8 , then we have t ≥ 1 and w ≥ 1 by ( 23) and (25).Define θ as the angle Again by law of cosine and note that t ≥ 1 and (26), we have By two-dimensional geometry, we have δ .
Combining (a)-(c) and ( 24), we have This leads to the contradiction.To sum up, for any iteration k, when , Case II cannot happen) and thus ∥z k − z * ∥ 2 ≤ δ 2 following the argument in Case I, which completes the proof of Lemma 6.
By combining Lemma 4 and 6, we are ready to prove Theorem 2.
We just need to show that x k N = 0. Notice it holds from Lemma 4 that for any i ∈ N and k ≥ K 1 Consider the update formula of PDHG Thus, it follows from Lemma 2 that Then for any q > 2δ sδ∥A∥ 2 = 2 s∥A∥ 2 and i ∈ N , we have q−1 k=0 s(c , thus it holds that x K 1 +q N = 0 , that is, after at most K := K 1 + 2 s∥A∥ 2 steps, the coordinates in set N are fixed to zero.This finishes the proof of Theorem 2.

Stage II: local convergence
In the previous section, we show that after a certain number of iterations, the iterates of PDHG stay close to an optimal solution, and the iterates identify the non-degenerate coordinates set N and B 1 .In this section, we characterize the local behaviors of PDHG for LP afterward.More specifically, we show that PDHG iterates converge to an optimal solution linearly after identification, and the local linear convergence rate is characterized by a local sharpness constant α L 2 for a homogeneous linear inequality system.The local convergence can be much faster than the global linear convergence rate obtained in [6], which is characterized by the Hoffman constant of the KKT system of the LP.This provides a theoretical explanation of the eventual fast rate of PDHG for LP.
Throughout this section, we consider the iterates z k after identification, i.e., for k ≥ K.In this stage, it holds that There is a slight difference of the sharpness constants α L 1 and α L 2 , because they correspond to slightly different linear inequality systems, ( 15) and (30), respectively.The difference is that in the Stage II (i.e., α L 2 ), the non-basic set N is identified, i.e., x k N = 0. Thus in Stage II, the iterates of PDHG are driven by sharpness α L 2 of system (30) which can ignore the constraints in non-basic set N , while sharpness α L 1 of (15) does depend on the whole constraint matrix.Nevertheless, α L 1 and α L 2 are sharpness of homogeneous linear inequality systems ( 15) and ( 30) respectively and thus exhibit simpler and tighter characterizations, as stated in Section 2.3.We present a simple example in Appendix B that demonstrates α L 2 can be arbitrarily larger than the global sharpness constant α.
Theorem 3 presents the local linear convergence of PDHG after identification.
k=0 be the iterates of the algorithm and let z * be the converging optimal solution, i.e., z k → z * .Then it holds for any k > K that , where K is the maximal length of the first stage defined in (17).
Theorem 3 shows that after identification, PDHG can obtain an ϵ-close optimal solution within iterations.Combining Theorem 2 and Theorem 3, the complexity of PDHG finding an ϵ-close optimal solution is In particular, when setting step-size s = C ∥A∥ 2 with C ≤ 1 2 , the total complexity becomes The rest of this section presents a proof of Theorem 3. First, we show that α L 2 is a local sharpness constant of the non-homogeneous linear inequality system (29) for (x B , y) that is close enough to (x * B , y * ).
Finally, by the definition of α L 1 , we have where the first equality uses (32) and the last inequality utilizes (34).We finish the proof by combining (33) and (35).
Now we are ready to prove Theorem 3.
Proof of Theorem 3. Notice that k > K.It follows from Theorem 2 that ∥z k − z * ∥ 2 ≤ δ 2 and x k N = 0. Denote the local normalized duality gap as where Then we have for any r ∈ (0, R] that where the second inequality is due to definition of α L 2 and the last inequality utilizes Proposition 1 and which finishes the proof.

Numerical experiments
In this section, we present numerical experiments that verify our theoretical results in the previous sections.
Dataset.In the experiments, we utilize the root-node LP relaxation of instances from the MIPLIB 2017.We convert the instances to the following form: and consider its primal-dual formulation where We choose the instances that can be solved by PDHG (after preconditioning, see below for more details) to 10 −10 accuracy of KKT residual (see Progress metric for formal definition) within 3×10 5 iterations, and there are 50 such instances.Among all 50 instances, the number of variables ranges from 86 to 3648, the number of constraints ranges from 29 to 6972, and the number of nonzeros in the constraint matrices ranges from 376 to 22584.Notice that there is a difference between (36) and the standard form (1) we present in the paper due to the existence of the inequality constraints.Indeed, all of our theory can be generalized to this form without loss of generality.We choose to present the results for the standard form LP (1) because that leads to theoretical results in relatively clean forms.
Preprocessing.As a first-order method, PDHG suffers from slow convergence on many instances from MIPLIB due to the ill-conditioning nature of the instances.To overcome this issue, we leverage the diagonal preconditioning heuristic developed in PDLP [4].More specifically, we rescale the constraint matrix A to Ã = D 1 AD 2 with positive definite diagonal matrices D 1 and D 2 .Both vectors b and c are correspondingly rescaled as b = D 1 b and c = D 2 c.The matrices D 1 and D 2 are obtained by running 10 steps of Ruiz scaling [66] followed by an l 2 Pock-Chambolle scaling [62] on the constraint matrix A. More details about this preconditioning scheme can be found in [4].
Progress metric.We use KKT residual of (37), i.e., a combination of primal infeasibility, dual infeasibility and primal-dual gap, to measure the performance of current iterates.More formally, the KKT residual of ( 36) is given by Computing identification time and δ.In the experiments, we run PDHG to 10 −8 accuracy to obtain the converging optimal solution z * = (x * , y * ).With z * , we can then identify the non-basic solution set N , non-degenerate basic solution set B 1 , and the degenerate basic solution set B 2 with respect to z * for the primal and dual variables, respectively, by and Then we go backward from the last iteration to find the first moment when either N or B 1 changes.This iteration is referred to as the empirical identification moment.The non-degeneracy metric δ for (37) can be computed via Results. Figure 4 presents KKT residual versus PDHG iterations on six representative instances.The orange vertical line in Figure 4 represents the empirical identification moment, i.e., subsequently the active set will be fixed and identical to the limit optimal point.As illustrated in Figure 4, there is a phase transition happening around the identification moment (orange line): PDHG identifies active variables in Stage I with a sublinear rate; PDHG has a fast linear convergence in Stage II after identification.Theorem 2 shows a bound on the identification time with the non-degeneracy parameter δ. Figure 5 presents the scatter plot of the metric R/δ versus the empirical iteration number for identification, where R = 2∥z 0 − z * ∥ 2 + 2∥z * ∥ 2 + 1 and δ is defined as (38).Each point in Figure 5 represents an LP instance from our dataset.We observe from Figure 5 that there is a strong correlation between the empirical identification time and the metric R/δ.Indeed, when running a linear regression model, the R 2 is 0.712, which means that 71.2% of the variability in the observed identification complexity is captured by this single quantity.An interesting implication of our result is that degeneracy itself does not slow down the convergence, but the near-degeneracy of the non-degenerate part does.Figure 6 demonstrates this theoretical implication by preliminary numerical experiments.We choose three representative LP instances from the dataset.Then, we slightly perturb the instances by adding a small Gaussian noise in A, b, and c, then run PDHG on both the original and the perturbed instances.We use Gurobi to verify the feasibility of the three perturbed problems.Figure 6 plots the KKT residual and the number of PDHG iterations for the original LP and the perturbed LP.For each instance in Figure 6, the blue line represents the performance of PDHG on the original LP, and the purple line is on the perturbed LP.While the original LP and the perturbed LP are almost identical (because the perturbation is tiny), PDHG converges much faster on the original LP than the perturbed LP.Indeed, almost all practical LP, including the three instances in Figure 6, are degenerate.Adding the tiny perturbation makes the problem non-degenerate.This showcases that degeneracy itself does not slow down the convergence of PDHG, but near-degeneracy can make the instance much harder to solve, which verifies our theory.
In summary, the numerical experiment verifies our geometric understandings of PDHG on LPs: (1) the convergence behavior of PDHG for LP have two distinct stages, that is, Stage I for active variable identification and Stage II for local linear convergence; (2) the non-degeneracy metric δ plays a crucial role in the length of the first stage; (3) degeneracy itself does not deteriorate the convergence of first-order methods, but near-to-degeneracy does.

Conclusion and future directions
In conclusion, this paper aims to bridge the gap between the practical success and loose complexity bound of PDHG for LP, and to identify geometric quantities of LP that drive the performance of the algorithm.To achieve this, we show that the behaviors of PDHG for LP have two stages: in the first sublinear stage, the algorithm identifies the non-basic and non-degenerate basic set (i.e., the set N and B 1 ) and we present the complexity result of identification in terms of the neardegeneracy metric δ; in the second linear stage, the algorithm effectively solves a homogeneous linear inequality system, and the linear convergence rate is driven by the local sharpness parameter of the system.Compared to existing complexity results of PDHG on LP, our results do not depend on the global Hoffman's constant of the KKT system, which is known to be too loose in the literature.Such two-stage behavior is also tied with the concept of partial smoothness in the non-smooth optimization literature.We introduce a new framework of complexity analysis without assuming the "irrepresentable" non-degeneracy condition.
We end the paper by presenting a few interesting open questions for further investigation: • Extensions to non-smooth optimization.The partial smoothness literature in nonsmooth optimization always assumes the non-degeneracy condition.Unfortunately, the iterates of first-order methods usually converge to a degenerate solution, thus violating this condition.A typical argument for avoiding this issue in theory is by adding a small perturbation to the original problem.However, as shown in our numerical experiments (Figure 6), adding the small perturbation can significantly slow down the convergence of the algorithm.Instead, we show in this paper that degeneracy itself does not slow down the convergence in the context of LP.How to extend this result to general non-smooth optimization can be an interesting future direction.
• Extensions to other primal-dual algorithms.We believe the two-stage behaviors are not limited to PDHG.Indeed, most of the theoretical results can be extended to ADMM.This should provide new theoretical understandings for ADMM-based solvers, such as OSQP [67] and SCS [58].
• Extensions to the restarted algorithm.PDLP utilizes a new technique, restarting, to obtain the optimal linear convergence rate with global sharpness condition.We believe our theoretical results can be extended to the restarted algorithms, and we leave it as future work.
• Extensions to infeasible problems.In the paper, we consider the case when the LP instance is feasible and bounded.Such two-stage behaviors are also numerically observed in infeasible problems.How to theoretically analyze it without the non-degeneracy condition can be another interesting question.
Thus we finish the proof by triangle inequality It turns out that distance to optimality has linear convergence under sharpness (8).where the first inequality is exactly Lemma 8 and the second one utilizes (40).

B Examples to illustrate the looseness of global sharpness constant
In this section, we present a series of simple LP instances which exhibits very bad global sharpness behaviors (so as the Hoffman constant), whereas the three quantities we use in the analysis, δ, α L 1 and α L 2 , are of reasonable sizes.This indicates that the previous complexity result based on Hoffman constant argument [6] can be arbitrarily loose, and may not be able to explain the behaviors of the algorithm.
where the primal form (1) is given by The feasible region of ( 41) is plotted in Figure 7.It is easy to check that there are multiple dual optimal solutions given by y * = (y 1 , 1) with 0 ≤ y 1 ≤ 1 + 1 κ , and the optimal primal solution is unique given by x * = (1, 0, 0).Denote α the sharpness constant (i.e., the reciprocal of the Hoffman constant) of the KKT system where (A, b, c) is given in (42) and R is essentially the distance between initial point to origin.This is the term that appears in the complexity of previous results [6].Table 1 presents bounds on α, α L 1 , α L 2 and δ for the LP when κ is set to 10 −10 .The value of α can be obtained by noticing that α ≤ κ (see Lemma 10 for a proof).The bound of α L 1 and α L 2 are obtained from the algorithm presented in [61].The bound on δ can be computed via Lemma 11.As we can see, for this specific example, the value of α is tiny, while the value of α L 1 , α L 2 and δ are of reasonable sizes.

Figure 2 :
Figure 2: Plots to illustrate the geometry of the LP instance (5), and the numerical behaviors of PDHG for solving the LP instance with different parameter κ, δ. 2a (a) (non-expansiveness) ∥z k+1 − z * ∥ Ps ≤ ∥z k − z * ∥ Ps for any z * ∈ Z * , (b) (convergence) there exists z * ∈ Z * such that z * = lim k→∞ z k , and (c) (property of the optimal solution set) ∥z k − z * ∥ Ps ≥ ∥z * − z * ∥ Ps for any z * ∈ Z * , where z * is the converging optimal solution defined in (b).

Figure 3 :
Figure 3: Illustration of the proof of Case II in Lemma 6

Figure 4 :
Figure 4: Plots showing the KKT residual in log scale versus number of iterations of PDHG on six representative LP instances from MIPLIB.

Figure 5 :
Figure 5: Plot showing the iteration number at identification in log scale versus non-degeneracy metric R/δ in log scale on the 50 MIPLIB instances.

Figure 6 :
Figure 6: Comparison of original and perturbed LP instances.