New Bregman proximal type algorithms for solving DC optimization problems

Difference of Convex (DC) optimization problems have objective functions that are differences between two convex functions. Representative ways of solving these problems are the proximal DC algorithms, which require that the convex part of the objective function have $L$-smoothness. In this article, we propose the Bregman Proximal DC Algorithm (BPDCA) for solving large-scale DC optimization problems that do not possess $L$-smoothness. Instead, it requires that the convex part of the objective function has the $L$-smooth adaptable property that is exploited in Bregman proximal gradient algorithms. In addition, we propose an accelerated version, the Bregman Proximal DC Algorithm with extrapolation (BPDCAe), with a new restart scheme. We show the global convergence of the iterates generated by BPDCA(e) to a limiting critical point under the assumption of the Kurdyka-{\L}ojasiewicz property or subanalyticity of the objective function and other weaker conditions than those of the existing methods. We applied our algorithms to phase retrieval, which can be described both as a nonconvex optimization problem and as a DC optimization problem. Numerical experiments showed that BPDCAe outperformed existing Bregman proximal-type algorithms because the DC formulation allows for larger admissible step sizes.


Introduction
We are interested in solving Difference of Convex (DC) optimization problems: where f 1 and f 2 are convex functions on R d , and C is the closure of C which is a nonempty, convex, and open set. Note that the function f 1 − f 2 is not always convex. Also, g may be nonsmooth, such as the 1 -norm x 1 in [8,20,29], or alternatively, f 2 may be nonsmooth [16]. Some interesting examples of (P) can be found in [28]. Although we will place some assumptions on C, it can be regarded as R d for simplicity. Applications of DC optimization are summarized in [13,17,26].
The DC Algorithm (DCA) (see for instance [17]) is a well-known iterative method for solving the DC optimization problems (P). At each iteration, its computational burden mainly depends on the resolution of the subproblem, where Solving subproblem (1) may be computationally demanding unless f 1 and g have simple structure or (P) is small-scale. When g is convex, the proximal DC Algorithm (pDCA) (see for instance [28]) is an alternative method of solving large-scale DC optimization problems. However, to guarantee global convergence of its iterates to a critical point, f 1 needs to be L-smooth; i.e., its gradient needs to be globally Lipschitz continuous. Each step of pDCA is given by where ξ k ∈ ∂ c f 2 (x k ), x k ∈ C, λ > 0 satisfies 0 < λL < 1, and · denotes the Euclidean norm. Since λ (< 1/L) plays the role of a step size, finding a larger upper bound 1/L, i.e., finding a smaller L, is of fundamental importance to achieving fast convergence. Wen et al. [28] proposed the proximal DC Algorithm with extrapolation (pDCAe) to accelerate pDCA with the extrapolation technique, which is used, for instance, in the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) of Beck and Teboulle [4] and in the Nesterov's extrapolation technique [21,22]. Bolte et al. [8], who incorporated the kernel generating distance (function) h and Bregman distances [9] into the algorithm framework, came up with the notion of the L-smooth adaptable property (see also [2]). This property is less restrictive than L-smoothness. A variant of the Bregman Proximal Gradient algorithm (BPG) proposed by Mukkamala et al. [20] iteratively estimates a small L, while Zhang et al. [29] proposed the Bregman Proximal Gradient algorithm with extrapolation (BPGe), which combines BPG with a line search step for extrapolating parameters.
In this paper, we propose two new algorithms, namely, the Bregman Proximal Difference of Convex Algorithm (BPDCA) and the Bregman Proximal Difference of Convex Algorithm with extrapolation (BPDCAe), which are inspired by pDCA(e) and BPG(e). These novel algorithms combine pDCA(e) with the Bregman distances. In the subproblem of BPDCA(e), the use of Bregman distances guarantees the accuracy of a linear approximation of f 1 − f 2 .
The novelty of our contributions can be better understood by comparing them with the existing work. As already mentioned, Bregman distances allow us to extend the class of functions to be minimized f 1 from L-smooth in pDCAe [28] to the larger class of L-smooth adaptable pairs of functions (f 1 , h). In addition, the function g does not need to be convex in the case of BPDCA. By assuming that either f 2 or g are differentiable and that their gradients are locally Lipschitz continuous, the iterates of BPDCA(e) converge globally to a limiting stationary point (Theorems 2 and 7) or a limiting critical point (Theorems 3 and 8), where the definitions of these convergent points are given in Definition 6. This means that either g or f 2 can be nonsmooth.
Compared with BPG-type algorithms [8,20,29], BPDCA(e) well exploits the structure of the objective function. When applying these BPG-type algorithms to solve problem (P), we decompose Ψ into two functions. There are two naive ways to decompose Ψ . First, we consider the decomposition Ψ = f 1 + (g − f 2 ) to apply BPG. In this case, BPG solves its subproblem min{g(x) − f 2 (x) + ∇f 1 (x k ), x − x k + 1 λ D h (x, x k )} at the kth iteration, where D h is the Bregman distance associated with a kernel generating distance h (see Definition 2) and λ < 1/L is a positive parameter. In general, it is difficult to efficiently solve it because f 2 often does not have simple structure such as separability. This fact is also true even if g is convex and separable, simultaneously. With BPDCA, we only need to solve its subproblem min{g(x) + ∇f 1 . If g is additionally convex, the subproblem becomes convex and hence is often efficiently solved. Moreover, if g and h are also separable, the subproblem is reduced to d independent one-dimensional convex optimization problems. Even without separability of h, it often has closedform solution formulae as we mentioned in Section 5. As an alternative way of decomposition of Ψ , we consider Ψ = (f 1 − f 2 ) + g to apply BPG. In this case, to guarantee global convergence, the L-smooth adaptability of (f 1 − f 2 , h) is required (see Definition 3). Meanwhile, for the global convergence of BPDCA(e), the L -smooth adaptability of (f 1 , h ) is required. Comparing these constants, L ≤ L in general, and then, we can expect substantial decrease in the objective function at each iteration (Lemma 5 and [8, Lemma 4.1]). This fact has dramatic consequences in practice, as we found in numerical experiments on phrase retrieval (Subsection 5.1).
The convergence of our algorithms and the monotonicity of the objective function are based on standard assumptions. Our new restart scheme (Subsection 3.2) plays an important role in guaranteeing the non-increasing property of the objective functions of BPDCAe without the need for a line search, as in [29]. We show global convergence under local Lipschitz continuity of the gradients and the Kurdyka-Lojasiewicz property or subanalyticity of the objective function. Additionally, we evaluated the rates of convergence of BPDCA(e).
To evaluate the performance of BPDCA(e), we applied them to phase retrieval, a well-known application in nonconvex optimization. Phase retrieval arises in many fields of science and engineering, such as X-ray crystallography and image processing [10,24]. It can be formulated as a nonconvex optimization problem or DC optimization problem (P), such as in [14]. It cannot be solved via pDCA or proximal algorithms, since the function we want to minimize is not L-smooth. When we formulated phase retrieval as a DC optimization problem, we obtained much smaller L-smooth adaptable parameters than the existing ones [8, Lemma 5.1], [20]. Thus, our algorithms outperformed BPG(e) in our numerical experiment. Further experiments showed that, under the Gaussian model, BPDCAe had a higher success rate of phrase retrieval than that of Wirtinger flow [10]. Although the kernel generating distance h we utilized does not satisfy Assumption 4 (i), the se-quences generated by BPDCA(e) converged in the numerical experiments. Therefore, we conjecture that all convergence analyses can be carried out with weaker conditions. This paper is organized as follows. Section 2 summarizes notions such as the limiting subdifferential, the Bregman distances, the L-smooth adaptable property, the Kurdyka-Lojasiewicz property, and the subanalytic functions. Section 3 introduces our Bregman proximal-type algorithms under the assumption that (f 1 , h) has an L-smooth adaptable property. Section 4 (and Appendix A) establishes the global convergence of BPDCA(e) to a limiting stationary point or a limiting critical point of the problem (P) and analyzes its rate of convergence. Section 5 derives small values for the constant L and compares the performance of our algorithms with that of BPG(e). Section 6 summarizes our contributions and discusses future work.

Preliminaries
Here, we review the important notions we will need in the subsequent sections.

Subdifferentials
Definition 1 (Limiting Subdifferential [23]) For a proper and lower semicontinuous function f : R d → (−∞, +∞], the limiting subdifferential [23] of f at x ∈ dom f is defined by Note that when f is convex, the limiting subdifferential coincides with the (classical) subdifferential [23,Proposition 8.12], that is, ∂f (x) = ∂ c f (x) for all x ∈ R d .

Bregman Distances
First, we define kernel generating distances and Bregman distances. (i) h is proper, lower semicontinuous, and convex, with dom h ⊂ C and dom ∂h = C.
We denote the class of kernel generating distances by G(C). Given h ∈ G(C), the Bregman distance From the gradient inequality, h is convex if and only if D h (x, y) ≥ 0 for any x ∈ dom h and y ∈ int dom h. When h is a strictly convex function, the equality holds if and only if x = y. When h = 1 2 · 2 , D h (x, y) = 1 2 x − y 2 , which is the squared Euclidean distance. In addition, the Bregman distances satisfy the three-point identity [8], for any y, z ∈ int dom h, and x ∈ dom h.

Smooth Adaptable Functions
Now let us define the notion of L-smooth adaptable.
Definition 3 (L-smooth adaptable [8]) Consider a pair of functions (f, h) satisfying the following conditions: The Since our focus is on DC optimization, the function f 1 in (P) is always convex. Thus, it will be enough to verify that Lh − f 1 is convex on C to have (f 1 , h) L-smad on C.
From the L-smooth adaptable property comes the Descent Lemma [8].
(i) f is said to have the Kurdyka-Lojasiewicz (KL) property atx ∈ dom ∂f if there exist η ∈ (0, +∞], a neighborhood U ofx, and a function φ ∈ Ξ η such that the following inequality holds: (ii) If f has the KL property at each point of dom ∂f , then it is called a KL function.
Lemma 2 (Uniformized KL property [7]) Suppose that f : R d → (−∞, +∞] is a proper and lower semicontinuous function and let Γ be a compact set. If f is constant on Γ and has the KL property at each point of Γ , then there exist positive scalars , η > 0, and φ ∈ Ξ η such that for anyx ∈ Γ and any x satisfying dist(x, Γ ) < and f (x) < f (x) < f (x) + η.
Next, we describe subanalytic functions.
Definition 5 (Subanalyticity [6]) where the functions f ij , g ij : V → R are real-analytic for all For instance, given a subanalytic set S, dist(x, S) is subanalytic, and every analytic function is subanalytic. Note that subanalytic functions are KL functions. See [5,6] for further properties of subanalyticity.

Proposed Methods: Bregman Proximal DC Algorithms
We place the following assumptions on the DC optimization problem (P). Recall that C = int dom h.
Note that Assumption 1 (iv) holds when C is compact [8].

Bregman Proximal DC Algorithm (BPDCA)
To obtain the Bregman Proximal DC Algorithm (BPDCA) mapping for some λ > 0, we recast the objective function of (P) via a DC decomposition: and, given x ∈ C = int dom h and ξ ∈ ∂ c f 2 (x), define the mapping, Additionally, we put the following assumption on (P).
Assumption 2 For all x ∈ C and λ > 0, we have Note that Assumption 2 is not so restrictive because it holds when C ≡ R d . Under Assumptions 1 and 2, we have the following lemma [8, Lemma 3.1].
Lemma 3 Suppose that Assumptions 1 and 2 hold, and let x ∈ C = int dom h. Then, the set T λ (x) is a nonempty and compact subset of C for any λ > 0.
Note that when h is strictly convex, T λ (x) is a singleton. Also, when g and h are separable, this mapping is easily computable since T λ (x) can be decomposed into a single-valued optimization problem, and often has a closed-form solution. For instance, when h(x) = 1 2 x 2 , for g(x) = x 1 , T λ (x) becomes the soft-thresholding operator, or for g(x) = x 0 , the hard-thresholding operator. Other well-known examples where this mapping has a closed-form solution are when we use an appropriate h such as Burg entropy [2], Shannon entropy [3], or h(x) = 1 4 x 4 + 1 2 x 2 [8] for the corresponding g. Note that this h(x) is not separable. For further examples, see [12,  The Bregman Proximal DC Algorithm (BPDCA), which we are proposing, is listed as Algorithm 1.

Algorithm 1 Bregman Proximal DC Algorithm (BPDCA)
Input: h ∈ G(C) with C = int dom h such that L-smad for the pair (f 1 , h) holds on C.

Bregman Proximal DC Algorithm with Extrapolation (BPDCAe)
Algorithm 2, which we are proposing, is an acceleration of BPDCA that uses the extrapolation technique [4,21,22] to solve the DC optimization problem (P).

Algorithm 2 Bregman Proximal DC Algorithm with Extrapolation (BPDCAe)
Input: h ∈ G(C) with C = int dom h such that L-smad for the pair (f 1 , h) holds on R d . 1], and 0 < λ < 1/L. for k = 0, 1, 2, . . . , do Take any ξ k ∈ ∂ c f 2 (x k ) and compute end for When β k ≡ 0 for all k ≥ 0, BPDCAe reduces to BPDCA. Here, we prefer the popular choice for the coefficients β k (and θ k ) given in [28] for acceleration. Accordingly, (6) guarantees that {β k } ∞ k=0 ⊂ [0, 1) and sup k≥0 β k < 1. These properties are needed to prove global subsequential convergence of the iterates (see Theorem 6 (ii)). Algorithm 2 introduces a new adaptive restart scheme, which resets θ k and β k whenever is satisfied for a fixed ρ ∈ [0, 1). This adaptive restart scheme guarantees the non-increasing property for BPDCAe (see Lemma 6). In addition, we can enforce this resetting every K iterations for a given positive integer K. In our numerical experiments, we set {β k } ∞ k=0 as both the fixed and the adaptive restart schemes.
∈ C, Algorithm 2 enforces β k = 0 and BPDCAe is not accelerated at the kth iteration. This operation guarantees that y k always stays in C. In practice, however, the extrapolation technique may be valid and accelerates BPDCAe.
We define the following BPDCAe mapping for all x, y ∈ C = int dom h, and λ ∈ (0, 1/L): Similarly to the case of BPDCA, we make an Assumption 3 and can prove Lemma 4 for S λ (x, y) ⊂ C.
Assumption 3 For all x, y ∈ C and λ > 0, we have Lemma 4 Suppose that Assumptions 1 and 3 hold, and let x, y ∈ C = int dom h. Then, the set S λ (x, y) is a nonempty and compact subset of C for any λ > 0.

Convergence Analysis
Throughout this section, we will assume that the pair of functions (f 1 , h) is L-smad on C.

Properties of BPDCA
First, we show the decreasing property of BPDCA mapping for 0 < λL < 1 (the argument is adapted from [8, Lemma 4.1]).
Lemma 5 Suppose that Assumptions 1 and 2 hold. For any x ∈ C = int dom h and any x + ∈ C = int dom h defined by where ξ ∈ ∂ c f 2 (x) and λ > 0, it holds that In particular, the sufficiently decreasing property in the objective function value Ψ is ensured when 0 < λL < 1.
Proof From the global optimality of x + by taking u = x ∈ int dom h and ξ ∈ ∂ c f 2 (x), we obtain Invoking the full Extended Descent Lemma (Lemma 1) for f 1 , the definition of the subgradient for f 2 , and the above inequality, we have The last statement follows with 0 < λL < 1.
Proposition 1 Suppose that Assumptions 1 and 2 hold. Let {x k } ∞ k=0 be a sequence generated by BPDCA with 0 < λL < 1. Then, the following statements hold:

Convergence Analysis of BPDCA
Suppose that the following conditions hold.
(ii) ∇h and ∇f 1 are Lipschitz continuous on any bounded subset of R d .
(iii) The objective function Ψ is level-bounded; i.e., for any r ∈ R, the lower level sets {x ∈

Definition 6
We say thatx is a limiting critical point of ( The set of all limiting critical points of (P) is denoted by X . In addition, we say thatx is a limiting stationary point of ( Although the limiting stationary points are sometimes called the limiting critical points in some papers, for example [7, Definition 1 (iv)], we distinguish these two terms. The reasons are the following: When Ψ is convex, we callx a stationary point if it satisfies 0 ∈ ∂ c Ψ (x). Because (12) is its natural extension by replacing ∂ c Ψ with ∂Ψ , we use the terminology "limiting stationary point" after [11, Definition 6.1.4]. We similarly namex satisfying (11): When g is convex, we callx a critical point if it satisfies 0 ∈ ∇f 1 (x) − ∂ c f 2 (x) + ∂ c g(x). Because (11) is its natural extension by replacing ∂ c g with ∂g, we use the terminology "limiting critical point." The limiting stationary point is a first-order necessary condition for the local optimality. This relation is known as the generalized Fermat's rule [23,Theorem 10 [19,Corollary 3.4]. Plugging it into [23, Corollary 10.9], it generally holds that It implies the limiting critical point is weaker than the limiting stationary point. When Next, using Lemma 5 and Proposition 1, we will show global subsequential convergence of the iterates to a limiting critical point of the problem (P). We can easily see that Theorem 1 (i) holds from the level-boundedness of Ψ . Theorem 1 (iii) and (vi) will play an essential role in determining the global convergence and the rate of convergence of BPDCA.
Theorem 1 (Global subsequential convergence of BPDCA) Suppose that Assumptions 1, 2, and 4 hold. Let {x k } ∞ k=0 be a sequence generated by BPDCA with 0 < λL < 1 for solving (P). Then, the following statements hold: k=0 is a limiting critical point of (P).
(ii) From Assumption 1 (ii), 4 (i), and the convexity of By the definition of the subgradients of convex functions, we have that, for any y ∈ R d , Assume for a moment that ξ k = 0. Letting {d k } ∞ k=0 be the subsequence given by (13), we obtain which is also true when ξ k = 0 by defining d k = 0. By taking k → ∞, we obtain We can take a compact set S such that for some valuef 2 ≤ f 2 (x k ), k ≥ 0. (14) and (15) contradict to ξ k → ∞.
(iii) From (10), we obtain where the last inequality holds since h is a σ-strongly convex function from Assumption 4 (i). Summing the above inequality from k = 1 to ∞, we obtain which shows that lim k→∞ x k+1 − x k = 0.
(iv) Letx be an accumulation point of {x k } ∞ k=0 and let {x kj } be a subsequence such that lim j→∞ x kj =x. Then, from the first-order optimality condition of subproblem (5) under Assumption 2, we have Therefore, From the boundedness of {x kj } and the Lipschitz continuity of ∇h on a bounded subset of R d , there exists A 0 > 0 such that Note that the sequence {ξ kj } is bounded due to (ii). Thus, by taking the limit as j → ∞ or more precisely, its subsequence, we can assume without loss of generality that lim j→∞ ξ kj =:ξ exists, which belongs to ∂ c f 2 (x) since f 2 becomes continuous due to its convexity on R d . Using this and (18), we can take the limit of (17). Setting x kj +1 − x kj → 0 and invoking the lower semicontinuity of g and ∇f 1 , we obtainξ ∈ ∂g(x) , which shows thatx is a limiting critical point of (P).
We can estimate the objective value at an accumulation point from lim inf j→∞ Ψ (x kj ) and lim sup j→∞ Ψ (x kj ). Consequently, we can prove that Ψ is constant on the set of accumulation points of BPDCA.
(ii) Take anyx ∈ Ω, that is lim j→∞ x kj =x. From (5), it follows that From the above inequality and the fact that f 1 is convex at x k , we obtain Substituting k j for k and limiting j to ∞, we have, from Proposition 1 (ii), To discuss the global convergence of BPDCA, we will suppose either of the following two assumptions. Assumption 5 is nonrestrictive because many functions in [28], including the f 2 in our numerical experiments, satisfy it. Thus, let us discuss the global convergence of Algorithm 1 under Assumption 5 by following the argument presented in [28]. Note that every limiting critical point is a limiting stationary point from the differentiability of f 2 under Assumption 5.
Theorem 2 (Global convergence of BPDCA under the local differentiability of f 2 ) Suppose that Assumptions 1, 2, 4, and 5 hold and that Ψ is a KL function. Let {x k } ∞ k=0 be a sequence generated by BPDCA with 0 < λL < 1 for solving (P). Then, the following statements hold: converges to a limiting stationary point of (P); moreover, From Theorem 1 (iv), we also have Ω ⊆ X . Thus, for any µ > 0, there exists k 0 > 0 such that dist(x k , Ω) < µ and x k ∈ N 0 for any k ≥ k 0 , where N 0 is defined in Assumption 5. As for N 0 , since Ω is compact from the boundedness of {x k } ∞ k=0 , by decreasing µ, if needed, we can suppose without loss of generality that ∇f 2 is globally Lipschitz continuous on N := {x ∈ N 0 | dist(x, Ω) < µ}.
The subdifferential of Ψ at x k for k ≥ k 0 is Moreover, considering the first-order optimality condition of subproblem (5), we have that, for any k ≥ k 0 + 1, since f 2 is C 1 on N and x k−1 ∈ N for any k ≥ k 0 + 1. Using the above and (20), we see that From the global Lipschitz continuity of ∇f 1 , ∇f 2 , and ∇h, there exists A 1 > 0 such that where k ≥ k 0 + 1. From Theorem 1 (iii), we conclude that lim k→∞ dist(0, ∂Ψ (x k )) = 0.
(ii) From Theorem 1 (iv), it is sufficient to prove that {x k } ∞ k=0 is convergent. Here, consider the case in which there exists a positive integer k > 0 such that Ψ (x k ) = ζ. From Proposition 1 (i) and Proposition 2 (i), the sequence {Ψ (x k )} ∞ k=0 is non-increasing and converges to ζ. Hence, for anŷ k ≥ 0, we have Ψ (x k+k ) = ζ. By recalling (16), we conclude that there exists a positive scalar A 2 such that From (22), we obtain x k = x k+k for anyk ≥ 0, meaning that {x k } ∞ k=0 is finitely convergent. Next, consider the case in which Ψ (x k ) > ζ for all k ≥ 0. Since {x k } ∞ k=0 is bounded, Ω is a compact subset of dom ∂Ψ and Ψ ≡ ζ on Ω from Proposition 2 (ii). From Lemma 2 and since Ψ is a KL function, there exist a positive scalar > 0 and a continuous concave function φ ∈ Ξ η with η > 0 such that for all x ∈ U , where From (19), there exists k 1 > 0 such that dist(x k , Ω) < for any k ≥ k 1 . Since {Ψ (x k )} ∞ k=0 is non-increasing and converges to ζ, there exists k 2 > 0 such that ζ < Ψ (x k ) < ζ + η for all k ≥ k 2 . Takingk = max{k 0 + 1, k 1 , k 2 }, then {x k } k≥k belongs to U . Hence, from (23), we obtain Since φ is a concave function, we see that for any k ≥k, where the second inequality holds from (24) and the fact that {Ψ (x k )} ∞ k=0 is non-increasing, and the last inequality holds from (22). From the above inequality and (21), we obtain Taking the square root of (25) and using the inequality of the arithmetic and geometric means, we find that This shows that converges to a limiting critical point of (P) from Theorem 1 (iv). Because every limiting critical point is a limiting stationary point from the differentiability of f 2 , {x k } ∞ k=0 converges to a limiting stationary point of (P).
Next, suppose that Assumption 6 holds instead of Assumption 5. Here, we can show the global convergence of BPDCA by referring to [16,Theorem 3.4]. We will use subanalyticity instead of the KL property in the proof.
Theorem 3 (Global convergence of BPDCA under the local differentiability of g) Suppose that Assumptions 1, 2, 4, and 6 hold and that Ψ is subanalytic. Let {x k } ∞ k=0 be a sequence generated by BPDCA with 0 < λL < 1 for solving (P). Then, the sequence {x k } ∞ k=0 converges to a limiting critical point of (P); moreover, Proof Since g is differentiable, g is continuous on R d . Since the convexity of f 1 and f 2 implies their continuity, Ψ is continuous on R d . Let From Assumption 1 (v), −Ψ is finite. In addition, by recalling the continuity and subanalyticity of −Ψ on B(x, 0 ), we can apply [6, Theorem 3.1] to the subanalytic function −Ψ and obtain ν 0 > 0 and θ 0 ∈ [0, 1) such that where ζ = Ψ (x).
Let Ω be the set of accumulation points of {x k } ∞ k=0 .
Finally, we will show the rate of convergence in a manner following [1,28].
(ii-iii) Next, consider the case θ ∈ (0, 1). If there exists k 0 > 0 such that Ψ (x k0 ) = ζ, then we can show that the sequence {x k } ∞ k=0 is finitely convergent in the same way as in the proof of (i). Therefore, for θ ∈ (0, 1), we only need to consider the case that Ψ (x k ) > ζ for all k > 0. Define where S k is well-defined due to Theorem 2 (ii). From (26), for any k ≥k, wherek is defined in (24), we obtain On the other hand, since lim k→∞ x k =x and {Ψ (x k )} is non-increasing and converges to ζ, the KL inequality (24) with φ(s) = cs 1−θ ensures that, for all sufficiently large k, From the definition of S k and (21), we also have that, for all sufficiently large k, Combining (34) and (35), we have R θ k ≤ A 1 · c(1 − θ)(S k−1 − S k ) for all sufficiently large k. Raising the above inequality to the power of 1−θ θ and scaling both sides by c, we find that Combining this with (33) and recalling φ(R k ) = cR 1−θ k , we find that, for all sufficiently large k, where Since lim k→∞ x k+1 − x k = 0 by Theorem 1 (iii), lim k→∞ S k−1 − S k = 0. From these considerations and (36), we conclude that there exists k 1 > 0 such that for all k ≥ k 1 , S k ≤ (A 3 + 1)(S k−1 − S k ), which implies S k ≤ A3+1 A3+2 S k−1 . Therefore, for all k ≥ k 1 , (iii) For θ ∈ 1 2 , 1 , 1−θ θ < 1. From (36) and lim k→∞ S k−1 − S k = 0, there exists k 2 > 0 such that for all k ≥ k 2 . Raising the above inequality to the power of θ 1−θ , for any k ≥ k 2 we find that S Using Theorem 3, we can obtain another rate of convergence in the same way as in the proof of [1, Theorem 2] or [16,Theorem 3.5].

Properties of BPDCAe
Inspired by [29], we introduce the auxiliary function, To show the decreasing property of H M , instead of Ψ , with respect to {x k } ∞ k=0 , we further assume the convexity of g.

Assumption 7
The function g is convex.
Under the adaptive restart scheme (see (8)), we show the decreasing property of H M . Lemma 6 Suppose that Assumptions 1, 3, and 7 hold. For any x k , y k ∈ C = int dom h and any x k+1 ∈ C = int dom h defined by Furthermore, when 0 < λL < 1 and {β k } ∞ k=0 is given by the adaptive restart scheme (8), In addition, when ρ λ ≤ M ≤ 1 λ for ρ ∈ [0, 1), the auxiliary function H M is ensured to be nonincreasing.
Proof From the first-order optimality condition for (37), we obtain From the convexity of g, we find that Using the three-point identity (3) of the Bregman distances, From the convexity of f 1 and Lemma 1, we find that The above inequalities and the definition of the subgradient for f 2 lead us to which implies inequality (38). If β k = 0, then y k = x k and D h (x k , y k ) = 0. If β k = 0, since we chose the adaptive restart scheme, there is a ρ ∈ [0, 1) satisfying D h (x k , y k ) ≤ ρD h (x k−1 , x k ). From the definition of H M (x k , x k−1 ) and 0 < λL < 1, we have where the second inequality comes from We can use Lemma 6 to prove Proposition 3.
Proposition 3 Suppose that Assumptions 1, 3, and 7 hold. Let {x k } ∞ k=0 be a sequence generated by BPDCAe with 0 < λL < 1. Assume that the auxiliary function 1). Then, the following statements hold: Proof (i) The statement was proved in Lemma 6.
(ii) Modify (40) into where the last inequality comes from (1−λL)D h (x k+1 , y k ) ≥ 0. Let n be a positive integer. Summing the above inequality from k = 0 to n and letting Ψ * = v(P) > −∞, we find that where the second inequality comes from D h (x −1 , x 0 ) = 0, x −1 = x 0 , and D h (x n , x n+1 ) ≥ 0. Note that x n+1 ∈ C by Assumption 3. By taking the limit as n → ∞, we arrive at the former statement (ii). The latter statement directly follows from the former.

Convergence Analysis of BPDCAe
The proofs of Theorems 6, 7, 8, and Proposition 4 are given in the Appendix. They follow arguments that are similar to their BPDCA counterparts.
Since H M (x, y) has a Bregman distance term, the subdifferential of H M (x, y) has a ∇h term. To prove Theorem 7, we should additionally suppose that there is a bounded subdifferential of the gradient ∇h [29].
Assumption 8 There exists a bounded u such that u ∈ ∂(∇h) on any bounded subset of R d .
We can prove the following theorems by supposing the KL property or the subanalyticity of the auxiliary function H M (x, y) in relation to x and y.
Theorem 7 (Global convergence of BPDCAe under the local differentiability of f 2 ) Suppose that Assumptions 1, 3, 4, 5, 7, and 8 hold and that the auxiliary function H M (x, y) is a KL function satisfying ρ λ ≤ M ≤ 1 λ for ρ ∈ [0, 1). Let {x k } ∞ k=0 be a sequence generated by BPDCAe with 0 < λL < 1 for solving (P). Then, the following statements hold: converges to a limiting stationary point of (P); moreover, Theorem 8 (Global convergence of BPDCAe under the local differentiability of g) Suppose that Assumptions 1, 3, 4, 6, 7, and 8 hold and that the auxiliary function H M (x, y) is subanalytic satisfying ρ λ ≤ M ≤ 1 λ for ρ ∈ [0, 1). Let {x k } ∞ k=0 be a sequence generated by BPDCAe with 0 < λL < 1 for solving (P). Then, the sequence {x k } ∞ k=0 converges to a limiting critical point of (P); moreover, Finally, we have theorems regarding the convergence rate of BPDCAe, whose proof is almost identical to Theorems 4 and 5.

Application to Phase Retrieval
In phase retrieval, we are interested in finding a (parameter) vector x ∈ R d that approximately solves the system, a r , x 2 b r , r = 1, 2, . . . , m, where the vectors a r ∈ R d describe the model and b = (b 1 , b 2 , . . . , b m ) T is a vector of (usually) noisy measurements. As described in [8,10], the system (42) can be formulated as a nonconvex optimization problem: where θ ≥ 0 is a trade-off parameter between the data fidelity criteria and the regularizer g. We define g : R d → R, in particular g(x) = x 1 .
In this case, the underlying space of (P) is which is a nonconvex differentiable function that does not admit a global Lipschitz continuous gradient. The objective function of the phase retrieval problem can be also reformulated as a difference between two convex functions such as in [14]. That is, When we do not regard the phase retrieval (43) as a DC optimization problem, the Bregman Proximal Gradient algorithm (BPG) can be used instead [8]. Enhancements using the extrapolation technique were proposed: the Bregman Proximal Gradient algorithm with extrapolation (BPGe) [29] and Convex-Concave Inertial BPG [20] for estimating L. For BPG(e), assuming L-smad for the pair (f 1 − f 2 , h) using h(x) = 1 4 x 4 + 1 2 x 2 , L satisfies the following inequality [8, Lemma 5.1]: On the other hand, for DC optimization problems, we define h : This function is simpler than the original nonconvex formulation. The function h(x) = 1 4 x 4 is not σ-strongly convex. Therefore, this function does not satisfy Assumption 4 (i).
Proposition 5 Let f 1 and h be as defined above. Then, for any L satisfying the function Lh − f 1 is convex on R d . Therefore, the pair (f 1 , h) is L-smad on R d .
Proof Let x ∈ R d . Since f 1 and h are C 2 on R d , to guarantee the convexity of Lh − f 1 , it is sufficient to find L > 0 such that Lλ min (∇ 2 h(x)) ≥ λ max (∇ 2 f 1 (x)), where λ min (M ) and λ max (M ) denote the minimal and maximal eigenvalues of a matrix M , respectively. Now, we have the Hessian for f 1 and h: From the well-known fact, λ max (M ) ≤ M , we have the following inequality: Therefore, we obtain the desired result.
Comparing the right hand side of (45) and that of (47), we can see that The constant L has the important role of defining the step size, and thereby affects the performance of the algorithms. Note that even if m r=1 a r 2 a r a T r = m r=1 a r a T r 2 , the left-hand side of (48) is always smaller than the right-hand side because m r=1 a r a T r |b r | ≥ 0. When h(x) = 1 4 x 4 + 1 2 x 2 , the subproblems of BPG(e) have a closed-form solution formula [8,Proposition 5.1]. When h(x) = 1 4 x 4 , subproblems (5) and (7) also have a closed-form solution formula, which is obtained by slightly modifications of those in BPG(e).
In this application, the functions f 1 , f 2 , g, and h satisfy Assumptions from 1 to 8 excepting Assumption 4 (i) and 6. In particular, Assumption 4 (i) is not satisfied for our choice h(x) = 1 4 x 4 , but it is satisfied if we replace it by h(x) = 1 4 x 4 + 1 2 x 2 . Finally, Ψ and H M are KL functions due to their semi-algebraicity [1]. Therefore, in this application, Assumption 6 is not required for the global convergence of BPDCAe.

Lower Bound on the L-smooth Adaptable Parameter in the Gaussian Model
We dealt with the following Gaussian model. We generated the elements of m vectors a r ∈ R d and the ground truthx ∈ R d , which was a sparse vector (sparsity of 5%), independently from the standard Gaussian distribution. Then, we generated b r = a r ,x 2 , r = 1, 2, . . . , m from a r andx.
From the linearity of the expectation, we consider the expectation of ∇ 2 f 1 , Since the elements of a r are independently generated from the standard Gaussian distribution, the j-th diagonal element of the above matrix is given by The non-diagonal (j, k) elements are Under the Gaussian model, we can reduce the lower bound of L given in Proposition 5 with high probability by applying [10,Lemma 7.4] as shown in the following proposition.

Proposition 6
Let the functions f 1 and h be given by (44) and (46), respectively. Moreover, assume that the vectors a r are independently distributed according to the Gaussian model with a sufficiently large number of measurements. Let γ and δ be a fixed positive numerical constant and c(·) be a sufficiently large numerical constant that depends on δ; this means that the number of samples obeys m ≥ c(δ) · d log d in the Gaussian model. Then, for any L satisfying the function Lh − f 1 is convex on R d and hence the pair Proof Consider the expectation of m r=1 a r a T r . Since the elements of a r are independently generated from the standard Gaussian distribution, for any y ∈ R d , we have From (50), for any y ∈ R d , we have We can easily find that From (51) and (52), we have From [10,Lemma 7.4], (49), and (53), we conclude that

Remark 1
Since each element of a r independently follows the standard Gaussian distribution, a r 2 follows the chi-squared distribution with d degrees of freedom. Thus, we can show a r 2 ≥ 3 with high probability for sufficiently large d. It implies that the bound given in Proposition 6 is smaller than that given in Proposition 5.

Performance Results for Phase Retrieval with the Gaussian Model
Here, we summarize the results for the Gaussian model. All numerical experiments were performed in Python 3.7 on an iMac with a 3.3 GHz Intel Core i5 Processor and 8 GB 1867 MHz DDR3 memory.
First, let us examine the results for Bregman proximal-type algorithms, i.e., BPG [8], BPGe [29], BPDCA (Algorithm 1), and BPDCAe (Algorithm 2). We compared the averages of 100 random instances in terms of the number of iterations, CPU time, and accuracy (Tables 1 and 2). Letx be a recovered solution andx be the ground truth generated according to the method described in Subsection 5.2. In order to compare the objective function values, we took the difference log 10 |Ψ (x)− Ψ (x)| to be the accuracy. In the numerical experiments, Ψ (x) > Ψ (x). The termination criterion was defined as x k − x k−1 / max{1, x k } ≤ 10 −6 . The equation numbers under each algorithm in Tables 1 and 2 indicate the value of λ; that is, we set λ = 1/L for L satisfying the equations. For the restart schemes, we used the adaptive restart scheme with ρ = 0.99 and the fixed restart scheme with K = 200. We set θ = 1 for the regularizer g in (43). We forcibly stopped the algorithms when they reached the maximum number of iterations (50,000). Table 2 compares the results of BPGe and BPDCAe under the same settings as the results in Table 1. BPDCA with (49) was the fastest among the algorithms without extrapolation (Table 1). On the other hand, the extrapolation method makes each algorithm faster ( Table 2). We can conclude that, at least for phase retrieval, BPDCA has a clear advantage over BPG because of its reformulation as a nonconvex DC optimization problem (44), which permits choosing a smaller L in (47) instead of (45). In particular, for the Gaussian model, we can use a smaller L in (49) with high probability. The extrapolation technique can further enhance performance. Also, we can see that the iterates of BPDCA(e) globally converge to their optimal solutions despite that the kernel generating distance h (46) does not satisfy Assumption 4 (i). This suggests that this condition may be relaxed in some cases.
Next, we compared the empirical probability of success for BPDCAe and Wirtinger flow [10], which is a well-known algorithm for phase retrieval. Here we took x 0 in BPDCAe to be the value Table 1 Average number of iterations, CPU time, and accuracy for BPG [8] and BPDCA using 100 random instances of phase retrieval (over the Gaussian model) for different values of L. calculated in the initialization step of the Wirtinger flow. The empirical probability of success in Fig. 1 is an average over 100 trials. We regard that the algorithms succeeded if the relative error x −x / x falls below 10 −5 after 2,500 iterations. The dimension d was fixed at 128, and we varied the number of measurements m. We used the adaptive restart scheme with ρ = 0.99 and the fixed restart scheme with K = 200. We set θ = 0; i.e., we solved (43) without its regularizer. From the figure, we can see that BPDCAe with the initialization step of the Wirtinger flow achieved almost 100% success rate when m/d ≥ 6 and obtained more stable results than those of Wirtinger flow.

Conclusions
We proposed two Bregman proximal-type algorithms for solving DC optimization problems (P). One is the Bregman Proximal DC Algorithm (BPDCA), the other is BPDCA with extrapolation Table 2 Average number of iterations, CPU time, and accuracy for BPGe [29] and BPDCAe using 100 random instances of phase retrieval (over the Gaussian model) for different values of L. (BPDCAe). Proximal-type algorithms including ours are effective on large-scale problems. In addition, our algorithms assume that the function f 1 has the L-smooth adaptable property in relation to the kernel generating distance h, instead of L-smoothness. The restart condition for our adaptive restart scheme is different from the existing ones. We conducted convergence analyses of our algorithms. Assuming the Kurdyka-Lojasiewicz property or subanalyticity of the objective function together with some standard assumptions, we established that the iterates generated by BPDCA(e) globally converge to a limiting stationary point or a limiting critical point and derived their convergence rates.
We applied our algorithms to phase retrieval. The numerical experiments demonstrated that BPDCAe is faster than the other Bregman-type algorithms. For the Gaussian model, BPDCAe offered more stable results than Wirtinger flow [10]. We conclude that BPDCAe is a powerful method for solving large-scale and structured DC optimization problems. Although the kernel generating distance h (46) does not satisfy Assumption 4 (i), the sequences generated by BPDCA(e) converged in the numerical experiments. Therefore, we conjecture that most of the convergent results can be demonstrated under weaker conditions. As future work, since g in BPDCA does not need to be convex, we will attempt to prove the monotonicity of the auxiliary function of BPDCAe (Lemma 6) without assuming Assumption 7.
Other Bregman proximal-type algorithms have been proposed. Mukkamala et al. [20] chose the L-smad parameters by using a line search. As this parameter is generally difficult to estimate accurately, we can utilize this line search in our algorithms.
For constrained problems, Wang et al. [27] proposed the Bregman alternating direction methods with multipliers. Tu et al. [25] also developed a Bregman-type algorithm for solving linearly constrained DC optimization problems. These variational methods may inspire further improvements and extensions.
(ii) From (39), we obtain where the last inequality holds because h is a σ-strongly convex function and the first two terms are nonnegative. Summing the above inequality from k = 0 to ∞, we obtain which shows that lim k→∞ x k+1 − x k = 0 due to 1 λ − L > 0 and sup k>0 β k < 1. (iii) Letx be an accumulation point of {x k } ∞ k=0 and let {x kj } be a subsequence such that lim j→∞ x kj =x. Then, from the first-order optimality condition of subproblem (7) under Assumption 3, we have 0 ∈ ∂ c g(x kj +1 ) + ∇f 1 (y kj ) − ξ kj + 1 λ ∇h(x kj +1 ) − ∇h(y kj ) .
Therefore, using x kj +1 − x kj → 0 and x kj − x kj −1 → 0, we obtain Note that the sequence {ξ kj } is bounded as shown in Theorem 1 (ii), and the sequence {x kj } is bounded and converges tox. Thus, by taking the limit as j → ∞ or more precisely, its subsequence, we can assume without loss of generality that lim j→∞ ξ kj =:ξ exists, which belongs to ∂ c f 2 (x) since f 2 is continuous. Using this and (56), we take the limit of (55). Invoking x kj +1 − x kj → 0 and the continuity of g and ∇f 1 , we obtainξ ∈ ∂ c g(x) + ∇f 1 (x). Therefore, 0 ∈ ∂ c g(x) + ∇f 1 (x) − ∂ c f 2 (x), which shows thatx is a limiting critical point of (P).