High-order methods beyond the classical complexity bounds, I: inexact high-order proximal-point methods

In this paper, we introduce a \textit{Bi-level OPTimization} (BiOPT) framework for minimizing the sum of two convex functions, where both can be nonsmooth. The BiOPT framework involves two levels of methodologies. At the upper level of BiOPT, we first regularize the objective by a $(p+1)$th-order proximal term and then develop the generic inexact high-order proximal-point scheme and its acceleration using the standard estimation sequence technique. At the lower level, we solve the corresponding $p$th-order proximal auxiliary problem inexactly either by one iteration of the $p$th-order tensor method or by a lower-order non-Euclidean composite gradient scheme with the complexity $\mathcal{O}(\log \tfrac{1}{\varepsilon})$, for the accuracy parameter $\varepsilon>0$. Ultimately, if the accelerated proximal-point method is applied at the upper level, and the auxiliary problem is handled by a non-Euclidean composite gradient scheme, then we end up with a $2q$-order method with the convergence rate $\mathcal{O}(k^{-(p+1)})$, for $q=\lfloor p/2 \rfloor$, where $k$ is the iteration counter.


Introduction
Motivation.Central to the entire discipline of convex optimization is the concept of complexity analysis for evaluating the efficiency of a wide spectrum of algorithms dealing with such problems; see [20,25].For example, under the Lipschitz smoothness of the gradient of the objective function, the fastest convergence rate for first-order methods is of order O(k −2 ) for the iteration counter k; cf.[4,5,21,23].Likewise, if the objective is twice differentiable with Lipschitz continuous Hessian, the best complexity for second-order methods is of order O(k −7/2 ); see [7].In the recent years, there is an increasing interest to applying high-order methods for both convex and nonconvex problems; see, e.g., [1,7,10,12,16].If the objective is p-times differentiable with Lipschitz continuous pth derivatives, then the fastest convergence rate for pth-order methods is of order O(k −(3p+1)/2 ); cf.[7].
In general, for convex problems, the classical setting involves a one-to-one correspondence between the methods and problem classes.In other words, there exists and unimprovable complexity bound for a class of methods applied to a class of problems.In fact, under the Lipschitz (Hölder) continuity of the pth derivatives, the pth-order methods is called optimal if it attains the convergence rate O(k −(3p+1)/2 ), and if a method attains a faster convergence rate (under stronger assumptions than the optimal methods), we call it superfast.For example, first-order methods with the convergence rate O(k −2 ) and second-order methods with the convergence rate O(k −7/2 ) are optimal under the Lipschitz (Hölder) continuity of the first and the second derivatives, respectively.Recently, in [29], a superfast second-order method with the convergence rate O(k −4 ) has been presented, which is faster than the classical lower bound O(k −7/2 ).The latter method consists of an implementation of a third-order tensor method where its auxiliary problem is handled by a Bregman gradient method requiring second-order oracles, i.e., this scheme is implemented as a second-order method.We note that this method assumes the Lipschitz continuity of third derivatives while the classical second-order methods apply to problems with Lipschitz continuous Hessian.This clearly explains that the convergence rate O(k −4 ) for this method is not a contradiction with classical complexity theory for second-order methods.
One of the classical methods for solving optimization problems is the proximal-point method that is given by x k+1 = arg min for the function h(•), a given point x k , and λ > 0. The first appearance of this algorithm dated back to 1970 in the works of Martinet [18,19], which is further studied by Rockafellar [31] when λ is replaced by a sequence of positive numbers {λ k } k≥0 .Since its first presentation, this algorithm has been subject of great interest in both Euclidean and non-Euclidean settings, and many extensions has been proposed; for example see [6,9,11,14,15,32].Recently, Nesterov in [28] proposed a bi-level unconstrained minimization (BLUM) framework by defining a novel high-order proximal-point operator using a pth-order regularization term prox p h/H (x) = arg min see Section 2 for more details.This framework consists of two levels, where the upper level involves a scheme using the high-order proximal-point operator, and the lower-level is a scheme for solving the corresponding proximal-point minimization inexactly.Therefore, one has a freedom of choosing the order p of the proximal-point operator and can also choose a proper method to approximate the solution of the proximal-point auxiliary problem.Applying this framework to twice smooth unconstrained problems with p = 3, using an accelerated third-order method at the upper level, and solving the auxiliary problem by a Bregman gradient method lead to a second-order method with the convergence rate O(k −4 ).The main goal of this paper is to extend the results of [28] onto the composite case.

Content
In this paper, we introduce a Bi-level OPTimization (BiOPT) framework that is an extension of the BLUM framework (see [28]) for the convex composite minimization.In our setting, the objective function is the sum of two convex functions, where both of them can be nonsmooth.At the first step, we regularize the objective function by a power of the Euclidean norm • p+1 with p ≥ 1, following the same vein as (1.1).The resulted mapping is called high-order proximal-point operator, which is assumed to be minimized approximately at a reasonable cost.If the first function in our composite objective is smooth enough, in Section 2, we show that this auxiliary problem can be inexactly solved by one step of the pth-order tensor method (see Section 2.1).Afterwards, we show that the plain proximal-point method attains the convergence rate O(k −p ) (see Section 2.2), while its accelerated counterpart obtains the convergence rate O(k −(p+1) ) (see Section 2.3).We next present our bi-level optimization framework in Section 3, which opens up entirely new ground for developing highly efficient algorithms for simple constrained and composite minimization problems.In the upper level, we can choose the order p of the proximal-point operator and apply both plain and accelerated proximal-point schemes using the estimation sequence technique.We then assume that the differentiable part of the proximal-point objective is smooth and strongly convex relative to some scaling function (see [9,17]) and then design a non-Euclidean composite gradient algorithm using a Bregman distance to solve this auxiliary problem inexactly.It is shown that the latter algorithm will be stopped after O(log 1 ε ) of iterations, for the accuracy parameter ε > 0. Hence, choosing a lower-order scaling function for the Bregman distance, there is a possibility to apply lower-order schemes for solving the auxiliary problem that will lead to lower-order methods in our convex composite setting.
Following our BiOPT framework, we finally pick a constant p for the pth-order proximal-point operator and apply the accelerated method to the composite problem at the upper level.Then, we introduce a high-order scaling function and show that the differentiable part of the proximal-point objective is L-smooth and µ-strongly convex relative to this scaling function, for L, µ > 0. We consequently apply the non-Euclidean composite gradient method to the auxiliary problem, which only needs the pth-order oracle for even p and the (p − 1)th-order oracle for odd p.Therefore, we end up with a high-order method with the convergence rate of order O(k −(p+1) ) under some suitable assumptions.We emphasize while this convergence rate is faster than the classical lower bound O(k −(3p−2)/2 ) for p = 3, it is sub-optimal for other choices of p.However, we show that our method can overpass the classical optimal rates for some class of structured problems.We finally deliver some conclusion in Section 4.

Notation and generalities
In what follows, we denote by E a finite-dimensional real vector space and by E * its dual spaced composed by linear functions on E. For such a function s ∈ E * , we denote by s, x its value at x ∈ E.
Let us measure distances in E and E * in a Euclidean norm.For that, using a self-adjoint positive-definite operator B : E → E * (notation B = B * 0), we define Sometimes, it will be convenient to treat x ∈ E as a linear operator from R to E, and x * as a linear operator from E * to R. In this case, xx * is a linear operator from E * to E, acting as follows: For a smooth function f : E → R denote by ∇f (x) its gradient, and by ∇ 2 f (x) its Hessian evaluated at the point x ∈ E. Note that We denote by x(•) the linear model of convex function f (•) at point x ∈ E given by Using the above norm, we can define the standard Euclidean prox-functions where p ≥ 1 is an integer parameter.These functions have the following derivatives: In what follows, we often work with directional derivatives.For p ≥ 1, denote by is a symmetric p-linear form.Its norm is defined in a standard way: If all directions h 1 , . . ., h p are the same, we apply the notation Note that, in general, we have (see, for example, [30, Appendix 1]) In this paper, we work with functions from the problem classes F p , which are convex and p times continuously differentiable on E. Denote by M p (f ) its uniform upper bound for its pth derivative: 2 Inexact high-order proximal-point methods Let function f : E → R be closed convex and possibly non-differentiable and let ψ : E → R be a simple closed convex function such that domψ ⊆ int(domf ).We now consider the convex composite minimization problem where it is assumed that (2.1) has at least one optimal solution x * ∈ domψ and F * = F (x * ).This class of problems is general enough to cover many practical problems from many application fields like signal and image processing, machine learning, statistics, and so on.In particular, for the simple closed convex set Q ⊆ E, the simple constrained problem can be rewritten in the form (2.1), i.e., min where δ Q (•) is the indicator function of the set Q given by Let us define the pth-order composite proximal-point operator for H > 0 and p ≥ 1, which is an extension of the pth-order proximal-point operator given in [28].Moreover, if p = 1, it reduces to the classical proximal operator Our main objective is to investigate the global rate of convergence of high-order proximal-point methods in accelerated and non-accelerated forms, where we approximate the proximal-point operator (2.4) and study the complexity of such approximation.To this end, let us introduce the set of acceptable solutions of (2.4) by where where β ∈ [0, 1) is the tolerance parameter.Note that if ψ ≡ 0, then the set A p H (x, β) leads to inexact acceptable solutions for the problem (2.4), which was recently studied for smooth convex problems in [28].Let us emphasize that extending the definition of inexact acceptable solutions from [28] for nonsmooth functions is not a trivial task because not all subgradients g ∈ ∂ψ(x) satisfy the inequality (2.5).In the more general setting of the composite minimization, we address this issue in Section 3.1 using a non-Euclidean composite gradient scheme that suggests which subgradient g ∈ ∂ψ(x) = ∅ can be explicitly used in (2.5).
Since function F (•) is convex and d p+1 (•) is uniformly convex, the minimization problem (2.4) has a unique solution that we assume to be computable at reasonable cost.Let us first see how the exact solution of (2.4) satisfies (2.5).The first-order optimality conditions for (2.4) ensure that Thus, for g = H T − x p−1 B(x − T ) − ∇f (T ), the inequality in (2.5) holds with any β ∈ [0, 1), i.e., prox p F/H (x), g) ∈ A p H (x, β).Furthermore, since ∇f p x,H (x) = ∇f (x), we have (x, g) ∈ A p H (x, β) except if x = x * .In the next subsection, we show that an acceptable approximation of the operator (2.4) can be computed by applying one step of the pth-order tensor method (see [26]) satisfying (2.5), while a lower-order method will be presented in Section 3.1.Let us highlight here that we are not able to find an inexact solution in the sense of (2.5) for all points x in a neighbourhood of the solution x * ; however its exact solution always satisfies this inequality.We study this in the following example.
We first present the following lemma, which is a direct consequence of the definition of acceptable solutions (2.5).Lemma 2.2 (properties of acceptable solutions).Let (T, g) ∈ A p H (x, β) for some g ∈ ∂ψ(T ).Then, we have ) (2.8) . (2.9) Proof.From (2.5) and the reverse triangle inequality, we obtain i.e., the inequality (2.7) holds.Squaring both sides of the inequality in (2.5), we come to leading to , which is the right-hand side of the inequality (2.10) with r = T − x .From the inequality (2.7), . Taking the derivative of ζ at r and β ≤ 1 p , we get Together with (2.10), this implies (2.9).

Solving (2.4) with pth-order tensor methods
In this section, we assume that f (•) is pth-order differentiabe with M p+1 (f ) < +∞ and show that an acceptable solution satisfying the inequality (2.5) can be obtained by applying one step of the tensor method given in [26].
The Taylor expansion of the function f (•) at x ∈ E is denoted by and it holds that ∇f (y) − ∇Ω x,p (y) * ≤ Let us define the augmented Taylor approximation as , which is a uniform upper bound for F (•).
In the case M ≥ pM p+1 (f ), the function Ω x,p (y) + ψ(y) is convex, as confirmed by [26,Theorem 1], which implies that one will be able to minimize the problem (2.1) by the tensor step, i.e., (2.12) We next show that an approximate solution of (2.12) can be employed as an acceptable solution of the proximal-point operator (2.4) by the inexact pth-order tensor method proposed in [13,26].
We note that setting M = 1+β β(1−γ)−γ M p+1 (f ) and H = M p! , the inequality (2.14) can be rewritten in the form In order to illustrate the results of Lemma 2.3, we study the following one-dimensional example.
Example 2.4.Let us consider the minimization of the one-dimensional function F : R → R given by F (x) = x 4 + |x|, where x * = 0 is its unique solution.In the setting of the problem (2.1), we have f (x) = x 4 and ψ(x) = |x|.Let us set p = 3, i.e., we have M 4 (f ) = 24 and where M = 1.9M 4 (f ).Thus, Setting γ = 8 19 ∈ [0, 9  19 ) and x = 0.8, we illustrate the feasible area of | and acceptable solutions in Subfigures (a) and (b) of Figure 2, respectively.We note that with our choice of γ and M , we have (1 − γ)M > M 4 (f ), which implies that all assumptions of Lemma 2.3 are valid.In Section 3, we further extend our discussion concerning the computation of an acceptable solution A p H (x, β) for the pth-order proximal-point problem (2.4) by the lower-level methods.

Inexact high-order proximal-point method
In this section, we introduce our inexact high-order proximal-point method for the composite minimization (2.1) and verify its rate of convergence.We now consider our first inexact high-order proximal-point scheme that generates a sequence of iterations satisfying which we summarize in Algorithm 1.
Algorithm 1: Inexact High-Order Proximal-Point Algorithm Input: In order to verify the the convergence rate of Algorithm 1, we need the next lemma, which was proved in [27,Lemma 11].
Lemma 2.5.[27, Lemma 11] Let {ξ k } k≥0 be a sequence of positive numbers satisfying for α ∈ (0, 1].Then, for k ≥ 0, the following holds Let us investigate the rate of convergence of Algorithm 1.Let us first define the radius of the initial level set of the function ψ in (2.1) as Theorem 2.6 (convergence rate of Algorithm 1).Let the sequence {x k } k≥0 be generated by the inexact high-order proximal-point method (2.15) with β ∈ [0, 1/p].Then, for k ≥ 0, we have (2.18) Proof.From the convexity of ψ(•) and (2.9), we obtain , with g ∈ ∂ψ(x k+1 ) and (x k+1 , g) ∈ A p H (x k , β).By Cauchy-Schwartz inequalitiy, we get It follow from the last two inequalities, that ) and α = 1/p, we see that the condition (2.16) is satisfied for all k ≥ 0. Therefore, from Lemma 2.5, we have giving (2.18).

Accelerated inexact high-order proximal-point method
In this section, we accelerate the scheme (2.15) by applying a variant of the standard estimating sequences technique, which has been used as an standard tools for accelerating first-and secondorder methods; see, e.g., [2,8,21,22,23,24,25].
Let {A k } k≥0 be a sequence of positive numbers generated by A k+1 = A k + a k+1 for a k > 0. The idea of the estimating sequences techniques is to generate a sequence of estimating functions {Ψ k (x)} k≥0 of F (•) in such a way that, at each iteration k ≥ 0, the inequality . Following [28,29], we set (2.20) For x 0 , y k ∈ E and (T k , g) ∈ A p H (y k , β), let us define the estimating sequence Proof.The proof is given by induction on k.For k = 0, Ψ 0 = d p+1 (x − x 0 ) and so (2.22) holds.
We now assume that (2.22) holds for k and show it for k + 1.Then, it follows from (2.21) and the subgradient inequality that We next present an accelerated version of the scheme (2.5).
Algorithm 2: Accelerated Inexact High-Order Proximal-Point Algorithm Input: and compute A k+1 and a k+1 by (2.20); In the subsequent result, we investigate the convergence rate of the sequence generated by the accelerated inexact high-order proximal-point method (Algorithm 2).Theorem 2.8 (convergence rate of Algorithm 2).Let the sequence {x k } k≥0 be generated by Algorithm 2 with β ∈ [0, 1/p].Then, the following statements holds: Proof.We first show by induction that (2.19) holds.Since A 0 = 0 and Ψ 0 = d p+1 (x − x 0 ), it clearly holds for k = 0. We now assume that inequality (2.19) holds for k ≥ 0, and prove it for k + 1.From (2.22), the induction assumption Ψ * k ≥ A k F (x k ), and the subgradient inequality, we obtain p−1 .For all x ∈ domψ, we have It follows from (2.9) and Combining the last three inequalities yields On the other hand, from (2.20), it can be deduced Together with (2.24) and f (T k ) ≥ F (x k+1 ), this ensures Ψ * k+1 ≥ A k+1 F (x k+1 ), i.e., the assertion (i) holds.

Bi-level optimization framework
As we have seen in the previous sections, solving the convex composite problem (2.1) by an inexact high-order proximal-point method involves two steps: (i) choosing a pth-order proximal-point method as an upper-level scheme; (ii) choosing a lower-level method for computing a point T ∈ A p H (x, β).This gives us two degrees of freedom in the strategy of finding a solution to the problem (2.1).This is why we call this framework Bi-level OPTimization (BiOPT).At the upper level, we do not need to impose any assumption on the objective F (•) apart from its convexity.At the lower-level method, we need some additional assumption on this objective function.Moreover, in the BiOPT setting, the complexity of a scheme leans on the complexity of both upper-and lower-level methods.
On the basis of the results of Section 2.1, the auxiliary problem (2.4) can be solved by applying one step of the pth-order tensor method.This demands the computation of ith (i = 1, . . ., p) directional derivatives of function f (•) and the condition (2.13), which might not be practical in general.Therefore, we could try to apply a lower-order method to the auxiliary problem (2.4), which leads to an efficient implementation of the BiOPT framework.This is the main motivation of the following sections.

Non-Euclidean composite gradient method
Let us assume that k is a fixed iteration of either Algorithm 1 or Algorithm 2, and we need to compute an acceptable solution z k of (2.4) satisfying (2.5).To do so, we introduce a non-Euclidean composite gradient method and analyze the convergence properties of the sequence {z i } i≥0 generated by this scheme, which satisfies in the limit inequality (2.5).Our main tool for such developments is the relative smoothness condition (see [9,17] for more details and examples).
Notice that an acceptable solution of the auxiliary problem (2.4) requires that the function be minimized approximately, delivering a point y k ∈ domψ, satisfying the inequality (2.5) holds for given .
Let us consider a simple example in which f : R → R with f ≡ 0 and y k = 0.Then, the function f 2 0,H : R → R defined as f 2 0,H (z) = 1 3 |z| 3 with ∇f 2 0,H (z) = |z|z, which is not Lipschitz continuous.This shows that one cannot expect the Lipschitz smoothness of f p y k ,H (•) for p ≥ 2. However, it can be shown that this function belongs to a wider class of functions called relatively smooth.
Let function ρ : E → R be closed, convex, and differentiable.We call it a scaling function.Now, the non-symmetric Bregman distance function β ρ : E × E → R with respec to ρ is given by For x, y, z ∈ E, it is easy to see (e.g., the proof of Lemma 3 in [27]) that is convex, and we call it µ h -strongly convex relative to ρ(•) if there exists µ h > 0 such that (h − µρ)(•) is convex; cf.[9,17].The constant κ h = µ h /L h is called the condition number of h(•) relative to the scaling function ρ(•).
In the following lemma, we characterize the latter two conditions.
In this subsection, for the sake of generality, we assume the existence of the scaling function ρ(•) such that the conditions (H1)-(H2) hold; however, in Section 3.2 we introduce a specific scaling function satisfying (H1)-(H2).
We are in position now to develop a non-Euclidean composite gradient scheme for minimizing (3.1) based on the assumptions (H1)-(H2).For given y k , z i ∈ domψ and H, L > 0, we introduce the non-Euclidean composite gradient scheme which is first-order method and the point z * k denotes the optimal solution of (3.4).Note that the first-order optimality conditions for (3.4) leads to the following variational principle For the sequence {z i } i≥0 generated by the scheme (3.4), we next show the monotonicity of the sequence {ϕ k (z i )} i≥0 .Lemma 3.2 (non-Euclidean composite gradient inequalities).Let {z i } i≥0 be generated by the scheme (3.4).Then, it holds that Moreover, we have Proof.Since z i+1 is a solution of (3.4), it holds that Together with the L-smoothness of f p y k ,H (•) relative to ρ(•), this implies giving (3.6).Setting x = z i+1 and y = z i in the three point inequality (3.3) and applying the inequality (3.5), it can be concluded that Accordingly, we get justifying the inequality (3.7).
In summary, we come to the following non-Euclidean composite gradient algorithm.
Algorithm 3: Non-Euclidean Composite Gradient Algorithm Input: We now assume that the auxiliary problem (3.4) can be solved exactly.For the sequence {z i } i≥0 given by (3.4), we will stop the scheme as soon as ∇f p y k ,H (z i+1 ) + g * ≤ β ∇f (z i+1 ) + g * holds, and then we set z k = z i+1 .In the remainder of this section, we show that the stopping criterion holds for i large enough.
Setting z = z 0 in the inequality (3.7), it follows the (p + 1)-uniform convexity of ρ(•) that Let us define the bounded convex set The next results shows that the sequence {dist(0, ∂ϕ k (z i ))} i≥0 vanishes, for {z i } i≥0 generated by Algorithm 3.For doing so, we also require that where
We now show the well-definedness and complexity of Algorithm 3 in the subsequent result.Theorem 3.4 (well-definedness of Algorithm 3).Let us assume that all conditions of Lemma 3.3 hold, let {z i } i≥0 be a sequence generated by Algorithm 3, and let where x * is a minimizer of F and ε > 0 is the accuracy parameter.Moreover, assume that there exists a constant D > 0 such that z i − x * ≤ D for all i ≥ 0.Then, for the subgradients and z i  * k ∈ domψ, the maximum number of iterations i * k needed to guarantee the inequality where C is defined in (3.9) and ε > 0 is the accuracy parameter.

Bi-level high-order methods
In the BiOPT framework, we here consider Algorithm 2 using the pth-order proximal-point operator in the upper-level, and in the lower-level we solve the auxiliary problem by the high-order non-Euclidean composite gradient method described in Algorithm 3. As such, our proposed algorithm only needs the pth-order oracle for even p and the (p − 1)th-order oracle for odd p, which attains the complexity of order O(ε −1/(p+1) ).
In the remainder of this section, we set p ≥ 2 and q = p/2 .Let us define the function ρ y k ,H : E → R given by which is uniformly convex with degree p + 1 that is not a trivial result.For p = 3, the function given in [28].Owing to this foundation, we can show that the function f p y k ,H (•) is L-smooth and µ-strongly convex relative to the scaling function ρ y k ,H (•), which paws the way toward algorithmic developments.We begin next with showing the uniform convexity of ρ y k ,H (•).To this end, we need the pth-order Taylor expansion of the function f around y ∈ domf given by for x ∈ domf and Ω y,p (x) = f (y) + see [26,Theorem 1].
Theorem 3.5 (uniform convexity and smoothness of ρ y k ,H (•)).For any x − y k ∈ E and ξ > 1, if p ≥ 2 and q = p/2 , then where Moreover, the function ρ y k ,H (•) given in (3.16) is uniformly convex with degree p + 1, and the Hence, replacing h by ξh in the last inequality, dividing by ξ p−2 for ξ > 1, and splitting the sum into the odd and even terms, we come to leading to the left hand side of (3.19).Replacing h by −h, it holds that giving the right hand side of (3.19).
From the pth-order Taylor expansion of the function f at y k , (3.18), (3.19), and (1.3), we obtain Since f (•) is convex, this and (3.23) imply leading to the convexity of ρ y k ,H (•).Moreover, its uniform convexity of the degree p + 1 follows from that of d p+1 (•).It follows from (3.17) that where r p+1 (±h) ≤

Mp+1
(p−1)! h p−1 .Summing up the latter identities, it holds that Moreover, we have Theorem 3.5 is clearly implies that the assumptions (H1) and (H4) are satisfied for the scaling function ρ y k ,H (•) (3.16).In the subsequent result, we show that the assumption (H2) also holds for this function.
Theorem 3.6 (relative smoothness and strong convexity of f p y k ,H (•)).Let H ≥ M p+1 (f ) and let p ≥ 2 and q = p/2 .Then, the function where ξ is the unique solution of the quadratic equation Proof.In light of , we can write On the other hand, the pth-order Taylor expansion (3.17) and (3.19) yield We therefore have giving our desired result.
Motivated by the equations (3.23), in the remainder of this section, we set Additionally, in view of (2.20), we consider We now present our accelerated high-order method by combining all above facts with Algorithm 2 leading to the following algorithm.
Algorithm 4: (Bi-Level High-Order Algorithm) Now, let us have a look at the optimality conditions for the auxiliary problem (3.4) for our pth-order proximal-point operator given by which should be solved exactly in our setting.We next translate this inclusion for convex constrained problem (2.2).
Example 3.7.We here revisit the convex constrained problem (2.2) and its unconstrained version For given z i ∈ E, writing the first-order optimality conditions for this problem leads to where ∂ψ(z i+1 ) = N Q (z i+1 ) and therefore the normal cone plays a crucial role for finding a solution of the auxiliary problem (3.4).As an example, let us consider the Euclidean ball Q = {x ∈ R n : x ≤ δ} for which we have We now set p = 3 and consider two cases: (i) z i+1 < δ; (ii) z i+1 = δ.In Case (i), we have This consequently implies where r = z i+1 − y k can be computed by solving the one-dimensional equation In Case (ii) ( z i+1 = δ), there exists α > 0 such that leading to where r = z i+1 − y k and α are obtained by solving the system Finally, we come to the solution for the r and α computed by solving the above-mentioned nonlinear systems.
In order to upper bound the Bregman term β ρ k (•, •), we next define the norm-dominated scaling function in the following, which will be needed in the remainder of this section.
for all x ∈ S and y ∈ E.
From now on and for sake of simplicity, we denote ρ y k ,H (•) by ρ k (•).In order to show the normdominatedness of the scaling function ρ k (•) (3.16) is norm-dominated, we first need the following technical lemma.Lemma 3.9 (norm-dominatedness of the Euclidean ball).Let p ≥ 2 and q = p/2 .Then, the function d p+1 (•) is norm-dominated on the Euclidean ball where for τ ≥ 0.
Proof.Let us first assume p is an odd number, i.e., p = 2q + 1.For x ∈ B R and y = x + h ∈ E, it follows from the inequality a 1/t + b 1/t t ≤ 2 t−1 (a + b) for a, b ≥ 0 and t ≥ 1 that Together with x ∈ B R , this implies For even p, p = 2q with q ≥ 1, x ∈ B R and y + h ∈ E, it follows from a 1/t + b 1/t t ≤ 2 t−1 (a + b) for a, b ≥ 0 and t ≥ 1 that Combining with x ∈ B R , it holds that To further simplify our upper bounds, for p = 2q + 1, we search for α, β > 1 such that Now, minimizing the right-hand-side of this inequality with respect to τ leads to the optimal point τ 1 = (β1−1)d1 (α1−1)a1 1 2q+2 .Substituting this into the last inequality, we come to giving (3.29) for p = 2q + 1.On the other hand, for p = 2q, we explore the constants α 2 , β 2 > 1 such that the inequality Let us minimize the right-hand-side of the latter inequality with respect to τ leading to the solution . Now, by substituting this into point the last inequality, we get which leads to the inequalities
We now have all the ingredients to address the complexities of the upper and lower levels of Algorithm 4, which is the main result of this section.To this end, for the auxiliary minimization problem (3.4), we assume iterations, for the accuracy parameter ε > 0.Moreover, the auxiliary problem (3.4) is approximately solved by Algorithm 3 in at most iterations.
Example 3.12.Let us consider the vector b ∈ R N , the vectors a i ∈ R n and the univariate functions f i : R → R that are four times continuously differentiable, for i = 1, . . ., N .Then, we define the function f : R n → R as
In the same way, for p = 4, we need fourth-order oracle of f i (•), for i = 1, . . ., N .Moreover, Theorem 3.11 ensures that the sequence generated by Algorithm 4 attains the complexity O(ε −1/5 ) for p = 4 and O(ε −1/6 ) for p = 5, which are worse that the optimal complexity O(ε −2/13 ), for the accuracy parameter ε.On the other hand, setting h = x − y k , it holds that Let us particularly verify these terms for f i (x) = − log(x) (i = 1, . . ., N ) for x ∈ (0, +∞), which consequently leads to i.e., Thus, in this case, the implementation of Algorithm 4 with p = 4 and p = 5 only requires the second-order oracle of f i (•) (i = 1, . . ., N ) and the first-order oracle of ψ(•).Therefore, we end up with a second-order method with the complexity of order O(ε −1/5 ) for p = 4 and O(ε −1/6 ) for p = 5, which are much faster than the second-order methods optimal bound O(ε −2/7 ); however, choosing the odd order p = 5, Algorithm 4 attains a better complexity.

Conclusion
In this paper, we suggest a bi-level optimization (BiOPT), a novel framework for solving convex composite minimization problems, which is a generalization of the BLUM framework given in [28] and involves two levels of methodologies.In the upper level, we only assume the convexity of the objective function and design some upper-level scheme using proximal-point iterations with arbitrary order.On the other hand, in the lower level, we need to solve the proximal-point auxiliary problem inexactly by some lower-level scheme.In this step, we require some more properties of the objective function for developing efficient algorithms providing acceptable solutions for this auxiliary problem at a reasonable computational cost.The overall complexity of the method will be the product of the complexities in both levels.We here develop the plain pth-order inexact proximal-point method and its acceleration using the estimation sequence technique that, respectively, achieve the convergence rate O(k −p ) and O(k −(p+1) ) for the iteration counter k.Assuming the L-smoothness and µ-strong convexity of the differentiable part of the proximal-point objective relative to some scaling function (for L, µ > 0), we design a non-Euclidean composite gradient method to inexactly solve the proximal-point problem.It turns out that this method attains the complexity O(log 1 ε ), for the accuracy parameter ε > 0.
In the BiOPT framework, we apply the accelerated pth-order proximal-point algorithm in the upper level, introduce a new high-order scaling function and show that the differentiable part of the auxiliary objective is smooth and strongly convex relative to this function, and solve the auxiliary problem by a non-Euclidean composite gradient method in the lower level.We consequently come to a bi-level high-order method with the complexity of order O(ε −1/(p+1) ), which overpasses the classical complexity bound of second-order methods for p = 3, as was known from [28].In general, for p = 2 and p ≥ 3, the complexity of our bi-level method is sub-optimal; however, we showed that for some class of structured problems it can overpass the classical complexity bound O(ε −2/(3p+1) ).Overall, the BiOPT framework paves the way toward methodologies using the pthorder proximal-point operator in the upper level and requiring lower-order oracle than p in the lower level.Therefore, owing to this framework, we can design lower-order methods with convergence rates overpassing the classical complexity bounds for convex composite problems.Hence, this will open up an entirely new ground for developing novel efficient algorithms for convex composite optimization that was not possible in the classical complexity theory.
Several extensions of our framework are possible.As an example, we will present some extension of our framework using a segment search in the upcoming article [3].Moreover, the proximal-point auxiliary problem can be solved by some more efficient method like the non-Euclidean Newton-type method presented in [6].In addition, the introduced high-order scaling function can be employed to extend the second-order methods presented in [26,27,28,29] to higher-order methods.
Acceptable solutions for x = 1.4.
leading to(2.22)for k + 1.The right hand side inequality in (2.22) is a direct consequence of the definition of Ψ k (•) and (1.4).