Newton-MR: Inexact Newton Method With Minimum Residual Sub-problem Solver

We consider a variant of inexact Newton Method, called Newton-MR, in which the least-squares sub-problems are solved approximately using Minimum Residual method. By construction, Newton-MR can be readily applied for unconstrained optimization of a class of non-convex problems known as invex, which subsumes convexity as a sub-class. For invex optimization, instead of the classical Lipschitz continuity assumptions on gradient and Hessian, Newton-MR's global convergence can be guaranteed under a weaker notion of joint regularity of Hessian and gradient. We also obtain Newton-MR's problem-independent local convergence to the set of minima. We show that fast local/global convergence can be guaranteed under a novel inexactness condition, which, to our knowledge, is much weaker than the prior related works. Numerical results demonstrate the performance of Newton-MR as compared with several other Newton-type alternatives on a few machine learning problems.


a r t i c l e i n f o a b s t r a c t
We consider a variant of inexact Newton Method [20,40], called Newton-MR, in which the least-squares sub-problems are solved approximately using Minimum Residual method [79]. By construction, Newton-MR can be readily applied for unconstrained optimization of a class of non-convex problems known as invex, which subsumes convexity as a sub-class. For invex optimization, instead of the classical Lipschitz continuity assumptions on gradient and Hessian, Newton-MR's global convergence can be guaranteed under a weaker notion of joint regularity of Hessian and gradient. We also obtain Newton-MR's problem-independent local convergence to the set of minima. We show that fast local/global convergence can be guaranteed under a novel inexactness condition, which, to our knowledge, is much weaker than the prior related works.

Introduction
For a twice differentiable function f : R d → R, consider the unconstrained optimization problem The canonical example of second-order methods is arguably the classical Newton's method whose iterations are typically written as where x k , g(x k ) = ∇f (x k ), H(x k ) = ∇ 2 f (x k ), and α k are respectively the current iterate, the gradient, the Hessian matrix, and the step-size that is often chosen using an Armijo-type line-search to enforce sufficient decrease in f [76]. When f is smooth and strongly convex, it is well known that the local and global convergence rates of the classical Newton's method are, respectively, quadratic and linear [18,73,76]. For such problems, the Hessian matrix is uniformly positive definite. As a result, if forming the Hessian matrix explicitly and/or solving the linear system in (2) exactly is prohibitive, the update direction p k is obtained approximately using conjugate gradient (CG), resulting in the celebrated Newton-CG [76]. Newton-CG is elegantly simple in that its iterations merely involve solving linear systems followed by a certain type of line-search. Despite its simplicity, it has been shown to enjoy various desirable theoretical and algorithmic properties including insensitivity to problem ill-conditioning [81,94] as well as robustness to hyper-parameter tuning [15,60]. However, in the absence of strong-convexity, Newton's method lacks any favorable convergence analysis. Indeed, recall that to obtain the update direction, p k , Newton-CG aims at (approximately) solving quadratic sub-problems of the form ( [76]) min p∈R d g(x k ), p + 1 2 p, H(x k )p .
However, if the Hessian is indefinite (as in non-convex settings) or if its positive semidefinite but g(x k ) / ∈ Range (H(x k )) (as in weakly convex problems), (3) is simply unbounded below.
In this light, many Newton-type variants have been proposed which aim at extending Newton-CG beyond strongly convex problems to more general settings, e.g., trust-region [32] and cubic regularization [25,26,75]. However, many of these extensions involve subproblems that are themselves non-trivial to solve, e.g., the sub-problems of trust-region and cubic regularization methods are themselves non-linear and (potentially) non-convex [92,93]. This has prompted many authors to design effective methods for approximately solving such sub-problems, e.g., [16,24,25,34,52,61,85]. These methods can be highly effective in many settings. For example, the trust-region method coupled with CG-Steihaug [85] has shown great promise in several machine learning applications [93]. However, solving such non-trivial sub-problems can be a major challenge in many other settings. For instance, despite the obvious advantages of the CG-Steihaug method, e.g., one matrix-vector product per iteration and the ability to naturally detect negative curvature directions, it is not guaranteed to solve the trust-region sub-problem to an arbitrary accuracy. This can have negative consequences for the convergence speed of the trust-region method. Indeed, if either of the negative curvature or the boundary is encountered too early, the CG-Steihaug method terminates and the resulting step is only slightly, if at all, better than the Cauchy direction [32]. If this occurs too often, the trust-region method's convergence can slow down to be comparable to that of the simple gradient descent method.
Recently, [68] proposes a novel Newton-type method, which not only enjoys subproblems in the form of linear-systems, but as long as these sub-problems are solved exactly, it has provably fast global convergence for weakly convex objectives. To allow for optimization of a larger class of problems beyond convex, while maintaining the simplicity offered by linear sub-problems, one can consider replacing (3) with an equivalently simple ordinary least-squares (OLS) min p∈R d 1 2 H(x k )p + g(x k ) 2 .
Clearly, regardless of the Hessian matrix, there is always a solution (in fact at times infinitely many solutions) to the above OLS problem. For example, exact minimum norm solution to the above OLS amounts to iterations of the form where A † denotes the Moore-Penrose generalized inverse of matrix A, and α k is an appropriately chosen step-size. Our aim in this paper is to give an in-depth treatment of the above alternative and study the convergence properties of the resulting algorithm, which we call Newton-MR, 1 under a variety of settings.
The rest of this paper is organized as follows. We end Section 1 by introducing the notation used throughout the paper. In Section 2, we take a look at high-level consequences of using a direction from (an approximation of) (4). Convergence properties of Newton-MR are gathered in Section 3. This is done, first, by extensive discussion on various assumptions underlying the analysis in Section 3.1, followed by a detailed theoretical development to obtain local and global convergence of Newton-MR in Sec- Notation In what follows, vectors and matrices are denoted by bold lowercase and bold uppercase letters, e.g., v and V, respectively. We use regular lowercase and uppercase letters to denote scalar constants, e.g., c or L. The transpose of a real vector v is denoted by v . For two vectors, v, w, their inner-product is denoted as v, w = v w. For a vector v, and a matrix V, v and V denote the vector 2 norm and the matrix spectral norm, respectively. For any x, z ∈ R d , y ∈ [x, z] denotes y = x + τ (z − x) for some 0 ≤ τ ≤ 1. For a natural number n, we denote [n] = {1, 2, . . . , n}. For any finite collection S, its cardinality is denoted by |S|. The subscript, e.g., x k , denotes iteration counter. For a matrix A ∈ R d×d , Range (A) and Null (A) denote, respectively, its range, i.e., Range (A) = Ax | x ∈ R d and its null-space, i.e., Null (A) = y ∈ R d | Ay = 0 . The Moore-Penrose generalized inverse of a matrix A is denoted by A † . The identity matrix is written as I and the indicator function of a set S is denoted as 1 {S} . When the minimum is attained at more than one point, "Arg min" denotes the set of minimizers, otherwise "arg min" implies a unique minimizer. Finally, we use g(x) ∇f (x) and H(x) ∇ 2 f (x) for the gradient and the Hessian of f at x, respectively, and at times we drop the dependence on x by simply using g and H, e.g., g k = g(x k ) and H k = H(x k ).

Newton-MR
Before providing a detailed look into the convergence implications of using (4), let us first study higher-level consequences of such a choice as they relate to solving (1).

Minimum residual sub-problem solver
For solving OLS problems, a plethora of iterative solvers exists. Each of these iterative methods can be highly effective once applied to the class of problems for which they have been designed. Among them, the one method which has been specifically designed to effectively obtain the least-norm solution for all systems, compatible or otherwise, involving real symmetric but potentially indefinite/singular matrices (such as those in (4)) is MINRES-QLP [29,30]. In fact, by requiring only marginally more computations/storage, it has been shown that MINRES-QLP is numerically more stable than MINRES [79] on ill-conditioned problems.
Perhaps a natural question at this stage is "what are the descent properties of the (approximate) solution to (4) using MINRES-QLP?" Recall that the t th iteration of MINRES-QLP yields a search direction, p (t) k , which satisfies [49]. Since 0 ∈ K t (H k , g k ), it follows that for any t in (6), we have r k (p In other words, p k for some t ≥ 1. As a consequence of such particular descent property of p k , the step-size, α k , can be chosen using Armijo-type line search [76] to reduce the norm of the gradient. Hence, we chose α k ≤ 1 as the largest value such that for some 0 < ρ < 1, we get This can be approximately achieved using any standard back-tracking line-search strategy.

Invexity
From the above discussion, we always have p k , H k g k ≤ 0, which implies that In other words, by appropriate application of Hessian and regardless of non-convexity of the problem, one can always obtain descent directions corresponding to the auxiliary problem In certain applications, e.g., chemical physics [2,66] and deep learning [46][47][48], instead of minimizing the function, the goal is to find zeros of its gradient field, which can be obtained by solving (9). However, when the goal is solving (1), a natural question is "what is the class of objective functions for which (9) is equivalent to (1)?" Clearly, any global optimum of (1) is also a solution to (9). However, the converse only holds for a special class of non-convex functions, known as invex [55,69].
The class of functions satisfying (10) with the same φ is denoted by F φ . The class of all invex functions is denoted by F.
Invexity characterizes the class of functions for which the first order optimality condition is also sufficient. It is also easily seen that, by choosing φ(y, x) = y − x in Definition 1, differentiable convex functions are subsumed in the class of invex functions; see [14,69] for detailed treatments of invexity. Much like all generalized convex functions [23], invexity has been introduced in order to weaken, as much as possible, the convexity requirements in optimization problems, and extend many results related to convex optimization theory to more general functions.
Invexity has recently garnered attention in the machine learning community. Restricted to a special subclass, often referred to as Polyak-Łojasiewicz functions (cf. Section 3.2.3), numerous works have established the convergence of first-order optimization algorithms, e.g., [7,37,57,59,78,90]. In addition, it has recently been shown that several deep learning models enjoy such invex properties in certain regimes, e.g., [6,64]. For more general non-convex machine learning models, [33] proposes a novel regularization framework, which ensures the invexity of the regularized function and is shown to be advantageous over the classical 2 -regularization in terms of the sensitivity to hyperparameter tuning and out-of-sample generalization performance.

Connections to root finding algorithms
Considering (9) in lieu of g(x) = 0 is, in fact, a special case of a more general framework for solving non-linear system of equations involving a vector valued function F : R m → R n . More specifically, as an alternative to solving F(x) = 0, minimization of F(x) has been considered extensively in the literature; e.g., [10,11,28,74,80,97,100]. Consequently, Newton-MR can be regarded as a member of the class of inexact Newton methods with line-search for solving nonlinear systems [35,40,41,70]. In our case, the nonlinear system of equations arises from optimality condition g(x) = 0, i.e., F(x) = g(x), which has a symmetric Jacobian. Hence, through the perspective of the application of MINRES within its iterations, Newton-MR can be viewed as a special case of more general Newton-GMRES algorithms for solving non-linear systems [1,12,13,19,20,58].
In light of these connections, the bare-bones iterations of Newton-MR do not constitute novel elements of our work here. What sets our contributions in this paper apart is studying the plethora of desirable theoretical and algorithmic properties of Newton-MR in the context of (9) and its connection to (1). In particular, we give a thorough treatment of the implications of the assumptions that we make as well as the novel inexactness conditions that we propose as they relate to convergence of Newton-MR. For example, in most prior works on Newton-GMRES, the Jacobian of the non-linear function is assumed full-rank [12,13,20,91]. This, in our setting here, amounts to assuming that the Hessian matrix is invertible, which can be easily violated in the non-convex settings. In fact, to address rank deficient cases, the ubiquitous approach has been to employ regularization techniques such as Levenberg-Marquardt [8,9,36,45,53,95]. By introducing a novel assumption on the interplay between the gradient and the sub-space spanned by the Hessian matrix, we obviate the need for full-rank assumptions or the Hessian regularization techniques. As another example, to guarantee convergence, we will introduce a novel inexactness condition that constitute a major relaxation from the typical relative residual conditions used, almost ubiquitously, in prior works.

Theoretical analysis
In this section, we study the convergence properties of some variants of Newton-MR (Algorithms 1 and 2). For this, in Section 3.1, we first give the assumptions underlying our analysis. Under these assumptions, in Section 3.2, we provide detailed local/global convergence analysis of these Newton-MR variants.

Assumptions
For the theoretical analysis of Newton-MR, we make the following blanket assumptions regarding the properties of the objective function f in (1). Some of these assumptions might seem unconventional at first, however, they are simply generalizations of many typical assumptions made in the similar literature. For example, a strongly-convex function with Lipschitz continuous gradient and Hessian satisfies all of the following assumptions in this Section.
In particular, all the first partial derivatives are themselves differentiable, but the second partial derivatives are allowed to be discontinuous. Recall that requiring the first partials be differentiable implies the equality of crossed-partials, which amounts to the symmetric Hessian matrix [56, pp. 732-733].
In the literature for the analysis of non-convex Newton-type methods for (1), to obtain non-asymptotic quantitative convergence rates, it is typically assumed that the function is sufficiently smooth in that its gradient and Hessian are Lipschitz continuous. Specifically, for some 0 ≤ L g < ∞, 0 ≤ L H < ∞, and ∀ x, y ∈ R d , it is assumed that For Newton-MR, the smoothness is required for the auxiliary function g(x) 2 , which amounts to a more relaxed smoothness condition.
Note that in Assumption 2, the constant L(x 0 ) depends on the choice of x 0 . We recall that, for β = 1, Assumption 2 is similar to the assumption that is often made in the analysis of many non-linear least-squares algorithms for minimizing F(x) 2 for some non-linear mapping F(x) in which ∇ F(x) 2 is assumed Lipschitz continuous (on a level-set). We refer to Assumption 2 as "moral-smoothness" since by (12), it is only the action of Hessian on the gradient that is required to be Lipschitz continuous, and each gradient and/or Hessian individually can be highly irregular, e.g., gradient can be very non-smooth and Hessian can even be discontinuous.
Example 1 (A morally-smooth function with discontinuous Hessian). Consider a quadratically smoothed variant of hinge-loss function, for a given (a, b) ∈ R d × R. It is easy to see that where N x ∈ R d | b a, x = 0 . Clearly, Hessian is discontinuous on N . Since N is a set of measure zero, i.e., μ(N ) = 0 with respect to the Lebesgue measure μ, we can arbitrarily define H(x) 0 for x ∈ N (in fact, any other definition would work as well). It follows that for any x, y ∈ R d , we have The last inequality follows by considering four cases with 1 {b a,x >0} = 1 {b a,y >0} and 1 {b a,x >0} = 1 {b a,y >0} . Indeed, for the former two cases, the last inequality follows immediately. For the latter two cases, suppose without loss of generality that 1 {b a,x >0} = 1 and 1 {b a,y >0} = 0. In this case, we use the fact that by b a, y ≤ 0, we It is easy to show that (12) is implied by (11), and hence, it constitutes a relaxation on the typical smoothness requirements for gradient and Hessian, commonly found in the literature. (12) is less strict than smoothness (11)). Suppose (11) holds. Then, for any x 0 , we have (12) with β = 1 and L(x 0 ) = L 2 g + L H g(x 0 ) .

Lemma 1 (Moral-smoothness
Proof. Suppose (11) is satisfied and fix any x 0 with the corresponding X 0 as defined in Assumption 2. Since for any two matrices A, B and two vectors x, y, we have Ax − where the last inequality follows since x ∈ X 0 .

Example 2 (Smoothness
1. For 0 < β < 1, consider a solution to the following ODE where b ∈ R, a ∈ R + . Such a solution can be of the form and satisfies (12) with 0 < β < 1, and L = 1. It is easily verified that f (x) and f (x) are both unbounded; hence f violates (11). Note that, on (x 0 , ∞), (13a) is convex, and hence by definition, it is invex. 2. For β = 1, the condition (12) implies that In one variable, for b ∈ R, a ∈ R ++ , the solution to the following ODE can be written as which satisfies (12) with β = L = 1. It can be easily verified that which are unbounded, and so, f does not satisfy (11a) or (11b). It is clear that, on The solution to this ODE is of the form which satisfies (12) with β = L = 1. However, since are both unbounded, such a function does not satisfy either of (11a) or (11b). 3. For β > 1, the condition (12) In one variable, the solution to the following ODE can be written as This implies that there are non-trivial function which can satisfy (12) for β > 1. In addition, since are both unbounded, it implies that such f does not satisfy either assumptions in (11). It is clear that, on (c/2, ∞), (13d) is convex, and hence it is invex. Assumption 2 allows for Hessian to grow unboundedly, however it implies a certain restriction on the growth rate of the action of Hessian on the gradient, i.e., H(x)g(x).

Lemma 2 (Growth rate of H(x)g(x)). Under Assumption 2, for any
where L(x 0 ), β and X 0 are as defined in Assumption 2.
We also require the following regularity on the pseudo-inverse of the Hessian matrix.

Assumption 3 (Pseudo-inverse regularity).
For any where X 0 is as in Assumption 2.
It turns out that (14) is equivalent to the regularity of the Hessian matrix on its range space.
The proof of Lemma 3 follows simply by considering the eigenvalue decomposition of H. Assumption 3 is trivially satisfied for all strongly convex functions. However, it is also satisfied for any function whose (possibly rank-deficient) Hessian is uniformly positive definite in the sup-space spanned by its range.

Example 3 (Functions satisfying Assumption 3). A simple example of a non-strongly convex function satisfying Assumption 3 is the undetermined least squares
This problem is clearly only weakly convex since the Hessian matrix, A A ∈ R d×d , is rank-deficient. However, it is easy to see that (14) holds with γ(x 0 ) = γ = σ 2 n (A), where σ n is the smallest non-zero singular value of A.
Finally, we make the following structural assumption about f with regards to the gradient and its projection onto the range space of Hessian.

Assumption 4 (Gradient-Hessian null-space property).
For any x ∈ R d , let U x and U ⊥ x denote arbitrary orthogonal bases for Range(H(x)) and its orthogonal complement, respectively. A function is said to satisfy the Gradient-Hessian Null-Space property, if for any where X 0 is as in Assumption 2.
Lemma 4 gives some simple, yet useful, consequences of Assumption 4.

Lemma 4. Under Assumption 4, we have
Proof. For a given x 0 ∈ R d , take any x ∈ X 0 . For (17a), we have Also, (17b) is obtained similarly.
Assumption 4 ensures that, as long as the gradient is non-zero, its angle with the sub-space spanned by the Hessian matrix is uniformly bounded away from zero. In other words, the gradient will never become arbitrarily orthogonal to the range space of Hessian.
Example 4 (Over-parameterized ERM using linear predictor models). Consider an empirical risk minimization (ERM) problem involving linear predictor models [84], where a i ∈ R d , i = 1, . . . n are given data points and each f i : R → R is some nonlinear misfit or loss corresponding to the i th data point. Consider the over-parameterized settings (n ≤ d), which has recently garnered significant attention within the machine learning community, e.g., [64,71,86,88]. Assume that the data points are linearly independent, i.e., Range ({a i } n i=1 ) = R n . Further, suppose each loss function f i is such that if f i (t) = 0 then we must also have that f i (t) = 0. Many loss functions satisfy this assumption, e.g., the convex function t 2p for any p ≥ 1, so that f i (a i x) = a i , x 2p . Another example is when each f i is such that f i (t) > 0, ∀t, e.g., logistic function log(1 + e −t ) or any strongly convex. Note that even if each f i is strongly convex, since we have n ≤ d, the overall objective f in (18) is still only weakly convex. Define It is easy to see that For the gradient to be in the range of the Hessian, we must have for some v ∈ R d Under the assumption on f i , it can be seen that 1 x), . . . , f (a n x)] , satisfies (19), where A † = A (AA ) −1 ∈ R d×n . Hence, Assumption 4 holds with ν(x 0 ) = ν = 1.

Convergence analysis
In this section, we discuss the convergence properties of Newton-MR. For this, in Section 3.2.1, we first consider the slightly simpler, yet less practical, case where the subproblems (20) are solved exactly (Algorithm 1). Clearly, this is too stringent, and as a result, in Section 3.2.2, it is subsequently relaxed to allow inexact solutions (Algorithm 2). Convergence under a generalized variant of Polyak-Łojasiewicz inequality will be treated in Section 3.2.3.
The following simple Lemma relating to (12) is frequently used in our theoretical analysis. The proof can be found in most textbooks and is only given here for completeness.
Proof. For any y ∈ [x, z], using the mean value theorem, we have

Exact update
The underlying sub-problem of Newton-MR, at k th iteration, involves OLS problem of the form (4). Under the invexity assumption, the Hessian matrix can be indefinite and rank deficient, and as a result, the sub-problem (4) may contain infinitely many solutions. Indeed, the general solution of (4) is written as

Algorithm 1 Exact Newton-MR.
Input: Find p k using (20), i.e., set Output: x for which g(x) ≤ which implies that when H k has a non-trivial null-space, the sub-problem (4) has infinitely many solutions. Among these, the one with the minimum norm is defined uniquely as Proof. We first note that, by (8), all iterates of Algorithm 1 remain in X 0 . Let U k ∈ R d×r be any orthogonal basis for the range of H k and r = rank (H k ) ≤ d. It follows that Using (7), we get the reduction in the gradient norm as All that is left is to obtain a non-zero iteration independent lower-bound on α k , i.e., α k ≥ α > 0, for which (21) holds. For this, using Assumption 2 and Lemma 5, with Using the above as well as (21), we require for α to satisfy But from (17a) and Assumption 4, This implies that any step-size returned from the line-search will be at least as large as the right-hand side. Now, using Assumption 4 and Lemma 4 again, we get g k+1 Remark 1. The convergence rate of Theorem 1, for β = 1, seems rather complicated. However, one can ensure that it is indeed meaningful, i.e., the rate is positive and strictly less than one. From Lemma 2 and Assumptions 3 and 4, it follows that ∀x ∈ X 0 , we have where U x ∈ R d×r is any orthogonal basis for the range of H(x). It can also be shown that with equality holding at ρ = β/(1 + β). Hence, noting that ν(x 0 ) ≤ 1, we obtain Remark 2 (Parallels between Newton-MR/Newton-CG and MINRES/CG). In the traditional settings of strongly convex and smooth functions, the global convergence rate in Theorem 1 with β = 1 is identical to that of Newton-CG; see [81] for example. However, the former indicates the rate of reduction in the norm of the gradient of the function, while the latter corresponds to the reduction in the function value itself. This relationship is reminiscent of the convergence guarantees of MINRES and CG for linear systems involving symmetric positive-definite matrices. Indeed, while having identical convergence rates, the former is stated in terms of the reduction in the norm of the residual of the iterates, while the latter indicates the reduction in the error of the iterates themselves. Also, recall that MINRES, unlike CG, can be applied beyond positive-definite matrices.
In the same spirit, Newton-MR, unlike Newton-CG, is applicable beyond the traditional convex settings to invex functions.
Similarly as in the case of many Newton-type methods, we can obtain local convergence guarantees for Newton-MR with unit step-size, i.e., α k = 1. We further show that such result greatly generalizes the classical analysis of Newton-CG. We will show that local (super-)linear rate of convergence is possible under either Assumption 2 with β > 1, or Assumption 5.
Although, we do not know of a particular way to, a priori, verify Assumption 5, it is easy to see that (24) with β = 1 is implied by (11b), and hence weaker. In fact, unlike the usual local convergence analysis of Newton-type methods, analyzing iterations in terms of gradient norm allows us to weaken (11b) and instead consider (24).
Proof. Recall that (20) and α k = 1 implies that . Throughout the proof, let U x and U ⊥ x denote any orthogonal bases for Range(H(x)) and its orthogonal complement, respectively. (i) From (22) with α k = 1 and using Assumptions 3 and 4, we get (ii) Since by assumption g(x) is continuously differentiable, using mean-value theorem for vector-valued functions ([31, Theorem 7.9-1(d)]) for g k+1 = g (x k + p k ), we have Note that Hence, it follows that where the inequality follows by Lemma 4. Using Assumptions 3 and 5, we get and hence, Remark 3. Theorem 2-(ii) implies convergence in gradient norm with a local rate that is linear when ν(x 0 ) < 1, super-linear when ν(x 0 ) = 1, and quadratic when ν(x 0 ) = β = 1. For example, when ν(x 0 ) < 1, for any given In other words, the local convergence rate is problem-independent, which is a similar characteristic to that of exact Newton-CG; see [81]. If ν(x 0 ) = β = 1, then Theorem 2-(ii) implies that which greatly resembles the quadratic convergence of Newton-CG, but in terms of g k in lieu of x k − x . For strongly-convex objectives, (26) coincides exactly with the wellknown bound on g in the literature, e.g., see [18,Eqn. (9.33)]. For comparison, the local convergence rate of the Newton-type method proposed in [68] for strongly-convex objectives is superlinear.

Inexact update
Clearly, in almost all practical application, it is rather unreasonable to assume that (20) can be solved exactly. Approximations to (20), in the similar literature, are typically done by requiring for some appropriate 0 ≤ θ < 1. In other words, we can simply require that an approximate solution p k is, at least, better than "p = 0" by some factor. Such inexactness conditions have long been used in the analysis of Newton-CG, e.g., see [17,21,76,81]. However, Newton-MR allows for further relaxation of this condition. Indeed, as we will later see in this section, we can further loosen (27) by merely requiring that an approximate solution satisfies It is easy to see that (27) implies (28). In our theoretical analysis below, we will employ (15) and hence we need to ensure that for the update directions, p k , we have p k ∈ Range(H k ). In this light, we introduce the following relaxation of (27) Find p ∈ Range(H k ) s.t. p satisfies (28).
The inexactness condition in (29) involves two criteria for an approximate solution p, namely feasibility of p in (28) and that p ∈ Range(H k ). To seamlessly enforce the latter, recall that the t th iteration of MINRES-QLP can be described as follows [29, where K t (A, b) = Span b, Ab, . . . , A t−1 b denotes the t th -Krylov sub-space generated by A and b with t ≤ r rank (A). If ν(x 0 ) = 1 in Assumption 4, then we necessarily have g ∈ Range(H), and hence p (t) ∈ Range(H), ∀t. Otherwise, when ν(x 0 ) < 1 in Assumption 4, which implies g / ∈ Range(H), we cannot necessarily expect to have p (t) ∈ Range(H). However, one can easily remedy this by modifying MINRES-QLP iterations to incorporate K t (H, Hg) instead of K t (H, g). Compared with (30), this essentially boils down to performing one additional matrix-vector product to compute the vector Hg, which is then normalized and used as the initial vector within the Lanczos process; see [22,54] for similar modifications applied to MINRES. Clearly, this will not change the guarantees of MINRES-QLP regarding the monotonicity of the residuals as well as the final solution at termination, i.e., p † = − [H] † g. Since for all t, we have p t ∈ Range(H), it follows that any feasible p t satisfies (29). We note that, in general, the Krylov subspace methods that define their iterates in H · K t (H, g), as opposed to K t (H, g), can have a slower convergence, e.g., see [43]. However, in our experience, such a side-effect can be negligible given the relatively small number of iterations that is often required to satisfy the inexactness condition (28), as opposed to (27); see also Remark 4.
For the analysis of Newton-type methods, to the best of our knowledge, (29) has never been considered before and constitutes the most relaxed inexactness condition on the sub-problems in the similar literature. (27)). For a fixed θ, the relative residual of any solution to (28) is often much larger than that required by (27). Indeed, from (30) and [29, Lemma 3.3], we have p k , H k (H k p k + g k ) = 0. Now, from (28) we get

Remark 4 ( (28) is more relaxed than
which implies Hence, (28) is equivalent to requiring the relative residual condition albeit with the tolerance of (1 + θ)/2. This in turn implies that for a given θ, we can satisfy (29) in roughly less than half as many iterations as what is needed to ensure the more stringent (27).

Condition 1 (Inexactness tolerance θ). The inexactness tolerance in, θ, in (28) is chosen such that
Of course, because of ν(x 0 ), the interval in Condition 1 also depends on x 0 , which consequently affect the range of possible choices for θ. In this sense, the proper notation for inexactness tolerance is θ(x 0 ). However, to simplify our results, we drop the dependence of θ(x 0 ) on x 0 . Also, note that if ν(x 0 ) > 1/2, then we can take θ to be negative.
Replacing the exact update (20) in Algorithm 1 with the inexact condition (29) and considering Condition 1, we get an inexact variant of Newton-MR, depicted in Algorithm 2, and Theorem 3 provides its convergence properties.

Theorem 3 (Convergence of Algorithm 2). Consider Assumptions 1 to 4. For the iterates of Algorithm 2, we have
This, in turn, implies that Now, as in the proof of Theorem 1, to obtain a lower bound on the step-size returned from the line-search, using the above and (7), we consider α satisfying This, in turn, implies that the step-size returned from the line-search (7) must be such that α ≥ With this lower-bound on the step-size, we obtain the desired result by noting that g k 2 + 2ρα p k , H k g k ≤ Recalling that 1 − θ ≤ 2ν(x 0 ) from Condition 1, we can obtain a bound similar to (23) with τ (x 0 ) replacing τ (x 0 ), for the rate given by Theorem 3.

Remark 5.
For θ = 1 − 2ν(x 0 ), i.e., when sub-problems are solved exactly, Theorem 3 coincides with Theorem 1. More generally, for any ν(x 0 ) ∈ (0, 1] and θ ∈ [1 − 2ν(x 0 ), 1), we get a rate similar, up to a constant factor of (1 − θ)/(2ν(x 0 )), to that of the exact update in Theorem 1, i.e., the effects of the problem-related quantities such as L(x 0 ) and γ(x 0 ) remain the same. This is in contrast to Newton-CG, for which in order to obtain a rate similar to that of the exact algorithm, one has to solve the linear system to a high enough accuracy, i.e., θ ≤ √ κ, where κ is the condition number of the problem. Otherwise, the dependence of the convergence rate on the problem-related quantities is significantly worsened; see [81,Theorem 2].
Similar local convergence results as in Theorem 2 can also be obtained for the case where the update direction, p k , is obtained approximately. Note that, again as in Remark 5, when θ = 1 − 2ν(x 0 ), the results of Theorem 4 coincide with those of Theorem 2.
Theorem 4 (Error recursion of Algorithm 2 with α k = 1). Under the same assumptions of Theorem 2, suppose x k ∈ X 0 , and consider one iteration of Algorithm 2 with α k = 1.

ii) If Assumption 5 holds and H(x) is continuous, then
Here, L(x 0 ), γ(x 0 ) and L H are defined, respectively, in Assumptions 2, 3, and 5, β refers to the respective constants of Assumptions 2 and 5, and θ is an inexactness tolerance chosen according to Condition 1.

Remark 6.
Here, as in Remark 3, one can obtain local linear convergence rate that is problem-independent. For example, from Theorem 4-(ii), it follows that if (1 + θ)/2 < c < 1 and the gradient is small enough, we get g k+1 ≤ c g k ; see [58, Section 6] for a similar statement in the context of more general nonlinear equations. This implies that local convergence in the norm of the gradient is very fast even with a very crude solution of the sub-problem. We also note that if ν(x 0 ) = β = 1, and θ k ∈ O( g k 2 ) − 1, then Theorem 4-(ii) implies a quadratic convergence rate.
Finally, we can obtain iteration complexity to find a solution satisfying g(x) ≤ for a desired > 0. Theorem 3, consider finding an iterate for which g k ≤ for a desired > 0. Then, we have the following two cases: Proof. Recall that we always have g k ≤ g (k−1) ≤ . . . ≤ g 0 . Dropping the dependence of τ (x 0 ) on x 0 , we have the following.

Corollary 1 (Iteration complexity of Algorithm 2). Under the same assumptions as in
order to obtain g k 2 ≤ , implies that we must have we obtain the result. (ii) For β < 1, as long as g k 2 ≥ , from (32), we have Now, we obtain the result as above.
Corollary 1 implies that (inexact) Newton-MR is guaranteed to converge for functions, which traditionally have not been considered suitable candidates for Newton-type methods. For example, Algorithm 2 can still be applied for optimization of a twice continuously differentiable objective for which we have β 1 in Assumption 2. Such functions can be extremely non-smooth, and as a consequence, they have been labeled, rather inaccurately, as cases where the application of curvature might be useless, e.g., [4,5].

Convergence under generalized Polyak-Łojasiewicz inequality
The set of global optima of an invex function, f (x), is characterized by the zeros of the gradient, g(x). If, in addition, the distance to the optimal set, in terms of iterates and/or their respective function values, is also somehow related to g(x), then one can obtain rates at which the iterates and/or their objective values approach optimality. In this section, we consider an important sub-class of invex problems, which allows us to do that.

Definition 2 (Generalized Polyak-Łojasiewicz inequality). Let X ⊆ R d be an open set.
A differentiable function f on X is said to satisfy the generalized Polyak-Łojasiewicz (GPL) inequality on X if there exist 1 < η < ∞ and 0 < μ < ∞ such that The class of functions satisfying (38) is denoted by F GP L η,μ .
It is clear that F GP L η,μ ⊂ F (cf. Definition 1). Most often in the literature, Polyak-Łojasiewicz (PL) inequality is referred to as (34) with η = 2, e.g., [57], which excludes many functions. The generalized notion of Polyak-Łojasiewicz in (34) with any 1 < η < ∞ encompasses many of such functions. For example, the weakly convex function f (x) = x 4 clearly violates the typical PL (i.e., with η = 2), but it indeed satisfies (34) with η = 4 and μ = 256. In fact, any polynomial function of the form But "how large is the class F GP L η,μ in Definition 2"? We aim to shed light on this question by considering other classes of functions that are, in some sense, equivalent to F GP L η,μ . In doing so, we draw similarities from various relaxations of strong convexity. More specifically, to alleviate the restrictions imposed by making strong convexity assumptions, several authors have introduced relaxations under which desirable convergence guarantees of various algorithms are maintained; e.g., the quadratic growth condition [3], the restricted secant inequality [89], and the error bounds [65]. The relationships among these classes of functions have also been established; see [57,72,83,98,99]. We now give natural extensions of these conditions to invex functions and show that F GP L η,μ is an equivalent class of functions.
Let us define the set of optimal points of the problem (1) as Further, assume that X is non-empty, and denote the optimal value of (1) by f . Recall that when f is invex, X need not be a convex set, but it is clearly closed; see [69, p. 14].

Definition 3 (Generalized functional growth property).
A differentiable function f is said to satisfy the generalized functional growth (GFG) property if there exist 1 < η < ∞, 0 < μ < ∞ and a vector-valued function φ : The class of functions satisfying (36) with the same φ is denoted by F GF G φ,η,μ . (y, x) , ∀y ∈ X }. A differentiable function f is said to satisfy the generalized restricted secant (GRS) inequality if there exist 1 < η < ∞, 0 < μ < ∞ and a mapping φ :

Definition 4 (Generalized restricted secant inequality). For a vector-valued function
The class of functions satisfying (37) with the same φ is denoted by F GRS φ,η,μ .
It is easy to see that when f is convex and φ(y, arg min y∈X y − x is the unique orthogonal projection of x onto the set of optimal solutions X (which, in this case, is convex). Hence, the generalized condition (37) coincides with, and hence is a generalization of, the usual definition of restricted secant inequality for convex functions with η = 2; see [57,83,98,99].

Definition 5 (Generalized error bound property).
A differentiable function f is said to satisfy the generalized error bound (GEB) property if there exist 1 < η < ∞, 0 < μ < ∞ and a vector-valued function φ : The class of functions satisfying (38) with the same φ is denoted by F GEB φ,η,μ .
Lemma 6 establishes a loose notion of equivalence among the classes of functions mentioned above, as they relate to invexity. However, when restricted to a particular class of invex functions for a given φ, Lemma 6 shows that F GP L η,μ ∩F φ can indeed be larger. In contrast, for when φ(y, x) = y−x, equivalence between these classes has been established in [57]. This is in particular so since, unlike [57], here we have made no assumptions on the smoothness of f such as Lipschitz-continuity of its gradient. This is a rather crucial distinction as for our main results in this paper, we make smoothness assumptions that are less strict that those typically found in the literature; see Assumption 2 in Section 3.1.

Lemma 6.
(i) For any φ, we have (ii) There exists a φ, for which (iii) These classes are equivalent in the sense that Proof. (i) We start by showing that, for any f ∈ F φ , (36) implies (37). Consider any x ∈ R d and y ∈ Y φ (x). By (36) and invexity as well as noticing that Since the last inequality holds for all y ∈ Y φ (x), we can minimize the right-hand side over all Y φ (x) to get which is exactly (37). A simple application of Cauchy-Schwarz inequality on (37) and noting that min y∈Y φ (x) φ(y, x) = min y∈X φ(y, x) , it follows that μ min y∈X φ(y, x) . If x is such that min y∈X φ(y, x) = 0, then the inequality (38) trivially holds, otherwise dividing both sides by min y∈X φ(y, x) gives (38).
(ii) We will show that (34) implies that there exists a φ for which (36) holds. Indeed, for any invex function f , we can always define a corresponding φ as For any x ∈ X , the inequality (36) trivially holds. Suppose x / ∈ X , which implies g(x) = 0. For any y ∈ X , we have where the last inequality follows since by (34), (iii) This isThe is a simple implication of the first two parts.
is the class of Generalized Polyak-Łojasiewicz (GPL) functions as defined in Definition 2, we immediately have the following convergence guarantee of Algorithm 2 in terms of objective value f (x k ). Similar results for Algorithm 1 can also easily be obtained. (34)). Under the same assumptions as in Theorem 3, if f (x) satisfies GPL inequality (34), there are constants 0 < C < ∞, 0 < ζ < 1 and ω > 0, such that for the iterates of Algorithm 2 we have the following.
Here, ρ is the line-search parameter of Algorithm 2, η and μ are as in Definition 2, and τ (x 0 ) is as in Theorem 3.
Proof. (i) For notational simplicity, we drop the dependence of τ (x 0 ) on x 0 . Consider β ≥ 1. From (32) and (34), we have where τ is as in Theorem 3, and by (23) we also have that 0 < ζ < 1. (ii) Now take 0 < β < 1. From (32), we have Using (34), we get The following Lemma shows that, under some assumptions, the norm of the gradient at each point can be estimated using the distance of the point to the optimality set X .

Lemma 7. Under Assumptions 2 to 4, for any
where X is as in (35).
Proof. By Assumption 2, since X ⊆ X 0 , we have H(x)g( Since the left-hand side of the above inequality is independent of x 0 , we get . Now, from Assumptions 3 and 4, it follows that for any (x, x ) ∈ X 0 × X , we have The result follows by taking the minimum of the right-hand side over X .

and if Assumption 5 holds and H(x) is continuous, then
Here, L(x 0 ), γ(x 0 ), L H and θ are defined, respectively, in Assumptions 2, 3 and 5 and Condition 1, X is as in (35), L g (x 0 ) is as in Lemma 7, μ and η are as in Definition 5, and β refers to the respective constants of Assumptions 2 and 5.
Remark 7. Assuming (38) with φ(y, x) = y − x (locally), is weaker than requiring to have isolated (local) minimum. For example, suppose we have ν(x 0 ) = 1 and η = 2. By setting θ k ∈ O min x ∈X x k − x 2β − 1, from Theorem 6-(ii), we get super-linear convergence to the set of optimal solutions, X . In fact, the rate is quadratic for β = 1 as a special case, which matches those from many prior works, e.g., [35,62,101]. Clearly, convergence in terms of min x ∈X x k − x relaxes the notion of isolated minimum, which is at times used for the local convergence analysis of Newton-CG. Hence, error bound conditions similar to that used in Theorem 6 have been extensively used in similar literature to establish convergence to non-isolated (local) minima [44,45,62,63,95,101,102].

Numerical experiments
In this section, we evaluate the empirical performance of inexact Newton-MR (Algorithm 2) as compared with several other Newton-type optimization methods on some machine learning problems. Empirical comparisons to a variety of first-order methods are left for future work.
Optimization methods In the following performance evaluations, we compare Newton-MR with some widely used optimization methods as listed below; see [76] for the details of each of these algorithms. The main hyper-parameters for these methods, used in our simulations, can be found in Table 1. In all of our experiments, we run each method until the norm of the gradient falls below 1E-10, maximum number of iterations are reached, or the algorithm fails to make progress. The latter case is detected when Newton-CG encounters a negative curvature and its CG inner iterations are terminated, or when, for any method, the maximum number of corresponding line-search has been reached and no step-size has been accepted. These scenarios are depicted by a cross "×" on all the plots. Table 1 Hyper-parameters used in optimization methods. The initial trial step-size for all methods, with the exception of nonlinear-CG is set to α = 1. For nonlinear-CG, we use the strategy described in [76,Section 3.5]. The initial trust-region size is also set to one.  Table 2 Complexity measure per iteration for each of the algorithms. N s and N l denote, respectively, the total number of iterations for the corresponding inner solver and that resulting from performing the line search or the trust-region radius update.
Although Newton-CG is not meant to be used on problems where Hessian can become singular or indefinite, we run plain Newton-CG on our examples without any modifications, e.g., we do not attempt to employ the encountered negative curvature as in [82,96]. As a result, we terminate its iterations if CG fails. This is a judicious choice to highlight a significant difference between Newton-MR and Newton-CG. Specifically, this serves to distinguish between cases where the trajectory of Newton-CG remains in regions that are locally strongly-convex and those where it enters areas with high degree of weak-convexity or non-convexity. For example, in Section 4.2, more often than not, Newton-CG fails at the very outset with x 0 , whereas Newton-MR makes continuous progress.
Performance evaluation measure In all of our experiments, we plot the objective value vs. the total number of oracle calls of function, gradient and Hessian-vector product. This is so since measuring "wall-clock" time can be greatly affected by individual implementation details. In contrast, counting the number of oracle calls, as an implementation and system independent unit of complexity, is most appropriate and fair. More specifically, after computing each function value, computing the corresponding gradient is equivalent to one additional function evaluation. Our implementations are Hessian-free, i.e., we merely require Hessian-vector products instead of using the explicit Hessian. For this, each Hessian-vector product amounts to two additional function evaluations, as compared with gradient evaluation. The number of such oracle calls per iteration for all the algorithms is given in Table 2.

Softmax regression
Here, we consider the softmax cross-entropy minimization problem without regularization, which is used in machine learning for multi-class classification applications. More specifically, we have where . . , C} denote the training data, C is the total number of classes for each input data a i and x = (x 1 , x 2 , . . . , x C−1 ). Note that, in this case, we have d = (C − 1) × p. It can be shown that, depending on the data, (40) is either strictly convex or merely weakly convex. In either case, however, by a similar analysis as that in Example 4, one can show that ν(x) = 1. Hence, it follows that g(x k ) ∈ Range (H(x k )) and CG iterations within Newton-CG are well-defined. The datasets use for our experiments in this section are listed in Table 3 and the performance of each method is depicted in Fig. 1.

Gaussian mixture model
Here, we consider an example involving a mixture of Gaussian densities where the goal is to recover some mixture parameters such as the mean vectors and the mixture weights. Although, this problem is generally not invex, it has been shown in [67] that it indeed exhibits features that are close to being invex, e.g., small regions of saddle points and large regions containing global minimum. Nonetheless, the non-convexity of this problem results in a Hessian matrix that can become indefinite and/or singular across iterations.  Table 2 for datasets of Table 3 on the model problem (40). As for this convex problem, Gauss-Newton is mathematically equivalent to Newton-CG, it is naturally excluded from comparisons. All algorithms are always initialized at x 0 = 0. As it can be clearly seen, Newton-MR almost always performs very competitively, and in particular, it greatly outperforms Newton-CG.
For simplicity, we consider a mixture model with two Gaussian components as where Φ denotes the density of the p-dimensional standard normal distribution, a i ∈ R p are the data points, x 1 ∈ R d , x 2 ∈ R p , Σ 1 ∈ R p×p , and Σ 2 ∈ R p×p are the corresponding mean vectors and the covariance matrices of two the Gaussian distributions, x 0 ∈ R, and ω(t) = 1/(1 + e −t ) to ensure that the mixing weight lies within [0, 1]. Note that, here, We run the experiments 500 times, and plot the performance profile [39,51] of each method; see Fig. 2. For each run, we generate 1, 000 random data points, generated from the mixture distribution (41) with p = 100, and ground truth parameters as 1], and x 2 ∼ U p [3,4], where U p [a, b] denotes the distribution of a p-dimensional random vector whose independent components are drawn uniformly over the interval [a, b]. Covariance matrices are constructed randomly, at each run, with controlled condition number, such that they are not axis-aligned. For this, we first randomly generate two p × p matrices, W 1 , W 2 , whose elements are iid drawn standard normal distribution. We then find corresponding orthogonal basis, Q 1 , Q 2 , using QR factorizations. This is then followed by forming For a given λ in the x-axis, the corresponding value on the y-axis is the proportion of times that a given solver's performance lies within a factor λ of the best possible performance over all runs. All algorithms are always initialized at random using a multivariate normal distribution with mean zero and identity covariance. As it can be seen, Newton-MR, greatly outperforms alternative second-order methods in this example.
matrix whose diagonal entries are chosen equidistantly from the interval [0, 100], i.e., the condition number of each Σ i is 100.
The performance profiles of all the methods are gathered in Fig. 2. As it can be seen, Newton-MR greatly outperforms all alternative second-order methods in this example. The trust-region method also shows a reasonable performance. The L-BFGS and Nonlinear-CG methods both have comparable performances. It is worthwhile to highlight that Newton-CG fails to converge in many of the runs. This is because, for this non-convex problem, the Hessian matrix can become indefinite and, as a result, the CG iterations can fail before the underlying inexactness condition is met. Furthermore, the Gauss-Newton method performed extremely poorly, which can be an indication that the underlying Gauss-Newton matrix is perhaps a low quality approximation to the true Hessian in such a model.

Nonlinear least-squares
The examples of Sections 4.1 and 4.2 show that, when the problem is (approximately) invex, the inexact Newton-MR method, depicted in Algorithm 2, can be a highly effective optimization algorithm. We now study a potential drawback of using the plain Newton-MR method for the optimization of the objectives that are highly non-invex. We set out to show that for such functions, Algorithms 1 and 2, by virtue of the monotonic reduction of the gradient norm, run the risk of converging to undesirable saddle points or local maxima. For this, following [93], we consider the simple, yet illustrative, nonlinear least squares problems arising from the task of binary classification with squared loss. 2 More specifically, given the data where φ(z) = 1/(1 + e −z ) is the sigmoid function and ψ(x) is some regularization term.
To make the objective highly non-invex, we add a non-convex regularization in the form , where x i is the i th component of the vector x. We compare the performance of the inexact variant of Newton-MR, Algorithm 2, the Gauss-Newton method, as well as the trust-region algorithm with CG-Steihaug solver. The Gauss-Newton method has traditionally been a method of choice for non-linear least-squares problems (of course without a non-convex regularization). The trust-region algorithm can take advantage of the negative curvature directions that arise as part of the Steihaug-CG iterations, and it is an effective method for most-non-convex problems. The approximate Hessian for the Gauss-Newton method is taken as J r (x) · J r (x) + ∇ 2 ψ(x), where J r (x) ∈ R n×d is the Jacobian of the sigmoid mapping within the squared loss. Since this matrix is not necessarily positive semi-definite (due to non-convexity of ψ(x)), the CG inner iterations might fail for this problem in which case we terminate the Gauss-Newton method. We initialize the algorithms using x 0 = 0, x 0 = 1, and one drawn from the multivariate normal distribution with mean zero and identity covariance. We evaluate the performance of the methods on three datasets from Table 3, namely 20 Newsgroups, mnist and UJIIndoorLoc. For our binary classification setting, we have relabeled the odd and even classes, respectively, as 0 and 1. The results are depicted in Fig. 3.
It is clear that, when the optimization landscape is somewhat less rugged near the initialization, e.g., the origin on most of the datasets, Newton-MR (as well as Gauss-Newton) can perform reasonably well. However, near more "hostile" regions with a high degree of non-convexity, Newton-MR fails to find a good solution and can converge to saddle points or local maxima. In such highly non-convex regions, the matrix used within the Gauss-Newton iterations can also become indefinite, leading to the underlying CG iterations to fail in which case the Gauss-Newton method is terminated (indicated by "×"). In contrast, by leveraging the negative curvature directions, the trust-region method can reliably converge to a local minima.
Experiments such as those depicted in Fig. 3 highlight the need for further refinement of the vanilla Newton-MR method studied in this paper to design more general purpose variants, that just like trust-region, can leverage the negative curvature directions that could arise as part of the inner solver. This way, one can extend Algorithms 1 and 2 far beyond invex problems for arbitrary non-convex objectives. We leave this important venture to future work. Fig. 3. Comparison of the Newton-MR, Gauss-Newton, and trust-region algorithms for a highly non-invex optimization problem, namely the problem (42) on three datasets from Table 3. We consider three initialization strategies: x 0 = 0, x 0 = 1, and one that is randomly drawn from the multivariate normal distribution with mean zero and identity covariance. Since this problem is highly non-invex, with unfortunate initialization, Newton-MR can converge to saddle points or even local maxima, whereas the trust-region algorithm, by leveraging the negative curvature direction arising as part of the CG-Steihaug iterations, can navigate its way out of these undesirable regions. The mark "×" indicates the Gauss-Newton iteration where the CG method fails due to the indefiniteness of the underlying approximate Hessian matrix.

Conclusions
Motivated to extend the simplicity of the iterations of Newton-CG to beyond strongly convex problems, we consider an alternative algorithm, called Newton-MR, which can be readily used for unconstrained optimization of invex objectives. The iterations of Newton-MR merely involve ordinary least-squares problems, which are (approximately) solved by minimum residual iterative methods, followed by Armijo-type line-search to determine the step-size. We show that under mild assumptions and weak inexactness conditions for sub-problems, we can obtain a variety of local/global convergence results for Newton-MR.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
| u x , g(x) | = 2ax 1 where the lower bound is obtained by considering the minimum of the function h(y) = 2/y + y for y = x 1 /(x 2 − b). Hence, it follows that ν = 8/9. Clearly, f is unbounded below and, admittedly, this example is of little interest in optimization. Nonetheless, it serves as a non-trivial example of functions that satisfy Assumption 4.

A.2. Composition of functions and Assumption 4
We now consider Assumption 4 in the context of the composition of functions. More specifically, consider where r : R d → R p and h : R p → R are smooth functions. We have g(x) ∇f (x) = J r (x) · ∇h(r(x)), H(x) ∇ 2 f (x) = J r (x) · ∇ 2 h(r(x)) · J r (x) + ∂J r (x) · ∇h(r(x)), where J r (x) ∈ R p×d and ∂J r (x) ∈ R d×d×p are, respectively, the Jacobian and the tensor of all second-order partial derivatives of r, and ∇h(r(x)) ∈ R p , and ∇ 2 h(r(x)) ∈ R p×p are the gradient and Hessian of h, respectively. Lemma 8 gives a sufficient condition for f to satisfy Assumption 4.
Proof. Let x 0 ∈ R d be given and take any x ∈ X 0 . Define B A + E where B = H(x), A = J r (x) · ∇ 2 h(r(x)) · J r (x), and E = ∂J r (x) · ∇h(r(x)).
Let U A and U B be, respectively, arbitrary orthogonal bases for Range(A) and Range(B) and denote U ⊥ A and U ⊥ B as their respective orthogonal complement.
By assumption, we have r A = rank(A) ≤ rank(B) = r B . Let U A ∈ R d×r B be U A augmented by any r B − r A vectors from U ⊥ A such that rank( U A ) = rank(U B ). By the matrix perturbation theory applied to B = A + E, [77,Theorem 19], we have where σ r B (B) is the smallest non-zero singular value of B. This, in turn, implies (see [50,Theorem 2.5.1]) Now since ∇ 2 h(r(x)) is full rank, then Range(A) = Range(J r (x)). Therefore g(x) = U A U A g(x) and hence U ⊥ A U ⊥ A g(x) = 0, which implies g(x) = U A U A g(x) and g(x) = U A g(x) . Now we have Rearranging the terms above, we obtain which completes the proof.
The requirement for ∇ 2 h(r(x)) being full rank is satisfied in many applications, e.g., nonlinear least squares where h(z) = z 2 /2, statistical parameter estimations using negative log-likelihood where h(z) = − log(z), or multi-class classifications using crossentropy loss where h(z) = − p i=1 y i log(z i ) for some p i=1 y i = 1 and y i ≥ 0. The requirement for rank(H(x)) ≥ rank(J r (x)), however, might not hold for many problems. Nonetheless, Lemma 8 can give a qualitative guide to understanding the connection between Newton-MR and Gauss-Newton, when both are applied to f (x) = h(r(x)).
Connection to Gauss-Newton Suppose f in (43) has isolated local minima, e.g., if J r (x ) is full-rank at a local minimum x . Define S(x) ∂J r (x) · ∇h(r(x)). It is well-known that the convergence of Gauss-Newton is greatly affected by the magnitude of S(x ) ; see [87,Section 7.2]. Indeed, when S(x ) is large, Gauss-Newton can perform poorly, whereas small values of S(x ) imply Gauss-Newton matrix is a good approximation to the full Hessian. In fact, local quadratic convergence of Gauss-Newton can only be guaranteed when S(x ) = 0; otherwise, the convergence can degrade significantly. For example, if S(x ) is too large, the Gauss-Newton method may not even converge, while if S(x ) is small enough relative to J r (x ) · ∇ 2 h(r(x )) · J r (x ) , the convergence is linear.
Lemma 8 allows us to, at least very qualitatively and superficially, compare the performance of Gauss-Newton with Newton-MR for the minimization of (43). Roughly speaking, Lemma 8 relates S(x ) to ν, which directly affects the performance of Newton-MR, e.g., see Theorem 1 where larger values of ν imply faster convergence for Newton-MR. More specifically, suppose S(x ) is small. Assume that S(x ) is smooth enough such that S(x) is also small for any x in some compact neighborhood of x , denoted by C. Let x 0 be the point in this compact neighborhood such that g(x) ≤ g(x 0 ) , ∀x ∈ C. With X 0 defined by this x 0 , consider (44). Since the left hand side is assumed small, ν(x 0 ) can be taken large, which directly implies a better convergence rate for Newton-MR.
In this light, we can expect Newton-MR to perform well whenever Gauss-Newton exhibits good behaviors. Our empirical evaluations of Section 4.2 indeed support this, while hinting at a possibility that the converse might not necessarily be true. Nonetheless, to make a fair and more accurate theoretical comparison, one needs to analyze the Gauss-Newton method under the assumptions of this paper (for example by drawing upon the ideas in [42] related to the minimum-norm solution). This endeavor is left for future work.