Depth with Nonlinearity Creates No Bad Local Minima in ResNets

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets studied in previous work, in the sense that the values of all local minima are no worse than the global minima values of corresponding shallow linear predictors with arbitrary fixed features, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems (NIPS) 2018. We note that even though our paper advances the theoretical foundation of deep learning and non-convex optimization, there is still a gap between theory and many practical deep learning applications.


Introduction
Deep learning has seen practical success with a significant impact on the fields of computer vision, machine learning and artificial intelligence. In addition to its practical success, deep learning has been theoretically studied and shown to have strong expressive powers. For example, neural networks with one hidden layer can approximate any continuous functions (Leshno et al., 1993;Barron, 1993), and deeper neural networks can approximate functions of certain classes in more compact manners (Montufar et al., 2014;Livni et al., 2014;Telgarsky, 2016).
However, one of the major concerns in both theory and practice is that training a deep learning model requires us to deal with highly non-convex and high-dimensional optimization. Optimization problems with a general non-convex function and with a certain nonconvex function induced by some specific neural networks are both known to be NP-hard (Murty and Kabadi, 1987;Blum and Rivest, 1992), which would pose no serious challenge if only it were not high-dimensional (Kawaguchi et al., 2015(Kawaguchi et al., , 2016. Therefore, a hope is that non-convex high-dimensional optimization in deep learning allows some additional structure or assumption to make the optimization tractable. Under several simplification assumptions, recent studies have proven the existence of novel loss landscape structures that may play a role in making the optimization tractable in deep learning (Dauphin et al., 2014;Choromanska et al., 2015;Kawaguchi, 2016). More recently, Shamir (2018) has shown that a specific type of neural network, namely residual network (ResNet) with a single output unit (a scalar-valued output), has no local minimum with a value higher than the global minimum value of scalar-valued linear predictors (or equivalently, one-layer networks with a single output unit). However, Shamir (2018) remarks that while it is natural to ask whether this result can be extended to networks with multiple output units (vector-valued outputs) as they are commonly used in practice, it is currently unclear how to prove such a result and the question is left to future research.
As a step towards establishing the optimization theory in deep learning, this paper presents theoretical results that provide an answer to the open question remarked in (Shamir, 2018). Moreover, this paper proves a quantitative upper bound on the local minimum value, which shows that not only the local minimum values of deep ResNets are always no worse than the global minimum value of vector-valued linear predictors (or one-layer networks with multiple output units), but also further improvements on the quality of local minima are guaranteed via non-negligible residual representations.

Preliminaries
The Residual Network (ResNet) is a class of neural networks that is commonly used in practice with state-of-the-art performances in many applications (He et al., 2016a,b;Kim et al., 2016;Xie et al., 2017;Xiong et al., 2018). When compared to standard feedforward neural networks, ResNets introduce skip connections, which adds the output of some previous layer directly to the output of some following layer. A main idea of ResNet is that these skip connections allow each layer to focus on fitting the residual of the target output that is not covered by the previous layer's output. Accordingly, we may expect that a trained ResNet is no worse than a trained shallower network consisting of fewer layers only up to the previous layer. However, because of the non-convexity, it is unclear whether ResNets exhibit this behavior, instead of getting stuck around some arbitrarily poor local minimum.

Model
To study the non-convex optimization problems of ResNets, both the previous study (Shamir, 2018) and this paper considers a type of arbitrarily deep ResNets, for which the preactivation output h(x, W, V, θ) ∈ R dy of the last layer can be written as h(x, W, V, θ) = W (x + V z(x, θ)). (1) Here, W ∈ R dy×dx , V ∈ R dx×dz and θ consist of trainable parameters, x ∈ R dx is the input vector in any fixed feature space embedded in R dx , and z(x, θ) ∈ R dz represents the outputs of arbitrarily deep residual functions parameterized by θ. Also, d y is the number of output units, d x is the number of input units, and d z represents the dimension of the outputs of the residual functions.
There is no assumption on the structure of z(x, θ), and z(x, θ) is allowed to represent some possibly complicated deep residual functions that arise in ResNets. For example, the model in Equation (1) can represent arbitrarily deep pre-activation ResNets (He et al., 2016b), which are widely used in practice. To facilitate and simplify theoretical study, Shamir (2018) assumed that every entry of the matrix V is unconstrained and fully trainable (e.g., instead of V representing convolutions). This paper adopts this assumption, following the previous study.
Remark 1. (On arbitrary hand-crafted features) All of our results hold true with x in any fixed feature space embedded in R dx . Indeed, an input x to neural networks represents an input in any such feature space (instead of only in a raw input space); e.g., given raw input x raw and any feature map φ : Remark 2. (On bias terms) All of our results hold true for the model with or without bias terms; i.e., given original x original and z original (x, θ), we can always set x = [(x original ) ⊤ , 1] ⊤ ∈ R dx and z(x, θ) = [(z original (x, θ)) ⊤ , 1] ⊤ ∈ R dz to account for bias terms if desired.

Optimization problem
The previous study (Shamir, 2018) and this paper consider the following optimization problem: minimize where W, V, θ are unconstrained, ℓ is some loss function to be specified, and y ∈ R dy is the target vector. Here, µ is an arbitrary probability measure on the space of the pair (x, y) such that whenever the partial derivative exists, the identity, holds at every local minimum (W, V, θ) (of L); 1 for example, an empirical measure µ with a training dataset {(x i , y i )} m i=1 of finite size m always satisfies this condition. Therefore, all the results in this paper always hold true for the standard training error objective, δ (x i ,y i ) with the Dirac measures δ (x i ,y i ) . In general, the objective function L(W, V, θ) in Equations (2) and (4) is non-convex even in (W, V ) with a convex map h → ℓ(h, y).
This paper analyzes the quality of the local minima in Equation (2) in terms of the global minimum value L * {x} of the linear predictors Rx with an arbitrary fixed basis x (e.g., x = φ(x raw ) with some feature map φ) that is defined as Similarly, define L * {x,z(x,θ)} to be the global minimum values of the linear predictors ( 1. A simple sufficient condition to satisfy Equation (3) is for ∂ (W,V ) ℓ(h(x, W, V, θ), y) to be bounded in the neighborhood of every local minimum (W, V, θ) of L. Different sufficient conditions to satisfy Equation (3) can be easily obtained by applying various convergence theorems (e.g., the dominated convergence theorem) to the limit (in the definition of derivative) and the integral (in the definition of expectation).

Background
Given any fixed θ, let L θ (W, V ) := L(W, V, θ) be a function of (W, V ). The main additional assumptions in the previous study (Shamir, 2018) are the following: PA1. The output dimension is one as d y = 1.
PA2. For any y, the map h → ℓ(h, y) is convex and twice differentiable.
PA3. On any bounded subset of the domain of L, the function L θ (W, V ), its gradient ∇L θ (W, V ), and its Hessian The previous work (Shamir, 2018) also implicitly requires for Equation (3) to hold at all relevant points for optimization, including every local minimum (see the proof in the previous paper for more detail), which is not required in this paper. Under these assumptions, along with an analysis for a simpler decoupled model (W x+V z(x, θ)), the previous study (Shamir, 2018) provided a quantitative analysis of approximate stationary points, and proved the following main result for the optimization problem in Equation (2).
The previous paper (Shamir, 2018) remarked that it is an open question whether Proposition 1, along with quantitative analysis of approximate stationary points, can be obtained for the networks with d y > 1 multiple output units.

Main results
Our main results are presented in Section 3.1 for a general case with arbitrary loss and arbitrary measure, and in Section 3.2 for a concrete case with the squared loss and the empirical measure.

Result for arbitrary loss and arbitrary measure
This paper discards the above assumptions from the previous literature, and adopts the following assumptions instead: Assumption A2. For any y, the map h → ℓ(h, y) is convex and differentiable.
Assumptions A1 and A2 can be easily satisfied in many practical applications in deep learning. For example, we usually have that d y = 10 << d x , d z in multi-class classification with MNIST, CIFAR-10 and SVHN, which satisfies Assumption A1. Assumption A2 is usually satisfied in practice as well, because it is satisfied by simply using a common ℓ such as squared loss, cross-entropy loss, logistic loss and smoothed hinge loss among others.
Using these mild assumptions, we now state our main result in Theorem 1 for arbitrary loss and arbitrary measure (including the empirical measure).
Theorem 1. If Assumptions A1 and A2 hold, every local minimum (W, V, θ) of L satisfies . (5) Remark 3. From Theorem 1, one can see that if Assumptions A1 and A2 hold, the objective function L(W, V, θ) has the following properties: (i) Every local minimum value is at most the global minimum value of linear predictors with the arbitrary fixed basis x as Here, the set of our assumptions are strictly weaker than the set of assumptions used to prove Proposition 1 in the previous work (Shamir, 2018) (including all assumptions implicitly made in the description of the model, optimization problem, and probability measure), in that the latter implies the former but not vice versa. For example, one can compare Assumptions A1 and A2 against the previous paper's assumptions PA1, PA2 and PA3 in Section 2.3. We note that along with Proposition 1, the previous work (Shamir, 2018) also provided an analysis of approximate stationary points, for which some additional continuity assumption such as such PA2 and PA3 would be indispensable (i.e., one can consider the properties around a point based on those at the point via some continuity).
In addition to responding to the open question, Theorem 1 further states that the guarantee on the local minimum value of ResNets can be much better than the global minimum value of linear predictors, depending on the quality of the residual representation z(x, θ). In Theorem 1, we always have that (L * {x} − L * {x,z(x,θ)} ) ≥ 0. This is because a linear predictor with the basis φ θ (x) = [x ⊤ z(x, θ) ⊤ ] ⊤ achieves L * {x} by restricting the coefficients of z(x, θ) to be zero and minimizing only the rest. Accordingly, if z(x, θ) is non-negligible (L * {x} − L * {x,z(x,θ)} = 0), the local minimum value of ResNet is guaranteed to be strictly better than the global minimum value of linear predictors, the degree of which is abstractly quantified in Theorem 1 and concretely quantified in the next subsection.

Result for squared loss and empirical measure
To provide a concrete example of Theorem 1, this subsection sets ℓ to be the squared loss and µ to be the empirical measure. That is, this subsection discards Assumption A2 and uses the following assumptions instead: Assumption B1. The map h → ℓ(h, y) represents the squared loss as ℓ(h, y) = h − y 2 2 .
Assumption B2. The µ is the empirical measure as µ = 1 m m i=1 δ (x i ,y i ) .
Assumptions B1 and B2 imply that Let us define the matrix notation of relevant terms as X := x 1 x 2 · · · x m ⊤ ∈ R m×dx , Y := y 1 y 2 · · · y m ⊤ ∈ R m×dy , and Z(X, θ) := z(x 1 , θ) z(x 2 , θ) · · · z(x m , θ) ⊤ ∈ R m×dz . Let P [M ] be the orthogonal projection matrix onto the column space (or range space) of a matrix M . Let P N [M ] be the orthogonal projection matrix onto the null space (or kernel space) of a matrix M . Let · F be the Frobenius norm. We now state a concrete example of Theorem 1 for the case of the squared loss and the empirical measure.
As in Theorem 1, one can see in Theorem 2 that every local minimum value is at most the global minimum value of linear predictors. When compared with Theorem 1, each term in the upper bound in Theorem 2 is more concrete. The global minimum value of linear predictors is L * {x} = 1 m P N [X]Y 2 F , which is the (averaged) norm of the target data matrix Y projected on to the null space of X. The further improvement term via the residual representation is This is the (averaged) norm of the residual P N [X]Y projected on to the column space of P N [X]Z(X, θ). Therefore, a local minimum can get the further improvement, if the residual P N [X]Y is captured in the residual representation Z(X, θ) that differs from X, as intended in the residual architecture. More concretely, as the column space of Z(X, θ) differs more from the column space of X, the further improvement term P [P N [X]Z(X, θ)]P N [X]Y 2 F becomes closer to P [Z(X, θ)]P N [X]Y 2 F , which gets larger as the residual P N [X]Y gets more captured by the column space of Z(X, θ).

Proof idea and additional results
This section provides overviews of the proofs of the theoretical results. The complete proofs are provided in the Appendix at the end of this paper. In contrast to the previous work (Shamir, 2018), this paper proves the quality of the local minima with the additional further improvement term and without assuming the scalar output (PA1), twice differentiability (PA2) and Lipschitz continuity (PA3). Accordingly, our proofs largely differ from those of the previous study (Shamir, 2018).
Along with the proofs of the main results, this paper proves the following lemmas.

Lemma 1. (derivatives of predictor)
The function h(x, W, V, θ) is differentiable with respect to (W, V ) and the partial derivatives have the following forms:

Proof overview of lemmas
Lemma 1 follows a standard observation and a common derivation. Lemma 2 is proven with a case analysis separately for the case of rank(W ) ≥ d y and the case of rank(W ) < d y .
In the case of rank(W ) ≥ d y , the statement of Lemma 2 follows from the first order necessary condition of local minimum, ∂ (W,V ) L(W, V, θ) = 0, along with the observation that the derivative of L with respect to (W, V ) exits.
In the case of rank(W ) < d y , instead of solely relying on the first order conditions, our proof directly utilizes the definition of local minimum as follows. We first consider a family of sufficiently small perturbationsṼ of V such that L(W,Ṽ , θ) = L(W, V, θ), and observe that if (W, V, θ) is a local minimum, then (W,Ṽ , θ) must be a local minimum via the definition of local minimum and the triangle inequality. Then, by checking the first order necessary conditions of local minimum for both (W,Ṽ , θ) and (W, V, θ), we obtain the statement of Lemma 2.

Proof overview of theorems
Theorem 1 is proven by showing that from Lemma 2, every local minimum (W, V, θ) induces a globally optimal linear predictor of the form, In the proof of Theorem 2, we derive the specific forms of L * {x z(x,θ)} for the case of the squared loss and the empirical measure, obtaining the statement of Theorem 2.

Conclusion
In this paper, we partially addressed an open problem on a type of deep ResNets by showing that instead of having arbitrarily poor local minima, all local minimum values are no worse than the global minimum value of linear predictors, and are guaranteed to further improve via the residual representation. This paper considered the exact same (and more general) optimization problem of ResNets as in the previous literature. However, the optimization problem in this paper and the literature does not yet directly apply to many practical applications, because the parameters in the matrix V are considered to be unconstrained. To improve the applicability, future work would consider the problem with constrained V .
Mathematically, we can consider a map that takes a classical machine learning model with linear predictors (with arbitrary fixed features) as an input and outputs a deep version of the classical model. We can then ask what structure this "deepening" map preserves. In terms of this context, this paper proved that in a type of ResNets, depth with nonlinearity (a certain "deepening" map) does not create local minima with loss values worse than the global minimum value of the original model.

Appendix
This appendix presents complete proofs.

Appendix A. Proofs of lemmas
A.1 Proof of Lemma 1 Proof. The differentiability follows the fact that h(x, W, V, θ) is linear in W and affine in V given other variables being fixed; i.e., with g(W, Taking derivatives of h(x, W, V, θ) in these forms with respect to vec(W ) and vec(V ) respectively yields the desired statement.

A.2 Proof of Lemma 2
Proof. This proof considers two cases in terms of rank(W ), and proves that the desired statement holds in both cases. Note that from Lemma 1 and Assumption A2, ℓ(h(x, W, V, θ), y) is differentiable with respect to (W, V ), because a composition of differentiable functions is differentiable. From the condition on µ, this implies that L(W, V, θ) is differentiable with respect to (W, V ) at every local minimum (W, V, θ). Also, note that since a W (or a V ) in our analysis is either an arbitrary point or a point depending on the µ (as well as ℓ and h), we can write E x,y∼µ [g(x, y)W (µ)] = g(x, y)W (µ)dµ(x, y) = E x,y∼µ [g(x, y)]W (µ) where g is some function of (x, y) and W (µ) = W with the possible dependence being explicit (the same statement holds for V ). Let z = z(x, θ) and E x,y = E x,y∼µ for notational simplicity.
Case of rank(W ) ≥ d y : From the first order condition of local minimum with respect to V , Similarly, from the first order condition of local minimum with respect to W , where the second line follows Lemma 1. This implies that where the last equality follows from that E x,y [zD] = 0. Therefore, if (W, V, θ) is a local minimum and if rank(W ) ≥ d y , we have that E x,y [zD] = 0 and E x,y [xD] = 0.
If (W, V, θ) is a local minimum, (W, V ) must be a local minimum with respect to (W, V ) (given the fixed θ). If (W, V ) is a local minimum with respect to (W, V ) (given the fixed θ), by the definition of a local minimum, there exists ǫ > 0 such that is an open ball of radius ǫ with the center at (W, V ). For any sufficiently small ν ∈ R d L such that (W,Ṽ (ν)) ∈ B ǫ/2 (W, V ), if (W, V ) is a local minimum, every (W,Ṽ (ν)) is also a local minimum, because there exists ǫ ′ = ǫ/2 > 0 such that for all (W ′ , V ′ ) ∈ B ǫ ′ (W,Ṽ (ν)) ⊆ B ǫ (W, V ) (the inclusion follows the triangle inequality), which satisfies the definition of local minimum for (W,Ṽ (ν)). Thus, for any such sufficiently small v ∈ R dz , we have that ∂L(W,Ṽ (ν), θ) ∂ vec(W ) = 0 since otherwise, (W,Ṽ (ν)) does not satisfy the first order necessary condition of local minima (i.e., W can be moved to the direction of the nonzero partial derivative with a sufficiently small magnitude ǫ ′ ∈ (0, ǫ/2) and decrease the loss value, which contradicts with (W,Ṽ (ν)) being a local minimum). Hence, for any such sufficiently small v ∈ R dz , where the last line follows from the fact that 0 = ∂L(W,V,θ) ∂ vec(W ) = ∂L(W,Ṽ (0),θ) ∂ vec(W ) = E x,y [vec((x + V z)D)] and hence E x,y [(x + V z)D] = 0. Since u 2 = 1, by multiplying u ⊤ both sides from the left, we have that for any sufficiently small ν ∈ R dz such that (W,Ṽ (ν)) ∈ B ǫ/2 (W, V ), Then, from ∂L(W,V,θ) ∂ vec(W ) = 0, Since the map h → ℓ(h, y) is convex and an expectation of convex functions is convex, E x,y [ℓ(h, y)] is convex in h. Since a composition of a convex function with an affine function is convex, L R (R, θ) is convex in R = R(W, V ). Therefore, from the convexity, if then R is a global minimum of L R (R, θ). We now show that if (W, V, θ) is a local minimum, then ∂L R (R,θ) ∂ vec(R) R=R(W,V ) = 0, and hence R = R(W, V ) is a global minimum of L R (R, θ). On the one hand, with the same calculations as in the proofs of Lemmas 1 and 2, we have that On the other hand, Lemma 2 states that if (W, V, θ) is a local minimum of L, we have that Proof. From Theorem 1, we have that L(W, V, θ) ≤ L * {x z(x,θ)} . In this proof, we derive the specific forms of L * {x z(x,θ)} for the case of the squared loss and the empirical measure. Let Z = Z(X, θ) for notational simplicity. Since the map h → ℓ(h, y) is assumed to represent the squared loss in this theorem, the global minimum value L * {x z(x,θ)} of linear predictors is the global minimum value of where R ∈ R (dx+dz)×dy . From convexity and differentiability of g(R), R is a global minimum if and only if ∂g(R) ∂ vec(R) = 0. Since solving ∂g(R) ∂ vec(R) = 0 for all solutions of R yields that and hence X Z R = P X Z Y.
Also, the same proof step obtains the fact that P F is the global minimum value of g ′ (R) = 1 m XR − Y 2 F , which is the objective function with linear predictors R ⊤ x.
On the other hand, since the span of the columns of X Z is the same as the span of the columns of X P N [X]Z , we have that P X Z = P X P N [X]Z , and