Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with ReLU activation. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point.


Introduction
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure which appears, for instance, in the context of natural language processing, face recognition, fraud detection, and game intelligence. Although there exist a large number of numerical simulations in which GD type optimization schemes are effectively used to train ANNs with ReLU activation, till this day in the scientific literature there is in general no mathematical convergence analysis which explains the success of GD type optimization schemes in the training of such ANNs.
GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods.
Although there is in general no theoretical analysis which explains the success of GD type optimization schemes in the training of ANNs in the literature, there are several auspicious analysis approaches as well as several promising partial error analyses regarding the training of ANNs via GD type optimization schemes and GFs, respectively, in the literature. For convex objective functions, the convergence of GF and GD processes to the global minimum in different settings has been proved, e.g., in [5,23,34,35,38]. For general non-convex objective functions, even under smoothness assumptions GF and GD processes can show wild oscillations and admit infinitely many limit points, cf., e.g., [1]. A standard condition which excludes this undesirable behavior is the Lojasiewicz inequality and we point to [1,3,4,8,16,28,29,30,31,33,36] for convergence results for GF and GD processes under Lojasiewicz type assumptions. It is in fact one of the main contributions of this work to demonstrate that the objective functions occurring in the training of ANNs with ReLU activation satisfy an appropriate Lojasiewicz inequality, provided that both the target function and the density of the probability distribution of the input data are piecewise polynomial. For further abstract convergence results for GF and GD processes in the non-convex setting we refer, e.g., to [6,20,32,37,40] and the references mentioned therein.
In the overparametrized regime, where the number of training parameters is much larger than the number of training data points, GF and GD processes can be shown to converge to global minima in the training of ANNs with high probability, cf., e.g., [2,14,17,19,21,22,41]. As the number of neurons increases to infinity, the corresponding GF processes converge (with appropriate rescaling) to a measure-valued process which is known in the scientific literature as Wasserstein gradient flow. For results on the convergence behavior of Wasserstein gradient flows in the training of ANNs we point, e.g., to [9], [12], [13], [18,Section 5.1], and the references mentioned therein.
A different approach is to consider only very special target functions and we refer, in particular, to [10,25] for a convergence analysis for GF and GD processes in the case of constant target functions and to [26] for a convergence analysis for GF and GD processes in the training of ANNs with piecewise linear target functions. In the case of linear target functions, a complete characterization of the non-global local minima and the saddle points of the risk function has been obtained in [11].
In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. Specifically, in the first main result of this article, see Theorem 1.1 below, we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation possesses for every initial value a solution which is also unique among a suitable class of solutions (see (1.4) in Theorem 1.1 for details). In the second main result of this article, see Theorem 1.2 below, we prove in the training of such ANNs under the assumption that the target function and the density function are piecewise polynomial (see (1.6) below for details) that every non-divergent GF trajectory converges with an appropriate speed of convergence (see (1.9) below) to a critical point.
In Theorems 1.1 and 1.2 we consider ANNs with d ∈ N = {1, 2, 3, . . . } neurons on the input layer (d-dimensional input), H ∈ N neurons on the hidden layer (H-dimensional hidden layer), and 1 neuron on the output layer (1-dimensional output). There are thus Hd scalar real weight parameters and H scalar real bias parameters to describe the affine linear transformation between d-dimensional input layer and the H-dimensional hidden layer and there are thus H scalar real weight parameters and 1 scalar real bias parameter to describe the affine linear transformation between the H-dimensional hidden layer and the 1-dimensional output layer. Altogether there are thus d = Hd + H + H + 1 = Hd + 2H + 1 real numbers to describe the ANNs in Theorems 1.1 and 1.2.
The real numbers ∈ R, ∈ ( , ∞) in Theorems 1.1 and 1.2 are used to specify the set [ , ] d in which the input data of the considered supervised learning problem takes values in and the function f : [ , ] d → R in Theorem 1.1 specifies the target function of the considered supervised learning problem.
In Theorem 1.1 we assume that the target function is an element of the set C([ , ] d , R) of continuous functions from [ , ] d to R but beside this continuity hypothesis we do not impose further regularity assumptions on the target function.
The function p : [ , ] d → [0, ∞) in Theorems 1.1 and 1.2 is an unnormalized density function of the probability distribution of the input data of the considered supervised learning problem and in Theorem 1.1 we impose that this unnormalized density function is bounded and measurable.
In Theorems 1.1 and 1.2 we consider ANNs with the ReLU activation function R x → max{x, 0} ∈ R. The ReLU activation function fails to be differentiable and this lack of regularity also transfers to the risk function of the considered supervised learning problem; cf. (1.3) below. We thus need to employ appropriately generalized gradients of the risk function to specify the dynamics of the gradient flows. As in [25, Setting 2.1 and Proposition 2.3] (cf. also [10,24]), we accomplish this, first, by approximating the ReLU activation function through continuously differentiable functions which converge pointwise to the ReLU activation function and whose derivatives converge pointwise to the left derivative of the ReLU activation function and, thereafter, by specifying the generalized gradient function as the limit of the gradients of the approximated risk functions; see (1.1) and (1.3) in Theorem 1.1 and (1.7) and (1.8) in Theorem 1.2 for details.
We now present the precise statement of Theorem 1.1 and, thereafter, provide further comments regarding Theorem 1.2.
(1.9) Theorem 1.2 above is an immediate consequence of Theorem 5.4 in Subsection 5.3 below. Theorem 1.2 is related to Theorem 1.1 in our previous article [24]. In particular, [24, Theorem 1.1] uses weaker assumptions than Theorem 1.2 above but Theorem 1.2 above establishes a stronger statement when compared to [24,Theorem 1.1]. Specifically, on the one hand in [24, Theorem 1.1] the target function is only assumed to be a continuous function and the unnormalized density is only assumed to be measurable and integrable while in Theorem 1.2 it is additionally assumed that both the target function and the unnormalized density are piecewise polynomial in the sense of (1.6) above. On the other hand [24, Theorem 1.1] only asserts that the risk of every bounded GF trajectory converges to the risk of critical point while Theorem 1.2 assures that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point and also assures that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point.
The remainder of this article is organized in the following way. In Section 2 we establish several regularity properties for the risk function of the considered supervised learning problem and its generalized gradient function. In Section 3 we employ the findings from Section 2 to establish existence and uniqueness properties for solutions of GF differential equations. In particular, in Section 3 we present the proof of Theorem 1.1 above. In Section 4 we establish under the assumption that both the target function f : [ , ] d → R and the unnormalized density function p : [ , ] d → [0, ∞) are piecewise polynomial that the risk function is semialgebraic in the sense of Definition 4.3 in Section 4 (see Corollary 4.10 in Section 4 for details). In Section 5 we engage the results from Sections 2 and 4 to establish several convergence rate results for solutions of GF differential equations and, thereby, we also prove Theorem 1.2 above.

Properties of the risk function and its generalized gradient function
In this section we establish several regularity properties for the risk function L : R d → R and its generalized gradient function G : R d → R d . In particular, in Proposition 2.12 in Subsection 2.5 below we prove for every parameter vector θ ∈ R d in the ANN parameter space R d = R dH+2H+1 that the generalized gradient G(θ) is a limiting subdifferential of the risk function L : R d → R at θ. In Definition 2.8 in Subsection 2.5 we recall the notion of subdifferentials (which are sometimes also referred to as Fréchet subdifferentials in the scientific literature) and in Definition 2.9 in Subsection 2.5 we recall the notion of limiting subdifferentials. Only for completeness we include in this section a detailed proof for Lemma 2.5. In Setting 2.1 in Subsection 2.1 below we present the mathematical setup to describe ANNs with ReLU activation, the risk function L : R d → R, and its generalized gradient function G : R d → R d . Moreover, in (2.6) in Setting 2.1 we define for a given parameter vector θ ∈ R d the set of hidden neurons which have all input parameters equal to zero. Such neurons are sometimes called degenerate (cf. [11]) and can cause problems with the differentiability of the risk function, which is why we exclude degenerate neurons in Proposition 2.3 and Corollary 2.7 below.

Local Lipschitz continuity of active neuron regions
(2.8) Proof of Lemma 2.4. Observe that for all v, w ∈ R d+1 we have that Moreover, note that the fact that for all y ∈ R it holds that y ≥ −|y| ensures that for all Combining this and (2.10) demonstrates for all In the following we distinguish between the case max i∈{1,2,...,d} |u i | = 0, the case (max i∈{1,2,...,d} |u i |, We first prove (2.8) in the case max i∈{1,2,...,d} |u i | = 0. (2.14) Note that (2.14) and the assumption that u ∈ R d+1 \{0} imply that |u d+1 | > 0. Moreover, observe that (2.14) shows that for all This establishes (2.8) in the case max i∈{1,2,...,d} |u i | = 0. In the next step we prove (2.8) in the case (2.17) For this we assume without loss of generality that |u 1 | > 0. In the following let J v,w Next observe that Fubini's theorem and the fact that for all v ∈ R d+1 it holds that I v is measurable show that for all v, w ∈ R d+1 we have that (2.20) Furthermore, observe that (2.10) demonstrates for all This, (2.13), and (2.9) establish (2.8) in the case (max i∈{1,2,...,d} |u i |, d) ∈ (0, ∞) × {1}. The proof of Lemma 2.4 is thus complete.

Local Lipschitz continuity properties for the generalized gradient function
be locally bounded and measurable, assume for all r ∈ (0, ∞) that Proof of Lemma 2.5. Observe that (2.25) and the assumption that φ is locally bounded ensure that there exists ∈ R which satisfies for all y, z ∈ {v ∈ R n : Next note that (2.26) shows for all y, z ∈ R n that Moreover, observe that (2.27) assures for all y, z ∈ {v ∈ R n : x − v ≤ ε} that In the next step we combine (2.27) with the assumption that for all y, z ∈ {v ∈ R n : This, (2.28), and (2.29) demonstrate for all y, z ∈ {v ∈ R n : The proof of Lemma 2.5 is thus complete.
Corollary 2.6. Assume Setting 2.1, let φ : R d × [ , ] d → R be locally bounded and measurable, and assume for all r ∈ (0, ∞) that is locally Lipschitz continuous and is locally Lipschitz continuous.
Proof of Corollary 2.6. First note that Lemma 2.5 (applied for every is locally Lipschitz continuous, and (iii) it holds for all i ∈ {1, 2, . . . , H} that is locally Lipschitz continuous.
Proof of Corollary 2.7. Note that (2.7) and Corollary 2.6 establish items (i)-(iii). The proof of Corollary 2.7 is thus complete.

Subdifferentials
Definition 2.8 (Subdifferential). Let n ∈ N, f ∈ C(R n , R), x ∈ R n . Then we denote bŷ ∂f (x) ⊆ R n the set given bŷ Definition 2.9 (Limiting subdifferential). Let n ∈ N, f ∈ C(R n , R), x ∈ R n . Then we denote by ∂f (x) ⊆ R n the set given by Then   In addition, observe that for all n ∈ N, i ∈ D θ it holds that b ϑn i = − 1 n < 0. This shows for all In addition, observe that for all n ∈ N, i ∈ D θ we have that I ϑn i = I θ i = ∅. Hence, we obtain for all i ∈ D θ , j ∈ {1, 2, . . . , d} that lim n→∞ G (i−1)d+j (ϑ n ) = 0 = G (i−1)d+j (θ) and lim n→∞ G Hd+i (ϑ n ) = 0 = G Hd+i (θ).
(2.47) Combining this, (2.45), and (2.46) demonstrates that lim n→∞ G(ϑ n ) = G(θ). This and Lemma 2.10 assure that G(θ) ∈ ∂L(θ). The proof of Proposition 2.12 is thus complete. In other words, in Theorem 3.3 we prove the unique existence of GF solutions with the property that once a neuron has become degenerate it will remain degenerate for subsequent times.
Our strategy of the proof of Theorem 3.3 and Proposition 3.1, respectively, can, loosely speaking, be described as follows. Corollary 2.7 above implies that the components of the generalized gradient function G : R d → R d corresponding to non-degenerate neurons are locally Lipschitz continuous so that the classical Picard-Lindelöf local existence and uniqueness theorem for ordinary differential equations can be brought into play for those components. On the other hand, if at some time t ∈ [0, ∞) the i-th neuron is degenerate, then Proposition 2.2 above shows that the corresponding components of the generalized gradient function G : R d → R d vanish. The GF differential equation is thus satisfied if the neuron remains degenerate at all subsequent times s ∈ [t, ∞). Using these arguments we prove in Proposition 3.1 the existence of GF solutions by induction on the number of non-degenerate neurons of the initial value.

Existence properties for solutions of GF differential equations
(3.7) Note that (3.6) assures that U ⊆ R d is open. In addition, observe that Corollary 2.7 implies that G is locally Lipschitz continuous. Combining this with the Picard-Lindelöf Theorem demonstrates that there exist a unique maximal τ ∈ (0, ∞] and Ψ ∈ C([0, τ ), U ) which satisfy for all t ∈ [0, τ ) that Next note that the fact that for all This, (3.7), and (2.7) demonstrate for all t ∈ [0, τ ) that G(Ψ t ) = G(Ψ t ). In addition, observe that (3.6) and (3.9) imply for all t ∈ [0, τ ) that D Ψt = D θ . Hence, if τ = ∞ then Ψ satisfies (3.1). Next assume that τ < ∞. Note that the Cauchy-Schwarz inequality and [24, Lemma 3.1] prove for all s, t ∈ [0, τ ) with s ≤ t that (3.10) Hence, we obtain for all (t n ) n∈N ⊆ [0, τ ) with lim inf n→∞ t n = τ that (Ψ tn ) is a Cauchy sequence. This implies that ϑ := lim t↑τ Ψ t ∈ R d exists. Furthermore, observe that the fact that τ is maximal proves that ϑ / ∈ U . Therefore, we have that D ϑ \D θ = ∅. Moreover, note that (3.9) shows that for all i ∈ D θ , j ∈ {1, 2, . . . , d} it holds that w ϑ i,j = b ϑ i = 0 and, therefore, i ∈ D ϑ . This demonstrates that #(D ϑ ) > #(D θ ). Combining this with the induction hypothesis ensures that there exists In the following let Θ : (3.12) Observe that the fact that ϑ = lim t↑τ Ψ t and the fact that Φ 0 = ϑ imply that Θ is continuous. Furthermore, note that the fact that G is locally bounded and (3.8) ensure that (3.14) This shows that Θ satisfies (3.1). The proof of Proposition 3.1 is thus complete.

Uniqueness properties for solutions of GF differential equations
Then it holds for all t ∈ [0, ∞) that Θ 1 t = Θ 2 t . Proof of Lemma 3.2. Assume for the sake of contradiction that there exists t ∈ [0, ∞) such that Θ 1 t = Θ 2 t . By translating the variable t if necessary, we may assume without loss of generality that inf t ∈ [0, ∞) : Θ 1 t = Θ 2 t = 0. Next observe that the fact that Θ 1 and Θ 2 are continuous implies that there exists δ ∈ (0, ∞) which satisfies for all t In addition, observe that the fact that Θ 1 and Θ 2 are continuous implies that there exists a compact K ⊆ {ϑ ∈ R d : D ϑ ⊆ D θ } which satisfies for all t ∈ [0, δ], k ∈ {1, 2} that Θ k t ∈ K. Moreover, note that Corollary 2.7 proves that for all i ∈ {1, 2, . . . , H}\D θ , j ∈ {1, 2, . . . , d} it holds that G (i−1)d+j , G Hd+i , G H(d+1)+i , G d : K → R are Lipschitz continuous. This and (3.16) show that there exists L ∈ (0, ∞) such that for all t ∈ [0, δ] we have that In the following let M : [0, ∞) → [0, ∞) satisfy for all t ∈ [0, ∞) that M t = sup s∈(0,t] Θ 1 s − Θ 2 s . Observe that the fact that inf t ∈ [0, ∞) : Θ 1 t = Θ 2 t = 0 proves for all t ∈ (0, ∞) that M t > 0. Moreover, note that (3.17) ensures for all t ∈ (0, δ) that (3.18) Combining this with the fact that M is non-decreasing shows for all t ∈ (0, δ), s ∈ (0, t] that This demonstrates for all t ∈ (0, min{L −1 , δ}) that which is a contradiction. The proof of Lemma 3.2 is thus complete.  Note that the risk function L : R d → R is given through a parametric integral in the sense that for all θ ∈ R d we have that L(θ) = [ , ] d (f (y) − θ (y)) 2 p(y) λ(dy). In general, parametric integrals of semialgebraic functions are no longer semialgebraic functions and the characterization of functions that can occur as such integrals is quite involved (cf. Kaiser [27]). This is the reason why we introduce in Definition 4.6 in Subsection 4.2 below a suitable subclass of the class of semialgebraic functions which is rich enough to contain the realization functions of ANNs with ReLU activation (cf. (4.28) in Subsection 4.2 below) and which can be shown to be closed under integration (cf. Proposition 4.8 in Subsection 4.2 below for the precise statement).

Semialgebraic sets and functions
Definition 4.1 (Set of polynomials). Let n ∈ N 0 . Then we denote by n ⊆ C(R n , R) the set 2 of all polynomials from R n to R. Definition 4.2 (Semialgebraic sets). Let n ∈ N and let A ⊆ R n be a set. Then we say that A is a semialgebraic set if and only if there exist k ∈ N, (P i,j, ) (i,j, )∈{1,2,...,k} 2 ×{0,1} ⊆ n such that    Definition 4.6. Let m ∈ N, n ∈ N 0 . Then we denote by m,n the R-vector space given by  Proof of Lemma 4.7. Throughout this proof let r ∈ N, A 1 , A 2 , . . . , A r ∈ {{0}, [0, ∞), (0, ∞)}, R ∈ ℛ m , P = (P i ) i∈{1,2,...,r} ⊆ m , and let g : R m → R satisfy for all θ ∈ R m that

On the semialgebraic property of certain parametric integrals
Observe that the graph of R m θ → R(θ) ∈ R is given by Since both of these sets are described by polynomial equations and inequalities, it follows that R m θ → R(θ) ∈ R is semialgebraic. In addition, note that for all i ∈ {1, 2, . . . , r} the graph of R m θ → 1 [0,∞) (P i (θ)) ∈ R is given by This demonstrates for all i ∈ {1, 2, . . . , r} that R m θ → 1 [0,∞) (P i (θ)) ∈ R is semialgebraic.
Combining this and (4.4) with Lemma 4.4 demonstrates that g is semialgebraic. The proof of Lemma 4.7 is thus complete.

(4.23)
Similarly, the other indicator functions can be brought into the correct form, taking into account the different signs of P j,n (θ) for j ∈ A and j ∈ B. Moreover, observe that the remaining terms can be written as linear combinations of rational functions in θ and polynomials in x. Hence, we obtain that the expressions (I), (II), (III), (IV ) are elements of m,n−1 . The proof of Proposition 4.8 is thus complete.

On the semialgebraic property of the risk function
Definition 4.9. Let d ∈ N, let A ⊆ R d be a set, and let f : A → R be a function. Then we say that f is piecewise polynomial if and only if there exist n ∈ N, α 1 , α 2 , . . . , α n ∈ R n×d , β 1 , β 2 , . . . , β n ∈ R n , P 1 , P 2 , . . . , P n ∈ d such that for all x ∈ A it holds that Note that (4.25) and the assumption that f and p are piecewise polynomial assure that (cf. Definition 4.6). In addition, observe that the fact that for all θ ∈ R d , x ∈ R d we have that Combining this with (4.26) and the fact that d,d is an algebra proves that This, Proposition 4.8, and induction demonstrate that Fubini's theorem hence implies that L ∈ d,0 . Combining this and Lemma 4.7 shows that L is semialgebraic. The proof of Corollary 4.10 is thus complete.

Convergence rates for solutions of GF differential equations
In this section we employ the findings from Sections 2 and 4 to establish in Proposition 5. In the proof of Proposition 5.1 the classical Lojasiewicz inequality for semialgebraic or subanalytic functions (cf., e.g., Bierstone & Milman [7]) is not directly applicable since the generalized gradient function G : R d → R d is not continuous. We will employ the more general results from Bolte et al. [8] which also apply to not necessarily continuously differentiable functions.
The arguments used in the proof of Proposition 5.2 are slight adaptions of well-known arguments in the literature; see, e.g., Kurdyka [1,Theorem 2.2] it is assumed that the object function of the considered optimization problem is analytic and in Bolte et al. [8,Theorem 4.5] it is assumed that the objective function of the considered optimization problem is convex or lower C 2 and Proposition 5.2 does not require these assumptions. On the other hand, Bolte et al. [8,Theorem 4.5] consider more general differential dynamics and the considered gradients are allowed to be more general than the specific generalized gradient function G : R d → R d which is considered in Proposition 5.2. Combining this with the fact that for all θ ∈ R d it holds that M(θ) ≤ G(θ) and the fact that sup θ∈Bε(ϑ) |L(θ) − L(ϑ)| < ∞ demonstrates that for all θ ∈ B ε (ϑ), α ∈ (a, 1) we have that

Generalized Lojasiewicz inequality for the risk function
This completes the proof of Proposition 5.1.

(5.5)
Then there exists δ ∈ (0, ε) such that for all Proof of Proposition 5.2. Note that the fact that L is continuous implies that there exists δ ∈ (0, ε /3) which satisfies for all θ ∈ B δ (ϑ) that In the following let In the first step we show that for all t ∈ [0, ∞) it holds that Θ t ∈ B ε (ϑ). We intend to show that T = ∞. Note that (5.8) assures for all t ∈ [0, T ) that L(t) ≥ 0. Moreover, observe that (5.10) and (5.11) ensure that for almost all t ∈ [0, T ) it holds that L is differentiable at t and satisfies Note that the fact that L is non-increasing implies that for all s ∈ [τ, T ) it holds that L(s) = 0. Combining this with (5.10) demonstrates for almost all s ∈ (τ, T ) that G(Θ s ) = 0. This proves for all s ∈ [τ, T ) that Θ s = Θ τ . Next observe that (5.5) ensures that for all t ∈ [0, τ ) it holds that (5.14) Combining this with the chain rule proves for almost all t ∈ [0, τ ) that In addition, note that the fact that [0, ∞) t → L(t) ∈ R is absolutely continuous and the fact that for all r ∈ (0, ∞) it holds that r, ∞) y → y 1−α ∈ R is Lipschitz continuous demonstrate for all t ∈ [0, τ ) that [0, t] s → [L(s)] 1−α ∈ R is absolutely continuous. Integrating (5.15) hence shows for all s, t ∈ [0, τ ) with t ≤ s that This and the fact that for almost all s ∈ (τ, T ) it holds that G(Θ s ) = 0 ensure that for all Combining this with (5.7) demonstrates for all t ∈ [0, T ) that This, the fact that δ < ε /3, and the triangle inequality assure for all t ∈ [0, T ) that Combining this with (5.12) proves that T = ∞. This establishes (5.9). Next observe that the fact that T = ∞ and (5.18) prove that Note that (5.20) proves that lim sup t→∞ σ(t) = 0. In addition, observe that (5.20) assures that there exists ψ ∈ R d such that lim sup t→∞ Θ t − ψ = 0. (5.22) In the next step we combine the weak chain rule for the risk function in (5.10) with (5.9) and (5.5) to obtain that for almost all t ∈ [0, ∞) we have that In addition, note that the fact that L is non-increasing and (5.7) ensure that for all t ∈ [0, ∞) it holds that L(t) ≤ L(0) ≤ 1. Therefore, we get for almost all t ∈ [0, ∞) that Combining this with the fact that for all t ∈ [0, τ ) it holds that L(t) > 0 establishes for almost all t ∈ [0, τ ) that d dt The fact that for all t ∈ [0, τ ) it holds that [0, t] s → L(s) ∈ (0, ∞) is absolutely continuous hence demonstrates for all t ∈ [0, τ ) that Therefore, we infer for all t ∈ [0, τ ) that This and the fact that for all t ∈ [τ, ∞) it holds that L(t) = 0 prove that for all t ∈ [0, ∞) we have that Furthermore, observe that (5.22) and the fact that L is continuous imply that lim sup t→∞ |L(Θ t )− L(ψ)| = 0. Hence, we obtain that L(ψ) = L(ϑ). This shows for all t ∈ [0, ∞) that In the next step we establish a convergence rate for the quantity Θ t − ψ , t ∈ [0, ∞). We accomplish this by employing an upper bound for the tail length of the curve Θ t ∈ R d , t ∈ [0, ∞).