Fast convergence rates of deep neural networks for classification

We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition that the conditional class probabilities of most data are sufficiently close to either 1 or zero. This assumption is not unusual for image recognition because human beings are extremely good at recognizing most images. To confirm our theoretical explanation, we present the results of a small numerical study conducted to compare the hinge loss and cross-entropy.


Introduction
Deep learning [Hinton and Salakhutdinov, 2006, Larochelle et al., 2007, Goodfellow et al., 2016] has received much attention for dimension reduction and classification of objects, such as images, speech, and language.Various supervised/unsupervised deep learning architectures, such as deep belief network [Hinton et al., 2006], have been developed and applied to large scale real data with great success.A key ingredient for the success of deep learning is to discover multiple levels of representation of the given dataset with higher levels of representation defined hierarchically in terms of lower level representations.The central motivation is that higher-level representations can potentially capture relevant higher-level abstractions.See Goodfellow et al. [2016] for details.
Theoretical explanations regarding the success of deep learning have been recently studied.Many researchers have demonstrated that deep neural networks (DNNs) are much more efficient in representing certain complex functions than their shallow counterparts [Montufar et al., 2014, Raghu et al., 2016, Eldan and Shamir, 2016], which has been reconfirmed by Yarotsky [2017] and Petersen and Voigtlaender [2018], who showed that DNNs can approximate a large class of functions, including even discontinuous functions with a parsimonious number of parameters.In turn, using this efficient approximation property of a DNN, Schmidt-Hieber [2017] and Imaizumi and Fukumizu [2018] proved that, for regression problems, we can estimate a complex function including a discontinuous function using a DNN with the (in the minimax sense) optimal convergence rate.A surprising result is that any linear estimators, which include the ridge penalized kernel estimator, are sub-optimal in estimating a discontinuous function while the DNN is optimal.
In this paper, we consider classification problems.It is known that estimating the classifier directly instead of estimating the conditional class probability (i.e., η(x) = Pr(Y = 1|X = x)) can help achieve fast convergence rates [Mammen and Tsybakov, 1999, Tsybakov, 2004, Tsybakov and van de Geer, 2005, Audibert and Tsybakov, 2007] under the Tsybakov's low noise condition.We prove that the estimation of a classifier based on the DNN with the hinge loss can achieve fast convergence rates under various situations.
In practice, estimating the classifier directly is difficult because the classifier itself is discontinuous.Mammen and Tsybakov [1999], Tsybakov [2004], Tsybakov and van de Geer [2005] considered estimating the classifier directly, which may be computationally infeasible in practice.Under the smoothness assumption on the conditional class probability, Audibert and Tsybakov [2007] estimated the conditional class probability using a local polynomial estimator and obtained a plug-in classifier.Finding the best plug-in classifier, however, requires searching in a given sieve, which is computationally demanding.In contrast, learning a DNN is relatively straightforward owing to the gradient descent algorithm, despite a risk of arriving at bad local minima.
We consider three cases regarding a true classifier: (1) a smooth boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of the inputs near the decision boundary is small).We prove that the DNN classifier can achieve fast convergence rates for all of these three cases if the architecture (i.e., the number of layers, number of nodes, and sparsity of the weights) of the DNN is carefully selected.In particular, the DNN classifier is minimax optimal for a smooth conditional class probability, and achieves faster convergence rates under the margin condition.To the best of the authors' knowledge, no other estimator achieves fast convergence rates for these three cases simultaneously.
The cross-entropy is the standard objective function used in learning a DNN, and is an empirical risk with respect to the logistic loss (i.e., the negative loglikelihood of the logistic model).It is well known that the logistic loss estimates the conditional class probability rather than the classifier, and hence will be sub-optimal.However, learning a DNN with the cross-entropy performs quite well in practice.We justify the use of the cross-entropy in learning a DNN by showing that the corresponding classifier also achieves a fast convergence rate when most data have a conditional class probability close to 1 or zero.Note that this assumption is reasonable for image recognition because human beings recognize most real world images quite well.
The remainder of this paper is organized as follows.Section 2 describes the hinge loss and DNN classifier.Section 3 derives the convergence rates of the excessive risk of a DNN classifier for the aforementioned three cases regarding a true model.The fast convergence rate of the DNN classifier with the crossentropy is derived in Section 4, and concluding remarks follow in Section 5.

Notations
For a function f : X → R, where X denotes the domain of the function, For two given sequences {a n } n∈N and {b n } n∈N of real numbers, we write a n b n if there exists a constant C > 0 such that a n ≤ Cb n for all sufficiently large n.In addition, we write and for s ∈ (0, 1], let We denote by C m (X ) and m ∈ N, the space of m times differentiable functions on X whose partial derivatives of order m with |m| ≤ m are continuous.For a positive real value α, we write α The Hölder space of order α is defined as denotes the Hölder norm defined by We let which is a closed ball in the Hölder space of radius r with respect to the Hölder norm.

Estimation of the classifier with DNNs
We consider a binary classification problem.The data are given as (x 1 , y 1 ), . . ., (x n , y n ), where x i ∈ X ⊂ R p are input vectors, and y i ∈ {−1, 1} are class labels.Here, for simplicity, we set X = [0, 1] d ; however, this can be extended to any compact subset of R d .We assume that (x i , y i ) are independent copies of a random vector (X, Y ) ∼ Pr for a certain probability measure Pr.We let P X be the marginal distribution of X induced by the joint distribution Pr.

Necessity of the hinge loss
Before going further, we will first review why we consider the hinge loss instead of the logistic loss to achieve fast convergence rates.Let C be the class of all classifiers (i.e., all measurable mapping from X to {−1, 1}).The objective of classification is to find the optimal classifier (called the Bayes classifier) C * , which is defined as where 1{•} is 1 if {•} is true, and is 0 otherwise.Because we do not know the probability measure Pr generating data, we cannot find C * .Instead, we estimate C * based on the training data.The most popular method for estimating C * is the empirical risk minimization approach, where we estimate C * by minimizing the empirical risk.That is, we estimate C * using C, where where C n is a given class of classifiers depending on the sample size n.
In practice, C is not computationally feasible because minimizing the empirical risk with the 0-1 loss over C n is NP hard [Bartlett et al., 2006].An alternative approach is to replace the 0-1 loss with other computationally easier losses so-called surrogate losses.In addition, instead of a class of classifiers C n , we consider a class of real-valued functions F n .For a given surrogate loss φ, we estimate f by minimizing the surrogate empirical risk (or empirical φ-risk) (2.2) on F n , and construct a classifier by C(x) = sign f (x).
A question in using a convex surrogate loss is the relation between the minimizer of the 0-1 empirical risk (2.1) and that of the empirical φ-risk (2.2).Because the empirical φ-risk converges to the population φ-risk E φ (f ) = E(φ(Y f (X)) for a given f by the law of large numbers, we can consider f as an estimator of f * φ , which is defined as where F ∞ is the limit of F n in a certain sense.When F ∞ is the set of all measurable functions, we say that the surrogate loss φ is Fisher consistent if sign(f * φ (x)) = C * (x).It is known [Lin, 2004, Bartlett et al., 2006] that the Fisher consistency holds under very mild conditions on φ.In particular, f * φ is known for various surrogate losses.For example, when φ is the logistic loss (i.e., φ(z , where η(x) = Pr(Y = 1|X = x) [Friedman et al., 2000].Hence, the logistic loss satisfies the Fisher consistency, which justifies the use of the cross-entropy when learning a deep neural network.That is, deep learning with the cross-entropy essentially estimates the log odds of the conditional class probability.
As we explained in the Introduction, it would be better to estimate the Bayes classifier directly, which is realized conceptually if f * φ is the Bayes classifier.The hinge loss φ(z) = (1−z) + = max{1−z, 0} has such a property [Lin, 2002], which is why we consider the hinge loss.Note that there are other losses that have f * φ = C * .An example is the ψ-loss [Shen et al., 2003], which is also known as the ramp loss [Collobert et al., 2006].Although the ψ-loss has many advantages over the hinge loss, the ψ-loss is nonconvex, and learning a DNN classifier using the ψ-loss would be extremely difficult because the DNN classifier is nonconvex as well.

Learning DNN with the hinge loss
We consider DNNs that take d-dimensional inputs and produce one-dimensional outputs.A DNN with L many layers, and {N (l) , l ∈ [L]} many nodes at each layer, is defined as with N (0) = d and h (0) k (x) = x k .We consider the ReLU activation function σ(z) = (z) + .We denote f (x) as f (x|Θ), where Θ = ((W (l) , b (l) )) l=1,...,L+1 is the parameter set including all weights and biases.
For the given Θ, let |Θ| be the number of layers in Θ.Let N max (Θ) be the maximum number of nodes, that is, f (•|Θ) has at most N max (Θ) nodes at each layer.We define Θ 0 as the number of nonzero parameters in Θ, where vec(W (l) ) transforms the matrix W (l) into the corresponding vector by concatenating the column vectors.Similarly, we define Θ ∞ as the largest absolute value of the parameters in Θ, For a given n, let F n be where the positive constants L n , N n , S n , B n , and F n are specified later.
We let f DNN φ,n be the minimizer of E φ,n (f ) over F n for a given surrogate loss φ, i.e., (2.3) In the following section, we prove the fast convergence rates of f DNN φ,n for various cases of the true model when φ is the hinge loss and L n , N n S n , B n , and F n are carefully selected.For detailed formulas of L n , N n , S n , B n , and F n in terms of the sample size n, see the proofs of the corresponding theorems in the Appendix.

Fast convergence rates of DNN classifiers with the hinge loss
In this section, we consider the hinge loss and derive the convergence rates of the excess risk of f DNN φ,n .For a given function f , the excess risk E(f, C * ) is defined as and the excess φ-risk is defined by Throughout this paper, we always assume the Tsybakov noise condition (Mammen and Tsybakov [1999], Tsybakov [2004]).
(N) There exists C > 0 and q ∈ [0, ∞] such that for any t > 0 We call the parameter q appearing in assumption (N) the noise exponent.
We consider three cases regarding a true model: (1) a smooth decision boundary, (2) smooth class conditional probability, and (3) the margin condition.We derive the fast convergence rates of the DNN classifier using the hinge loss for all three cases.

Case 1: Smooth boundary
To describe the smooth Bayes classifier, we introduce the notion of piecewise constant functions with smooth boundaries.We adopt the notations and definitions from Petersen and Voigtlaender [2018] and Imaizumi and Fukumizu  [2018]. For g ∈ H α,r ([0, 1] d−1 ) and j ∈ [d], we define a horizon function where x −j = (x 1 , . . ., x j−1 , x j+1 , . . ., x d ).For each horizon function, we define the corresponding basis piece I g,j as We define a piece by the intersection of K basis pieces.The set of pieces is denoted by Let C α,r,K,T be the set of classifiers of the form for T ∈ N, and disjoint subsets A 1 , . . ., A T of X in A α,r,K .In this subsection, we assume that the Bayes classifier belongs to C α,r,K,T .
The following theorem proves the convergence rate of the DNN classifier with the hinge loss.
Theorem 1. Assume (N) using the noise exponent q ∈ [0, ∞].If the surrogate loss φ is the hinge loss, the classifier f DNN φ,n defined by (2.3) with carefully selected L n , N n , S n , B n , and where the expectation is taken over the training data.
Tsybakov [2004] showed that the minimax lower bound is given by where the infimum is taken over all classifiers f n : (X × Y) n → F, where F is a set of all measurable functions.Unfortunately, the convergence rate (3.1) is not optimal in the minimax sense.However, the difference becomes small when the noise exponent q is large.Note that the estimators in Mammen and Tsybakov [1999] and Tsybakov [2004] have slower convergence rates than that in (3.1) when α < d − 1.However, the estimator of Tsybakov and van de Geer [2005] achieves the minimax lower bound for any α > 0. At this point, we do not know whether the sub-optimal convergence rate (3.1) is inevitable owing the use of the hinge loss rather than the 0-1 loss.We will pursue this issue in the near future.

Case 2: Smooth conditional class probability
We assume that η(x) is smooth.That is, η(•) ∈ H β,r ([0, 1] d ) for some β > 0 and r > 0. The following theorem provides the convergence rate of the DNN classifier.
Theorem 2. Assume (N) with the noise exponent q ∈ [0, ∞].If the surrogate loss φ is the hinge loss, the classifier f DNN φ,n defined by (2.3) with carefully selected L n , N n , S n , B n , and F n satisfies (3.2) Audibert and Tsybakov [2007] showed that when η(•) ∈ H β ([0, 1] d ), the minimax lower bound of the excess risk is given by Hence, the convergence rate (3.2) is minimax optimal up to a logarithmic factor.

Case 3: Margin condition
The convergence rate can be improved if we assume that the density of an input vector is small around the decision boundary.Let , where • 2 denotes the Euclidian norm.We introduce the following condition on the probability measure P X .
(M) There exist C > 0, 0 > 0, and γ ∈ [1, ∞] such that for any ∈ (0, 0 ], The condition (M) is considered by Steinwart and Christmann [2008], where the parameter γ in (M) is called the margin exponent.Steinwart and Christmann [2008] proves that the support vector machine with the Gaussian kernel achieves a fast convergence rate under the condition (M).The following theorem proves that a similar convergence rate can be achieved using the DNN classifier.defined by (2.3) with carefully selected L n , N n , S n , B n , and F n satisfies An interesting feature of the convergence rate (3.3) is that the dependency of the input dimension d diminishes as γ increases.In the extreme case where γ → ∞, the convergence rate becomes n −(q+1)/(q+2) up to the logarithm factor, which depends on neither the smoothness of the boundary nor the dimension of the input.This partly explains why the DNN classifier works well with highdimensional inputs such as images.
To investigate the validity of the margin condition (M), we explore the area near the decision boundary obtained by the cat and dog images of the CIFAR10 dataset.We first fit the decision boundary using a convolutional neural network (CNN) with cat and dog images in the CIFAR10 dataset.We then randomly select two images, one from dog and the other from cat, and take convex combinations of them to obtain a sequence of images between the two selected images.Figure 1 shows five sequences of images from five randomly selected pairs of dog and cat images.The images in the red box, which are the interpolated images with weights of the dog images ranging from 0.3 to 0.7, are visually unrealistic, which suggests that the image classification has a large margin exponent.

Remarks regarding adpative estimation
In practice, we know neither q, α, β nor γ, that affect the choice of the DNN architecture parameters L n , N n , S n , B n , and F n .We may select them dataadaptively.General tool kits used to find an adaptive classifier have been developed by Tsybakov [2004] and Audibert and Tsybakov [2007].These tools can be applied to a DNN classifier with minor modification.
For example, the model selection approach with a data-split proposed by Audibert and Tsybakov [2007] can be applied without much hamper.We first  et al. [2008] and Audibert and Tsybakov [2007]), the selected model achieves the best possible convergence rate r * n as long as n 1 /n → 1 and r * n /n 2 → 0. We plan to report the detailed results of this soon.

Use of cross-entropy
The logistic loss does not estimate the classifier directly, and hence the convergence rate is sup-optimal in general.However, in practice, a DNN with the logistic loss (i.e., learned by minimizing the cross-entropy) works quite well.In this section, we investigate when the logistic loss works well with a DNN.We prove that the convergence rate of the excess risk of the DNN estimator with the logistic loss can be fast when the true conditional class probabilities of most of data are close to 1 or 0. This condition is expected to hold in most image recognition problems because human beings, who are thought to be a Bayes classifier, are very good at recognizing most images.The formal statement of The convergence rate in Theorem 4 is equivalent to that in Theorem 3 for q = ∞ up to a logarithmic factor.
To investigate the validity of the condition (E), Figure 2 shows a histogram of the estimated conditional class probabilities of the test data of the CIFAR10 We compare the performance of the two DNN classifiers learned using the two surrogate losses -the logistic loss and the hinge loss.We analyze three benchmark datasets for image recognition, that is, MNIST, SVHN, and CI-FAR10, where for each dataset we select two classes that are most difficult to recognize.The data descriptions and selected classes are summarized in Table 1.The detailed DNN architectures for the three datasets are given in Appendix A.10.The Adam is used for optimization with the learning rate 10 −3 .Table 2 summarizes the test data error rates for various sizes of training data.The results are the averages (and standard errors) of 100 randomly selected training data, which amply show that the two estimators compete well with each other.

Concluding Remarks
We showed that a DNN is very flexible in the sense that it achieves fast convergence rates for various cases regarding a true model.It is interesting to note that a DNN is not only good at estimating a smooth decision boundary but also a smooth conditional class probability.In addition, a DNN can fully utilize the margin condition.
We showed that using the cross-entropy is also promising when the true conditional class probability is close to either 0 or 1 for most data.However, we conjecture that learning a DNN by minimizing the cross-entropy would be sub-optimal when the conditional class probability is not extreme.
Our theoretical results could be used to develop model selection procedures, particularly for the optimal selection of L n and N n .Moreover, it will be interesting to develop an online learning algorithm that can select L n and N n data adaptively.
We did not consider a computational issue in this paper.Learning a DNN with a sparsity constraint has not been fully studied, although some methods have been proposed (e.g., Liu et al. [2015], Han et al. [2015], and Wen et al. [2016]).A learning algorithm that supports our theoretical results will be worth pursuing.

A.1 Complexity measures of a class of functions
We introduce the complexity measures of a given class of functions.Let • p for 1 ≤ p < ∞ be defined as f p = X |f (x)| p dµ(x) 1/p , where µ denotes the Lebesgue measure and Let F be a given class of real-value functions defined on X .Let δ > 0 and with respect to the L p norm if, for all f ∈ F, there exists f i in the collection such that f − f i p ≤ δ.The cardinality of the minimal δ-covering set is called the δ-covering number of F with respect to the L p norm, and is denoted by , and for any f ∈ F, there is a pair The cardinality of the minimal δ-bracketing set is called the δ-bracketing number of F with respect to the L p norm, and is denoted by N B (δ, F, • p ).The δ-bracketing entropy, denoted by H B (δ, F, • p ) is the logarithm of the δ-bracketing number, i.e., H B (δ, F, • p ) = log N B (δ, F, • p ).
For any δ > 0, it is known (see, for example, Lemma 2.1 of van de Geer for any p ∈ [1, ∞), and A.2 Convergence rate of the excess φ-risk for general surrogate losses In this subsection, we derive the convergence rate of the excess φ-risk under regularity conditions, which is used repeatedly in the following subsections.
The regularity conditions and techniques of the proof are minor modifications of those in Park [2009]; however, we present the complete conditions and proof for the sake of readers' convenience.We assume the following regularity conditions.
(A1) φ is Lipschitz, i.e., there exists a constant (A2) For a positive sequence a n = O(n −a0 ) as n → ∞ for some a 0 > 0, there exists a sequence of function classes {F n } n∈N such that for some f n ∈ F n .
(A3) There exists a sequence {F n } n∈N with F n 1 such that sup f ∈Fn f ∞ ≤ F n .
(A4) There exists a constant ν ∈ (0, 1] such that for any f ∈ F n and any n ∈ N, for a constant C 2 depending only on φ and η(•).
For a proof of the general convergence result, we apply the large deviation inequality of Shen and Wong [1994] presented in Lemma 1.
Lemma 1 (Theorem 3 of Shen and Wong [1994]).Let F be the class of functions bounded above by F .Assume that Ef (Z) = 0 for any f ∈ F and v ≥ sup f ∈F Var(f (Z)) for some v > 0. Suppose that there exists ζ > 0 such that . Then, where Pr * denotes the outer probability measure.
The following Theorem is the main result of this section, which gives the convergence rate of the excess φ-risk.
Theorem 5. Suppose that the conditions (A1)-(A5) are met.Let , where the expectation is taken over the training data.
Proof.Let C 1 , C 2 , and C 3 be constants appearing in assumptions (A1), (A4), and (A5), respectively.Let 2 n = max{2a n , 2 7 δ n /C 1 }.We define the following empirical process where We define We introduce the notation M n,i = 2 i−2 2 n for a concise expression.Through the triangle inequality and (A4), we obtain the following variance bound sup To bound the right-hand side, we apply Lemma 1 to the class of functions , where we denote where the first inequality follows from (A1), and the second inequality follows where the fourth inequality is due to (A5).By taking for certain constants C 4 , C 5 , and C 6 , which leads to the desired result.Here, the last inequality follows from the assumption that n( 2 n ) 2−ν 1.

A.3 Generic convergence rate for the hinge loss
We derive the convergence rate of the excess risk of the hinge loss under the conditions (A2), (A3), and (A5).Note that (A1) holds with C 1 = 1 for the hinge loss.We adopt the following lemma for the variance bound (A4).
For the second inequality of (A.2), note that f * φ = C * for the hinge loss, and (A4) is satisfied with ν = q/(q + 1) by Lemma 2. Hence, Theorem 5 can be applied to complete the proof.

A.4 Entropy of the class of DNNs
The following proposition states the upper bound of the δ-entropy of a neural network function space.

A.5 Proof of Theorem 1
The following proposition given by Petersen and Voigtlaender [2018] proves that DNNs are good at approximating piecewise constant functions with smooth boundaries.
Proposition 2 (Corollary 3.7 of Petersen and Voigtlaender [2018]).Let d ≥ 2, α, r > 0, K ∈ N, and T ∈ N.For any C ∈ C α,r,K,T and any sufficiently small ξ > 0, there exists a neural network where the positive constants L 0 , N 0 , S 0 , B 0 , and b 0 depend only on d, α, r, and Proof of Theorem 1.We will check the conditions (A2), (A3), and (A5) in Section A.2, and apply Theorem 6 to complete the proof.For (A2), let {ξ n } n∈N be a positive sequence such that ξ n ↓ 0. Through Proposition 2, there exists and hence (A2) and (A3) hold with a n = ξ n and F n = 1.For (A5), let In turn, (A.1) implies that (A5) is satisfied if we choose n satisfying which leads to the best possible convergence rate and completes the proof by Theorem 6.

A.6 Proof of Theorem 2
We first introduce the smooth function approximation result of DNNs.
Proposition 3.For any function f ∈ H α,r ([0, 1] d ) and any sufficiently small ξ > 0, there exists a neural network where the constants L 0 , N 0 , S 0 , and F depend only on d, α and r.
Proof of Theorem 2. For a given ξ n , by Proposition 3, there exists ηn such that ηn nodes in each layer, and C 3 ξ −d/β n log(1/ξ n ) nonzero parameters for some positive constants C 1 , C 2 , and C 3 .We construct the neural network f n by adding one layer to η(x) to achieve where σ denotes the ReLU activation function.Note that Similarly, we can show that ηn (x)−1/2 < −ξ n when 2η(x)−1 < 4ξ n .Therefore, by (N) we have where the inequality in the last line holds since f n (x) ∞ ≤ 1.
Note that f n is also a DNN in which the last layer of f n has a finite number of parameters, and the maximum of the parameters is bounded above by ξ −1 n .Hence, we can construct the DNN class

A.7 Proof of Theorem 3
The main technique of the proof is to approximate a piecewise constant function using a DNN with respect to the supremum norm on a specific subset of the domain, where this subset depends on the function to be approximated.
Let d ≥ 2, α, r > 0, and K ∈ N. Let A 1 , . . ., A T ∈ A α,r,K be a disjoint with the form Let T ∈ N, and let For a given ξ > 0, define B ξ such that It turns out that any point in B ξ has the supremum norm from the the decision boundary of C(x) being larger than ξ.The following theorem proves that a DNN approximates C(x) well on B ξ .
Proposition 4. Let d ≥ 2, α, r > 0, K ∈ N, and T ∈ N.For any C ∈ C α,r,K,T and a sufficiently small ξ > 0, there exists a neural network where the positive constants L 0 , N 0 , S 0 , B 0 , and b 0 depend only on d, α, r, K, and T , such that sup where C(x) is the function defined in (A.3), and B ξ is defined in (A.4).
Proof.The proof is deferred to Section A.9.
x * be the d-dimensional vector where the j (t,k) -th component is equal to g (t,k) (x −j (t,k) ) and the other components are the same as the corresponding components of x, i.e., x * for some constant C > 0, and hence (A2) and (A3) hold with a n = Cξ γ n and F = 1.
For (A5), if we take which leads to the best possible convergence rate , and completes the proof by Theorem 6.

A.8 Proof of Theorem 4
For the logistic loss, the following two lemmas are needed.The first lemma states that the φ-risks of both the φ-risk minimizer and the Bayes classifier are bounded.The second lemma provides the variance bound of the logistic loss.
Lemma 3. Let φ be the logistic loss.Assume (E) with λ n e − Fn .There then exist constants C 1 > 0 and C 2 > 0 such that We divide A n into two disjoint sets {x : 1 + e Fn Fn e − F .

Similarly, we can show that G n
Fn e − Fn on {x : We use the similar argument above for f 1 + e Fn Fn e − Fn , and similarly we obtain the same upper bound on {x : f * φ < − Fn }.
Proof of Theorem 4. Let f * n = F n C * .As in the proof of Theorem 3, for a positive sequence {ξ n } n∈N approaching zero, we can find for some constant C 1 > 0. By Lemma 3 and the Lipschitz property of the logistic loss, we have for some positive constants C 2 and C 3 .Recall that we have defined We now take F n e −Fn n −κ log 3κ+1 n, and thus the conditions (A2) and (A3) in Section A.2 hold with a n = n −κ log 3κ+1 n and F n = κ(log n − 3 log(log n)).
A.9 Proof of Proposition 4 Before we provide the proof of Theorem 4, we introduce some useful definitions and techniques for the construction of DNNs, which are mostly from Petersen and Voigtlaender [2018].
Proof of Proposition 4. We give the proof only for the case of T = 1.An extension of the cases T ≥ 2 is straightforward.Thus, we omit the subscript t in all expressions.
We now show that

Figure 1 :
Figure 1: Interpolated images between cat and dog.The images in the red box seem to be unrealistic examples.

Figure 2 :
Figure 2: Histogram of the conditional class probabilities estimated using a DNN with the logistic loss for CIFAR10 data.The blue bins are for the 'dog' samples, and the red bins indicate the 'cat' samples.
and completes the proof based on Theorem 6.

Table 2 :
Test errors of the DNN classifiers learned using the hinge and logistic losses with various training data sizes.