Statistical theory for image classification using deep convolutional neural networks with cross-entropy loss under the hierarchical max-pooling model

Convolutional neural networks (CNNs) trained with cross-entropy loss have proven to be extremely successful in classifying images. In recent years, much work has been done to also improve the theoretical understanding of neural networks. Nevertheless, it seems limited when these networks are trained with cross-entropy loss, mainly because of the unboundedness of the target function. In this paper, we aim to fill this gap by analyzing the rate of the excess risk of a CNN classifier trained by cross-entropy loss. Under suitable assumptions on the smoothness and structure of the a posteriori probability, it is shown that these classifiers achieve a rate of convergence which is independent of the dimension of the image. These rates are in line with the practical observations about CNNs.


Introduction
Deep convolutional neural networks (CNNs) have led to state-of-the-art performance in solving various problems, especially visual recognition tasks, see, e.g., LeCun et al. (2015), Krizhevsky et al. (2012), Schmidhuber (2015), Rawat and Wang (2017).While deep learning applications are characterised above all by a high degree of flexibility, ranging from different initialisation strategies (Larochelle et al. (2009)) to the choice of the right activation function (Janocha and Czarnecki (2017)) and the application of a proper learning algorithm (Le et al. (2011)), one thing has so far been chosen as fixed for classification: The log or cross entropy loss.The smoothness of this loss function simplifies the optimisation procedure and shows good practical performance (Goodfellow et al. (2016), Simonyan and Zisserman (2015)).However, statistical risk bounds for neural networks trained with logistic loss only exist for very restrictive conditions (see, e.g., Kim et al. (2021)), mainly because of the unboundedness of the corresponding excess φ-risk minimizer (see ( 2)), which leads to slow convergence rates.In general, many results on CNNs are based on considering them as a special type of feedfoward neural networks (FNNs) and then using results on FNNs to derive theoretical properties (Oono and Suzuki (2019) and the literature cited therein).Unfortunately, these results do not demonstrate situations, where CNNs outperform FNNs, which is the case in many practical applications, especially in image classification.Generalisation bounds for CNNs with arbitrarily ordered fully connected and convolutional layers were derived in Lin and Zhang (2019).Here the model complexity is bounded by the norm of the convolutional weights leading to tighter bounds than existing bounds for FNNs.Yarotsky (2021) obtained approximation properties of deep CNNs, but only in an abstract setting, where it is unclear how to apply those results.Kohler et al. (2022) analysed plug-in classifiers based on CNNs and showed that under proper assumptions on the structure of the a posteriori probability, suitable defined CNNs trained by squared loss achieve a rate of convergence which does not depend on the input dimension of the image.But as, e.g., experimental results in Golik et al. (2013) show, CNNs learned by cross entropy loss allow to find a better local optimum than the squared loss criterion.Therefore, CNNs learned by cross entropy loss are of higher practical relevance.Cross-entropy loss or, more generally, convex surrogate loss functions have been studied in Bartlett et al. (2006) and Zhang (2004).Bartlett et al. (2006) showed that for convex loss functions ϕ satisfying a certain uniform strict convexity condition, the rate of convergence can be strictly faster than the classical n −1/2 , depending on the strictness of convexity of ϕ and the complexity of the class of classifiers.Zhang (2004) analysed how close the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function.Some results of this article (see Lemma 1) are also used in our analysis.In Lemma 1 b) we derive a modification of Zhang's bound which enables us to derive better rate of convergence under proper assumptions on the a posteriori probability.
In this paper we derive dimension-free rates for CNN classifiers with cross-entropy loss in a binary image classification problem, where we impose some hierarchical structure on the a posteriori probability.In case that with high probability the a posteriori probability is very close to zero or one, meaning that the optimal classification rule makes only a very small error, our rate can even been improved.The first result can be framed as an extension of the analysis of Kohler et al. (2022), which analysed plug-in classifiers based on a class of CNNs in a similar setting.However, it is not straightforward to extend these results to CNNs with cross-entropy loss as one cannot analyse the classification problem as a nonparametric regression setting and a network approximation for the logistic function is needed.We deal with these difficulties with novel approximation results as well as an alternative proof strategy to bound the excess risk of the classifier.

Image classification
We consider a binary classification problem.Let d 1 , d 2 ∈ N, X = [0, 1] d 1 ×d 2 be an image space and Y = {−1, 1} the set of corresponding binary labels.We describe an (random) image from (random) class Y ∈ Y by a (random) matrix X ∈ X with d 1 columns and d 2 rows, which contains at position (i, j) the grey scale value of the pixel of the image at the corresponding position.Let P be the probability measure of X × Y and define by the so-called a posteriori probability.Our aim is to predict Y by a deterministic function g : X → R such that the sign of g(X) is a good prediction of Y .In particular, we aim to minimize the prediction error or 0-1 risk where sgn(x) = 1 if x > 0 and −1 otherwise and 1(E) is the the indicator function of the set E, that is, 1 if event E occurs and 0 otherwise.It is well-known, that the Bayes classifier f * (x) = 2η(x) − 1 minimizes E among all measurable functions (cf., e.g., Theorem 2.1 in Devroye et al. (1996)).But, as the probability measure P of (X, Y ) is unknown in practice, we cannot find f * .Instead we estimate f * by using the training data , where (X i , Y i ) are independent copies of the random vector (X, Y ) ∼ P. A popular approach is estimating f * by minimizing the empirical risk among a class of real-valued functions F n .However minimizing the empirical risk with 0-1 loss over F n is NP hard and thus computationally not feasible (Bartlett et al. (2006)).
By replacing the number of misclassifications by a convex surrogate loss φ, one can overcome computational problems.For a given loss φ we are searching for an estimate fn ∈ F n that minimizes .
By the law of large numbers, the empirical φ-risk converges to the population φ-risk when n → ∞.A wide variety of classification methods are based on the idea of replacing the 0-1 risk by some kind of convex surrogate loss.In particular, AdaBoost (Friedman et al. (2000)) employs the exponential loss exp(−z), while support vector machines often use hinge loss max(1 − z, 0) (Vapnik (1998)) and logistic regression applies the log loss log(1 + exp(−x)) (Hastie et al. (2009)).In the context of CNNs and image classification it is a standard to use cross-entropy loss or log loss.Therefore we fix φ(x) = log(1 + exp(−x)) in the following.
The classification performance of an estimator fn ∈ arg min is measured by its excess risk Accordingly we denote the excess φ-risk by , where in case of φ(x) = log(1 + exp(−x)) (cf., Friedman et al. (2000)).Our following result states a relation between the excess risk and its logistic surrogate counterpart.
Lemma 1. Define fn , f * and f * φ as above.a) Then In both parts the expectation is taken over the training data D n .
Remark 1.We use two different bounds on the excess risk as in case of proper assumption on the distribution of (X, Y ) (see Assumption 2 below), E φ (f * φ ) is small, such that part (b) of this lemma leads to faster rates.

Hierarchical max-pooling model
In order to derive nontrivial rate of convergence results on the excess φ-risk of any estimate it is necessary to restrict the class of distributions (cf., Cover (1968) and Devroye (1982)).In case of logistic loss we have f * φ (x) = log(η(x)/(1 − η(x))), showing that f * φ is a monotone transformation of the a posteriori probability η.Hence we need to impose some assumptions on η.As in Kohler et al. (2022) we assume that the a posteriori probability fulfills some (p, C)smooth hierarchical max-pooling model.As smoothness measure we use the following definition of (p, C)-smoothness.For simplicity we introduce the multi-index notation, that is, Definition 1.Let p = q + s for some q ∈ N 0 and 0 For the next definitions we frequently use the following notation: For M ⊆ R d and x ∈ R d we define x The definition of hierarchical max-pooling models is motivated by the following observation: Human beings often decide, whether a given image contains some object, i.e., a car, or not by scanning subparts of the image and checking, whether the searched object is on this subpart.For each subpart the human estimates a probability that the searched object is on it.The probability that the whole image contains the object is then simply the maximum of the probabilities for each subpart of the image.This leads to the definition of a max-pooling model for the a posteriori probability.
We say that m satisfies a max-pooling model with index set Additionally, the probability that a subpart contains the searched object is composed by several decisions, if parts of the searched objects are identifiable.This motivates the hierarchical structure of our model.In the following we denote the four block matrices of a matrix satisfies a hierarchical model of level l, if there exist functions An illustration of Definition 3 for l = 2 is shown in Figure 1.
Combining Definition 1, 2 and 3 leads to the final definition of (p, C)-smooth hierarchical max-pooling models.
Definition 4. We say that m : the definition of this max-pooling model satisfies a hierarchical model with level l and if all functions g k,s in the definition of the functions m are (p, C)-smooth for some C > 0.

Convolutional neural networks
We consider CNNs that take d 1 × d 2 -dimensional images as input and produce an onedimensional output.As the name suggests, the most important operation of a CNN is its convolution.The main idea behind it is to apply filters, i.e., small weight matrices to the input image to extract high-level information.Mathematically a convolution can be described as follows: Let X be a d 1 × d 2 input matrix, X i,j be its ℓ × ℓ block matrix with entries (X i+a,j+b ) a,b=0,...,ℓ−1 and W be a corresponding filter of size ℓ.The entry (i, j) of a resulting channel C can then be described by where ⊙ denotes the Hadamard product.Finally an activation function σ is applied componentwise, i.e., C i,j := σ( Ci,j ).This in turn means that the final channel C consists of entries computed by the sum of a Hadamard product between the filter and the respective block matrix of the input applied to an activation function σ.We set with σ(x) = max{x, 0} being the ReLU activation function.Here ⋆ describes the computation of each entry Ci,j as in ( 5), where σ is applied componentwise.One can see that the weights generating the feature map C are shared, which has the advantage of reducing the complexity of the model and the training time of the networks.Usually a CNN consists of several convolutional layers.Each convolutional layer l (l ∈ {1, . . ., L}) consists of k l ∈ N channels (also called feature maps) while the filter size M l ∈ {1, . . ., min{d 1 , d 2 }} per layer is fixed.In our setting we make use of so-called zero-padding meaning that we enlarge each channel by appending zero matrices on each side such that the convolution does not change the in-plane dimension.This, in turn, means that every resulting channel has size d 1 × d 2 .For filters the s-th channel of layer l (s = 1, . . ., k l , l = 1, . . ., L) can be described by with C 1,0 = X and k 0 = 1.In our network, only in the last step a max-pooling layer is applied to the values of the last convolutional layer L. As in Langer and Schmidt-Hieber (2022) we consider a global max-pooling where we extract from every channel C s,L (s = 1, . . ., k L ) the largest absolute value.A CNN with L ∈ N convolutional layers and one pooling layer, channel vector k = (k 1 , . . ., k L ) ∈ N L and filter vector M = (M 1 , . . ., M L ) ∈ N L , where k i describes the number of channels and M i describes the size of the filters in layer i, respectively, can be described as a funcion f : with C s,L recursively defined as in ( 6).We denote this network class by F C L,k,M .After convolutional and pooling layers typically several fully connected layers are applied.Again we choose the ReLU activation function σ(x) = max{x, 0}.Following the definition in Schmidt-Hieber (2020), a fully connected network with L ∈ N hidden layers and width vector k = (k 0 , . . ., k L+1 ) ∈ N L+2 , where k i denotes the number of neurons in layer i, can be described by a function f : where W j is a k j × k j+1 weight matrix and v j is the j-th shift vector.We denote the network class of fully connected neural networks by F L,k .
Our final function class F n,Θ is then a composition of convolutional and fully connected layers, i.e., where Θ = (L, k (1) , k (2) , M) with parameters and Accordingly we denote by the empricial risk minimizer based on our class of CNNs.
In general, deep learning theory can be roughly divided into three parts, namely expressivity, generalisation and optimisation (see Kutyniok (2020)).While the intersection of all three aspects has only been analysed in very limited settings so far, e.g. , for shallow neural networks and a restricted class of regression functions (see, e.g. , Braun et al. (2021)), most statistical risk bounds of neural networks exclude the optimisation algorithm and deal with the empirical risk minimizer (ERM) instead (see, e.g., Schmidt-Hieber (2020), Bauer and Kohler (2019), Kohler and Langer (2021)).Following this line of work, we also analyse the ERM based on a particular class of CNNs.It therefore remains an open question whether similar rates can be shown for CNNs trained with (stochastic) gradient descent.In case of overparametrized CNNs, e.g. , Du et al. (2019) could show that the gradient descent is able to find the global minimum of the empirical loss function.But for overparametrized networks one cannot use standard generalisation bounds as these usually depend on the number of parameters.Therefore a completely new statistical approach is needed for this analysis.

Main Result
In this section we derive convergence rates of the excess risk of f CN N n under the assumption that the a posteriori probability η fulfills a (p, C)-smooth hierarchical max-pooling model (see Definition 4).Before providing our results, two assumptions are imposed on the distribution of (X, Y ).
Assumption 1.For p ≥ 1 and C > 0 arbitrary, η(x) = P{Y = 1|X = x} satisfies a (p, C)-smooth hierarchical max-pooling model of finite level l and supp(P The second is a margin condition on the a posteriori probability. Assumption 2. For f * φ being the minimizer of E φ (g) and n ∈ N, it holds Assumption 2 requires that with high probability the a posteriori probability is very close to zero or one, and hence the optimal classification rule makes only a very small error.This is in particular realistic for various image classification tasks, where object classes can often be confidently distinguished (cf., Kim et al. (2021)).Additionally, a similar assumption was applied in Bos and Schmidt-Hieber (2022) within the context of a multiclass classification problem, also analysing a ReLU network classifier minimising cross entropy loss.
Theorem 1. Suppose Assumption 1 holds.Set where the function π : {1, . . ., L n } → {1, . . ., l} is defined by n and k (2) = (c 5 , . . ., c 5 ) ∈ N L (2) n , and define the estimate f CN N n as in (8).Assume that the constants c 2 , . . ., c 5 are sufficiently large.a) There exists a constant c 6 = c 6 (η, d 1 , d 2 , p, C, l) > 0 such that we have for any n > 1 b) If, in addition, Assumption 2 holds , then there exists a constant c 7 = c 7 (η, d 1 , d 2 , p, C, l) > 0 such that we have for any n > 1 In both parts the expectation is taken over the training data D n .
Remark 2. An interesting feature of the convergence rates in Theorem 1 is that both rates do not depend on the dimension d 1 •d 2 of the input image.Thus, given the structure of the a posteriori probability fulfills a (p, C)-smooth hierarchical max-pooling model, our estimator circumvents the curse of dimensionality.Under Assumption 2 the rate can even be improved.To us these results partly explain the good performance of CNN classifiers on image data.
Remark 3. The definition of the parameters L (1) n ) of the estimate in Theorem 1 depends on the smoothness and the level of the hierarchical maxpooling model for the a posteriori probability, which are usually unknown in applications.In this case it is possible to define these parameters in a data-dependent way, e.g., by using a splitting of the sample approach (cf., e.g., Chapter 7 in Györif et al. (2002)).
On the proof.To prove Theorem 1 we use Lemma 1 in combination with the following general upper bound on the excess φ-risk of an empirical risk minimizer fn ∈ F n , where F n can be a general function space consisting of functions f : R d 1 ×d 2 → R.
Lemma 2. Let φ be the logistic loss and D n = {(X i , Y i )} n i=1 .Then the the empirical risk minimizer fn defined as in (1) satisfies Lemma 2 shows that the excess φ-risk of an ERM is bounded above by the sum of two terms.The first term is the so-called generalisation error.It is closely related to the complexity of the function class and can be bounded using results from empirical process theory.The second one is the approximation error measuring how rich the function class F n is, meaning if we can express the problem under consideration by a function of F n .In the following F n is chosen to be a class of convolutional neural networks, i.e., F n = F n,Θ and the estimator under consideration is defined as in (8).

Approximation error
where This, in turn, means that in order to find a satisfying bound for our approximation error we need to build a CNN which approximates g(η(x)) properly.Using the compositional structure of neural networks, one can break this task down into two parts.On the one hand we show that CNNs approximate η(x), i.e., (p, C)-smooth hierarchical max-pooling models.On the other hand we build a fully connected neural network that approximates g.The approximation result on g is the following.
Lemma 3. Set and let K ∈ N with K ≥ 6.Let η : R d → [0, 1] and let η : R d → R such that ∥η − η∥ ∞ ≤ ϵ for some 0 ≤ ϵ ≤ 1/K.Then there exists a neural network ḡ : R → R with ReLU activation function, K + 3 hidden layers with 7 neurons per layer, which is bounded in absolute value by log(K + 1) and which satisfies sup A complete proof is found in the appendix.Roughly, it is based on the idea that functions of the form ḡ(z) : where , can be computed by a ReLU network with K + 3 hidden layers and 7 neurons per layer.
Combining Lemma 3 with the approximation result on the hierarchical max-pooling models, we are then able to show the following approximation result.
The complete proof of this result is given in the appendix.

Generalisation error
The generalisation error sup f ∈Fn |E φ (f ) − E φ n (f )| can be bounded using results from empirical process theory together with bounds on the covering number of CNNs.
In particular, using Theorem 9.1 in Györfi et al. (2002) it holds for and ϵ > 0, that Here and N 1 (ε, F, x n 1 ) describes the ε-covering number of F on x n 1 , that is the smallest εcover of F on x n 1 , i.e., the number N ∈ N such that there exists i ∈ {1, . . ., N } such that 1 As every ϵ-cover of F n,Θ is an ϵ-cover of H n , we have The following bound on the covering number of F n,Θ , then helps us to bound the generalisation error.
Lemma 4. Define F n,Θ as in (7) and set 1 , . . ., k Then we have for any ϵ ∈ (0, 1), sup for some constant c 11 > 0 which depends only on k max and M max .
The proof of this result follows by Lemma 7 in Kohler et al. (2022).
where the expectation is taken over the training data D n .Note that the last inequality follows since fn (x) ≥ 0 implies fn (x) ≥ 1 2 and fn (x) < 0 implies fn (x) < 1 2 and consequently we have and Using mean value theorem it follows for j ∈ {1, 2} and z 1 , z 2 ∈ [0, 1] Choosing z 1 = fn (x) and z 2 = η(x), this, in turn, means that ( 9) can be further bounded by

. Proofs of Section 3
This section contains the missing proofs of Theorem 1 and Lemma 2.
Proof of Theorem 1.As by Lemma 1b) and by using Assumption 2 together with Lemma 3 in Kim et al. (2021), where we choose Application of Lemma 2, Lemma 8 and Theorem 2 yields □ Proof of Lemma 2. This result is the standard error bound for empirical risk minimization.For the sake of completeness we present nevertheless a complete proof.
Let f ∈ F n be arbitrary.Then the definition of fn implies

. Proofs of Section 4
This section presents the missing proofs of Lemma 3 and Theorem 2.
In the same way we get □ In order to prove Theorem 2 we need the following three auxiliary results.
Lemma 5. Let d ∈ N, let f : R d → R be (p, C)-smooth for some p = q + s, q ∈ N 0 and s ∈ (0, 1], and C > 0. Let A ≥ 1 and M ∈ N sufficiently large (independent of the size of A, but , Proof.See Lemma 5 in Kohler et al. (2022).□ Proof of Theorem 2. For each g k,s in the hierarchical max-pooling model for η we select an approximating neural network from Lemma 5 a) which approximates g k,s : R 4 → R up to an error of order (L (1) n ) −2p/4 .Then we use Lemma 7 to generate with these networks a convolutional neural network, and combine this network with the feedforward neural network with L Lemma 8. Let φ be the logistic loss and D n = {(X i , Y i )} n i=1 .Let F n,Θ be defined as in (5).Then Let h i (x, y) = φ(y •f i (x)) ((x, y) ∈ R d 1 ×d 2 ×{−1, 1}) for some f i : R d 1 ×d 2 → R. Together with the Lipschitz continuity of φ, it follows Thus, if {f 1 , . . ., f ℓ } is a ϵ-cover of F n , then {h 1 , . . ., h ℓ } is a ϵ-cover of H n .Then Set L max = max{L (1) n }.Lemma 4 implies and n } √ n ≥ 1 n .
Here we have used for the last inequality that w.l.o.g.we can assume min{c 10 , c 12 , c 13 } ≥ 1.Then n } √ n . □