Nonoverlapping Feature Projection Convolutional Neural Network With Differentiable Loss Function

We propose NullSpaceNet, a novel network that maps from the pixel-level image to a joint-nullspace, as opposed to the traditional feature space. The features in the proposed learned joint-nullspace have clearer interpretation and are more separable. NullSpaceNet ensures that all input images that belong to the same class are collapsed into one point in this new joint-nullspace, and the input images of different classes are collapsed into different points with high separation margins. Moreover, a novel differentiable loss function is proposed that has a closed-form solution with no free parameters. NullSpaceNet architecture consists of two components; 1) a feature extractor backbone (i.e., the convolution and pooling layers), which is used to extract features from the input, and 2) a nullspace layer, which maps from the pixel-level image to the joint-nullspace. This novel architecture and formulation results in a significant reduction in the number of learnable parameters in the network. NullSpaceNet is architecture-agnostic, which means it can use any feature extractor as a backbone in its first component. NullSpaceNet exhibits superior performance when tested over four different datasets against VGG16, MobileNet-224, and MNASNET1-0. In general, NullSpaceNet needs only $\text{1}\hbox{--}\hbox{30}\%$ of the time it takes a traditional CNN to classify a batch of images, and with better accuracy of up to $+2.57\%$. Impact Statement– Convolutional neural networks (CNNs) have achieved excellent performance in most computer vision tasks. However, the current formulation of CNN lacks a clear interpretation of the learned features in the feature space. Moreover, most learnable parameters are located in the classifier component (i.e., the fully connected layers), which require extensive computations during training and inference. We propose a novel feature space called NullSpaceNet. We provide theoretical and experimental evidence in NullSpaceNet that the learned nullspace features are more discriminative than the feature space. Moreover, the new formulation significantly decreases the number of parameters up to 86% and with better accuracy by up to $+2.57\%$. NullspaceNet sets a new line of architecture formulation research by improving the performance while decreasing the number of learnable parameters. Computer vision and deep learning communities will benefit from NullSpaceNet formulation, as it is architecture-agnostic.


I. INTRODUCTION
I N RECENT years, convolutional neural networks (CNNs) have revolutionized computer vision tasks such as object tracking [1]- [4], surveillance systems [5], image understanding [6], computer interactions [7], and generative models [8]. Image classification is one of the core tasks in computer vision, especially in large-scale visual recognition challenges (e.g., ILSVRC15) [9]. Most classification networks consist of two components: 1) the feature extractor and 2) the classifier. The feature extractor uses a stack of convolutional layers to extract the deep features from the input images through consecutive convolutional operations. The classifier uses fully connected layers with a softmax layer. It has been proved that most learnable parameters within a classification network are located in the fully connected layers [10]. For example, the classifier in VGG16 has 102.76 million parameters, while the feature extractor has only 32 million parameters. This huge number of learnable parameters requires extensive computations during both training and inference.
In this article, we propose NullSpaceNet, a novel network that maps from the pixel-level image to a joint-nullspace, as opposed to a traditional CNN that maps to a classical feature space. The features in this newly learned joint-nullspace have a clear interpretation and are more separable. In particular, instead of using the fully connected layers with categorical cross-entropy, NullSpaceNet maps the pixel-level image to a joint-nullspace. All input images from the same class are collapsed into one point in this new joint-nullspace and the input images from different classes are collapsed into different points with high separation margins. Moreover, the hyperplane that has the orthonormal vectors of the projected joint-nullspace features is well-defined and can be described, as shown in Fig. 2 and (24).
The architecture of NullSpaceNet consists of two components, 1) a feature extractor backbone (i.e., the convolution and pooling layers), which is used to extract features from the input, and 2) a nullspace layer, which maps from the pixel-level image to the joint-nullspace. NullSpaceNet is architecture-agnostic, which means it can use any feature extractor as a backbone in its first component. For example, the VGG16 feature extractor component (i.e., VGG16 without the fully connected layers) can be used as the feature extractor backbone in NullSpaceNet. The core idea of NullSpaceNet is to minimize the within-class scatter matrix to be zero or very close to zero, while maintaining the between-class scatter matrix to always be positive. This makes the classification task more robust, as shown in Fig. 4. The pretrained network and the source code are available online. 1 To summarize, the main contributions of this article are as follows.
1) A novel network (NullSpaceNet) that learns to map from the pixel-level image to a joint-nullspace. The formulation of NullSpaceNet ensures that the input images from the same class are collapsed into a single point, while the ones from different classes are collapsed into different points with high separation margins. NullSpaceNet is architecture-agnostic, which means it can easily integrate different feature extractors in its architecture.

2) A differentiable loss function is developed to train
NullSpaceNet. The proposed loss function is different from the standard categorical cross-entropy functions. The proposed loss function ensures that the within-class scatter matrix vanishes while maintaining a positive betweenclass scatter matrix. The differentiable loss function has a closed-form solution with no free parameters.
3) The proposed NullSpaceNet has a clear interpretation of the learned features, both mathematically and geometrically. These three contributions resulted in NullSpaceNet needing only 1-30% of the time it takes a traditional CNN to classify a batch of images, and with a better accuracy of up to 2.57% over all four datasets used in testing.
The rest of the article is organized as follows: related work is presented in Section II, then Section III details the proposed NullSpaceNet method. The training and inference phases are presented in Section IV. The experimental results are presented in Section V. The results and discussion are detailed in Section VI. Finally, Section VII concludes this article.
Advancements in Deep Learning Architecture: VGG [13] was one of the earliest architecture with deep layers. VGG has different configurations, such as a VGG with sixteen layers (i.e., VGG16) and VGG with nineteen layers (i.e, VGG19). The core idea behind VGG architecture is to increase the depth of the network with a very small receptive field of (3 × 3). Increasing the depth in VGG led to improved performance and VGG was the winner of ILSVRC2014 [14]. VGG16 has over 134 million trainable parameters with 15.3 billion floating point operation (FLOPs) and 74.4% accuracy, which makes the inference time per image about 2 ms.
Another paradigm of deep architectures uses skip connection in forward and backward propagation. This paradigm includes two architectures, DensNet [15] and ResNet [16]. In DenseNet, the authors noted that the layers close to the input layer are more accurate and efficient than those close to the output layers. Therefore, DenseNet connects each layer to every other layer in a feed-forward fashion. Building DenseNet in this way makes the feature map at each layer a composition of all other layers. Hence, DenseNet tackles the vanishing gradient problem and boosts the performance. Similar to VGG architecture, DenseNet has fully connected layers for classification. The total number of parameters in DenseNet121 is 7.2 million with 3 billion FLOPs, 74.98% accuracy, and an inference time of about 0.4 ms. ResNet adapts the skip connection in its architecture as well as adds more layers. ResNet has a different configuration in terms of the number of layers. For example, ResNet18 has eighteen layers, ResNet34 has thirty four layers, ResNet50 has fifty layers, and ResNet101 has one hundred and one. The number of FLOPs in ResNet50 that happens during the forward propagation in test time is 3.8 billion operations with a total number of parameters of 25.7 million and 75.30% accuracy.
To design a light weight CNN, an automated mobile neural architecture search (MnasNet) was proposed for mobile devices. MnasNet is an optimized neural search with a tradeoff between accuracy and latency. MnasNet has a total number of parameters of 3.9 million, 75.20% accuracy, and an average inference time of 78 ms. It is worth mentioning that MnasNet has about 1 million parameters in the fully connected layers. Similar to MnasNet, MobileNet, has been proposed to efficiently work on embedded devices. MobileNet is based on depthwise separable convolutions. MobileNet has 5.4 million total parameters, of which 469.460 parameters are in the classification part (i.e., the fully connected layer). The accuracy of MobileNet is 70.60% and the average inference time per image is 0.98 ms with 5.8 million FLOPs.
More recent architectures have been developed/modified to further improve the performance. For example, EfficientNet [17] used neural architecture search to carefully balance network depth, width, and resolution. By scaling up the baseline architecture (ResNet) based on the depth, width, and resolution, Effi-cientNet achieved better performance. The accuracy of Efficient-Net is 78.8% with 16.7 billion FLOPs and an average inference time of 0.1 s. Similarly, RegNet [18] is designed based on the observation of width and height and can be parametrized by a Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. quantized linear function. Moreover, instead of relying on designing individual network instances, RegNet parametrizes populations of networks. RegNet achieved an accuracy of 79.9%, 8 billion FLOPs, with 39.2 million total parameters, and an average inference time of 0.0022 s. NFNet [19] developed an adaptive gradient clipping to train ResNet without normalization. This class of normalizer-free architecture improved the performance with speed up to 8.7× faster to train compare to ResNet. NFNet achieved an accuracy of 83.96%, 12.38 billions FLOPs, with total parameters of 71.5 million, and an average inference time of 0.0088 s.
Nullspace and Linear Discriminant Analysis: Nullspace and linear discriminant analysis (LDA) have existed as analytical methods for a long time [20]- [25]. LDA has frequently been employed as a dimensionality reduction tool or feature extractor within the field of classification [26]- [34].
The authors in [35] proposed the application of hybrid orthogonal projection estimation (HOPE) to CNN for image classification. HOPE is a hybrid model that combines orthogonal linear projection, for feature extraction, with mixture models. The idea in HOPE is to allow for the extraction of useful information from high-dimension feature vectors while filtering out irrelevant noise. HOPE has an error rate of 7.57%. Tian et al. [36] used LDA with the Fisher-criterion on VGG16 to classify gender based on facial recognition and reported an accuracy of 97.5% on CelebA [37]. LDA was applied on the output of the last layer to derive a lightweight version of VGG16. Bayesian classification was then used to classify the output.
DeepLDA [38] proposed to use LDA to learn to maximize eigenvalues of the Fisher-criterion. After training, DeepLDA uses the entire training set to extract the dominant basis vectors to project the new samples and reported an accuracy of 81.40% on STL10. It is worth mentioning that DeepLDA did not include any reference to any usage of nullspace.
In contrast to all previous methods, in the literature, which modified the baseline architecture (i.e., internal conv layers), we use the nullspace on any backbone in a learnable way with a differentiable loss function to project the pixel-level image to a joint-nullspace. This formulation does not change the base line architecture (i.e., internal conv layers) and it removes the majority of parameters in the classification layers.

III. NULLSPACENET
As discussed in Section II, the proposed NullSpaceNet is novel and different from previous LDA-based approaches. We reformulate the problem using the mathematical definition of nullspace to train the network end-to-end to project from the pixel-level image onto a new joint-nullspace. NullSpaceNet architecture consists of two components, 1) a feature extractor backbone, and 2) a nullspace layer. NullSpaceNet can use different feature extractor backbones. A (Conv-BatchNormalization-ReLu) layer can be added before the nullspace component to accommodate the different backbones and datasets.
In this section and for the sake of demonstration, the formulation is applied to a NullSpaceNet network that uses the VGG16 feature extractor as its backbone. This formulation is also applied to other feature extractor backbones, such as DenseNet [15] and ResNet50 [16], as outlined in Section III-H.

A. Problem Definition
Given a dataset of training images X = {x 1 , x 2 , . . ., x N } ∈ R w×h×d , where w, h, and d are the width, height, and depth of each image, respectively, and N is the number of images in the training dataset. Each image is associated with a respective class C, where C = {c 1 , c 2 , . . ., c n } ∈ R, and n is the number of classes in the training dataset. In this article, the VGG16 feature extractor component φ(x; θ) is used as the backbone feature extractor in NullSpaceNet.
The objective is to force the network to learn a strong discriminative space, as opposed to feature space, that maps from the pixel-level image to joint-nullspace. The features are naturally projected onto the joint-nullspace during the training and hence during the testing. Thus, the inference happens in the joint-nullspace rather than the classical classifiers that are attached to the backbone network (e.g., VGG16, ResNet, and DenseNet). These classifiers have the majority of the number of learnable parameters within the network. In other words, there will be no fully connected layers, but instead, a learned joint-nullspace, as shown in Fig. 1.

B. Proposed Architecture
NullSpaceNet uses, for example, the VGG16 feature extractor as the backbone. In this setting, each layer consists of (Conv-BatchNormalization-ReLu),and the pooling is considered a stand-alone layer.
The novelty of NullSpaceNet lies in the nullspace layer and the differentiable loss function, which is detailed in Section III-C. The nulllspace layer forces the network, through backpropagation, to learn the projection from the pixel-level image to a joint-nullspace, where the joint-nullspace features have optimal separation margins. The nullspace layer achieves this through spanning vectors of the optimal within-class scatter matrix (see Section III-C). Formulating the nullspace layer this way prevents the network from encountering the small sample size (SSS) problem (i.e., the model has high dimensional output features while training on small batches of images) [39].

C. Mathematical Formulation of the Loss Function
Background: VGG16 with FC, or any other backbone, has a tendency to minimize the within-class scatter matrix (i.e., spread of the samples in the same class). However, it does not put constraints on between-class scatter matrix (i.e., spread of classes next to each other). For example, in Fig. 4(b) and using visual inspection, it can be seen that the learned feature distribution of VGG16+FC layers is scattered with no constraints on the between-class scattered matrix; for example, some classes (e.g., classes #4, #5, and classes #2 and #8) overlap.
To derive a differentiable loss function to learn the jointnullspace, we start from the LDA [40]. Here, we emphasize that NullSpaceNet does not learn LDA features. However, LDA is used as a starting point to derive the equations of the nullspace. In this article, we assume that the output of the feature extractor component in the network for each batch is F ∈ R D×B , where D is the dimension of the predicted vector (i.e., the feature vector before feeding to the classifier) and B is the batch size. We seek an optimal projection space (i.e., joint-nullspace) P ∈ R d×B , where d < D, that minimizes the within-class scatter matrix and maximizes the between-class scatter matrix simultaneously. To achieve this optimal space, we will maximize the Fisherdiscriminant criterion J (P ) as follows: where P is the projection space and S b and S w are the betweenclass and within-class scatter matrices, respectively. The projection space that maximizes between-class matrix and minimizes within-class scatter matrix can be found by optimizing (1) as follows: (2) By solving (2) as follows: If S w has full rank (i.e., the inverse exists), it can be converted to a standard Eigenvalue problem S −1 w S b P = λP , where S −1 w S b are the eigenvectors corresponding to nonzero eigenvalues λ.
Derivation of the Proposed Novel Loss Function: In addition to the previous derivation, NullSpaceNet forces two constraints on the learning process. In particular, the between-class scatter matrix should always be large and positive, while minimizing the within-class scatter matrix to approach zero as follows: Lemma 1: When NullSpaceNet satisfies the two constraints in (6) and (7), the feature distribution in the same class of the new projection space (i.e., joint-nullspace) approaches the Dirac Delta function.
Proof: Assuming the features are represented by the normal distribution, for simplicity, we will use a 1-D normal distribution where g(x) is the mean value function of the projected features by the network to the joint-nullspace,μ is Gaussian mean, and σ is the standard deviation of the distribution. We take the limit of (8) asσ 2 approaches zero is the Dirac delta function. In other words, this proves that the feature distribution in the joint-nullspace has the optimal separability among different classes.
Using Lemma 1 in NullSpaceNet: Using (6), (7), and (9) to find the limit of (1) (which guarantees the best separability as explained above), we get lim Since the between-class scatter matrix S b in (6) is hard to calculate, especially in the case of high dimensional features, we calculate S b using the total-class scatter matrix S t and the within-class scatter matrix S w as follows: Then, by substituting (11) in (7) and using (6) we get To calculate the scatter matrices in (11) for the output of the NullSpaceNet, let the output F ∈ R D×B when the input batch X ∈ R W ×H×D×B , where B is the batch size. We define the within-class scatter matrix S w and the total-class scatter matrix S t as follows: where F w is the centered class mean output features (i.e, subtracting the class mean from each feature output belonging to this class), and F t is the centered global mean output features, as shown in where μ c is the class mean and μ g is the global mean of the dataset. Now, we want to integrate the scatter matrices we derived in (13) in the joint-nullspace formulation. Let U t denote the nullspace of the total-class scatter matrix and U w denote Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the nullspace of within-class matrix. From the definition of the nullspace and using the fact that S t is nonnegative definite, we get similarly, we get U w . 2 Lemma 2: The projection matrix P that satisfies the constraints in (6) and (7) can be achieved, if and only if, P lies in the shared space between U ⊥ t and U w , mathematically represented as where U ⊥ t is the orthogonal complement subspace of U t spanned by the the centered global mean output features. U ⊥ t can be obtained using the Gram-Schmidt process [41].
Proof: Geometrically by looking at (15) and (1) (in the Appendix), the only space that satisfies S t u = 0 and S w u = 0 is the joint-space, where U ⊥ t and U w overlap [26]. Using Lemma 2 in NullSpaceNet: Now we have the nullspace of S w , which is U w , and the nullspace of S t , which is U t . One problem when calculating the nullspace of S w is that the dimensionality of the nullspace is at least (D + C − n), where D is data dimensionality (which is high when we use the output of NullSpaceNet, e.g., 2048), C is the number of classes, and n is the sample size as it has been proven in [39]. To address this problem, we revert to (11) where it can be seen geometrically that S t is the intersection of the nullspace of S b and the nullspace of S w [40]. Hence, the nullspace of S t can be removed based on this observation. We proceed with this solution using the singular value decomposition (SVD) theory to decompose F t as follows: where Y and Z are orthogonal and Y has an orthonormal basis.
Σ t is the diagonal matrix Σ t ∈ R t×t with the eigenvalues. Now we can represent S t as follows: We select a portion of basis Y with dimension Y 1 ∈ R m×t where t = RANK(S t ), using the new subspace U 1 spanned by the new 2 See details in Appendix (1).

Algorithm 1:
Steps to Calculate the Proposed Loss Function. input: The output of the last layer of VGG16 F ∈ R D×B output: Optimize the weights of the NullSpaceNet using the proposed differentiable loss function L based on the nullspace formulation 1: Calculate matrix F w , F t ; 2: Calculate SVD(F t ); 3: Calculate the scatter matrices S w , S b , S t ; 4: Calculate matricesS t ,S b ,S w from (20); 5: Calculate the nullspace W ofS w using (21); 6: Solve for the eigenvalues of W TS b W using (22); 7: Formulate the loss function over the average of the nonzero eigenvalues using (23); 8: Use the proposed differentiable loss function in (23) and its derivative, as shown in the Appendix (50) to train the network set of the basis, we project the scatter matrices as follows: where(.) represents the reduced version of the decomposed S b , S w , and S t . After reducing the complexity, the reduced version nullspace will be usedS w .
where W is the nullspace ofS w . Finally, the projection matrix that satisfies (6) and (7) is calculated by where M is the eigenvectors of W TS b W corresponding to the nonzero eigenvalues. Consequently, maximizing the eigenvalues of W TS b W , by NullSpaceNet, leads to projecting the features onto the joint-nullspace.

D. Loss Function and Its Gradient
Training the NullSpaceNet requires the loss function to be differentiable everywhere. Hence, we propose a novel differentiable loss function that maximizes the positive, or minimizes the negative, of the average nonzero eigenvalues of the decomposed W TS b W . Given C classes, we define E as an Eigenvalue and k = C − 1 is the number of eigenvalues. The steps to calculate the proposed differentiable loss function is shown in Alg. 1. The final loss function is shown in where where is used for numerical stability calculations.
Using the chain rule [42], the derivative of the loss function in (23) w.r.t the last layer H is given by ∂L(φ E (x;θ)) ∂H . 3

E. Metric Learning
The proposed NullSpaceNet is related to metric learning from the perspective that both (i.e., NullSpaceNet and metric learning) push similar embeddings as close as possible and different embeddings as far as possible. To achieve that, metric learning learns a similarity function by using either different distance functions, such as mean squared error as loss functions or contrastive/triplet loss functions. Consequently, these loss functions require different embeddings from the backbone network to measure the distance between them, which is usually achieved by two branches of CNN (i.e., Siamese network) or three branches. Another line of work uses one network in supervised training during the training and adds another network, which shares the same parameters with the network in the training phase, during the inference/testing phase [43]- [46]. However, in NullSpaceNet, we directly optimize the proposed loss function on a single network during the training and no other networks are needed in testing phase. Moreover, NullSpaceNet's loss function does not require different embeddings since we do not learn a similarity function. Another experiment has been conducted on four datasets of metric learning, namely, CUP-200-2011 [47], Cars-196 [48], SOP [49], and IN-Shop [50] at different Recall (K). The results on these datasets are reported in Table XIII. The results show that NullSpaceNet outperfoms state-of-the-art methods in Table XIII on

F. Time Complexity of NullSpaceNet
From (17) and (20), NullSpaceNet applies the SVD on F t with time complexity of O(Dn 2 ) instead of O(D 2 n). Moreover, NullSpaceNet with VGG16 has a complexity of 9.8 billion FLOPs in comparison to VGG16 which has 15.3 billion FLOPs. Formulating NullSpaceNet with the approximation of S w (i.e, S w ), as shown in (21), significantly speeds up the training process. Moreover, it gives the network two advantages: 1) the model does not suffer from the small sample size (SSS) problem, e.g., the model has high dimensional output features while training on small batches of images, as in [38], and 2) it is faster than solving the generalized eigenvalue problem.

G. Insights Into NullSpaceNet
In this section, we provide a deeper look, both mathematically and geometrically, into the proposed NullSpaceNet.
Mathematical Insights: The main idea of NullSaceNet is to learn to map the input data to another subspace (different from the traditional feature space) that satisfies the two constraints in (6) and (7). The new proposed subspace (i.e., the joint-nullspace) mathematically forces the within-class scatter matrix to vanish through the optimization of the proposed loss function in (23). 3 See details in the appendix (50). Meanwhile the new joint-nullspace mathematically forces the between-class scatter matrix to always be positive through the optimization of the loss function in (23). Geometric Insights: The features produced by NullSpaceNet are living in the hyperplane represented by U ⊥ t ∩ U w , as shown in Fig. 2. The hyperplane is now well defined and all the features are located in a confined space that can be precisely described both mathematically and geometrically.
Based on the discussion of the above insights, this proves our claim that the same class inputs are collapsed into one point in the joint-nullspace, and the different classes are collapsed into different points with high separation margins.

H. Using Other Feature Extractors
NullSpaceNet formulation can be applied to different feature extractor backbones. The only difference is that the last layer should be of spatial size F ∈ R D×1 . For example, if using the feature extractor of MnasNet, a convolutional layer with kernel size = 3 is used to produce a 2-D tensor of shape D × 1, as shown in Fig. 3. On the other hand, if using the feature extractor of MobileNet, a convolutional layer with kernel size = 7 is used to produce a tensor of shape D × 1, as shown in Fig. 3(b).

A. NullSpaceNet Training Phase
The input batch of images is fed into the input layer as shown in Fig. 1. The batch goes through NullSpaceNet's feature extractor layers, which include convolution, batch normalization, and pooling. Then, it goes to the new nullspace layer, where all calculations and the new loss function in (23) are performed, as shown in Algorithm 1.
During the training, we keep track of the mean of each class (i.e., μ k = (μ 1 , . . ., μ c )) from the last layer using moving average. Then, the eigenvectors of the decomposed W TS b W are calculated after the training using the moving average for each class, which is then used in (22) to calculate the projection matrix P .

B. NullSpaceNet Inference Phase
In the inference phase, the output of NullSpaceNet F can be classified using the following hyperplane equation: where β is the hyperplane and Σ = P × P .

V. EXPERIMENTAL RESULTS
Implementation Details: NullSpaceNet is defined in mixedprecision using the publically available NVIDIA APEX library [51]. NullSpaceNet was trained using 4 v100 Tesla GPUs with 32 GB and implemented in Python using PyTorch framework [52]. All experiments were performed on Linux with a Xeon E5 @2.20 GHz CPU and NVIDIA Titan XP GPU. All experiments were performed on networks that were trained from scratch. We set to 1 and the number of epochs for the training to 200. We used Adam optimizer [53] with a learning rate anneals geometrically at each epoch starting from 0.001, with a momentum of 0.9.
ImageNet is a large-scale dataset for classification and detection. ImageNet classification has 1000 categories of natural images, it consists of 1.3 million images for the task of classification.
CIFAR10 and CIFAR100 have 10 and 100 classes of spatial size (32 × 32), respectively. The images were collected from natural scenes. Each dataset has 50 000 images for training and 10 000 images for testing. We used 49 000 for training, 1000 for validation, and 10 000 for testing for both datasets.
The STL10 dataset has ten classes and the resolution of images has been resized to (64 × 64). STL10 has 5000 images for training, while the testing set has 8000 images. We used 5000  for the training set, 1000 for the validation set from the testing set and the remaining 7000 for testing.

A. NullSpaceNet Results With VGG16 as a Backbone
Results on ImageNet: ImageNet has a higher resolution with a spatial size of 224 × 224 compared to CIFAR datasets and STL10. Results of the proposed NullSpaceNet with VGG16 as backbone compared to the traditional VGG16 with FC tested on ImageNet are shown in Table I. The number of parameters has been reduced from ≈134 million in VGG16+FC to ≈18 million in NullSpaceNet. Similarly, the number of FLOPs has been decreased by 50%, from 19.6 billion to 9.8 billion. As can be seen from Table I, the accuracy gain is in favor of NullSpaceNet. NullSpaceNet has a Top-1 accuracy of 75.7% while VGG16+FC has a top-1 accuracy of 74.4%, which is a gain of 1.3%. The average inference time has significantly reduced with a rate of 99.17%.
Results on CIFAR10 Dataset: The results on CIFAR10 dataset are shown in Table II. VGG16+FC achieves an accuracy of 93.51%, while the proposed NullSpaceNet achieves 94.01%. The accuracy difference between the proposed NullSpaceNet and the VGG16+FC is ≈ 0.5%, in favor of NullSpaceNet.
More importantly, there is a significant reduction in the network parameters of NullSpaceNet compared to VGG16+FC. The parameters decreased from ≈ 134 million parameters in   Results on CIFAR100 Dataset: The results on CIFAR100 dataset are shown in Table III. NullSpaceNet outperforms VGG16+FC by a gain of 0.07% in terms of accuracy (the gain is not significantly similar to the CIFAR10 dataset). However, the number of parameters in NullSpaceNet has been reduced from ≈ 134 million parameters to ≈ 18 million parameters. Moreover, Table III shows that the inference time required per batch by VGG16+FC is 0.6841 seconds while NullSpaceNet required only 0.0051 s, which is a reduction of 99.25% in favor of NullSpaceNet.
Conducting this experiment on CIFAR100 dataset is important to proving that NullSpaceNet performance is not significantly affected by the increase in the number of classes in the classification task.
Results on STL10 Dataset: The results on STL10 dataset are shown in Table IV. NullSpaceNet outperforms VGG16+FC in terms of accuracy with a gain of 2.57%, parameters reduction of 86.29%, and inference time reduction of 99.22%. It is worth noting that NullSpaceNet significantly benefits from the higher image resolution, STL10 has an image resolution of 64 × 64.
Visualization: To visualize the learned features by NullSpaceNet and VGG16+FC on STL10 dataset, t-SNE is used to produce Fig. 4. Each color is associated with a number that represents a class in the STL10 dataset. It can be seen from Fig. 4(a) that the within-class scatter for all classes has been reduced to the minimum and the between-class scatter has been maximized with high margin separation among all classes. Fig. 4(a) visualizes the power of the learned joint-nullspace that has the optimal separation among different classes. By examining Fig. 4(b), the classes are overlapping and the separation margin is not optimal.

B. NullSpaceNet With Other Feature Extractor Backbones
Another two experiments have been conducted on Mo-bileNet [63] and MnasNet [64] as the backbones of NullSpaceNet. The architectures of NullspaceNet with Mo-bileNet and MnasNet as backbones are shown in Fig. 3. Table V shows that the MobileNet network has 5.4 million parameters with an accuracy of 70.6%. In the case of NullSpaceNet with MobileNet as the backbone, the accuracy went up to 72.30% and a reduction in the number of parameters of ≈ 8.7% has been achieved. Consequently, the inference time decreased by 70.67%. It is clear that NullSpaceNet with the MobileNet feature extractor backbone does not significantly reduce the number of parameters due to the last fully connected layer in the original MobileNet, which only has 469 460 parameters. However, the inference time has been significantly reduced.
Similarly, Table V shows that NullSpaceNet with MnasNet backbone has a gain of 1.7% in terms of accuracy and parameters reduction rate of 27.95%. The average inference time of NullSpaceNet with MnasNet as a backbone has been significantly reduced by 69.05%. This confirms that NullSpaceNet can be applied to different backbones with the benefit of parameter reductions and inference time.
We conducted two experiments on DenseNet and ResNet to show the effect of the proposed joint-nullspace formulation on skip-connections.
We conducted two experiments on DenseNet and ResNet. As shown in Table VII, when NullSpaceNet uses DenseNet121 as a backbone, the top-1 accuracy increases from 74.98% to 76.86%. More importantly, the number of parameters decrease by 94.45% (from 7.10 million to ≈ 0.39 million). Also, the average inference time went down from 0.17325 to 0.00144 s per batch. In the case of ResNet in Table VIII, the number of parameters decreases by 86.788% (from 25.66 million to 3.39 million), while the top-1 accuracy went up from 75.30% to 78.43%. Consequently, the average inference time per batch decreased from 0.75316 to 0.00627 s. From these two experiments, it is clear that backbones with skip-connections significantly benefit from the proposed joint-nullspace formulation. This can be justified by the fact that the proposed joint-nullspace improves the gradient flow between the layers, hence, the calculations of projection onto the joint-nullspace do not suffer from a singularity when maximizing (1).

Impact of Batch Size
To study the effect of different batch sizes on the performance, we conducted four experiments on CIFAR10 and ImageNet datasets using NullSpaceNet with VGG16 and ResNet50 using different batch sizes. As shown in Table XI, we performed an ablation study on CIFAR10 dataset with batch sizes of 4, 8, 64, and 128. We also conducted another ablation study on a different dataset, ImageNet, as shown in Table XII and used batch sizes of 256, 512, 1024, and 2048. The reason we selected these numbers for the experiment is because CIFAR10 has only ten classes, while ImageNet has 1000 classes. The results show the effect of batch size on the performance. In the case of using batch sizes of 4 and 8 in CIFAR10 (which are smaller than the number of classes), the accuracy of NullSpaceNet using VGG16 is 90.16% and 92.35%, which are far below the accuracy when using batch sizes of 64 and 128 (96.02% and 96.31%, respectively). Similarly, NullSpaceNet with ResNet50, which has batch sizes of 4 and 8, achieves a performance of 91.54% and 93.76%, respectively. If the batch size increases to be 64 and 128, the accuracy is reported to be 96.98% and 97.82%, respectively. Another experiment has been conducted on ImageNet using NullSpaceNet with VGG16 and Resnet50, as shown in Table XII. For NullSpcaeNet with VGG16, the performance results are 70.12% and 72.65% for batch sizes of 256 and 512 (smaller than the number of classes) and 73.30% and 75.74% for batch sizes of 1024 and 2048 (larger than the number of classes). The accuracy of NullSpaceNet with ResNet using batch sizes of 256 and 512 is 73.95% and 75.84%, respectively. Increasing the number of images per batch such that the batch size is larger than the number of classes significantly boosts the performance. In the case of batch sizes of 1024 and 2048 the accuracy is reported to be 77.47% and 78.43%, respectively.
It is clear from these experiments that NullSpaceNet is sensitive to the batch size. More specifically, the best performance is reported when the batch size is greater than the number of classes in the dataset. More insights into this performance are provided in the NullSpaceNet Analysis subsection below.

Impact of Image Resolution
The accuracy gain between the proposed NullSpaceNet with VGG16 as backbone and the traditional VGG16+FC when tested on CIFAR10 and CIFAR100 is 0.5% and 0.07%, respectively, in favor of NullSpaceNet. The gain suggests that the accuracy does not significantly benefit from the projection onto the proposed joint-nullspace in this case. This can be justified based on the fact that the image resolution in CIFAR10 and CIFAR100 is 32 × 32. This means the number of pixel-level features to be mapped to either the feature space or the joint-space is small, which explains the low accuracy gain.
This justification is further supported in light of the results on the STL10 (which has a higher image resolution of 64 × 64), and thus better accuracy in favor of NullSpaceNet.
Furthermore, another experiment has been performed on a reduced resolution version of STL10. All training images have been reduced to 32 × 32 resolution, similar to CIFAR10 and CIFAR100. NullSpaceNet has been trained on the modified version of STL10, and the results are shown in Table VI. It can be seen that the accuracy gain is 0.02%, which confirms that the accuracy barely benefits from lower resolution.
Similarly, in Table I, NullSpaceNet has a gain of +1.3 in terms of Top-1 accuracy compared to VGG16+FC on ImageNet. ImageNet has 1000 classes and the spatial size of images is 224 × 224. This confirms our justification that NullSpaceNet has better accuracy in cases of images with higher resolution. In general, NullSpaceNet outperforms VGG16+FC in all cases. All results are summarized in Table IX.
Top-k Error Rate on STL10: The Top-k error rate is the fraction of the testing set for which the true label is not among the five labels that are most likely by the model prediction [10]. Fig. 5 shows top-1 accuracy on STL10 dataset. It is clear from Fig. 5 that NullSpaceNet has a lower error rate compared to VGG16+FC. In Fig. 6, the top row shows the training and testing loss over three datasets CIFAR10, CIFAR100, and STL10. It is clear that NullSpaceNet converges to the minimum at 200 epochs without overfitting to the training set. In the bottom row, we report the top-1 and top-5 accuracy on CIFAR10, CIFAR100, and STL10. The top-1 and top-5 accuracy show that NullSpaceNet is robust to image challenges as the top-5 accuracy is always high.
Impact of Different Optimizers: I-GWO is an optimizer that was proposed to achieve the proper compromise between  NullSpaceNet Analysis: In this subsection, we provide insights on the performance of the proposed NullSpaceNet from two aspects: Time reduction and accuracy.
From time reduction aspect, the core idea of NullSpaceNet is to project from pixel-level image onto joint-nullspace and use this space to classify images. As mentioned before, the majority of the trainable parameters are located in the fully connected layers. Therefore, when NullSpaceNet completely removes the fully connected layers from the backbone, this leads to a significant reduction in the number of parameters in the network. For example, as seen in Table IX From the accuracy aspect, NullSpaceNet constraints the margin between the classes such that the between-class scatter matrix is large and positive while the within-class scatter matrix to vanish through the optimization of the proposed loss function. This formulation improved the accuracy of NullSpaceNet as shown in Table IX. It is worth mentioning that classifiers with small number of parameters do not significantly benefit from parameter reduction, e.g., MNASNET1-0 shown in Table V. However, these networks can benefit from performance boosting. We justify the performance boosting by the fact that NullSpaceNet clearly constraints that the separation margin  between classes in the joint-nullsapce should be large and positive, while the diffusion of different images from the same class should approach zero. Moreover, if there is any outlier data point, the chance to be correctly classified in the proposed nullspace is higher as opposed to the feature space, as shown in Fig. 4.
NullSpaceNet gives better performance when B > C, where B is the batch size while C is number of the classes. The reason is that the formulation heavily depends on the number of classes. To clarify this point, consider the steps of the Algorithm 1, where we first calculate centred class mean output F w and the centred global mean output features of NullSpaceNetF t . Then, F t is decomposed based on the SVD to calculate the scatter matrix S t . The other scatter matrix S w is calculated from F w , which depends on the number of classes in the dataset. Then, S b is calculated to produce the reduced versionsS w , S b , S t . Finally, we solve for the eigenvalue problem W S b W to maximize the nonzero eigenvalues in the loss function 23, which depends on the number of classes.

Limitation of NullSpaceNet
NullSpaceNet provides a principled way to project from the pixel-level image onto the joint-nullspace, NullSpaceNet performance degrades when the batch size is less than the number of classes. This is because the formulation of projection depends on the number of classes as shown in (13). Therefore, during the training we constraint that the batch size (B) should be larger than the number of classes (C) (i.e., B > C). Moreover, NullSpaceNet requires the population of the training set to be represented in the batch size. In other words, during the training, we sample at least one image from each class.
From uncertainty respective, uncertainty in deep learning can be divided into two categories: 1) the epistemic uncertainty, and 2) the aleatoric uncertainty. The epistemic uncertainty is caused by the model itself while the aleatoric is caused by the training data. The epistemic uncertainty is reducible by improving the model (e.g, the network design and the loss function), while the aleatoric uncertainty is hard to reduce [65]. To improve the model, different approaches have been used, such as Bayesian inference, deterministic networks with data uncertainty component [66]- [68]. Most of these approaches aim to give a strong signal to indicate whether the data point is of the training data (in-distribution) or out of the training data (out-of-distribution) such that the model has a lower uncertainty. To quantify the uncertainty, the model needs to produce a probability distribution over the classes. However, in NullSpaceNet, we infer the class during the testing by projecting onto nullspace. This is a limitation in NullSpace that can be tackled by introducing a modified and differentiable version of SoftMax function in (24) to introduce probability distribution over points/classes in NullSpaceNet.
To extend technical limitations of the proposed method, NullSpaceNet is hard to train without mixed precision, therefore, we train NullSpaceNet within APEX framework [52]. NullSpaceNet needs the mean of each class in the dataset to be tracked over the course of the training. Therefore, during the training NullSpaceNet stores the moving average to be used in each iteration, as shown in (14).

VII. CONCLUSION
A typical CNN optimizes the weights of the network by maximizing the likelihood between the estimated probability of the predicted class and the true probability of the correct class. NullSpaceNet learns to project the features from the pixel-level (i.e, input image) to a joint-nullspace. All features from the same class are collapsed into a single point in the learned joint-nullspace, whereas all features from different classes are collapsed into different points with high separation margins. Moreover, a novel differentiable loss function is developed to train NullSpaceNet to learn to project the features onto the joint-nullspace. NullSpaceNet with the proposed differentiable loss function exhibits superior performance, with accuracy gains of 0.02-2.57%, and a reduction in inference time of ≈ 99-70% in favor of NullSpaceNet. This means that NullSpaceNet needs 1-30% of the time it takes a traditional CNN to classify a batch of images with competitive accuracy.
Since NullspaceNet is architecture-agnostic, it can be used in different tasks; for example, semantic segmentation, where each pixel can be labeled. Moreover, NullSpaceNet can be effectively utilized in object tracking. In object tracking, the only information available about the object of interest is in the first frame. NullSpaceNet can see this information as a class on its own and project similar objects to be very close to this class. Another example, NullSpaceNet can be used in few-shot learning where the backbone can learn a very strong discriminative feature for each class. The future work section stipulate these examples and formulate the joint-nullspace as a learnable regularizer and an auxiliary loss to merge the feature space and nullspace in one network. Formulating the regularizer in this way will leverage the best of both spaces.