Deep Classification with Linearity-Enhanced Logits to Softmax Function

Recently, there has been a rapid increase in deep classification tasks, such as image recognition and target detection. As one of the most crucial components in Convolutional Neural Network (CNN) architectures, softmax arguably encourages CNN to achieve better performance in image recognition. Under this scheme, we present a conceptually intuitive learning objection function: Orthogonal-Softmax. The primary property of the loss function is to use a linear approximation model that is designed by Gram–Schmidt orthogonalization. Firstly, compared with the traditional softmax and Taylor-Softmax, Orthogonal-Softmax has a stronger relationship through orthogonal polynomials expansion. Secondly, a new loss function is advanced to acquire highly discriminative features for classification tasks. At last, we present a linear softmax loss to further promote the intra-class compactness and inter-class discrepancy simultaneously. The results of the widespread experimental discussion on four benchmark datasets manifest the validity of the presented method. Besides, we want to explore the non-ground truth samples in the future.


Introduction
In the past few years of artificial intelligence research, Convolutional Neural Networks (CNNs) have played a crucial role in deep learning classification tasks. Benefiting from advanced network architecture [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] and discriminative capacity [17], CNNs have dramatically upgraded the performance across various visual classification tasks, such as object recognition [18,19], face verification [20,21], molecular biology [22,23], and hand-written digit recognition [24]. A recent current towards learning is to strengthen CNN with more discriminative power and more applied scenarios. In the research of machine learning, Long [25] proposed a new self-training semi-supervised deep learning (SSDL) method to further explore the fault diagnosis models. Xu [26] presented the Global Contextual Multiscale Fusion Network (GCMFN) to better accommodate noisy and unbalanced scenarios. In addition, several studies have employed CNN in the medical field. Fan [27] attached SVM to the fully connected layer to better identify the cancer datasets. Sekhar [28] proposed a novel transfer learning method to detect brain tumors. As arguably one of the most crucially used components in CNN architecture, softmax is widely used in image classification tasks.
Intuitively, the softmax loss is a popular choice to learn discriminative features in the pioneering work [29], but the original softmax loss only discriminates between partial features and does not separate inter-class features enough. Several variants have been offered to enhance the discriminative capacity of the softmax loss. The center loss [30] was proposed to compact the intra-class distance by calculating the L2 distance between the feature vector and its class center. By the cooperative penalization of the softmax, center loss achieves stronger discriminability and obtains a smaller intra-class distance. However, updating the factual center is impractical as the training number grows. Some research has also adapted distance constraints to a pair or a triplet of samples to improve the discriminative power so that similar samples are as compact as possible and dissimilar samples are as spread apart as possible. For example, contrastive loss [31] further distinguishes similar samples from different samples by feature extraction. On the other hand, regarding triplet loss, Refs. [32][33][34] presented the triplet training samples for the first time. This method guarantees an anchor sample is far from a negative sample and is close to a positive sample in a triplet of samples. Furthermore, batches of all triplet loss [35] and hard triplet loss [36] were proposed to administer more constraints and achieved a stronger generalization capability of features. Based on triplet loss, N-pair loss [37] employs a positive sample and multiple negative samples for an anchor sample to train the network. Specifically, N-pair loss applies N-1 negative samples in each training phase, which selects more information and increases the convergence speed. However, both contrastive loss and triplet loss cannot oblige on each sample of features, which will lead to unstable convergence as the size of the samples grows dramatically. Although these methods can achieve higher discriminative power and diminish this problem, they also complicate the network and make training more difficult.
On the other hand, various studies have attempted to reformulate the softmax by implementing the margin-based loss function. Unlike previous loss methods, these studies aim to improve the discriminability of softmax loss by presenting the angular penalty, which was set between feature vectors and corresponding weight vectors of the last fully connected layer. By the angular margin m, the margin-based loss functions enhance interclass distance and try to achieve stronger discriminability. For instance, SphereFace [38,39] and L-Softmax [40] first came up with the concept of angular margins by a multiplicative angular penalty, which further separated various classes and compacted the same classes. The novel loss functions enhance the discriminability of features by the change of the decision boundary, but these can lead to an unstable training process due to the difficulty of optimization. CosFace [41], AM-Softmax [42], and Soft-Softmax [43] suggested enhancing angular discriminative power by use of an additive cosine margin. Benefitting from the cosine margin, this can thereby further develop the discriminative power and provide an intuitive explanation. Building on the previous method, ArcFace [44,45] presented an additive angular margin that effectively unites the multiplicative angular margin, cosine margin, and angular margin. Profiting from the advantages of a unified framework, ArcFace plays a crucial role in deep classification and achieved sample features with stronger discriminability. For the ArcFace, the feature margin between different classes was set to the same and fixed, and this may not adapt to the real situation of various classes. In addition, several lines of research have been improved in various directions based on ArcFace. For example, Dyn-Arcface [46] replaced the flexible margin penalty based on the distance between each class center and the other class centers. ElasticFace loss [47] relaxes the fixed single margin by deploying a random margin drawn from a normal distribution. To reflect the more real properties of class separability, Groupface [48] suggested enriching the feature representations with group-aware representations based on the Arcface. AdaptiveFace put forward hard prototype mining (HPM) to adaptively adjust the margins between various classes to solve the problem of imbalanced training data in deep classification. Moreover, Uniformface [49] presented equalized distances between various class centers by adding a new loss function on SphereFace. ASL [50] mitigates the bias induced by data imbalance and increases interclass diversity. It is obvious that the flexible models present better performance compared to the fixed margin. Generally, the margin-based loss functions enhance the inter-class discrepancy by proposing an angular penalty, which is between feature vectors and the corresponding weight vectors of the last fully connected layer. However, these methods only penalize the partial samples in the angular space, which leads to unfair considerations for every class.
Based on this, we present a linear Orthogonal-Softmax loss to achieve stronger discriminability. Inspired by the Taylor-Softmax [51], the proposed Orthogonal-Softmax presents various orthogonal polynomials approximation for the e z of softmax, which is designed by executing Gram-Schmidt orthogonalization. By employing an approximated linear logit, the proposed Orthogonal-Softmax has a sturdier linear relationship than the softmax loss and the Taylor-Softmax. On the other hand, benefitting from the thinking of CosFace and AM-Softmax, we added margin m to the new loss and achieved Orthogonal-M. Compared to the Orthogonal-Softmax, Orthogonal-M increases the inter-class separation and achieves stronger discriminative power. The principal contributions can be outlined as follows: (1) The proposed Orthogonal-Softmax applies Gram-Schmidt orthogonalization in the softmax loss, which presents the approximated orthogonal polynomials for the exponential function of softmax. Additionally, in order to verify the fitting effect of the new loss functions, we compare various series of orthogonal polynomials to the Taylor series.
(2) In order to achieve the stronger discriminative power, we employ the idea of inter-class margin m to the Orthogonal-Softmax and obtain Orthogonal-M. The proposed Orthogonal has a better geometric attribute, which enhances inter-class discrepancy and intra-class compactness.
(3) Extensive experiments are conducted on four benchmark datasets (MNIST, Fashion-MNIST, CIFAR10, and CIFAR100). The results demonstrate the effectiveness of the Orthogonal-Softmax and Orthogonal-M, which have better performances over the Taylor-Softmax and softmax loss.

Related Work
In recent years, the softmax loss has been widely used as a key method to learn discriminative features for multiclass classification. Several margin-based methods have been presented to enhance the discriminative power of the softmax loss. These studies have added a margin penalty into various classes to create inter-class feature separability. SphereFace [38,39], CosFace [41], AM-Softmax [42], and ArcFace [44,45] all introduce an additive angular margin between the features and their corresponding weights under various manners. On the other hand, Taylor-Softmax [51], LinCos-Softmax [52] and LinArc [53] have proposed an approximated linear model, which creates a stricter relationship by Taylor expansion.
In addition, margin-based softmax loss functions enforce better intra-class compactness and inter-class diversity, but these studies have not effectively emphasized every sample according to its practical importance. To a certain extent, Taylor approximated softmax loss enhances the linear relationship with the angle, but they may not have enough discriminative power. Based on margin-based softmax loss and approximated linear softmax loss, we introduce a novel loss function through Gram-Schmidt orthogonalization. By combining the strengths of both, Orthogonal-Softmax has a better approximate effect and enhances the discriminative power through experiments on four datasets.

Overview of the Proposed Method
In this section, we will introduce the relative definition and derivation of orthogonal polynomials first. Based on the previous definition, then we will present the proposed Orthogonal-Softmax and the whole process of the Gram-Schmidt orthogonalization algorithm.

Introduction of Orthogonal Polynomials
Orthogonal polynomials are generally calculated by Gram-Schmidt orthogonalization, and we mainly introduce the idea of the nearest distance between orthogonal polynomials and the target function as follows: Following the definition of Axler [54]: U is a subspace of the inner product V, for any vector v ∈ V, and we have: and P U v = v j , e 1 e 1 + · · · + v j , e j−1 e j−1 (2) where P U v is the orthogonal projection on U and e i is the orthonormal basis of v.
Based on the definition, it can be inferred that P U v is the shortest distance from v to V. In this way, we assume that the vector v is the exponential function in the softmax, and the orthogonal polynomial is the P U v. Finding the shortest distance from subspace V to vector v means finding the best approximated orthogonal polynomial for the exponential function e in the inner product space.

Orthogonal-Softmax
In more detail, the softmax loss is defined by the formula: where z = (z 1 , . . . z K ) ∈ R K and z i denotes the deep feature of the ith input vector z, and y i is the corresponding class. K is the sum of the classes. In order to enhance the discriminability of features and produce an excellent result, we designed a linear approximated logit by applying the orthogonal polynomials. The Orthogonal-Softmax loss can be defined as: where and and where α i is a linearly independent list of vectors in inner-product space V, and we can apply the Gram-Schmidt procedure to the α i to obtain the orthogonal basis β i and orthonormal basis e i . Furthermore, U is the subspace of V where v ∈ V and P U v = v, e 1 e 1 + · · · + v, e i e i is the orthographic projection on U to V. Generally, the orthographic projection is the shortest distance from V to U, and we can denote P U v as the best approximation, so it is the smallest distance to the exponential function; hence, we can use the orthogonal polynomial to approximate the softmax. We can present the whole process of the second-order approximated logit, thus a group of basis was given by: i = 1: where after a simple and intuitive calculation, an approximated linear polynomial can be presented as: The second-order orthogonal polynomials approximation for softmax has been proved by this, and the whole process of Gram-Schmidt Orthogonalization can be calculated in this way. The various series of the orthogonal basis β i (x) and orthonormal basis e i (x) are presented in Table 1. Various series of orthogonal polynomials f i (x) will be directly presented in the classification experiments of Section 3.

Series
Function

Comparing with Taylor-Softmax
By using Taylor series approximation, the Taylor-Softmax [39] can be defined as: where The linear approximated effects of orthogonal polynomials and Taylor polynomials on various series are shown in Figure 1. Black lines denote the exponential functions of softmax, red lines denote the orthogonal polynomials of Orthogonal-Softmax, and blue lines denote the Taylor polynomials of Taylor-Softmax. Compared to Taylor polynomials, the orthogonal polynomials present the better approximation for the exponential functions on all series. In addition, as the series increases, both polynomials achieve stronger approximated effects, which we will demonstrate further in the following experiments. Back to Equation (3), when dealing with the binary-classes scenarios with the class 1 and class 2, the original Softmax loss presents a decision boundary by z 1 = z 2 . To make the intra-class more compact, CosFace [41] and AM-Softmax [42] introduce an additive margin m and the decision boundary given by: (15) where m is a fixed parameter presented to control the margin of inter-class. By introducing z 1 − m to relace z 1 , the logits can be defined as: Based on the thinking of AM-Softmax, we also introduced margin m into Orthogonal Softmax and Taylor-Softmax as Orthogonal-M and Taylor-M, respectively. As illustrated in Figure 2, by employing an additive margin, the proposed Orthogonal-M enhances inter-class discrepancy and intra-class compactness.

Implementation Details
In the following experiment, we employ MNIST, FashionMNIST, CIFAR10, and CI-FAR100 as our training datasets in order to create a fair comparison with Taylor-Softmax [39]. It is important to note that f n (z) is definite for n = 2k in Taylor-Softmax, so we will look into the orthogonal polynomials series expansion of 2, 4, and 6 on the experiment design method for the convenience of comparison. Our aim was not to achieve the best accuracy for these datasets but to explore new ways to enhance the discriminative power of Softmax.

Experimental Setting
As shown in Table 2, we present the network structure corresponding to each dataset. During the experiment process, we evaluate the generalized softmax loss in visual classification. For the CNN construction, we adopt the VGG-net, as well as the Taylor-Softmax for convenient comparison. During all experiments, we assume the PRelu as the activation and the batch size is 256. Convolution neural network training is finished with SGD with momentum 0.9 and weight decay 0.0005. The margin parameter m plays a crucial role in the proposed the Orthogonal-M. Following the setup of the previous work, we varied margin m from 0 to 0.6 with a deep size of 0.1, which was set in both Taylor-M and Orthogonal-M. As shown in Figure 3, for the sixth-order orthogonal series of the proposed Orthogonal-M, MNIST and CIFAR10 achieved the highest accuracy levels at m = 0.4 and m = 0.5, respectively. This indicates that different datasets have various intrinsic properties, and it may not be effective to treat all datasets with the same fixed parameter m. Thus, we present the best accuracy levels of each dataset in various m values. In all these topologies, we replace the final Softmax function with each of its alternatives in our experiments.

Evaluation Results
The experimental results shown in Table 3 are the verification accuracies on four datasets under various methods. As shown in Table 3, we display the performance of our proposed Orthogonal-Softmax with series variables 2 to 6 to correspond to Taylor-Softmax. The bold numbers in each column are the highest verification result for various models on each dataset. From the quantitative comparison among all the methods on four datasets, Orthogonal-Softmax presents better performance than Softmax loss and Taylor-Softmax. Benefitting from the introduction of margin-based loss functions, the proposed Orthogonal-M further enhances the discriminative power based on the Orthogonal-Softmax. For the MNIST and Fashion-MNIST, it is well known that these two datasets are simple and typical in deep classification tasks, so the results in the proposed methods are not significantly improved. On CIFAR10, the proposed Orthogonal-M achieves an improvement of 90.45%, which is 3.58% over the Softmax loss and 3.28% over the Taylor-Softmax. Additionally, on CIFAR100, the proposed Orthogonal-M is 8.45% higher than the Softmax loss and 7.08% higher than the Taylor-Softmax. It is worth noting that the proposed Orthogonal-M partially achieves lower performance on series 2 and 4 when compared with Orthogonal-Softmax. The reason for this is that series 6 has a better approximation effect and is more steady, and the added margin presents a better reflection of margin-based Softmax. As observed, all the verification experiments show that the proposed linear logits can appropriately distinguish features and outperform the other existing methods.

Conclusions
In this paper, we proposed the Orthogonal-Softmax loss function, which employs an approximated linear logit to effectively replace the Softmax loss. By taking Gram-Schmidt orthogonalization, the proposed Orthogonal-Softmax was able to achieve a better linear relationship and stronger discriminative power. We also supplied a mathematical explanation to dissect the whole process of Gram-Schmidt orthogonalization and its advantages. Experimental results and analyses on four well-known benchmark datasets (MNIST, Fashion-MNIST, CIFAR10, and CIFAR100) demonstrate the superiority of the proposed Orthogonal-Softmax and Orthogonal-M when compared to other loss functions. Therefore, how the conundrum of non-ground-truth classes can be resolved, as well as how bigger datasets can be explored will be our future work.