GO Loss: A Gaussian Distribution-Based Orthogonal Decomposition Loss for Classification

,


Introduction
In recent years, deep neural networks have achieved great success [1,2], and classi cation tasks have been widely used in various elds [3][4][5][6].Loss function is an indispensable part of deep learning; various kinds of loss functions, such as MSE and BCE, are available for di erent tasks, including image-based object recognition [7][8][9], face recognition [10][11][12], and speech recognition [13,14].e performance of loss functions has been widely studied [15,16].A good loss function should theoretically make the distribution of features of di erent classes separated while ensuring the features of the same class as compact as possible.
Among the existing loss functions, soft-max cross-entropy is the most common [9,[17][18][19].However, soft-max only ensures the separability of the features of di erent classes while lacking the ability to compress distances among features within the same class.As a result, the distances between features of di erent classes are less than those of the same class, as shown in Figure 1(a).
Several variants have been proposed to improve the intraclass compactness of soft-max.Some metric learning methods are used to promote the classi cation e ectively [20][21][22].
ese studies attempt to resolve this problem through feature normalization [23,24] or adding an extra regularization item to construct a joint supervision [25][26][27][28].In these studies, the stochastic gradient descent algorithm has been widely used.
is algorithm can determine a convergence direction in each iteration on the y, depending on the network parameters and training samples at the time.
e feature as a vector can be decoupled into two components, namely, direction and norm.eoretically, the two components determine the interclass separability and intraclass compactness of the distribution of the sample features, respectively.erefore, if we treat the feature as a whole, as what the existing works do, then the optimizations of the two components will be intertwined.erefore, the computation of the convergence center has to simultaneously consider the two components, which will interfere with each other and thus a ect the nal classi cation e ects.
In this paper, we propose an orthogonal decompositionbased loss function called GO loss, which decomposes the convergence direction into two mutually orthogonal components.Moreover, we assume that the two components follow Gaussian distribution.Speci cally, the norm of the feature in the radial direction follows Gaussian distribution, while the angles (cosine value) between the features and the corresponding center vector (class weight vector) of the class in tangential direction also follow Gaussian distribution.is assumption enables the use of Bayes' rule in loss computation, which is an e ective manner to model training features.We can therefore (1) model the classi cation loss as the cross entropy between the posterior probability of features and the corresponding class labels in tangential direction, called tangential loss, and (2) compute the di erence between the norm of feature distribution and the assumed distribution in radial direction using the negative log likelihood, called radial loss.
e two losses can be used to form a joint supervision for balancing interclass separability and intraclass compactness on the learned training feature space; thus, a high classi cation accuracy can be ensured, as shown in Figure 1.
In summary, the main contribution of the paper is a novel loss function for classi cation, namely, GO loss, which integrates the following: (i) A strategy to optimize loss function through decomposing the convergence direction into two mutually orthogonal components and conducting optimization on them, respectively.is approach is di erent from most traditional methods that mainly rely on feature normalization and adding regularization item.e rationale is to avoid the mutual in uence of the optimizations on the two components for obtaining a stable convergence center.
(ii) A solution that implements the optimization.is solution decouples the feature into direction and norm associated with the interclass separability and intraclass compactness, respectively, and conduct optimizations on the two components with the assumption that they follow Gaussian distribution.

Related Works
For various classi cation tasks, the loss function directly a ects the classi cation e ect [29][30][31].In the existing methods, metric learning is widely used in the loss function to improve the classi cation e ect [32][33][34].e idea of GO loss is based on existing loss functions.We highlight the most related aspects below.Soft-max is one of the most common loss functions in classi cation.It uses the inner product metric to implement the classi cation function.However, loose intraclass feature distribution brings di culty in handling complex classication problems.Many other metrics, such as Euclidean and cosine distances, have been used to resolve the aforementioned problem.
us, many variants of soft-max are available.
Contrastive loss [25] uses a prede ned margin to train a Siamese network for face recognition.It minimizes the Euclidean distances between positive pairs and enlarges the Euclidean distances between negative face image pairs.However, the combinatorial explosion problem of image pairs will greatly increase the number of iterations.
Triplet loss [26] applies Euclidean distance regularization for loss optimization.e regularization is conducted on image triplets rather than the image pairs of contrastive loss to achieve a high accuracy of face recognition.However, 2 Complexity it has the same problem as contrastive loss in terms of computational complexity.
Center loss [27] minimizes the Euclidean distance between each feature vector and its class center.However, the extra regularization item generates two convergence directions, which not only increases computation complexity but also makes the convergence center unstable to some extent.
Gaussian mixture loss [28] is an effective alternative to soft-max.Center loss is a special case of the likelihood regularization in the GM loss.
e problem of Gaussian mixture is the same as that of center loss.us, the former generates increased computation overhead.
Ring loss [23] utilizes a different optimization mechanism, which normalizes all features through a convex augmentation of the primary loss function.In that case, all the features are put around a ring.As a result, all features have the same norm and thus cannot be used for optimization.
Large-margin soft-max loss [24] uses the cosine distance metrics to solve the inconsistency problem of distance measurements.It introduces an angular margin in the softmax through a well-designed angular distance function.It mainly focuses on angular variation while ignoring important influence of norm on the classification effects.
e abovementioned methods optimize the loss function from the perspective of the feature distribution.Regularizing the extracted features or adding regularization terms makes the features of the same class compact and the features of different classes separated.Based on this, several loss functions for classification have been studied from the perspective of redesigning clusters [35,36], such as GCPL loss [37] and Structure-aware loss [38].
L2T-DLF [39], which means "learning to teach with dynamic loss functions," is a novel model to train the loss function.
rough the training process, the model adjusts and changes the loss function.e trained loss function is best suited to the datasets.As a result, the best classification results are obtained.
Noise-robust loss [40] uses the joint supervision of categorical cross-entropy loss and mean absolute error to optimize the loss function from the perspective of noiserobust.When the label has a wide range of noises, this loss function can exert a better classification effect than other loss functions, which normalize the features.SL [41], which means "symmetric cross-entropy learning," is also proposed to solve the noise-robust problem.It boosts cross-entropy symmetrically with a noise-robust counterpart called reverse cross-entropy.SL overcomes the overfitting and under learning problem of cross-entropy when the label has the noise.
Recent research on loss function focuses on the application scenario of loss function.e methods study the loss function for the characteristics of the datasets, such as the presence of noisy labels.
As same as the existing works, we also improve the classification effect from the perspective of intraclass compactness and interclass separation of feature distribution.e aforementioned methods regard direction and norm as a whole for optimizing the loss.On the contrary, GO loss performs optimization on the two characteristics separately. is approach has not been tried before to the best of our knowledge.An unknown sufficiently large sample can be approximated as obeying Gaussian distribution.Considering the characteristics of the datasets, we reasonably assume that features obey Gaussian distribution.We use the Gaussian distribution to guide the optimization process.

Problem Statement
3.1.General Consideration.Several aspects should be further explained before introducing the approach.e first aspect is to determine the change in the convergence direction in existing loss functions during the iteration and the impacts of the indeterminate direction on classification results.In loss function, the affinity score (logit) is usually calculated by different metrics, such as inner product and Euclidean distance metrics.ese metrics are usually used directly to calculate affinity scores or as part of the process of calculating affinity scores if they are in the form of extra regularization item. is way makes convergence direction depend on the network parameters and training samples, which are changing over each iteration.
e indeterminate convergence direction causes difficulty in obtaining a stable convergence center, which indirectly leads to increased errors in the established model.
Here, we use soft-max as an example to illustrate this effect.For a K-classification task, we assume that x i and ω k are the extracted deep feature vector and the class weight vector for class k, respectively.For inner product metric, the convergence direction is the same as the direction of x i .For Euclidean distance, the convergence direction, which is reflected as the vector from x i to ω k , is determined by the direction and norm of feature, as shown in Figure 2.
e second aspect is the decoupling of the feature into direction and norm.e feature vector is determined by two characteristics, namely, direction and norm, which are naturally coupled.It is therefore as incomplete as cosine metric when only one of the characteristics is considered during the optimization process.Existing metrics always treat the two characteristics as a whole.us, the optimization inevitably involves both of them. is condition may lead to the interference of the two characteristics with each other, which affects the final classification effects.

Approach Overview.
As discussed above, we first need to decompose the convergence direction into two mutually orthogonal directions.To facilitate implementation, we make the feature as the subject of decoupling and decouple it into two components, namely, direction and norm, which correspond to tangential and radial directions, respectively.is step has two advantages.First, we can separately optimize the two components to prevent them from interacting with each other.Second, the relationship between the two components can be explicitly determined.e decomposition makes it convenient to obtain the convergence center because only one component (direction or norm) is taken into the calculation process at a time.
We assume that the two components follow Gaussian distribution to improve the accuracy of the model further.We Complexity believe that this assumption is reasonable, especially when the overall distribution is unknown and the sample size is sufficiently large.
Figure 3 shows the optimizing process in a classification task, in which x i is the extracted deep feature vector for an input sample and ω k is the class weight vector for the class k, from which x i belongs to.As observed, the convergence direction is static during each iteration.
We design the loss function in the tangential direction to conduct classification.Given that the core purpose of classification is to separate the different classes from each other, the loss function in the tangential direction is mainly responsible for the interclass separation.We adopt the popular method called cross entropy to implement the classification function.
e ability to classify is achieved in tangential loss.us, the ability to classify in radial loss need not be obtained.We achieve interclass separation of feature distributions.We design the radial loss to be primarily responsible for intraclass compactness to improve the classification effect further.We achieve the intraclass compactness by reducing the difference between the actual distribution of features and the ideal Gaussian distribution of features.We use a popular method called likelihood function to measure the difference in distribution.

GO Loss
In this section, we first introduce the optimization of the tangential and radial components and then give the method for merging the two parts as the GO loss for implementing the joint supervision.

Optimization on Tangential Direction.
In tangential direction, we first provide the formal definition of the Gaussian distribution.en, we use Bayes' rule to calculate the posterior probability distribution.Finally, we use cross entropy to calculate the classification loss.

Gaussian Distribution. Let 􏽢
x i be the feature following the Gaussian distribution, as shown in equation (1), where ω k is the class weight from which x i corresponds to, and σ k represents the covariance of class k in the feature space.For unknown K-classification task, we assume that the probability of each class is equal, whose purpose is to ensure that the prior probability is constant.e prior probability of class k is p(k) � 1/K.e hyperparameter α is used to control the difficulty in the training process.
Our ideal idea is to guarantee that the angle between the feature and its corresponding class weight obeys Gaussian distribution.However, Gaussian distribution of angles is too complicated to calculate.We use the normalized feature and its corresponding class weight vector instead of the cosine of the angle between them to avoid complex angle calculations.According to the cosine theorem, ( x i −  ω k ) 2 can be replaced by the cosine of the angle between the feature and its corresponding class center vector.us, equation (1) can be understood as a similar Gaussian distribution associated with the angular cosine.It proves the feasibility of the replacement.

Bayes' Rule. Assume 􏽢
x i is a normalized feature with the label z i ∈ [1, K].Under Gaussian distribution assumption, its conditional probability distribution can be written as According to Bayes' rule, its posterior probability distribution is 4.1.3.Cross-Entropy Loss.We finally use the cross entropy between the posterior probability distribution and the class label to calculate the loss of the tangential direction, which is written as L T and defined as 4 Complexity 1( ) is an indicator function, which defined as

Optimization on Radial Direction.
In radial direction, we first give the formal definition of the Gaussian distribution.en, we use Bayes' rule to calculate the posterior probability distribution.Finally, we use the likelihood to calculate the loss.

Gaussian Distribution.
Similar to that in the tangential direction, we assume that l 2 -norm of the feature ‖x i ‖ 2 on the radial direction also follows the Gaussian distribution, which is defined as where ‖ω k ‖ 2 , Σ k , and p(k) are the l 2 -norm values of the class weight vector, the covariance, and the prior probability of class k, respectively.Similar to the Gaussian distribution assumption in the tangential direction, the prior probability is constant.As a result, the prior probability of class k is p(k) � 1/K.

Bayes' Rule. Assume ‖x
Under the Gaussian distribution assumption, its conditional probability distribution can be written as According to Bayes' rule, its posterior probability distribution is 4.2.3.Likelihood Loss.For a complete dataset X, Z { }, the likelihood can be expressed as e negative log likelihood can be expressed as According to Gaussian distribution assumption, the prior probability p(x i ) is a constant and is equal to 1/K for K-classification problem.erefore, the loss on the radial direction, which is written as L R , can be simplified as 4.3.Joint Supervision.We have already obtained the loss functions on tangential and radial directions, namely, L T and L R .In this section, we continue to introduce the merging of the two loss functions to construct the final GO loss.Assume L O is the GO loss, which can be composed of L T and L R , as shown in equation (12).Naturally, L T is only related to the cosine of the angle between the feature vector and its corresponding class weight vector, while L R is only related to the norm of the feature vector.Without loss of generality, L T and L R share all the parameters: Complexity A hyperparameter α is used to control the difficulty in the training process in L T .A nonnegative weighting coefficient λ is used to balance the two loss functions.If λ is set to 0, then only L T is used in the optimization, while L T and L R will have the same importance when λ is set to 1. e influence of the hyperparameter is investigated in the subsequent experiments.

MNIST Datasets.
In the first experiment, we compare GO loss with soft-max loss though the MNIST Handwritten Digit dataset [42].e classification results, which are in the form of high-dimensional vectors, are projected onto a 2D plane, as shown in Figure 1.As observed, features distribute in 300 units of measure by using traditional soft-max loss and in 3 units of measure by using our GO loss.Our GO loss has better intraclass compactness and interclass separability than soft-max loss.
We train the network with different loss functions, namely, soft-max loss, center loss [27], ring loss with softmax [23], LGM loss [28], GCPL loss [37], and SL [41].In the aforementioned methods, center loss, ring loss, LGM loss, and GCPL loss optimize the loss function from the perspective of intraclass compactness and interclass separation of features.ese methods are consistent with the goal of our GO loss.But SL is a popular method for datasets where the label has noise.We also compare from new optimization perspective.We use SampleNet, which has five convolution layers, each with 32 dimensions, and a fully connected layer with a two-dimensional output.For the existing loss function, we attempt to adjust the hyperparameters and select the best results for recording.e networks are trained with a batch size of 128 for 50 epochs, and the learning rate is set to 0.1 and then divided by 2 for every 20 epochs.e hyperparameter α is set to 20. e classification accuracy in different methods is shown in Table 1.As observed, GO loss has a better performance than other loss functions on MNIST.

Parameter Analysis.
We also conduct experiments to investigate the influence of the hyperparameter α and λ on the performance.We set α to 10, 20, 30, and 40, each with λ of 0.1 and 0.01.Table 2 shows that the accuracy of GO loss is the highest when α is 20 and λ is 0.1.We therefore use this setting for other experiments.
We determine the effects of tangential and radial losses on the overall GO loss.We set λ to 0, which indicates that only the tangential loss is used in GO loss.Only the radial loss cannot achieve classification.us, we set λ to 1, which implies that the radial loss has more evident contribution than the general experimental situation.e classification results, which are in the form of high-dimensional vectors, are projected onto a 2D plane, as shown in Figure 4. e experimental results show that the distance between the features of the same class becomes significantly larger when only the tangential loss is used as the loss function.is result shows that radial loss can effectively control intraclass compactness.When the proportion of radial loss is too large, the different classes of features will be intertwined.is condition results in poor interclass separation. is indicates that tangential loss plays a decisive role in the performance of separation between classes.
Features distribute in 300 units of measure by using traditional soft-max loss in Figure 1(a) and in 2 units of measure by using our tangential loss in Figure 4(a).e shape of their distribution may be similar.Most of the existing loss functions have similar feature distributions in two-dimensional space with soft-max.However, the reason has never been discussed to the best of our knowledge.We analyze the traditional soft-max using the inner product space metric, which is essentially a linear constraint.As a result, feature distribution is linearly separable.Although our tangential loss is calculated by the normalized feature, it is also related to the cosine according to the cosine theorem.
e cosine is the inner product of the normalized vector, which is also a linear constraint.us, they are similar in the shape of the distribution.e Euclidean distance and our radial loss are quadratic or bilinear constraints.us, the features are different, as shown in Figure 4(b).

Fashion-MNIST Datasets.
We conduct another experiment on the Fashion-MNIST dataset [43], which contains 70,000 grayscale images with the pixel resolution of 28 × 28.[27] λ � 0.1 99.62 Ring loss [23] λ � 0.1 99.58 LGM loss [28] α � 1 99.36 GCPL loss [37] λ � 0.1 99.41 SL [41] η � 0.0 99.32 GO loss λ � 0.1 99.66 ± 0.03 6 Complexity e dataset contains 10 categories of fashion products and is divided into 60, 000 training samples and 10,000 testing samples.We adopt the same network and training parameters with MNIST.e classi cation accuracy is shown in Table 3.As observed, GO loss also has the best performance on this dataset.WRN-28 [18] is proven to have the best results on the Fashion-MNIST datasets.We try our GO loss on this network structure.e classi cation accuracy is shown in Table 4. e experimental results prove that our GO loss also has excellent performance in the advanced network structure.

CIFAR-10 and CIFAR-100
Datasets.We use GO loss to implement three more complex networks on CIFAR-10 and CIFAR-100 datasets [44].Each dataset contains 60,000 colored images, which are divided into 50,000 training images and 10,000 testing images with the pixel resolution of 32 × 32.
For CIFAR-10, we use ResNet [9] of a depth of 20 as the network structure.e batch size is set to 128 and epoch is 300.We set the learning rate to 0.1, which will become half of the original one for every 60 epochs.e hyperparameter α is set to 20.We use a weight decay of 5 × 10 − 4 SGD optimization algorithm with a momentum of 0.9.e method introduced in [45] is used to initialize the network weights.e main purpose of the experiment is to compare the classi cation accuracy on soft-max loss and GO loss.Moreover, we compare the classi cation accuracy under di erent values of balance parameter λ (0.1 and 0.01), which describes the degree of contribution of the loss functions of radial and tangential directions to the nal GO loss (Section 4.3).
e experimental results are shown in Table 5.As expected, GO loss can achieve better results than traditional soft-max loss.We use another network, namely DenseNet-BC [1] with 12 feature maps, to observe the performance of GO loss on it for eliminating the deviation in the experimental results caused by the network structure.
e experiment is also conducted on the CIFAR-10 dataset.e experimental results shown in Table 6 indicate that GO loss also has a better performance than the others under this experiment condition.
For CIFAR-100, we use ResNet [9] of a depth of 50 as the network structure.e batch size is set to 128 and epoch is 300.We set the learning rate to 0.1, which will be divided by 10 for every 120 epochs.e hyperparameter α is set to 20.We use a weight decay of 5 × 10 − 4 and SGD optimization algorithm with a momentum of 0.9.e method introduced  Complexity in [45] is used to initialize the network weights.e main purpose of the experiment is to compare the classification accuracy on soft-max loss and GO loss.Moreover, we compare the classification accuracy under different values of balance parameter λ (0.1 and 0.01), which describes the degree of contribution of the loss functions of radial and tangential directions to the final GO loss (Section 4.3).e experimental results shown in Table 7 indicate that the GO loss has the best effects when λ � 0.1.erefore, tangential direction, which is more related to interclass separability than the other direction, has a greater impact on the classification accuracy.We use another network, namely, DenseNet-BC with 12 feature maps, to observe the performance of GO loss on it for eliminating the deviation in the experimental results caused by the network structure.e experiment is also performed on the CIFAR-100 dataset.
e experimental results in Table 8 indicate that GO loss also has a better performance than the others under this experiment condition.

ImageNet Dataset.
We use ImageNet dataset [46] with a larger size to observe the performance of GO loss on it for verifying the scalability of GO loss.A more complex network, namely, ResNet-101 [9], is used.Soft-max is selected as the reference to compare its classification accuracy with GO loss.We use 8 Titan GPUs to train all the models.e batch size and epoch are set to 128 and 120, respectively.Meanwhile, the learning rate is initialized as 0.01 with a decay rate of 50% every 40 epochs.We also investigate the influence of varying balance parameter λ on the accuracy.
e results shown in Table 9 indicate that GO loss is also effective on the large-scale datasets and will achieve a better performance with a larger λ.

Conclusions
In this paper, we present an orthogonal decompositionbased loss for classification.
Our approach can be summarized as follows: (1) We propose a new optimization perspective.Specifically, we consider the optimization problem from the perspective of convergence direction.(2) We decompose the convergence direction into two mutually orthogonal components, namely, tangential and radial directions, and conduct optimization on them separately.(3) We decouple the direction and norm of feature to avoid their interference with each other during the optimization process.(4) We use the direction and norm of feature to associate with the interclass separability and intraclass compactness, respectively.(5) We use Gaussian distribution to guide the optimization processes on direction and norm of feature.
We train six networks on five datasets with different sizes to evaluate the proposed GO loss.e results demonstrate the effectiveness of GO loss.In our future work, we plan to make two improvements.First, we plan to apply GO loss to other datasets for a thorough evaluation of its performance under different application scenarios.Second, we will propose a method to quantitatively determine the value of the hyperparameters, such as by visual analytics [6] or adaptive scaling [47].

Figure 1 :
Figure 1: Distributions of the features trained using (a) soft-max and (b) GO loss on MNIST.Each color represents a class.e GO loss has better intraclass compactness and interclass separation than traditional soft-max loss.Best viewed in color.

r g e n c e d ir e c ti oFigure 2 :
Figure 2: Illustration of convergence direction change in the cases of (a) inner product metric and (b) Euclidean distance metric.

Figure 3 :
Figure 3: Convergence direction of GO loss, which is fixed in (a) the tangential and (b) radial directions during the optimization process.

Figure 4 :
Figure 4: e distributions of the features trained using GO loss (a) λ 0 and (b) λ 1 on MNIST.Di erent colors represent di erent class.econtribution of radial loss and tangential loss to the overall GO loss is shown.Best viewed in color.

Table 2 :
Classification accuracy of different hyperparameter on MNIST.