Unimodal regularisation based on beta distribution for deep ordinal regression

Currently, the use of deep learning for solving ordinal classiﬁcation problems, where categories follow a natural order, has not received much attention. In this paper, we propose an unimodal regularisation based on the beta distribution applied to the cross-entropy loss. This regularisation encourages the distribution of the labels to be a soft unimodal distribution, more appropriate for ordinal problems. Given that the beta distribution has two parameters that must be adjusted, a method to automatically determine them is proposed. The regularised loss function is used to train a deep neural network model with an ordinal scheme in the output layer. The results obtained are statistically analysed and show that the combination of these methods increases the performance in ordinal problems. Moreover, the proposed beta distribution performs better than other distributions proposed in previous works, achieving also a reduced computational cost. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
In the last decade, ordinal classification/regression has received an increasing interest in the literature [1][2][3] . The methods focused on solving this kind of problems aim to determine the discrete category or ranking of a pattern in an ordinal scale. This ordinal scale is given by the natural ordering of the categories existing in the problem considered [4] . For instance, in medical problems where we obtain a diagnosis from images, the category is usually in an ordinal scale (e.g. Diabetic Retinopathy (DR) detection [5] with five levels of the disease). Another possible example is the prediction of the age range of people from photographs of their faces [6] .
There are many real world problems where the available data have an underlying ordinal structure. In [7] , the authors aim to predict the level of the Parkinson's disease based on volumetric images obtained through encephalograms. The patients are classified depending on the state of this pathology (1: healthy patient, 2: slight alteration, 3: more advanced alteration, etc.). In [8] the authors try to predict convective situations in the Madrid-Barajas airport in Spain, which is crucial for this kind of transportation facilities as it can cause severe impact in flight scheduling and safety. These situations can be present in several degrees, resulting in dif-ferent classes that follow a natural order. Finally, in [9] authors try to automatically detect prostate cancer on different degrees based on the Gleason Score, which is a standard for measuring the aggressiveness of this type of cancer. According to this metric, prostate cancer is divided into 5 categories, where the first one does not require a treatment while the others do. Also, depending on the aggressiveness degree, the treatment must be different, and it is important to determine this level accurately to avoid excessive or insufficient treatments.
In any of these real world problems, misclassifying a pattern in an adjacent class is always less important than misclassifying it in distant classes. This is the main reason why taking the ordinal information into account when solving this kind of problems is essential. Moreover, when ordinal classifiers are used, the order of the labels is explicitly considered in the model, what generally accelerates the learning process and reduces the amount of data needed for training.
Machine learning methods based on Deep Learning [10] have been used for a wide variety of tasks. Deep Neural Networks (DNN) have the ability to obtain high level abstract representations of low level features. Each layer of the network extract higher level features from the previous layer. Specifically, Convolutional Neural Networks (CNN) take an image on gray-scale or RGB colour as input data and extract a set of features that are used to classify the pattern in one of the different categories or rankings. This kind of models have been used in problems related with image classifi-  [11] , speech recognition [12] , control problems [13] , object detection [14] , privacy and security protection [15] , recovery of human pose [16] , semantic segmentation [17] , image retrieval [18,19] , visual recognition [20] , etc.
However, the resolution of ordinal problems with deep learning models has not received much attention. The most common approach to solve ordinal data problems is to treat them as multiclass problems and optimise the model using the standard crossentropy (CE) loss [21] . This approach has a major drawback: it does not takes into account the order between categories. Some previous works [22,23] have explored different loss functions to address this problem. Another common approach is to convert the ordinal regression problem to a standard regression one [24] . This approach keeps the order between rankings but assumes that the discrete categories are continuous and equispaced.
Another problem when working with ordinal data is found in the way the labels are encoded. Usually, the one-hot encoding is used, which represents the label as binary vector where the j element is 1 when the true class is j. However, the way this encoding represents the labels incurs in the same penalty for all misclassification errors, without taking into account the distance to the true label. A better approach for ordinal regression problems is to use a smooth label representation, in such a way that classes which are close to the real label produce a smaller error than classes that are far. This method, known as label smoothing or unimodal regularisation for the loss function, has been used in previous works and different distribution functions have been used to model the shape of these smooth labels (poisson, binomial [25] or exponential [26] ).
In this work, we propose to use beta distributions for the label smoothing method together with an approach to automatically determine the parameters of these distributions. To evaluate the performance of the proposed method, we use the unimodal regularised loss to train a CNN model with three separate images datasets. The unimodal regularisation is combined with the recently proposed stick-breaking scheme for the output layer [27] but also tested with the standard softmax function. As shown in the following sections, combining these two elements results in improved performance for ordinal problems with respect to previously published alternatives.
The rest of this paper is organised as follows: previous related works, including the stick-breaking and the loss regularisation, are described in Section 2 ; in Section 3 , the new cross-entropy loss regularisation with the beta distribution is explained; in Section 4 , the design of the experiments and the datasets used are described; in Section 5 , the results of the experiments are shown and compared with previous works; and, finally, in Section 6 , the conclusions of this work are presented.

Stick-breaking
The stick-breaking approach considers the problem of breaking a stick of length 1 into J segments. This methodology is related to non-parametric Bayesian methods and can be considered a subset of the random allocation processes [28] . Also, this method has been applied as a generalisation of the continuation ratio models [29] .
In Latent Gaussian Models (LGMs), the latent variables follow a Gaussian distribution where the probability of each categorical variable is parametrised in terms of a linear projection η 1 , . . . , η J [30] , where J is the number of classes of the problem. The probability of the first category is modelled as σ (η 1 ) where σ (x ) = 1 / (1 + e −x ) . This represents the first piece of the stick. The length of the remainder of the stick is (1 − σ (η 1 )) . The probability of the second category is a fraction σ (η 2 ) of the remainder of the stick. Following this process, we can define the probability of each of the J categories. The probabilities (stick lengths) are all positive and sum to one; they thus define a valid probability distribution. A different function for σ (x ) can be used (such as the probit function). However, the logit allows us to use efficient variational bounds. The stick-breaking parametrisation can be written compactly as: Recently, a stick-breaking approach was presented in [27] as an alternative to the standard softmax for ordinal classification problems where the output distribution should be unimodal. The authors define a stick of unit length and sequentially break off parts of the stick. The length of the generated stick fragments can represent the discrete probabilities for each class.
In the first stick-breaking, two parts with the length of σ (η 1 ) and 1 − σ (η 1 ) are created. These fragments represent the probability of the first class: and the probability of the remaining classes in the ordinal scale ( y i C 1 ): Then, the remaining part 1 − σ (η 1 ) is broken, obtaining two parts of length σ ( This breaking process can be mathematically written as: where the length of each bit can be used to formulate the probability of each class. The stick-breaking process is used for training deep ordinal neural networks [27] . To do this, the authors set J − 1 output neurons for a problem with J ranks or ordinal categories. f i (x ) is a scalar denoting the i th output of the neural network and replaces the linear projections ( η i ) of the LGMs. The conventional crossentropy loss, CE, can be used to train the model.
It can be derived that each output associated with f i (x ) is actually the ratio: Consequently, f i (x ) can be interpreted as defining decision boundaries that try to separate the i th class from all the classes that come after it. By doing so, the prediction is still a discrete probability.
An interesting property of this method is that, unlike other approaches that only output a single distribution value [25,31] , it is more expressive, because each boundary of two adjacent classes has its own scalar output f i (x ) .

Unimodal regularisation
Label smoothing is a general regularisation to address the noisy label problem, which encourages the model to be less confident [27] . In the case of a one-hot label, the distribution of a label probability is q (i ) = δ i, 1 , where 1 is the ground truth class, δ i, 1 is a Dirac delta, which equals to 1 for i = 1 , and 0 otherwise. This label smoothing can be applied to the cross-entropy loss and replaces q (i ) : with a more conservative target distribution: where q (i ) = (1 − η) δ i, 1 + η 1 J and η is the parameter that controls the linear combination.
In ordinal classification, errors in classifying a pattern in its real class are more likely to be caused by the classifier classifying them in the closest classes. Therefore, building unimodal distributions which have their mode in the centre of the interval, for the case of middle classes, or in the upper or lower bounds, for extreme classes, should report a more accurate loss computation. Moreover, it is quite important that the probability distribution has small variance and the majority of its probability mass is concentrated in the interval associated with the real class. In this way, the probability is drastically reduced as long as we go further from the correct class.
The distributions proposed in previous works [25][26][27] to model the targets in a soft manner have improved the performance of ordinal classifiers concerning the standard one-hot encoding. However, they have high variance or do not offer the required flexibility to position the mode of each distribution in the centre of the class interval while preserving a small variance. Also, some of the proposed methods require to adjust experimentally different parameters.
In [25] , the authors used Poisson distributions to model the probabilities. The mean and variance of this kind of distributions is equal to the distribution parameter λ. Therefore, it has limited flexibility to obtain a small variance. For this reason, they also used the binomial distribution, which has two parameters: the number of classes, J, and the probability, p. Even though the mean ( Jp) and the variance ( Jp(1 − p) ) have different expressions, it is not easy to position the mode in the right point of the interval while obtaining a small variance. Finally, the authors of [27] proposed to sample on an exponential function e −| i −l| τ , where l is the class of the pattern and i = 1 , . . . , J, followed by a softmax normalisation. However, the value of τ must be adjusted experimentally and, in some cases, the probability mass is not sufficiently concentrated in the interval of the correct class.
To overcome the issues related to the Poisson and binomial distributions and the exponential function described, we propose in this work a set of probability distributions associated with the beta distribution, given that their variance is small and the domain of the distribution is between 0 and 1. As a graphical example, Fig. 1 illustrates the shape of the distribution associated with each class and type of distribution for a problem with five classes. For the discrete distributions, the class number is represented in the x axis while the class intervals are used for the continuous distributions.

Beta regularised cross-entropy loss
The main idea behind this work is to use probability distributions to model the targets as unimodal distributions instead of using the one-hot encoding. In this way, we obtain the soft target distributions q (i ) discussed in Section 2 . To do that, we consider the beta distribution defined in the range [0,1], therefore there is no need to apply any normalisation, and, also, it does not lead to high variance. The beta distribution has been applied to model the behaviour of random variables limited to intervals of finite lengths in a wide variety of disciplines. Some properties of this distribution are described below.
In its standard form, the beta distribution, β(a, b) , is a continuous distribution, and its probability density function (pdf) is: where 0 < x < 1 , a > 0 and b > 0 . This is also known as the classical beta distribution or the Incomplete Beta function. The function B (a, b) has the form: 2) and is zero at has a corresponding terminal value b or a . Finally, if a = b = 1 , then f (x ) becomes the uniform distribution.
Since the range of f (x ) is finite, all its moments exist. Its mean is given by the expression E(x ) = a a + b and its variance is defined as In order to analyse the behaviour of this distribution, we consider an ordinal classification problem with five classes ( J = 5 ). We assume that the distributions of the labels are beta distributions, In this way, the probability of each class can be calculated as follows: and, therefore, In the same way, we can compute the distributions and the probabilities associated with the other classes finding the a and b parameters that make the distribution be centred in the intervals However, adjusting these parameters by trial and error is not the best method, as it requires additional computational time. So, in the next section, an alternative method to determine a and b is proposed.

Beta distribution parameters based on number of classes
In this section, we propose a method to find the parameters ( a, b) for each class on any ordinal problem based on the number of classes.
First, we define the thresholds of each class based on the number of classes ( J). For any value of J, the centres of the intervals can be obtained as 1 / (2 J) , 3 / (2 J) , ..., (2 J − 1) / (2 J) . The length of the first interval should be 1 /J, and the mean value is 1 / (2 J) . Consequently: Then, the variance can be defined as: and the standard deviation: We assume that most of the values of the distribution should be in the range E(x ) ± S(x ) . In this way, we obtain the constraints for the first interval: Solving the first inequality, we get: As a consequence, for any J, we can use a = 1 and b = a (2 J − 1) .
In the same way, we obtain the parameters for the second in- . The variance and the standard deviation are: And the constraints for this interval are given by: what leads to: In the same way, we can obtain the parameters for the rest of the intervals. The parameters of the beta distributions for each class are shown in Table 1 for different number of classes ( J ≤ 8 ).
Finally, the beta regularised cross-entropy loss can be expressed as: where x, a, b) and f (x, a, b) is the probability value sampled from a beta distribution that is centred in x = 2 J−1 2 J and uses the a and b parameters obtained using the method described in this section.

Beta distribution properties regarding ordinal classes representation
The main benefit of using the beta distribution for modelling the probability distribution is the fact that it has two parameters that allow obtaining different distribution shapes with small Table 1 Beta parameters (a,b) for each class and each number of classes.
(a, b) parameters for class variance. Thus, it can represent the distribution of extreme classes along with the symmetric distribution of the middle class. When a = 1 , the probability density function is given by: while the pdf for b = 1 is: As shown in Fig. 2 a, these functions can easily represent the distributions associated with the extreme classes.
On the other hand, the beta distribution with a = b and b → ∞ is similar to a normal distribution. Let the random variable X be associated to a β(b, b) distribution with probability density function: where b is a real positive parameter. The mean of X is E[ X] = 0 . 5 and the variance of X is V [ X] = 1 4(2 b+1) .
Proposition The β(b, b) distribution converges to the normal distribution when b → ∞ , that is: The proof of this proposition is included in Appendix A . Therefore, the beta distribution can also accurately represent the distribution of the middle class through a symmetric distribution when the problem has an odd number of classes keeping the variance small (see Fig. 2 b). When the value of b increases, the variance of the distribution becomes smaller. The symmetric property of the distribution in the aforementioned cases can be easily checked with the skewness coefficient, which is calculated as: which is zero due to the fact that a = b. As mentioned before, the rest of the classes have asymmetrical distributions with small variance, as can be observed in Fig. 2 c. However, when the number of classes increases, the resulting distributions gets closer to a normal distribution with small variance (see distribution for (17,37) or (37,17) in Fig. 2 c).
The properties described in this section make the beta distribution be an excellent choice for modelling the probability distribution of each class in an ordinal problem, as it can precisely represent both the extreme and the middle classes.

Data
The ordinal classification of images has not been widely explored yet and, therefore there are not many ordinal images benchmark datasets that can be used to test our approach. We have evaluated the different proposals using the most well-known ordinal images datasets.

Diabetic Rretinopathy
Diabetic Retinopathy (DR) is a dataset consisting of extremely high-resolution fundus image data. It was used in a Kaggle competition 1 and has been used in several previous works [32,33] as a benchmark dataset for ordinal classification. The training set consists of 17563 pairs of images (where a pair includes a left and right eye image corresponding to a patient). In this dataset, we try to predict the correct category from five levels of DR: no DR (25810 images), mild DR (2443 images), moderate DR (5292 images), severe DR (873 images), or proliferative DR (708 images). The test set contains 26788 pairs of images, which are distributed in the same five classes with the following proportions: 39532, 3762, 7860, 1214 and 1208 images. These images are taken in variable conditions: by different cameras, conditions of illumination and resolutions. They come from the EyePACS dataset that was used in the DR detection competition hosted on the Kaggle platform. The images are resized to 128 × 128 pixels and the value of each pixel is standardised using the mean and the standard deviation of the training set. Some images from the test set are presented in Fig. 3 a.

Adience
Adience 2 dataset consists of 26580 faces belonging to 2284 subjects. It has been used in previous works [34] for gender and  age classification. The faces that appear in the original images of this dataset have been pre-cropped and aligned in order to ease the training process. Also, images have been resized to 256 × 256 pixels and contrast-normalised and the distribution of the pixels was standardised. The original dataset was split into five crossvalidation folds. The training set consists of merging the first four folds which comprise a total of 15554 images. The last fold is used as test set. Fig. 3 b shows some images taken from the test set.

FGNet
FGNet 3 is the smallest dataset considered in this work. It consists of 1002 128 × 128 colour images of faces from 82 different subjects. From these images, we took 80% for training and the remaining 20% for testing. These partitions were done in a stratified way. Each image was labelled with the exact age that the subject had at the moment that the picture was taken. We grouped these ages into six categories based on age ranges ( 0 -3 , 3 -11 , 11 -16 , 16 -24 , 24 -40 , > 40 ).

Model
The model considered for this work is a Residual Convolutional Network [27] , as it can achieve good generalisation capabilities with a reduced number of parameters. Figure 4 shows more details about the layers that compose the network architecture. Kernel size and stride is specified for each convolutional and pooling layer. The structure of a residual block ResBlock NxNsS is shown in Fig. 4 too. The output of each residual block is concatenated with the input. The parameters of every convolutional and batch 3 https://yanweifu.github.io/FG _ NET _ data/index.html normalisation layer are L2 normalised ( 10 −4 ). He normal initialisation [35] has been used for the weights and bias of these layers. The global average pooling layer replaces each channel of its input with the mean value of all the pixels of the channel. This layer achieves a high reduction of data dimensionality, significantly reducing the number of parameters at the end of the network while obtaining good performance.
In the output layer of the model, two different alternatives are considered: (1) a dense layer with N units and the standard softmax function, (2) a dense layer with N − 1 neurons, sigmoid activation and followed by the stick breaking layer described in Section 2.1 .

Experimental design
The model described above is trained using the three datasets described in Section 4.1 . The convolutional network model was optimised using the well-known batch based first-order optimisation algorithm called Adam [36] . The initial learning rate ( η = 10 −4 ) of the optimiser and the batch size (128) were adjusted by crossvalidation. The training process is run for 100 epochs and repeated 10 times following a 10-fold validation scheme, where we take 9 folds for training and the remaining for validation. To ease further comparison and make possible the reproducibility of the experiments, the folds considered were the same for all the experiments.
Using the aforementioned validation set, an early stopping mechanism is applied in order to stop the training process when the validation loss has not improved for 20 epochs. Also, when the validation loss has not decreased for 8 epochs, the learning rate will be multiplied by a 0.5 factor until it reaches 10 −6 .
Data augmentation techniques are applied as previous works [37] have proved that they avoid the model over-fitting and reduce the amount of data needed to train a deep learning model. We considered the following transformations: horizontal flipping, random zoom in or out within a [ −20% , 20%] range and random width shifting within a [ −10% , 10%] range. They are individually applied to every image in the training set with a certain probability. In this way, more than one transformation can be applied to the same image. Also, for the zoom in/out and the width shifting, the magnitude of the transformation is randomly selected from the ranges described.
In terms of the loss function used for the optimisation algorithm, we have considered five different alternatives, all based on the standard cross-entropy loss: • Standard cross-entropy. • Cross-entropy loss with poisson regularisation (CE-P) [26] . • Cross-entropy loss with binomial regularisation (CE-B) [26] . • Cross-entropy loss with exponential regularisation (CE-E) [27] .
• Cross-entropy loss with the beta regularisation (CE-β) proposed in this work ( Section 3.1 ). The parameters used for the distribution are obtained using the method described in Section 3.2 .
Since datasets are imbalanced, the loss function is weighted based on the a priori probabilities of the classes (considering the number of instances of each class in the training set) following the method described in [38] . Classes with few samples have a higher weight than classes with many instances.
Considering the different alternatives for the output layer described in Section 4.2 and the separate loss functions described in this Section, ten different experiments were run. As mentioned be-fore, each of these experiments was repeated ten times using the described 10-fold cross-validation scheme. These experiments can be reproduced running the code available in our public repository 4 .

Results
The results of the experiments described in Section 4 are presented in this section. The evaluation metrics used are the Quadratic Weighted Kappa (QWK) [33] , the Correct Classification Rate (CCR) or accuracy, the Minimum Sensitivity (MS) [39] and the execution time. All the values presented in Table 2 are the mean and the standard deviation of all the executions ran for each method in the test set. The experiments with softmax in the output layer are denoted as S and the experiments using the stickbreaking scheme as SB. The best result of each metric is highlighted in bold font face, while the second one is in italics. All the metrics must be maximised, except the execution time.

Statistical analysis
In this Section, a statistical analysis have been carried out in order to obtain robust conclusions from the experimental results. Each of the metrics presented in Section 5 were analysed separately.
First, the Kolmogorov-Smirnov test for the QWK reported that the values of this metric are normally distributed. Then, an ANOVA II Test [40] with the method and the dataset as factors was performed, in order to check whether these factors had any impact on the value of these metric. The parametric test reported that both factors were significant ( p -value < 0 . 001 ) and that there is an interaction between them. Given that the factors considered are significant, a posthoc HSD Tukey's test was performed [41] . The results of this test are shown in Table 3 . The stick breaking with cross-entropy binomial regularised loss obtained the best mean results. However, there are no significant differences with S CE-B, SB CE-β and SB CE-E.
The same analysis was carried out for the CCR metric. The Kolmogorov-Smirnov reported that the values are normally distributed, and the ANOVA II test found significant influence of the factors considered as well as an interaction between them. The posthoc test results are shown in Table 4 . In this case, the best methodology is the one that uses the softmax output layer combined with the beta regularisation for the cross-entropy loss. However, there are no significant differences with S, S CE-β, SB, S CE-E and S CE-B.
In the case of the MS metric, the values are also normally distributed, and there are significant differences based on the two factors considered. The posthoc Tukey's test results are displayed in Table 5 . Again, the best method was the one that uses softmax in the output layer and the beta regularised cross-entropy loss. However, there are no significant differences with SB CE-β and S CE-E.
Finally, the experiment time is also normally distributed, and the ANOVA II test reported significant differences based on the factors considered. The posthoc test ( Table 6 ) showed that the method with the best average time is the standard softmax coupled with the cross-entropy, followed by the stick-breaking with cross-entropy. Within the methods that use regularisation, the beta regularised cross-entropy with softmax or stick-breaking is the one with the best time. When we analyse the results of all the metrics combined, we find that the method that uses stick breaking with beta regularised loss (SB CE-β) achieves the best result for QWK and CCR, and the second best for MS. Also, as mentioned before, it obtains the best time among the methods that use regularisation. These facts turn this method into a competitive alternative that can be applied to solve other ordinal classification problems.

Conclusions
In this work, we have proposed the application of a unimodal regularisation based on beta distributions for the cross-entropy loss. The method described improved the performance on problems where classes follow a natural ordering. The regularisation proposed benefit from the fact that, in ordinal regression problems, misclassification tends to be in adjacent classes, and, consequently, slightly modifying the labels considering the ordinal scale should increase the robustness of the model in the presence of noisy targets. Thus, the main advantage of the proposed regularised loss is that it encourages the classification errors to be in the adjacent classes and minimises the number of errors in distant classes, achieving more accurate results for ordinal problems.
The distribution used to regularise the loss function has two parameters. Therefore, a method to automatically determine these parameters has been introduced. This method avoids learning them from the training data, thus improving the computational time with respect to other alternatives with free parameters to be adjusted. The parameters obtained through this method have been used for the label smoothing that has been applied as a regularisation method for the loss function and tested with three datasets and one CNN model. Even though the model used was a deep learning method, the proposal of this work is also suitable for other kinds of modelling techniques.
This regularised loss has been combined, in one hand, with the standard softmax function in the output layer and, in the other, with the stick-breaking method. Moreover, it has been also compared with previously proposed alternatives as well as the standard nominal classification methods. The statistical tests that were carried out with the obtained results showed that the proposed method improves the performance on ordinal problems for several metrics. Also, these tests corroborated an interaction between the ordinal method and the dataset considered, which means that some methodologies are more accurate than others for some datasets. However, the stick-breaking with beta regularised crossentropy achieved the best global results when analysing the three datasets. Therefore, the proposed method has significantly improved the performance on the benchmark ordinal problems and, in the future, can be applied to real world problems that have an underlying ordinal structure.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment
This work was supported in part by the Agencia Espaola de Investigación under Grant PID2020-115454GB-C22, in part by the Consejería de Economía, Conocimiento, Empresas y Universidad de la Junta de Andalucía under Grant UCO-1261651, in part by the Consejería de Transformacin Económica, Industria, Conocimiento y Universidades (Junta de Andalucía) y Programa Operativo

Appendix A. Beta convergence to a normal distribution
If we subtract the mean and divide by the standard deviation and we make a change of scale and origin, the proposition can be written in a more particular form: Stirling's approximation gives an approximate value for the gamma function (n ) for n → ∞ : n ! = √ 2 π n n e n . Therefore: and: Moreover: Thus, this limit converges pointwise to the probability density function of a standard normal random variable when b → ∞ , g Y (y ) . So, by Scheff's theorem [42] , the distribution of Y converges to the standard normal distribution.