Rank consistent ordinal regression for neural networks with application to age estimation

In many real-world prediction tasks, class labels include information about the relative ordering between labels, which is not captured by commonly-used loss functions such as multi-category cross-entropy. Recently, the deep learning community adopted ordinal regression frameworks to take such ordering information into account. Neural networks were equipped with ordinal regression capabilities by transforming ordinal targets into binary classification subtasks. However, this method suffers from inconsistencies among the different binary classifiers. To resolve these inconsistencies, we propose the COnsistent RAnk Logits (CORAL) framework with strong theoretical guarantees for rank-monotonicity and consistent confidence scores. Moreover, the proposed method is architecture-agnostic and can extend arbitrary state-of-the-art deep neural network classifiers for ordinal regression tasks. The empirical evaluation of the proposed rank-consistent method on a range of face-image datasets for age prediction shows a substantial reduction of the prediction error compared to the reference ordinal regression network.


Introduction
Ordinal regression (also called ordinal classification), describes the task of predicting labels on an ordinal scale. Here, a ranking rule or classifier h maps each object x i ∈ X into an ordered set h : X → Y, where Y = {r 1 ≺ ... ≺ r K }. In contrast to classification, the labels provide enough information to order objects. However, as opposed to metric regression, the difference between label values is arbitrary.
While the field of machine learning has developed many powerful algorithms for predictive modeling, most algorithms have been designed for classification tasks. The extended binary classification approach proposed by Li and Lin (2007) forms the basis of many ordinal regression implementations. However, neural network-based implementations of this approach commonly suffer from classifier inconsistencies among the binary rankings (Niu et al., 2016). This inconsistency problem among the predictions of individual binary classifiers is illustrated in Figure 1. We propose a new method and theorem for guaranteed classifier consistency that can easily be implemented in various neural network architectures. Furthermore, along with the theoretical rank-consistency guarantees, this paper presents an empirical analysis of our approach to challenging real-world datasets for predicting the age of individuals from face images using our method with convolutional neural networks (CNNs). Aging can be regarded as a non-stationary process since age progression effects appear differently depending on the person's age. During childhood, facial aging is primarily associated with changes in the shape of the face, whereas aging during adulthood is defined mainly by changes in skin texture (Ramanathan et al., 2009;Niu et al., 2016). Based on this assumption, age prediction can be modeled using ordinal regression-based approaches (Yang et al., 2010;Chang et al., 2011;Cao et al., 2012;Li et al., 2012).
The main contributions of this paper are as follows: 1. The consistent rank logits (CORAL) framework for ordinal regression with theoretical guarantees for classifier consistency; 2. Implementation of CORAL to adapt common CNN architectures, such as ResNet (He et al., 2016), for ordinal regression; 3. Experiments on different age estimation datasets showing that CORAL's guaranteed binary classifier consistency improves predictive performance compared to the reference framework for ordinal regression. Note that this work focuses on age estimation to study the proposed method's efficacy for ordinal regression. However, the proposed technique can be used for other ordinal regression problems, such as crowd-counting, depth estimation, biological cell counting, customer satisfaction, and others.

Ordinal regression and ranking
Several multivariate extensions of generalized linear models have been developed for ordinal regression in the past, including the popular proportional odds and proportional hazards models (McCullagh, 1980). Moreover, the machine learning field developed ordinal regression models based on extensions of well-studied classification algorithms, by reformulating the problem to utilize multiple binary classification tasks (Baccianella et al., 2009). Early work in this regard includes the use of perceptrons (Crammer and Singer, 2002;Shen and Joshi, 2005) and support vector machines (Herbrich et al., 1999;Shashua and Levin, 2003;Rajaram et al., 2003;Chu and Keerthi, 2005). Li and Lin (2007) proposed a general reduction framework that unified the view of a number of these existing algorithms.

Ordinal regression CNN
While earlier works on using CNNs for ordinal targets have employed conventional classification approaches (Levi and Hassner, 2015;Rothe et al., 2015), the general reduction framework from ordinal regression to binary classification by Li and Lin (2007) was recently adopted by Niu et al. (2016) as Ordinal Regression CNN (OR-CNN). In the OR-CNN approach, an ordinal regression problem with K ranks is transformed into K − 1 binary classification problems, with the k-th task predicting whether the age label of a face image exceeds rank r k , k = 1, ..., K − 1. All K −1 tasks share the same intermediate layers but are assigned distinct weight parameters in the output layer.
While the OR-CNN was able to achieve state-of-the-art performance on benchmark datasets, it does not guarantee consistent predictions, such that predictions for individual binary tasks may disagree. For example, in an age estimation setting, it would be contradictory if the k-th binary task predicted that the age of a person was more than 30, but a previous task predicted the person's age was less than 20. This inconsistency could be suboptimal when the K − 1 task predictions are combined to obtain the estimated age. Niu et al. (2016) acknowledged the classifier inconsistency as not being ideal and also noted that ensuring the K − 1 binary classifiers are consistent would increase the training complexity substantially (Niu et al., 2016). The CORAL method proposed in this paper addresses both these issues with a theoretical guarantee for classifier consistency and without increasing the training complexity. Chen et al. (2017) proposed a modification of the OR-CNN (Niu et al., 2016), known as Ranking-CNN, that uses an ensemble of CNNs for binary classifications and aggregates the predictions to estimate the age label of a given face image. The researchers showed that training an ensemble of CNNs improves the predictive performance over a single CNN with multiple binary outputs (Chen et al., 2017), which is consistent with the well-known fact that an ensemble model can achieve better generalization performance than each individual classifier in the ensemble (Raschka and Mirjalili, 2019).

Other CNN architectures for age estimation
Recent research has also shown that training a multi-task CNN that shares lower-layer parameters for various face analysis tasks (face detection, gender prediction, age estimation, etc.) can improve the overall performance across different tasks compared to a single-task CNN (Ranjan et al., 2017).
Another approach for utilizing binary classifiers for ordinal regression is the siamese CNN architecture proposed by Polania et al. (2019), which computes the rank from pair-wise comparisons between the input image and multiple, carefully selected anchor images.

Proposed method
This section describes our proposed CORAL framework that addresses the problem of classifier inconsistency in the OR-CNN by Niu et al. (2016), which is based on multiple binary classification tasks for ranking.

Preliminaries
be the training dataset consisting of N training examples. Here, x i ∈ X denotes the i-th training example and y i the corresponding rank, where y i ∈ Y = {r 1 , r 2 , ...r K } with ordered rank r K r K−1 . . . r 1 . The ordinal regression task is to find a ranking rule h : X → Y such that a loss function L(h) is minimized.
Let C be a K×K cost matrix, where C y,r k is the cost of predicting an example (x, y) as rank r k (Li and Lin, 2007). Typically, C y,y = 0 and C y,r k > 0 for y = r k . In ordinal regression, we generally prefer each row of the cost matrix to be V-shaped, that is, C y,r k−1 ≥ C y,r k if r k ≤ y and C y,r k ≤ C y,r k+1 if r k ≥ y. The classification cost matrix has entries C y,r k = 1{y = r k } that do not consider ordering information. In ordinal regression, where the ranks are treated as numerical values, the absolute cost matrix is commonly defined by C y,r k = |y − r k |.
Li and Lin (2007) proposed a general reduction framework for extending an ordinal regression problem into several binary classification problems. This framework requires a cost matrix that is convex in each row (C y,r k+1 − C y,r k ≥ C y,r k − C y,r k−1 for each y) to obtain a rank-monotonic threshold model. Since the cost-related weighting of each binary task is specific for each training example, this approach is considered as infeasible in practice due to its high training complexity (Niu et al., 2016).
Our proposed CORAL framework does neither require a cost matrix with convex-row conditions nor explicit weighting terms that depend on each training example to obtain a rankmonotonic threshold model and produce consistent predictions for each binary task.

Ordinal regression with a consistent rank logits model
In this section, we describe our proposed consistent rank logits (CORAL) framework for ordinal regression. Subsection 3.2.1 describes the label extension into binary tasks used for rank prediction. The loss function of the CORAL framework is described in Subsection 3.2.2. In subsection 3.2.3, we prove the theorem for rank consistency among the binary classification tasks that guarantee that the binary tasks produce consistently ranked predictions.

Label extension and rank prediction
Given condition is true and 0 otherwise. Using the extended binary labels during model training, we train a single CNN with K − 1 binary classifiers in the output layer, which is illustrated in Figure 2. Based on the binary task responses, the predicted rank label for an input x i is obtained via h(x i ) = r q . The rank index 1 q is given by 1} is the prediction of the k-th binary classifier in the output layer. We require that { f k } K−1 k=1 reflect the ordinal information and are rank-monotonic, , which guarantees consistent predictions. To achieve rank-monotonicity and guarantee binary classifier consistency (Theorem 1), the K − 1 binary tasks share the same weight parameters 2 but have independent bias units ( Figure 2).

Loss function
Let W denote the weight parameters of the neural network excluding the bias units of the final layer. The penultimate layer, whose output is denoted as g(x i , W), shares a single weight with all nodes in the final output layer; K − 1 independent bias units are then added to g( are the inputs to the corresponding binary classifiers in the final layer. Let be the logistic sigmoid function. The predicted empirical probability for task k is defined as For model training, we minimize the loss function which is the weighted cross-entropy of K − 1 binary classifiers. For rank prediction (Eq. 1), the binary labels are obtained via In Eq. 4, λ (k) denotes the weight of the loss associated with the k-th classifier (assuming λ (k) > 0). In the remainder of the paper, we refer to λ (k) as the importance parameter for task k. Some tasks may be less robust or harder to optimize, which can be considered by choosing a non-uniform task weighting scheme. For simplicity, we carried out all experiments with uniform task weighting, that is, ∀k : λ (k) = 1. In the next section, we provide the theoretical guarantee for classifier consistency under uniform and non-uniform task importance weighting given that the task importance weights are positive numbers.

Theoretical guarantees for classifier consistency
The following theorem shows that by minimizing the loss L (Eq. 4), the learned bias units of the output layer are nonincreasing such that Consequently, the predicted confidence scores or probability estimates of the K − 1 tasks are decreasing, for instance, for all i, ensuring classifier consistency. Consequently, { f k } K−1 k=1 (Eq. 5) are also rank-monotonic.
p i = σ(wx + b i ) with a single feature x. If the weight w is not shared across the K − 1 equations, the S-shaped curves of the probability scores p i will intersect, making the p˙i's non-monotone at some given input x. Only if w is shared across the K − 1 equations, the S-shaped curves are horizontally shifted without intersecting.
Theorem 1 (Ordered bias units). By minimizing the loss function defined in Eq. 4, the optimal solution (W is an optimal solution and b k < b k+1 for some k. Claim: replacing b k with b k+1 , or replacing b k+1 with b k , decreases the objective value L. Let By the ordering relationship, we have Denote p n (b k ) = σ(g(x n , W) + b k ) and Since p n (b k ) is increasing in b k , we have δ n > 0 and δ n > 0. If we replace b k with b k+1 , the loss terms related to the k-th task are updated. The change of loss L (Eq. 4) is given as Accordingly, if we replace b k+1 with b k , the change of L is given as By adding 1 and know that either ∆ 1 L < 0 or ∆ 2 L < 0. Thus, our claim is justified. We conclude that any optimal solution (W * , b * ) that minimizes L satisfies Note that the theorem for rank-monotonicity proposed by Li and Lin (2007), in contrast to Theorem 1, requires a cost matrix C with each row y n being convex. Under this convexity condition, let λ (k) y n = |C y n ,r k − C y n ,r k+1 | be the weight of the loss associated with the k-th task on the n-th training example, which depends on the label y n . Li and Lin (2007) proved that by using training example-specific task weights λ (k) y n , the optimal thresholds are ordered - Niu et al. (2016) noted that example-specific task weights are infeasible in practice. Moreover, this assumption requires that λ (k) y n ≥ λ (k+1) y n when r k+1 < y n and λ (k) y n ≤ λ (k+1) y n when r k+1 > y n . Theorem 1 is free from this requirement and allows us to choose a fixed weight for each task that does not depend on the individual training examples, which greatly reduces the training complexity. Also, Theorem 1 allows for choosing either a simple uniform task weighting or taking dataset imbalances into account under the guarantee of non-decreasing predicted probabilities and consistent task predictions. Under Theorem 1, the only requirement for guaranteeing rank monotonicity is that the task weights are non-negative.

Datasets and preprocessing
The MORPH-2 dataset (Ricanek and Tesafaye, 2006), containing 55,608 face images, was downloaded from https: //www.faceaginggroup.com/morph/ and preprocessed by locating the average eye-position in the respective dataset using facial landmark detection (Sagonas et al., 2016) and then aligning each image in the dataset to the average eye position using EyepadAlign function in MLxtend v0.14 (Raschka, 2018). The faces were then re-aligned such that the tip of the nose was located in the center of each image. The age labels used in this study were in the range of 16-70 years.
The CACD dataset (Chen et al., 2014) was downloaded from http://bcsiriuschen.github.io/CARC/ and preprocessed similar to MORPH-2 such that the faces spanned the whole image with the nose tip at the center. The total number of images is 159,449 in the age range of 14-62 years.
The Asian Face Database (AFAD) by Niu et al. (2016) was obtained from https://github.com/afad-dataset/ tarball. The AFAD database used in this study contained 165,501 faces in the range of 15-40 years. Since the faces were already centered, no further preprocessing was required.
Following the procedure described in Niu et al. (2016), each image database was randomly divided into 80% training data and 20% test data. All images were resized to 128×128×3 pixels and then randomly cropped to 120×120×3 pixels to augment the model training. During model evaluation, the 128×128×3 RGB face images were center-cropped to a model input size of 120×120×3.
We share the training and test partitions for all datasets, along with all preprocessing code used in this paper in the code repository (Section 4.4).

Neural network architectures
To evaluate the performance of CORAL for age estimation from face images, we chose the ResNet-34 architecture (He et al., 2016), which is a modern CNN architecture that achieves good performance on a variety of image classification tasks (?). For the remainder of this paper, we refer to the original ResNet-34 CNN with standard cross-entropy loss as CE-CNN. To implement a ResNet-34 CNN for ordinal regression using the proposed CORAL method, we replaced the last output layer with the corresponding binary tasks (Figure 2) and refer to this implementation as CORAL-CNN. Similar to CORAL-CNN, we modified the output layer of ResNet-34 to implement the ordinal regression reference approach described in (Niu et al., 2016); we refer to this architecture as OR-CNN.

Training and evaluation
For model evaluation and comparison, we computed the mean absolute error (MAE) and root mean squared error (RMSE), on the test set after the last training epoch: where y i is the ground truth rank of the i-th test example and h(x i ) is the predicted rank, respectively. The model training was repeated three times with different random seeds (0, 1, and 2) for model weight initialization, while the random seeds were consistent between the different methods to allow fair comparisons. Since this study focuses on investigating rank consistency, an extensive comparison between optimization algorithms is beyond the scope of this article, so that all CNNs were trained for 200 epochs with stochastic gradient descent via adaptive moment estimation (Kingma and Ba, 2015) using exponential decay rates β 0 = 0.90 and β 2 = 0.99 (default settings) and a batch size of 256. To avoid introducing empirical bias by designing our own CNN architecture for comparing the ordinal regression approaches, we adopted a standard architecture (ResNet-34 (He et al., 2016); Section 4.2) for this comparison. Moreover, we chose a uniform task weighting for the cross-entropy of K − 1 binary classifiers in CORAL-CNN, for instance, we set ∀k : λ (k) = 1 in Eq. 4.
The learning rate was determined by hyperparameter tuning on the validation set. For the various losses (cross-entropy, ordinal regression CNN (Niu et al., 2016), and the proposed CORAL method), we found that a learning rate of α = 5 × 10 −5 performed best across all models, which is likely due to using the same base architecture (ResNet-34). All models were trained for 200 epochs. From those 200 epochs, the best model was selected via MAE performance on the validation set. The selected model was then evaluated on the independent test set, from which the reported MAE and RMSE performance values were obtained. For all reported model performances, we reported the best test set performance within the 200 training epochs. We provide the complete training logs in the source code repository (Section 4.4).

Hardware and software
All loss functions and neural network models were implemented in PyTorch 1.5 (Paszke et al., 2019) and trained on NVIDIA GeForce RTX 2080Ti and Titan V graphics cards. The source code is available at https://github.com/ Raschka-research-group/coral-cnn.

Results and discussion
We conducted a series of experiments on three independent face image datasets for age estimation (Section 4.1) to compare the proposed CORAL method (CORAL-CNN) with the ordinal regression approach proposed by Niu et al. (2016) (OR-CNN). All implementations were based on the ResNet-34 architecture, as described in Section 4.2. We include the standard ResNet-34 classification network with cross-entropy loss (CE-CNN) as a performance baseline. .92 AVG ± SD 3.34 ± 0.07 4.74 ± 0.11 3.60 ± 0.02 5.03 ± 0.03 5.65 ± 0.11 8.07 ± 0.14 OR-CNN (Niu et al., 2016)    CORAL-CNN OR-CNN (Niu et al., 2016) OR-CNN (Niu et al., 2016) OR-CNN (Niu et al., 2016) Table 1, the proposed rank-consistent CORAL method shows a substantial performance improvement over OR-CNN (Niu et al., 2016), which does not guarantee classifier consistency.
Moreover, we repeated each experiment three times using different random seeds for model weight initialization and dataset shuffling to ensure that the observed performance improvement of CORAL-CNN over OR-CNN is reproducible and not coincidental. We can conclude that guaranteed classifier consistency via CORAL has a noticeable positive effect on the predictive performance of an ordinal regression CNN (a more detailed analysis of the OR-CNN's rank inconsistency is provided in Section 5.2).
For all methods (CE-CNN, CORAL-CNN, and OR-CNN), the overall performance on the different datasets appeared in the following order: MORPH-2 > AFAD > CACD (Table 1). A possible explanation is that MORPH-2 has the best overall image quality, and the photos were taken under relatively consistent lighting conditions and viewing angles. For instance, we found that AFAD includes images with very low resolutions (for example, 20x20). CACD also contains some lower-quality images. Because CACD has approximately the same size as AFAD, the overall lower performance achieved on this dataset may also be explained by the wider age range that needs to be considered (CACD: 14-62 years, AFAD: 15-40 years).

Empirical rank inconsistency analysis
By design, our proposed CORAL guarantees rank consistency (Theorem 1). In addition, we analyzed the rank inconsistency empirically for both CORAL-CNN and OR-CNN (an example of rank inconsistency is shown in Figure 3). Table 2 summarizes the average numbers of rank inconsistencies for the OR-CNN and CORAL-CNN models on each test dataset. As expected, CORAL-CNN has 0 rank inconsistencies. When comparing the average numbers of rank inconsistencies considering only those cases where OR-CNN predicted the age correctly versus incorrectly, the average number of inconsistencies is higher when OR-CNN makes wrong predictions. This observation can be seen as evidence that rank inconsistency harms predictive performance. Consequently, this finding suggests that addressing rank inconsistency via CORAL is beneficial for the predictive performance of ordinal regression CNNs.

Conclusions
In this paper, we developed the CORAL framework for ordinal regression via extended binary classification with theoretical guarantees for classifier consistency. Moreover, we proved classifier consistency without requiring rank-or training labeldependent weighting schemes, which permits straightforward implementations and efficient model training. CORAL can be readily implemented to extend common CNN architectures for ordinal regression tasks. The experimental results showed that the CORAL framework substantially improved the predictive performance of CNNs for age estimation on three independent age estimation datasets. Our method can be readily generalized to other ordinal regression problems and different types of neural network architectures, including multilayer perceptrons and recurrent neural networks.

Generalization Bounds
Based on well-known generalization bounds for binary classification, we can derive new generalization bounds for our ordinal regression approach that apply to a wide range of practical scenarios as we only require C y,r k = 0 if r k = y and C y,r k > 0 if r k = y. Moreover, Theorem 2 shows that if each binary classification task in our model generalizes well in terms of the standard 0/1-loss, the final rank prediction via h (Eq. 1) also generalizes well.
By taking the expectation on both sides with (x, y) ∼ P, we arrive at Eq. (8).
In Li and Lin (2007), by assuming the cost matrix to have Vshaped rows, the researchers define generalization bounds by constructing a discrete distribution on {1, 2, . . . , K − 1} conditional on each y, given that the binary classifications are rank-monotonic or every row of C is convex. However, the only case they provided for the existence of rank-monotonic binary classifiers was the ordered threshold model, which requires a cost matrix with convex rows and example-specific task weights. In other words, when the cost matrix is only V-shaped but does not meet the convex row condition, for instance, C y,r k − C y,r k−1 > C y,r k+1 − C y,r k > 0 for some r k > y, the method proposed in Li and Lin (2007) did not provide a practical way to bound the generalization error. Consequently, our result does not rely on cost matrices with V-shaped or convex rows and can be applied to a broader variety of real-world use cases.