An ordinal CNN approach for the assessment of neurological damage in Parkinson's disease patients

3D image scans are an assessment tool for neurological damage in Parkinson's disease (PD) patients. This diagnosis process can be automatized to help medical staff through Decision Support Systems (DSSs), and Convolutional Neural Networks (CNNs) are good candidates, because they are effective when applied to spatial data. This paper proposes a 3D CNN ordinal model for assessing the level or neurological damage in PD patients. Given that CNNs need large datasets to achieve acceptable performance, a data augmentation method is adapted to work with spatial data. We consider the Ordinal Graph-based Oversampling via Shortest Paths (OGO-SP) method, which applies a gamma probability distribution for inter-class data generation. A modification of OGO-SP is proposed, the OGO-SP-$\beta$ algorithm, which applies the beta distribution for generating synthetic samples in the inter-class region, a better suited distribution when compared to gamma. The evaluation of the different methods is based on a novel 3D image dataset provided by the Hospital Universitario 'Reina Sof\'ia' (C\'ordoba, Spain). We show how the ordinal methodology improves the performance with respect to the nominal one, and how OGO-SP-$\beta$ yields better performance than OGO-SP.


Introduction
In the medical field, plenty of data is obtained from patients in order to give an accurate diagnosis and decide on a beneficial treatment. From simple scalar to fully structured spatiotemporal data, like electrocardiograms, projectional radiographs, and computed tomography scans, all of these sources aid medical staff look for abnormalities and signs of degradation or damage. Nonetheless, these examinations are time-consuming and costly for hospitals and other medical administrations. A great deal of expertise is needed in order to interpret all this data into useful information.

Machine Learning
Machine Learning (ML) techniques can automatize some of these tasks by leveraging on existing data. They can serve as the core to a Decision Support System (DSS) in order to help medical staff make a better judgment or give an additional opinion.
In the case of structured data, Convolutional Neural Networks (CNNs) have shown excellent performance in tasks such as classification, segmentation or regression. These models rely on large amounts of labeled data in order to learn. Examples of such tasks can be diagnosis (classification) (El-Dahshan et al., 2014;Mazaheri & Khodadadi, 2020), automatic identification of anatomical regions (segmentation) (Peng et al., 2020) or landmark location (regression) (Payer et al., 2019).

Parkinson's disease
One of many medical applications of ML explored in the literature is Parkinson's disease (PD) diagnosis. PD is a degenerative nervous system disorder affecting the brain whose symptoms are primarily motor-related: shaking, gait disturbances, slowness and difficulty walking. Other symptoms are related to sleeping, emotional or sensory problems. The cost on society of this disease grows as the symptoms worsen, as the greatest component of cost is in patient care and nursing home costs. Just in the UK, the total cost has been estimated to be between £445 million and £3.3 billion annually (Findley, 2007).
Assessing the severity of the neurological damage of PD patients is a crucial part for the correct treatment, as an unnecessarily high dose of levodopa (the most common medication for PD) may worsen symptoms in the longterm (Tomlinson et al., 2010). To evaluate this, doctors rely both on observations of motor capabilities (Marino et al., 2012) as well as imaging techniques like Magnetic Resonance Imaging (MRI) (Marino et al., 2012) and nuclear tomography like Single-Photon Emission Computerized Tomography (SPECT) or Positron Emission Tomography (PET) (Arbizu et al., 2014). Ioflupane ( 123 I) (known commercially as "DaTscan") is a neuro-imaging drug often used to evaluate the dopaminergic activity in the nigrostriatal dopaminergic pathway when the disease may be in the early stages (Darcourt et al., 2010). It is injected into the patient's bloodstream before taking a SPECT image, so that the brain's dopaminergic activity can be inspected visually. The evaluation of this damage may require a great deal of experience.
In recent years there has been a growing interest on the application of ML techniques to this kind of images, which require no previous assumptions on the regions of interest or relevant areas for the given task. Instead, these methods are able to discover for themselves where and what to look for in images, based solely on data previously labeled by doctors. Popular methods tackle binary ("healthy control" or "disease") (Vinícius dos Santos Ferreira et al., 2018) or nominal ("healthy control", "disease A", "disease B", ...) classification (Akyol, 2020).
Some datasets related to PD are available online for research use, such as the Parkinson's Progression Markers Initiative (PPMI) 1 or the LRRK2 Cohort Consortium (LCC) 2 .

Ordinal classification
In the last decade, the concept of "ordinal classification" (sometimes referred as "ordinal regression") has been popularized as a way to exploit extra informa-tion in a classification problem where a natural ordering of the classes is present (Gutiérrez et al., 2016;Ben-David, 2008). This approach has been proven to outperform the classic nominal perspective in medical applications such as melanoma diagnosis (Sánchez-Monedero et al., 2018) and liver transplantation (Dorado-Moreno et al., 2017). Up until now, research into this methodology applied to PD diagnosis has not been addressed, because existing scales, such as the Hoehn and Yahr scale (Hoehn & Yahr, 1967) or the Unified Parkinson's Disease Rating Scale (UPDRS), require subjective evaluation from clinicians of tremor, rigidity or movement (Ramaker et al., 2002) and are difficult to quantify in a single measure.
A complete taxonomy of ordinal classification methods can be found in Gutiérrez et al. (2016). The most naive methods perform a simple regression using the class labels and then round the values when predicting (Kramer et al., 2010) or simply apply a label distance cost penalty to a classical nominal classification method (Kotsiantis & Pintelas, 2004). These basic approaches lack an understanding of the underlying label distance, as different misclassifications may represent a different error cost.
Threshold Models are another popular approach for this task. An underlying latent continuous variable is assumed to exist, from which the different ranks arise by assigning certain thresholds. Thus, in this framework, both the value of the latent variable and the thresholds need to be learned from the data. Some approaches, like the classical Proportional Odds Model (POM) (McCullagh, 1980) or the more recent gologit model (Williams, 2006) fall into the Cumulative Link Models (CLMs) framework, a probabilistic method for predicting probabilities of groups of contiguous categories, taking the ordinal scale into account.
Other ordinal approaches consist on decomposing the ordinal problem into a set of binary problems (called Ordinal Binary Decomposition (OBD)). Sometimes these decompositions are solved by a set of different models, like in the cascade linear utility model (Wu et al., 2003). In other cases they are modeled by several outputs of the same underlying model (Jianlin Cheng et al., 2008). All OBDs present the same challenge: combining the results of all decompositions into a single final classification. The simplest approaches assign the first class to reach a certain threshold (Wu et al., 2003), but this can lead to an imbalance as the last classes are more difficult to select. Error Correcting Codes (ECOCs) are better suited to this task, as this approach considers all outputs equally in the final decision (Allwein et al., 2001).
The available PD datasets mentioned previously only provide label information of binary, nominal or continuous nature. Our study presents a novel dataset, collected by the Clinical Management Unit of Nuclear Medicine of the Hospital Universitario "Reina Sofía" (Córdoba, Spain), containing SPECT Ioflupane ( 123 I) images of PD patients classified in 4 distinct ordinal labels, according to their stage of presynaptic dopamine binding damage ("no alteration", "slight alteration", "more advanced alteration" and "severe alteration"). Ordinal methods are perfect candidates to tackle the task of predicting these labels, which serve as a better indicator for the medical decision process.

Data augmentation
Classification tasks may suffer from unbalanced data, especially in the medical field, as healthy patients are much more common than sick patients. Also, data gathering and proper labeling is expensive and time consuming. In this situation, data augmentation techniques are required in order to boost the performance of ML models.
The most basic strategy for augmenting spatial data, like medical images, is image translation, rotation, flipping and cropping (Perez & Wang, 2017). Several of these can be used depending on the specific task to learn. For example, object detection tasks, such as anomaly or lesion detection, can be accelerated by using cropped Regions of Interest (ROIs) as samples (Rey et al., 2020).
Classic techniques like Synthetic Minority Oversampling (SMOTE) (Chawla et al., 2002;Rivera & Xanthopoulos, 2016) perform well on low dimensional data. Some techniques, such as Autoencoders (Hinton & Zemel, 1993) or Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are able to leverage convolutional operations, which improve the performance and efficiency over spatial data. Those techniques, however, require a vast amount of training examples in order to avoid pitfalls like mode collapse.
In order to meet this data volume requirement, more sophisticated data augmentation methods are being applied to medical data as of recently. As an example, Salazar et al. (2021) combine GANs with Markov Random Field models to augment 3D functional MRI multi-subject data and enhance nominal classification performance.
There also exist data augmentation methodologies that make use of ordinal information in the class labels to improve the synthetic data generation process. A family of such methods, presented by Pérez-Ortiz et al. (2015), are the Ordinal Graph-based Oversampling (OGO) methods. These consist on computing a graph estimating the latent manifold structure in the data by exploiting ordinal information in the labels, and then using the edges on that graph to generate samples, similar to SMOTE.
Classic proven techniques such as SMOTE and OGO can be adapted to work on spatial data: a CNN can first be trained so that a projection from high-dimensional data is learned, and then a traditional data augmentation method can be applied to the resulting low-dimensional data.

Limitations of OGO
Pérez -Ortiz et al. (2015) propose the use of a gamma distribution for the generation of samples in the inter-class space of the latent manifold. This distribution is skewed towards the part of the graph adjacent to the class to be augmented but permits the inclusion of features in the frontier of both classes.
This, however, presents some limitations. The gamma distribution is not suited for the generation of values in a closed domain (in this case, δ ∈ [0, 1]), as its original domain is in the interval [0, ∞). This is not only a theoretical overlook, but it also hinders the ability to tune its parameters in a meaningful way, subject to the dataset to which it is applied.
In this paper, an alternative to the gamma distribution is proposed: using the better suited beta distribution, which is bounded in the [0, 1] interval and has easier-to-tune parameters, enabling a higher degree of flexibility which can help achieve better performance.

Goals
To the best of our knowledge, there are no previous works using ML methods for the assessment of the severity of brain damage from a patient's brain SPECT 3D image. These methods could help doctors in the diagnosis and treatment of PD and other parkinsonisms through DSSs and contribute towards the relief of the public health cost of this disease.
Thus, the goals of the present work are: 1. Exploring the potential classification performance improvement in using ordinal label information.
2. Adapting the use of classical data augmentation and class balancing techniques to spatial three-dimensional data.
3. Analyzing the developed methodologies in points 1 and 2 using a novel and extensive database of SPECT images from Hospital "Reina Sofía" (Córdoba, Spain).
4. Studying a potential improvement to the data augmentation methodology presented by Pérez-Ortiz et al. (2015) by applying a better suited probability distribution for generating synthetic samples in the class frontiers.
The rest of this paper is organized as follows: In Section 2 we describe the data to be used and we propose a fully 3D Deep CNN model for the evaluation of presynaptic deficit in SPECT Ioflupane ( 123 I) images. We design two versions of the same base model, one nominal and the other ordinal, differing on the output layer shape and the activation function as well as the loss function. We propose a novel approach for data augmentation using the beta probability distribution with efficiently estimated parameters. In Section 3, we describe the design of the experiments and metrics for the evaluation of the proposed methods. We show that our proposal performs better than the nominal methodology as well as the previous ordinal method. Finally, in Section 4, we discuss the results and propose future work.

Materials and methods
In this section, we present the dataset used for the experimentation, as well as the architecture of the proposed models and the novel data augmentation proposal.
These models have been designed to tackle the task of assessing the alteration of dopaminergic activity in the brain of PD patients by examining 3D SPECT scans of the brain.

Data description
The dataset consists of 508 3D images provided by the Clinical Management Unit of Nuclear Medicine of the Hospital Universitario "Reina Sofía" (Córdoba, Spain). They are obtained by first administering the patients with Ioflupane ( 123 I), a radiopharmaceutical which binds to the presynaptic dopamine transporters in the brain. Some time later, a SPECT scan is performed, so as to inspect the dopaminergic activity in the nigrostriatal dopaminergic pathway, which is one of the neuropathological characteristics of PD.
Of these images, 314 (61.8 %) are of healthy patients (class 0), 42 (8.3 %) show slight alteration (class 1), 52 (10.2 %) show more advanced alteration (class 2) and 100 (19.7 %) show severe alteration (class 3). It is common for medical diagnosis datasets like this one to have a severe imbalance problem. In our case, more than 60 % of samples are of healthy patients, and less than 10 % belong to class 1.
The doctors' diagnosis is attached as a class label to each of the images. Because of the gradual nature of these classes, the task of recognizing which of the four categories an image belongs to can be posed as an ordinal classification problem and, thus, specific techniques can be employed to exploit the order information.
Automatic linear image registration has been performed on all images using the FMRIB's Linear Image Registration Tool (FLIRT) from the FMRIB Software Library (FSL) (Smith et al., 2004) considering the T2 version of the MNI152 2mm standard space SPECT template (Evans et al., 2012). Thus, all images have a final resolution of 91 × 109 × 91 voxels. Also, during training, the symmetrical nature of the images is exploited: with 50 % probability, the images are flipped on the frontal axis (left to right) each time they are used.

Global architecture
The overall architecture of the CNN model considered in this paper consists on convolutional blocks of repeating layers reducing the size of the image while increasing the number of feature maps. Afterwards, the output of the convolutional part of the network is fed into a fully connected layer of neurons before computing the output decision of the model.
Each convolutional block consists of a 3D convolutional layer followed by a batch normalization layer. The kernel size and the stride of the convolution are parameters to be cross-validated during the training process. The output of each block is then fed into a Leaky Rectified Linear Unit (LReLU) activation function, which has been proven to have good convergence properties such as scale-invariance and 1-Lipschitz continuity (Suzuki, 2018).
The low resolution feature maps which are the output of the convolutional blocks are then the input of a densely connected neuron layer (the number of neurons of this layer is also cross-validated during training) using, again, LReLU as the activation function. A final output layer computes the final classification given by the model. In the training phase, the model weights are updated using the Adam optimization algorithm (Kingma & Ba, 2017) so that the outputs align with the annotated labels.
We test two different architectures: a classic architecture based on nominal classification and an ordinal architecture, considering the ordering of the class labels. In Fig. 1, it can be noted that both architectures share the same structure for their convolutional part, but they are different in the way the final output is computed. Moreover, two different class balancing methods are applied in each case, as explained in Section 2.5.

Classic nominal classification architecture
As is the case with classic CNN architectures, the output of the convolutional part of the network is then fed to a single fully connected layer. This is then again fully connected with the output layer, which has as many outputs as classes to decide. Then, a softmax activation function maps the output of the network into a set of probabilities o q of belonging to each class C q : o q = P (y = C q ).
The optimizing criterion is the categorical cross-entropy loss, described as: where Q is the number of classes, 1{y i = C q } is the indicator function that is equal to 1 when y i = C q and 0 otherwise and P (y = C q | x i ) is the predicted probability of x i belonging to class C q .
To evaluate the effectiveness of the trained model, a new unseen given sample x is classified as belonging to the class with maximum predicted probabilitŷ y = arg max 1≤q≤Q P (y = C q | x).

Proposed ordinal classification architecture
When the classes are ordered, instead of dealing with the complete problem as previously mentioned, OBD can be applied: the problem is decomposed into Q − 1 binary decision problems. Each problem q consists on deciding if y C q conditioned to x (1 ≤ q < Q).
This would normally require Q−1 different models, each solving one of these binary problems. This approach, originally presented by Frank & Hall (2001), would imply to first compute every probability p q = P (y = C q ) based on the obtained models and then select the highest probability class. The individual probabilities are computed as a function of the cumulative probabilities, P (y C q ), estimated by the binary models: However, different problems are associated to this approach: because the outputs of the different decompositions are not combined in the same training process, the basic probability assumptions (that is, P (y C q ) ≥ P (y C q+1 ), p q ≥ 0 and q p q = 1) are not necessarily satisfied, which can lead to inconsistencies. Moreover, when computing the individual probabilities, at most only two of the model outputs are considered, instead of all of them. Finally, when obtaining the decomposition, the ratio of positive to negative samples becomes very unbalanced in the extreme classes. In the case of an already unbalanced dataset, this procedure can become unrealistic.
To circumvent this limitation, a compromise is proposed: a single convolutional model can be trained simultaneously to then be fed into multiple fully connected blocks, each one solving an individual binary classification subproblem. This way, the imbalance of the training data can be less acute, and training can be done in parallel. The output of each of the Q − 1 fully connected blocks has a sigmoid activation function representing the probability Moreover, for obtaining the final probabilities, we use a more stable option based on the decision function of the ECOC framework (Allwein et al., 2001). The correct ideal output code for each class is considered as the coordinates of a vertex of a hypercube in Q − 1 dimensions, e.g. for a 4 class ordinal problem classes C 1 , C 2 , C 3 and C 4 would be associated to the codes (0, 0, 0), (1, 0, 0), (1, 1, 0) and (1, 1, 1), respectively. This way, all the model outputs are considered for classification. To decide which class a sample x belongs to, the class with the nearest code according to some distance d is selected: is the vector of output values, and c q is the code vector associated with class C q . It can be noted that the aforementioned probability assumptions do not need to hold in order to assign a class in this framework.
Instead of individually training the different binary subproblems (as proposed for ECOC) and to be consistent with the decision criterion, the global optimization criterion of the network is set to be the mean squared error (M SE) loss: While classifying new samples, the L 2 norm is used as the distance metric d, as it aligns with the M SE optimization criterion:

Class balancing
For problems of imbalanced nature, such as medical diagnosis, where samples of specific classes are a minority compared to the rest, some considerations need to be made in the training process to avoid biases in the final model that can undermine its generalization capabilities.  Different techniques such as class balancing can help with this problem. These generally consist on sampling or generating techniques that balance the ratio of training samples of each class presented to the classifier during training.
A popular approach is SMOTE: in order to generate new samples of a minority class, several existing samples of said class from the training set are randomly selected and a random weighted sum of their features is used as a new synthetic sample of that class. SMOTE can work well when the number of features is not too large and all features are continuous.
In order to account for the ordinal information in the sample synthesis process, OGO techniques based on the idea of SMOTE have been previously proposed, such as OGO via shortest paths using a probability function for the inter-class edges (abbreviated to OGO-SP) (Pérez-Ortiz et al., 2015). This algorithm defines a way to construct a graph which captures the neighbouring relations between samples of the dataset, considering the ordering information provided by the class labels.
From this, an N -vertex undirected graph G = (V, E) will be constructed, where V corresponds to the vertices representing the N training samples and E ⊆ V 2 corresponds to the edges representing the neighbouring relations between samples: If q is the index of the class to be over-sampled, a graph G q of this form will be constructed based on three subgraphs: The edges of graph G q−1,q are determined by the intersection of two different sets obtained from a neighbourhood analysis based on the distance relation d: where X c is the subset of all samples with label y = C c , N d (X 1 , X 2 , k) is the k-neighbourhood of X 1 with respect to X 2 according to some distance d and nn d (x, X, k) is the set containing the k nearest neighbours of x from set X. The vertices V q−1,q are all those appearing on E q−1,q : Using only the intersection of both neighbourhoods ensures that only the connecting regions of each class are considered. Parameter k will control how broad is the region to consider.
Graph G q,q+1 is defined in an analogous way: Finally, G q is simply defined as: For the case of the extreme classes (C 1 and C Q ), one of G q−1,q or G q,q+1 may be empty and only the connecting region to the one adjacent class will be considered.
Based on the ordinal classification hypothesis that the distance to adjacent classes is lower than the distance to non-adjacent classes, the final graph G q = (V q , E q ) will be constructed based on the previously constructed G q . Ideally, a distance-based manifold exists in the class label, such that X q lies in the space between X q−1 and X q+1 . In reality, some outliers may exist in X q that are not desirable in the over-sampling procedure. In order to identify the key samples which lie between the adjacent classes, the shortest paths between the samples of X q−1 and X q+1 are used to decide the edges present in the final G q .
A path between two vertices v 1 and v z of the graph is defined as the sequence If a cost function f : E → R assigning a cost to every edge is defined, the shortest path P 1,z is that which minimizes the total sum of the costs of the edges ). In our implementation, the cost function selected is the same as distance d used for N d , which is the L 2 norm or euclidean distance: In order to find those patterns in X q lying in the latent ordinal manifold, all the shortest paths between all the vertices in V q−1,q and all in V q,q+1 will be computed using Dijkstra's algorithm (Dijkstra, 1959), and only the edges contained in one or more of these paths will be included in E q : Note that, if q is any of the extreme classes, V q,q will have to be used instead of V q−1,q or V q,q+1 , depending on the case.
An example of the computed graph (V q , E q ) is shown in Fig. 2. Finally, new synthetic samples can be generated from G q : in order to generate sample (x , C q ), a random edge e = (x i , x j ) ∈ E q is selected so that x lies in the segment between x i and x j : where δ is a random variable in the range [0, 1]. The distribution from where δ is sampled will depend on the selected edge e: • If both y i = C q and y j = C q (i.e. e is an intra-class edge), then δ is sampled from a uniform distribution U (0, 1) just like SMOTE.
• If y i = C q but y j = C q (i.e. e is an inter-class edge), then δ is sampled from an asymmetrical distribution so that the new synthetic sample favours the augmented class but is able to capture the class transition phase. In the original OGO-SP paper (Pérez-Ortiz et al., 2015), δ ∼ Gamma(k = 2, θ = 0.15). While this has the previously mentioned properties, the gamma distribution is not bounded, δ ∈ [0, ∞), which means that P (δ > 1) > 0, and there is a risk that x lies in a different region of the manifold than C q and is therefore incorrectly labelled.
Figure 2: Example of the OGO-SP graph construction procedure. The markers represent samples of the dataset. The graph (V 2 , E 2 ) corresponding to C 2 is constructed. The top diagram shows E 1,2 (in red) and E 2,3 (in blue). The bottom diagram shows the shortest path between two vertices in V 1,2 and V 2,3 (in green) and the edges of the final constructed graph E 2 (in black).
In order to overcome this problem of the original algorithm, we propose the following modification: instead of weighting on a parameter δ ∼ Gamma(k, θ), the better suited beta distribution is used δ ∼ Beta(α, β), where the parameters α > 0 and β > 0 control the shape of the distribution (Keeping, 1995). We hypothesize that the beta distribution is better suited than the gamma distribution for this purpose, because it is bounded in the interval [0, 1], it has lower variance and its parametrization allows the probability density to be skewed towards a specific extreme, depending on the values of α and β.
The beta distribution has been applied to model the behaviour of random variables limited to intervals of finite lengths in a wide variety of disciplines. This distribution, in its standard form, is a continuous distribution with probability density function f (x) given by: for 0 < x < 1 and α > 0, β > 0, where B is the beta function. Depending on the parameter values, the following properties of the distribution can be outlined: • If α > 1, then f (0) = 0. Similarly, if β > 1, then f (1) = 0.
We refer to this new method from now on as OGO-SP with beta frontier distribution (OGO-SP-β).
Based on the two different possibilities of the two endpoints f (0) and f (1), four different asymmetric shapes can be obtained for this distribution. One of these shapes (α > 1 and β < 1) will not be considered, as this would put more probability mass in the neighbouring class side of the distribution. To ensure this, we use the quantile constraint P (δ < 0.5) = 0.75, so that the majority of the probability mass is in the augmented class side. Van Dorp & Mazzuchi (2000) prove that two quantile constraints are sufficient to parametrize the beta distribution and provide a numerical method to obtain the values of α and β that satisfy these constraints. We therefore choose three other quantile constraints that, in combination with the previous one, yield the three different shapes, and, using the aforementioned method, we compute the values of α and β: All three configurations were tested and compared to the original OGO-SP. Our hypothesis is that the beta distribution will be a better candidate for synthetic sample generation for certain datasets, like the one that will be studied in this paper (Section 2.1). While configuration (a) imitates the original gamma distribution just for comparison, configuration (b) and (c) of OGO-SP-β exploit the versatility of the beta distribution.
A graphical visualization of the shape of the probability density function for all distributions can be seen in Fig. 3. From there, it can be noted that when α < 1 the value of the PDF for δ = 0 tends to infinity, as more probability mass is directed to that extreme. The same thing happens for β < 1 and δ = 1. This way, the probability of generating samples in the inter-class region can be controlled, while favouring the generation of samples in the augmented class region with respect to the neighbouring class.
The detailed process is finally specified in algorithm 1. In our case, the data described in Section 2.1 is severely imbalanced, so data augmentation is crucial for the performance of the classification model. We if y i = C q then Swap so that x i is the one with label C q 12 Sample δ from an asymmetrical distribution (see Section 2.6) argue that the OGO-SP-β algorithm is well suited here due to the ordinal nature of the problem: the alteration of dopaminergic activity is a gradual process, so it is expected to see a more severe damage in a later stage of the disease and vice versa. Moreover, because the intermediate classes are the minority ones, the beta distribution will favour the generation of samples in their regions.

Application to spatial data
Applying techniques like SMOTE or OGO-SP to spatial data, such as images or 3D scans is not appropriate. They are unable to capture the variability of the position of different objects in a scene or anatomical elements on a CT scan. Applying these techniques in the original space that the images are sampled on yields completely unnatural and inappropriate synthetic samples that detract from the generalization capabilities of the resulting model.
On the other hand, the convolutional part of a CNN model tries to achieve a projection from the original space to a small number of features that can separate the samples correctly according to the classification problem at hand. The space of this projection should be better suited for interpolation and, consequently, for the application of SMOTE and derived techniques. Thus, in this paper, we propose a two-step training process for the application of class balancing: 1. First the whole network (convolutional + fully connected parts) is trained on the original dataset D.
2. Once the stopping criterion is reached, the convolutional part g is used to project the original dataset D into a new space with reduced dimensionality in order to obtain D : where d is the original dimensionality of the data and d is the new reduced dimensionality.
3. New synthetic samples for each class q are generated by using SMOTE in the nominal case and OGO-SP/OGO-SP-β in the ordinal case. If n q is defined as the number of samples labelled as C q in D , then D +q are the generated samples of class q. The number of samples to generate for each class is chosen so as to equalize the number of training samples of all classes. Note that this means that no synthetic sample will be generated for the majority class: 4. Synthetic samples are then merged with dataset D to produce the augmented dataset D + : 5. Finally, only the fully connected part of the original model is trained again using D + , with the same stopping criterion.

Experimentation
The five models previously described (one nominal and four ordinal for the different distributions for δ) will be compared against each other, in order to evaluate the effect of using the ordinal information in the learning process.

Experimental design
A stratified 5-fold cross-validation over the complete dataset is performed. The dataset is split into 5 (approximately) equal size subsets in a stratified fashion, so that the class distribution is maintained for each fold. For each step, one subset is used for testing and the rest are used as training samples.
For each of these 5-fold steps, in the first phase, a model selection process is performed, i.e. the hyperparameters of the algorithm are tuned. For this, three 90/10 holdout splits are performed and all the possible combinations of the following parameter values are considered: • Learning rate (η): { 10 −3 , 10 −4 }. • Neighbourhood size for the data augmentation method: {3, 5}.
The mean MAE score across the three splits is used to rank the parameter combinations, and the best combination for each fold is then used for evaluation.
Once the hyperparameter selection phase is completed, the optimal parameter combination is used in the second phase for final evaluation. The model is initialized and trained 30 times with different 90/10 train/validation splits, as well as a different random seed for initialization of the weights of the network and data augmentation.
The same scheme is repeated for each of the data folds. An illustration of this process can be seen in Fig. 4. In every training instance, the validation samples are used for early stopping of the training process: when the loss over the validation set does not improve for 50 epochs ("patience" parameter), training is stopped, and the best performing weights are restored.
A batch size of 32 samples is used in all cases. The code used to perform the experiments can be accessed through GitHub 4 .

Evaluation metrics
Multi-class nominal and ordinal metrics. While the Correct Classification Rate (CCR) is usually the most important criterion in classification tasks, in the case of high class imbalance, it is less relevant (Provost & Fawcett, 1997). In scenarios like this, a model that disregards the minority classes can still obtain a high CCR score, which is not desirable, and per-class sensitivity needs to be addressed (Sánchez-Monedero et al., 2011). Still, while optimizing sensitivity, specificity may suffer a performance hit, so it also needs to be monitored. In addition to this, CCR does not consider how much each prediction deviates from the ground truth, as it is designed primarily for nominal classification problems (where all the mistakes are equally penalised). For ordinal classification problems, a classification error of only 1 class is more desirable over an error of 2 classes. For this reason, rank difference metrics like the Mean Absolute Error (MAE), Spearman's rank correlation coefficient (r s ), Kendall's rank correlation coefficient (τ b ) (Cardoso & Sousa, 2011) and Weighted Cohen's Kappa (κ) (Ben-David, 2008) are better suited to evaluate the performance of the model. Moreover, MAE may also be deceiving in high class imbalance scenarios, so per-class MAE is also useful to consider.
When monitoring per-class metrics, it is useful to look at the minimum in order to ensure that performance does not increase at the expense of ignoring some of the classes.
In this study, the following metrics will be used. Metrics to be maximized are marked with (↑) and metrics to be minimized with (↓). In the following equations, N denotes the number of test samples, Q denotes the number of classes, y i is the class label for sample x i andŷ i is the estimated label for sample x i . where per-class sensitivity (S q ) is: • Geometric mean of the specificities and minimum specificity (↑): where per-class specificity (Sp q ) is: • Average and maximum MAE (↓): where per-class MAE (MAE q ) is: where w ij is the disagreement cost when y = C i andŷ = C j (w ij = |i − j|), p ij is the observed agreement and e ij is the expected agreement due to chance.
• Kendall rank correlation coefficient (↑): where n c , n d , n 1 and n 2 are computed from every pair {(y i ,ŷ i ), (y j ,ŷ j )}. n c is the number of concordant pairs, n d is the number of discordant pairs, and n 1 and n 2 is the number of tied pairs only for y and only forŷ, respectively: • Spearman rank correlation coefficient (↑): r s = Cov(y,ŷ) σ y σŷ , where Cov(y,ŷ) is the covariance between the ground truth labels and the predicted labels and σ y and σŷ is their standard deviation.
Of the previous metrics, CCR, MS, MAE, AMAE, MMAE, κ, τ b and r s are completely defined by Cruz-Ramírez et al. (2014) and GMS by Pérez-Ortiz et al. (2015). CCR, GMS, MS, GMSp and MSp are all purely nominal metrics, while the rest apply only to ordinal class labels and are generally more relevant for this kind of classification problems.
Also, GMS, MS, MSp, GMSp, AMAE and MMAE were chosen as class imbalance-sensitive metrics, to asses the effectiveness in these scenarios.
Lastly, based on the output scores of the models, the area under the Receiver Operating Characteristic (ROC) curve (AUC) will be computed. In order to obtain this value, as many curves as the number of classes are obtained, where each curve is based on a binary one-vs-rest labelling (OVR). Then, the average AUC is obtained from the Q curves.
Binary metrics. Given that no previous literature exists on the diagnosis of the stage of presynaptic damage from PD, some binary metrics will be extracted from the results with the only purpose of comparing it to previous works, which deal with the diagnosis problem of discerning PD patients from Healthy Controls (HC). Directly comparing CCR or any other of the previously mentioned metrics against the case of a binary classifier would be incorrect and unfair, as the ordinal metric is quite more demanding, having to classify into four different classes instead of only two.
For this purpose, all class 0 labels will be considered as "negative class" or HC and the rest (class 1 through 3) will be considered as "positive class" or PD. From this, a confusion matrix can be extracted.  Table 1 includes a summary of all the results from the 150 executions of the different methodologies, based on the different performance metrics introduced in the previous subsection. A graphical summary of the results can be found in Fig. 5.

Experimental results
From these results, it can be noted that some configuration of OGO-SP-β always shows better average performance over all other classifiers in all metrics, with the exception of CCR for which the nominal methodology obtains the best results, with OGO-SP-β (c) in second place. By looking at the GMS and MS metrics, it is very clear that the nominal methodology fails to address the present class imbalance and ignores the minority classes, arranging all test samples into a couple of the majority classes, which drops the score to zero in most of the evaluation splits.
Comparing OGO-SP-β to the original OGO-SP, OGO-SP-β generally obtains better average performance over all splits, specially according to MAE.
Both ordinal methodologies clearly outperform the nominal one. Fig. 6 shows the corresponding ROCs for the different classes of the problem. Both ordinal methodologies obtain a noticeable advantage in the intermediate minority classes, specially for class 1, the class with the least number of samples.
We used a Wilcoxon signed-rank test to detect significant differences between the metrics. This version of the test was used to measure the performance of the classifiers, because the variables come from the same sample (Wilcoxon, 1945). Purely nominal methodology was compared with the overall best performing ordinal methodology using OGO-SP-β (configuration (c)). We also used the test to compare the ordinal methodology using OGO-SP with the best performing ordinal methodology using OGO-SP-β (configuration (c)).
The critical values (p-values), for a bilateral test of each metric, are provided in Table 2.
Significant differences in performance (with α = 0.05) favouring the ordinal methodology over the nominal one are found for all metrics, but CCR and κ. As expected, the nominal methodology fixates on improving CCR instead of the rest of ordinal metrics, which are generally more desirable for ordinal type classification problems. Moreover, the nominal methodology tends to ignore minority classes, obtaining significantly worse results for GMS, MS, MSp, GMSp and AUC. Furthermore, OGO-SP-β performs significantly better in CCR, MAE and GMSp compared to the original OGO-SP using the gamma distribution.   The curves are obtained according to the output scores of each model using a one-vs-rest labelling (OVR) for each class. Table 2: p-values for the two-tailed Wilcoxon signed rank test. "N" stands for the nominal methodology, "γ" stands for the methodology using the original OGO-SP algorithm and "β" stands for the methodology using the OGO-SP-β algorithm. Values less than α = 0.05 have been highlighted in bold. For these tests, the (+) subscript denotes an advantage for OGO-SP-β and (-) otherwise. γ vs. β (c) 7.3 × 10 −2 8.0 × 10 −2 6.7 × 10 −1 Clearly, even when the tests do not find significant differences, OGO-SP-β shows better overall performance in the ordinal metrics. The inconclusiveness present in the imbalance-sensitive metrics (GMS, MS, MSp and MMAE) could be explained by their unstable nature, which results in a larger standard deviation, and thus it is more difficult for the test to draw a conclusion.

CCR GMS MS
Lastly, the results for the binary metrics results can be found in Table 3. The results are compared against the following works: • Rizzo et al. (2016): A meta-analysis of 20 different studies, all using different techniques, between 1988 and 2014.
• Martinez-Murcia et al. (2017): A CNN approach for binary classification from SPECT imaging.
• Orozco-Arroyave et al. (2016): An application of radial base Support Vector Machines to running speech audio samples from patients.
• El Maachi et al. (2020): An application of neural networks to gait sensor data from patients.
Note that the task considered in these works is different, as the different possibilities for positive labels (PD) are not differentiated, greatly reducing the complexity. However, even considering that the proposed models are more informative and taking into account that the experimental settings and the datasets are not same, we can conclude that the performance obtained by the different proposals is competitive, specially when trying to achieve a balance between the three binary metrics. That is, the extra information provided by the proposed multi-class classifiers is not obtained at the cost of losing performance in the binary task.

Conclusions
We have confirmed experimentally that the exploitation of ordinal information can improve the performance of a complex task such as the assessment of brain activity alteration in PD. This exploitation comes from aspects such as the model architecture, the optimization target and the data augmentation strategy. This expands on the current range of models capable of exploiting ordinal information, in this case, a fully 3D CNN.
This methodology is able to alleviate the class imbalance problem while improving ordinal performance metrics. This approach could be applied to already existing ordinal classification tasks which suffer from this same problem, which is common among the medical field.
Furthermore, the proposed OGO-SP-β ordinal augmentation algorithm improves performance of ordinal and nominal metrics compared to the original OGO-SP. Table 3: Binary metrics results of the five proposed methodologies along with three other works. Rows marked as type "A" correspond to an automatic method while rows marked as type "H" correspond to studies about diagnosis by human experts. The Data column indicates the reference data used for diagnosis: imaging (I), speech (S), gait (G) or a refined clinical diagnosis (C).

Method/work
Type Data Accuracy Sensitivity Specificity From the more classical binary diagnosis point of view, a good performance is achieved, as can be noted in the comparison against expert diagnoses and other ML techniques.
Future work lines include further data acquisition or the application of this methodology to publicly available data. Current available data commonly lacks ordinal information, but ordinal target labels may be extracted from already available information. Larger datasets could help obtain more relevant and precise results. Also, 3D ordinal applications outside the medical can be explored to asses the performance over nominal approaches.