Imbalanced classification for protein subcellular localization with multilabel oversampling

Abstract Motivation Subcellular localization of human proteins is essential to comprehend their functions and roles in physiological processes, which in turn helps in diagnostic and prognostic studies of pathological conditions and impacts clinical decision-making. Since proteins reside at multiple locations at the same time and few subcellular locations host far more proteins than other locations, the computational task for their subcellular localization is to train a multilabel classifier while handling data imbalance. In imbalanced data, minority classes are underrepresented, thus leading to a heavy bias towards the majority classes and the degradation of predictive capability for the minority classes. Furthermore, data imbalance in multilabel settings is an even more complex problem due to the coexistence of majority and minority classes. Results Our studies reveal that based on the extent of concurrence of majority and minority classes, oversampling of minority samples through appropriate data augmentation techniques holds promising scope for boosting the classification performance for the minority classes. We measured the magnitude of data imbalance per class and the concurrence of majority and minority classes in the dataset. Based on the obtained values, we identified minority and medium classes, and a new oversampling method is proposed that includes non-linear mixup, geometric and colour transformations for data augmentation and a sampling approach to prepare minibatches. Performance evaluation on the Human Protein Atlas Kaggle challenge dataset shows that the proposed method is capable of achieving better predictions for minority classes than existing methods. Availability and implementation Data used in this study are available at https://www.kaggle.com/competitions/human-protein-atlas-image-classification/data. Source code is available at https://github.com/priyarana/Protein-subcellular-localisation-method. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
A cell contains approximately 10 9 proteins residing at different subcellular locations (Chou and Shen, 2006), namely organelles which are subcellular structures for performing specific cellular functions. The subcellular location of a protein defines its functionality and helps understand protein behaviours. Any abnormal translocation of proteins in subcellular environments is vital to understand the clinical diagnosis and drug-targeting treatment.
The spatial patterns of protein distribution in high-contrast immunofluorescence images are interpretable, intuitive and can be effectively utilized for the detection of protein translocation. However, protein subcellular localization (PSL) is challenging due to the proteins being located in multiple organelles, which motivated the study of the PSL problem by Peng et al. (2010) and Coelho et al. (2010). Additionally, proteins exhibit highly variant distribution, which makes PSL an imbalanced multilabel classification task. Accordingly, the classification model is expected to recognize protein patterns among different organelles and also cell phenotypes, which further adds to the complexity of PSL. More specifically, the model classifies protein images into multiple subcellular locations provided in the label sets. The difference in label frequencies across the dataset leads to varied representation of classes (subcellular locations/organelles) during training. The majority classes with higher number of samples are frequently observed during training, and minority classes with lower number of samples are rarely encountered. Subsequently, the trained model can be heavily biased towards the majority classes and demonstrate very poor performance in minority classes.
In the past, PSL has been studied extensively using handcrafted features (Rana et al., 2021;Xu et al., 2018). Advancements in microscopic imaging have led the focus of current studies towards deep learning with state-of-the-art performance. Existing deep learning-based PSL methods have utilized different datasets and approaches to handle data imbalance. The Human Protein Atlas (HPA) project is the widely used public imaging database to study human proteins (Thul and Lindskog, 2018).
In reference to the HPA project, the 'Human Protein Atlas Image Classification' challenge is hosted on the Kaggle platform (Ouyang et al., 2019). The winning team utilized multilabel 31 072 full-sized (a mix of 2048 Â 2048 and 3072 Â 3072 pixels) images from the Kaggle competition and 78 000 images from external sources. Thereafter, they replaced the multi-labels with single protein IDs to apply metric learning using ArcFace loss (Deng et al., 2019) which works by exploiting 'batch effects' in the dataset through so-called 'hidden variables', a pitfall in machine learning (Ouyang et al., 2019). Likewise, all top scorers of the Kaggle competition used external data and ensembling of multiple models to boost their macro F1 scores (Ouyang et al., 2019). Another study (Zhang et al., 2021) utilized the same image sets and applied threshold optimization, cycle learning (Smith, 2017) and a weighted loss function with a hard-data sampler for improved performance. In order to save computational cost, Wang (2020) utilized only scaled Kaggle images (512 Â 512 pixels) and proposed a novel composite loss function, which combines focal loss (Lin et al., 2017) and Lová sz-Softmax loss (Berman et al., 2018). They trained three different models which were later ensembled to achieve better results. Aggarwal et al. (2022) proposed a convolutional neural network (CNN)-based model which outperformed the stacked ensemble of three pretrained CNN models.
Recently, PSL methods have incorporated multi-instance learning settings where labels are provided for each protein instead of an image, and each protein is considered a bag of multiple images/ instances. Arcamone et al. (2021) applied few-shot learning under which the model is trained on majority classes using a contrastive representation learning framework (Le-Khac et al., 2020) with ArcFace loss function and cosine distance metric. During testing, the trained feature extractor provides the feature embeddings for minority classes and evaluation is performed using a 9-way 20-shot learning scheme. Tu et al. (2022) proposed a new method, namely SIFLoc, which performs training in two stages. Stage one uses a contrastive learning-based self-supervised pretraining method which includes hybrid data augmentation and a modified contrastive loss function, followed by stage two of supervised learning. Another PSL study by Zhang et al. (2022) applied multi-task learning strategy for pretraining and proposed a combination of discrimination loss and reconstruction loss for better performance. During prediction, a multi-instance method is applied with a balanced sampling strategy.
These existing methods have mostly utilized different subsets of HPA that are different in the number of classes, images and degree of imbalance. They have achieved improvement over their base models; however, their performance with respect to each other is not known. Furthermore, all the methods so far have focused on image feature description, to make the model learn to distinguish different classes. However, there are no existing methods that have been specifically designed for the problem of multilabel imbalanced classification in PSL.
Data imbalance is identified when certain classes have far fewer samples in the dataset than other classes. However, in multilabel settings, data imbalance is also in reference to label sets. A label set constitutes all the classes associated with a sample, which can contain different combinations of classes with various number of classes per sample, causing further data imbalance. Therefore, in order to handle the data imbalance in multilabel settings effectively, it is essential to quantify the imbalance, particularly the magnitude of concurrence among majority and minority classes. The SCUMBLE (Score of ConcUrrence among iMBalanced LabEls) metric (Charte et al., 2014) was proposed to assess the magnitude of concurrence in a multilabel dataset. Values for SCUMBLE fall in the range [0,1], where a value 0.1 indicates not much coexistence of minority and majority classes in the dataset and vice versa.
Existing techniques to handle multilabel data imbalance can be categorized into classifier-dependent and independent approaches.
Classifier-dependent approaches include ensembling, algorithmicspecific adaptations and cost-sensitive learning (Tarekegn et al., 2021). Ensemble methods utilize different classifier models trained on the same data to obtain diverse predictions which are combined to boost the classifier performance. Algorithmic-specific adaptations enhance existing machine learning algorithms, for instance, multilabel-KNN (Zhang and Zhou, 2007), multilabel SVM (Elisseeff and Weston, 2001) and multilabel neural networks (Zhang, 2009;Zhang and Zhou, 2006). Cost-sensitive learning applies high cost on the misclassified minority samples. It is however not a widely applied method as it is challenging to determine an effective cost matrix for the dataset (Tarekegn et al., 2021).
Classifier-independent approaches to handle class imbalance include oversampling of minority samples. However due to the coexistence of majority and minority classes in multilabel images, only the datasets with low SCUMBLE metric value benefit from oversampling techniques (Charte et al., 2015a(Charte et al., , 2019a. Oversampling of multilabel samples can be performed by approaches such as LP-ROS (Label Powerset-Random Oversampling) (Charte et al., 2013) and ML-ROS (Multi-Label-Random Oversampling) (Charte et al., 2015a). Both clone the original minority instances using various geometric and colour transformations. LP-ROS considers the label set to evaluate imbalance while ML-ROS evaluates the imbalance level of each label/class. These methods if applied repetitively, often cause overfitting.
In order to create more variations in synthetic images, Multilabel Synthetic Minority Oversampling Technique (MLSMOTE) (Charte et al., 2015b) was proposed to create new images by interpolating two neighbouring images from the pool of minority class samples. However, generation of new instances and their corresponding labels in multilabel settings is often associated with chances of introducing noise alongside. Accordingly, Charte et al. (2019b) proposed a resampling approach for datasets with high concurrence, namely REMEDIAL (REsampling MultilabEl datasets by Decoupling highly ImbAlanced Labels), which decouples imbalanced concurrent labels and applies resampling for improved classification performance. The 'mixup' method (Zhang et al., 2018) and its variants have been widely used to handle data imbalance in multiclass and single-label settings (Chou et al., 2020;Galdran et al., 2021;Shorten and Khoshgoftaar, 2019;Tarekegn et al., 2021); however, its application in multilabel settings has not been explored much.
In this study, we propose an oversampling method to handle data imbalance in multilabel settings, by creating synthetic samples using multiple data augmentation techniques combined with an imbalance-aware sampling approach. In multilabel settings, each sample is associated with more than one class, due to which oversampling often generates noise. Therefore, it is crucial to identify the classes for oversampling and design the data augmentation approach accordingly. We first investigate the imbalance ratio per class and the extent of concurrence of majority and minority classes using the SCUMBLE metric. Based on the value obtained, the benefit of applying oversampling to handle data imbalance is measured. Thereafter, our oversampling method is applied to handle data imbalance in multilabel protein images for PSL.
The proposed oversampling method constructs synthetic samples using non-linear mixup and random combinations of geometric and colour transformations to achieve better diversity in the training set for improved classification performance. Our non-linear mixup method enhances the regularization capability of mixup by combining the transformations in the spatial domain and colour space. In particular, regularization is further enhanced in the spatial domain by using a matrix as mixup coefficient, followed by applying 3D rotation in colour space about different reference channels which correspond to proteins of interest and the relevant cellular regions of minority classes. We also design an imbalance-aware sampling approach, which measures imbalance using the median and mean of the 'imbalance ratio per class' values and accordingly identifies the medium and minority classes. Thereafter, synthetic samples of minority and medium classes form separate minibatches for training and supplement the minibatch of original images during different iterations, so that the minority class samples are utilized more than the medium class samples.
The proposed oversampling method improves our previous approach (Rana et al., 2022a) that handles data imbalance in singlelabel settings. The proposed improved method uses a matrix as a mixup coefficient instead of the standard scalar. Furthermore, nonlinear mixup images are generated by applying 3D rotation specifically about the colour components (image channels) that correspond to only minority classes, unlike our previous approach which randomly selects the axis of 3D rotation for single-channel grey-scale images. In addition, our new imbalance-aware sampling approach measures the imbalance in the dataset and focuses on medium and minority classes, while our previously proposed minority-focused sampling approach did not include any such measures and focused only on the minority classes. The proposed method was evaluated on the Kaggle challenge dataset (Ouyang et al., 2019) as it contains the highest number of classes among all the HPA subsets, and with the most extreme data imbalance closest to real-world settings (Fig. 1). The results demonstrate that our proposed framework achieves better predictions for all the classes in the dataset than existing methods.

Dataset
The Kaggle challenge dataset has 28 classes, 31 072 samples for training and 11 702 samples for testing (TS2) (labels are not provided for the test set). Each sample consists of four channels: green, red, blue and yellow representing the antibody-stained protein of interest (POI), microtubules, nucleus and endoplasmic reticulum, respectively (see Supplementary Fig. S1). The goal of the study is to predict the subcellular location of POI, which is in the green channel, while red, yellow and blue channels function as references. Images are available in their original high-resolution size of 2048 Â 2048 pixels and the downsized version of 512 Â 512 pixels. The class distribution of the dataset with the number of samples in each class and number of images with their respective label set sizes are shown in Figure 1. There are 1-5 labels per image.
Due to the unavailability of test set labels, we randomly selected 20% of images from each class of the given training dataset to use as another test set (TS1) for the evaluation of methods.

Methods
First, we measure the extent of concurrence of the majority and minority classes in the dataset to assess if oversampling can boost the classification performance. Next, minority and medium classes are identified ( Fig. 1), and our proposed oversampling approach is employed for model training.

Data imbalance measures
Data imbalance in a multilabel dataset can be measured using two main measures, namely imbalance ratio per label (IRL) and SCUMBLE (Tarekegn et al., 2021). In a multilabel dataset M with L labels/classes, Y label sets, Y i denoting the label set of an instance i, IRL of a label l is the ratio of the number of samples of the most frequent label (majority class) to the label l: In addition, SCUMBLE measures the coexistence of majority and minority labels in the multilabel dataset as: where IRL i ¼ IRLðlÞ if label l is present in the instance Y i , else it is 0 and IRL i denotes the average imbalance level of the labels appearing in the ith instance. A low SCUMBLE value implies lower concurrence of imbalanced labels. A value 0:1 is taken to benefit from resampling (Charte et al., 2019b).

Mixup
In this study, we used mixup as a data augmentation technique to improve the regularization effect on the classifier in multilabel settings. It adopts the standard mixup strategy that generates synthetic images through weighted linear interpolation of two input images (Zhang et al., 2018): where I x0 and I x1 are two image matrices and k is the mixup coefficient for each sample pair, which is a scalar 2 ½0; 1 in standard settings.
In order to obtain better regularization, this study used k as a matrix of random values 2 ½0; 1 which is of the same size as the image. Matrix k provides more diverse variations in the resultant mixup image as it expands the space between reference images to generate synthetic images. In order to avoid the mixup image being too close to the reference images, the elements of mixing matrix k are set to random values in the range of 0.35-0.65. A pair of images from the same minority class is randomly chosen and interpolation is carried out between the corresponding pixels of two reference images to generate a new mixup image. The label set of the new mixup image is the intersection of the label sets of the two reference images, where intersection refers to the labels which appear in both reference images.

Non-linear mixup
In order to enhance the regularization effect of mixup, we earlier proposed a non-linear mixup method (Rana et al., 2022a) to modify the pixel values of the mixup image in colour space using 3D rotation. Non-linear mixup produces a synthetic image with transformations in both the spatial domain and colour space to provide better regularization effects, compared to the standard mixup. Non-linear mixup is carried out as follows: Step 1: For two randomly chosen reference images from a certain minority class, mixup is first used to create a new synthetic image I mix through weighted interpolation of the corresponding channel images using matrix as mixup coefficient. This step transforms pixel values in the spatial domain.
Step 2: Each pixel of I mix obtained from Step 1 undergoes 3D rotation in colour space for further transformation. 3D rotation is carried out about one of the colour axes at a randomly chosen angle h using the corresponding rotation matrix. Consequently, the pixel value of the colour component which is the axis of rotation remains unchanged, while new pixel values for the other two colour components are created to generate a different colour Step 3: The label set of the obtained non-linear mixup image is inherited from its corresponding mixup image, which is the intersection of the label sets of the two input images.
To adapt this non-linear mixup method to multilabel HPA images, consider that HPA images have four channels: green channel as POI, blue channel as the reference for the inner nuclear region, red and yellow channels as the references for the outer nuclear region. Since mixup and non-linear mixup are applied only to the minority classes which are subcellular locations in the outer nuclear region, in this study the blue colour component of the 3D RGB colour space is replaced with the yellow colour component to construct a 3D colour space including red, green and yellow colour components (Fig. 2). Subsequently, non-linear mixup is used to generate two image variants, namely NL mixup (G) and NL mixup (R) by applying 3D rotation to the obtained mixup image about the green and red colour component, respectively. 3D rotation about the green axis keeps the POI pixels unchanged, while creating variations in the outer nuclear region. Consequently, NL mixup (G) has the same POI pattern as mixup but with variant backgrounds (Fig. 2). NL mixup (R) brings variations in the green and yellow colour components while the red colour component remains unchanged. With a controlled tweak such as setting up the range to select h, NL mixup (R) accentuates POI pixels (Fig. 2).
More specifically, each pixel in the 3D colour space represented as [I p R , I p G , I p Y ] is rotated at an angle h using rotation matrices Rot G and Rot R for NL mixup (G) and NL mixup (R), respectively (Fig. 2). During 3D rotation about G, a new pixel value is created on a circle whose centre is [0, I p G , 0] with radius , and similarly for the rotation about R. Accordingly, transformation of pixels for NL mixup (G) and NL mixup (R) are defined as: respectively, where Rot G and Rot R are: In order to obtain more variations in the non-linear mixup images across the iterations and avoid the resultant NL mixup image being too similar to the mixup image, h is randomly chosen alternatively from two sets ([60 -110 ], [120 -300 ]).

Imbalance-aware sampling
The standard approach to identify classes that need oversampling is to consider all the classes in the dataset with IRL > mean (IRL C1;::;CN ) as minority classes. However, in cases of extreme data imbalance, the distribution of IRL values of the classes is skewed to the right. Therefore, apart from minority classes, there is another category of classes that may benefit from oversampling, that we have named medium classes. The medium classes have IRL too high to be considered as majority class and too low to be a minority class. Since for a right-skewed distribution, the mean is typically greater than the median, this study utilizes the median of the IRL values to identify medium classes. Accordingly, all the classes which satisfy median <IRL < mean are considered to be medium classes.
Adapting from our earlier approach (Rana et al., 2022b), in this study, each minibatch during an iteration is supplemented with another minibatch of synthetic samples from minority or medium classes. Our sampling approach does not require a preset sample distribution ratio, instead it finetunes the number of samples per class to be added for training during an iteration through an hyperparameter n. For the supplementary batch of minority samples, n ¼ 1 adds three images [original image (with geometric and colour transformation) þ NL mixup (R) þ NL mixup (G)], while the supplementary batch of medium samples consists of all three images obtained after random combinations of geometric and colour transformations. Consequently, the method balances the representation of target classes at minibatch level and reduces computational cost as it allows the inclusion of the minimum number of synthetic samples, without affecting the representation of the original samples.
We note that 65% of the iterations per epoch includes the supplementary batch of minority samples and 35% of iterations per epoch includes the supplementary batch of medium samples. Since the number of original samples in medium classes are comparatively high with a wide range of label set sizes, it is observed that if mixup/ NL mixup is applied to medium classes, the new label sets obtained after intersection affects the predictions of other classes. Therefore, only geometric and colour transformations are applied on medium class samples during a small percentage of iterations.

Experimental setup and implementation
The proposed method and several state-of-the-art PSL methods were evaluated on the multilabel HPA Kaggle challenge images. The primary goal of our study is to address the key problem of PSL which contains imbalanced multilabel data. Therefore, considering the computational cost, we used only the downsized images provided by the Kaggle competition (31 072 training images of size 512 Â 512 pixels) with their original multilabel sets and without using any external data.
We computed IRL and SCUMBLE metrics to measure the data imbalance and, based on the IRL values, identified the minority and medium classes. Minority classes are 8, 9, 10, 15, 27 and medium classes are 6,12,13,16,17,18,20,22,24 and 26 (see Supplementary Table S1 for a description of the classes). The SCUMBLE threshold for the dataset was 0.1, which implies low concurrence of majority and minority classes and therefore oversampling could be useful to handle the data imbalance. Accordingly, minority samples were constructed using geometric and colour transformations and mixup, which were further transformed using non-linear mixup, while oversampling of medium classes was performed using only geometric and colour transformations. We utilized geometric and colour transformations such as horizontal/ vertical flips, rotation at 90 , 180 and 270 , brightness adjustments of images in the range of 50-150% of the original pixel value and cropping followed by padding. Fig. 2. Mixup and non-linear mixup (RGY view): Two images randomly selected from the same minority class '8' to generate a mixup image (I mix ) with a label obtained from the intersection of the label sets of the reference images. Non-linear mixup image is generated by applying 3D rotation to each pixel about G and R axes in the colour space. Rotation about the G axis at two different angles h provides variation in backgrounds to POI (green channel). Rotation about the R axis at two different angles h accentuates POI pixels For classification, the available training data after separation of TS1 images were divided as follows: 80% for training and 20% for validation, for 5-fold stratified cross-validation. Subsequently, five models were obtained after training on each set of training and validation images. The final predictions for TS1 and TS2 images were obtained by ensembling via averaging the predictions of all five models. The classification model was trained by finetuning the ImageNet pre-trained ResNet-50 (He et al., 2016) architecture on the training dataset, which is a common model used for PSL studies. The loss function was binary cross-entropy (BCE). The minibatch size was 32, where the number of original samples was 17 and n was set to 1 for all the minority classes. The weight parameters were updated using the stochastic gradient descent method (Robbins and Monro, 1951) with a momentum parameter of 0.9 and weight decay of 0.0001 over 60 epochs. The initial learning rate was 0.03 and StepLR offered by PyTorch was used to schedule the learning rate, with gamma and step size set to 0.05 and 15, respectively. The initial settings of hyperparameters (batch size, decay and learning rate) and learning scheduler, were the same for all the experiments.
Hyperparameter temperature for SIFLoc (Tu et al., 2022) was 0.1. This study used the macro F1-score for performance evaluation of the proposed method, which is the arithmetic mean of the F1-scores of all 28 classes and indicates the performance of all the classes in the dataset equitably. Methods were evaluated based on the individual F1-scores of all the classes in the utilized test set (TS1), macro F1-score of TS1 and Kaggle test set (TS2). In addition, the Wilcoxon rank sum test at significance level of 1% is used to compare the performance of the proposed method with other compared methods and demonstrate the statistical significance of the results.

Results and discussion
We first compare the performance of the proposed method, base model and existing PSL methods which are based on contrastive learning (Tu et al., 2022), metric learning (Ouyang et al., 2019) and supervised learning using Focal-Lovász (Wang, 2020) loss function. The base model results are from the pre-trained ResNet-50 model finetuned on the original images of the Kaggle challenge dataset using the BCE loss function.
As the results show (Table 1), our proposed method outperforms other compared methods. In reference to medium classes, our method improved their macro F1-score from 0.61 to 0.66. As the IRL value of a class rises from the median value of IRLs, which is from Class 6 to Class 20, the percentage of improvement in the F1score after oversampling also rises from 4.6% to 31.3% (see Supplementary Table S2).
In multilabel settings, since each sample is associated with more than one class, oversampling often generates noise due to a mismatch of the assigned label set to the newly constructed synthetic sample. Therefore, it is important to correctly identify the classes for oversampling before applying data augmentation. Our imbalance-aware sampling approach measures imbalance in the dataset using mean and median of the 'imbalance ratio per class' values and accordingly identifies the medium and minority classes. Noise and overfitting are avoided while handling imbalance, as synthetic medium samples are generated using only geometric and colour transformations and supplement the original minibatch for fewer iterations, while synthetic minority samples are generated using NL mixup and geometric and colour transformations, and supplement the original minibatch for majority of iterations. Noise is further avoided during NL mixup by selecting image pairs from the same class and reference channels that correspond to only minority classes, along with the intersection approach to generate new label sets.
We note that the proposed non-linear mixup does not necessarily produce a biologically relevant image. However, it achieves a better regularization effect than the current data augmentation approaches as it expands the space between reference images to generate new images and hence provides more variations in the synthetic images. This makes a diverse training set which improves the generalization ability of the model and in turn helps in training a better classification model. For instance, as shown in Supplementary Figure S3, sample images have both majority and minority classes in their original label set, and Wang (2020) predicted mostly majority classes while our method is able to predict the complete label set correctly.
Among the compared approaches, SIFLoc (Tu et al., 2022) utilizes a modified contrastive loss function to perform self-supervised pretraining followed by supervised learning. The model aims to capture the semantic relationships between images to generate better image representations. Moreover, unlike existing PSL methods that applied metric learning with ArcFace loss and protein IDs (Le-Khac et al., 2020;Ouyang et al., 2019), for further comparison, we also applied triplet margin loss function (Ding et al., 2015) to evaluate metric learning with multi-labels (see Supplementary Section S2 for details). Metric learning based on the triplet margin loss function does not perform well in comparison to other methods specifically for minority and medium classes. It could not capture the intraclass variance of minority samples as the chance of multiple minority class samples to fall together in a minibatch is low.
As shown in Table 1, supervised learning methods using BCE and Focal-Lová sz loss perform better than SIFLoc and metric learning. However, the latter methods performed well in their respective PSL studies because of comparatively less extreme data imbalance in the utilized datasets and the use of protein IDs. Supervised learning method using Focal-Lová sz loss performed comparatively better than other PSL methods. In line with the existing study (Wang, 2020), Focal-Lová sz loss achieves more than 3.5% improvement over the base model (with BCE loss) on both test sets in this study. However, it does not perform as well as the proposed method, particularly for classes C15 and C27. This can be due to a very low representation of classes C15 (21 samples) and C27 (11 samples) in the dataset. Furthermore, most of the samples in these two classes have a label set of size 3, while other minority classes typically have label set size of 1 or 2. In addition, there are very few samples representing each label set which makes the model training particularly challenging for these two classes. We note that the macro F1-score of 0.5292 reported in the study by Wang (2020) is obtained by ensembling of predictions from three different models. The size of the training data used in their study is different from our training dataset, as we used 20% of the images from each class (TS1) for evaluation and performed 5-fold cross-validation on the remaining images, while Wang (2020) used all the images for 5-fold cross-  (Ding et al., 2015) 0.20 0.16 0 0 0 0.32 0.224 1.5eÀ50 Note: Best performance per class and per method is indicated in bold. C8, C9, C10, C15 and C27 are the F1-scores of minority classes 8, 9, 10, 15 and 27, respectively and TS1 is the macro F1-score of the sampled test set (extracted from the Kaggle training set) in this study. TS2 is the macro F1-score of the official Kaggle test set obtained by submitting classification results to the Kaggle competition. F1-score for all individual classes in TS1 are shown in Supplementary Table S3. validation. Accordingly, a macro F1-score of 0.447 for TS2 in Table 1 is obtained by applying Focal-Lová sz loss with our dataset setup and ResNet-50.
For ablation studies of each component of the proposed approach, we explored different combinations of the utilized data augmentation techniques (geometric and colour transformations, non-linear mixup about red and green axis) and alternatives for non-linear mixup such as standard mixup, MLSMOTE (Charte et al., 2015b) and manifold mixup (Verma et al., 2019). The compared methods are as follows: (i) GCT þ mixup þ NL mixup (G), which is oversampling of minority classes with images obtained after applying a random combination of geometric and colour transformations, mixup and non-linear mixup about the green axis, using imbalance-aware sampling; (ii) GCT þ mixup þ NL mixup (R), which is oversampling of minority classes with images obtained after applying a random combination of geometric and colour transformations, mixup and non-linear mixup about the red axis, using imbalance-aware sampling; (iii) mixup þ NL mixup (R) þ NL mixup (G), which is oversampling of minority classes with images obtained after applying mixup and non-linear mixup about green and red axes, using imbalance-aware sampling; (iv) GCT þ mixup, which is oversampling of minority classes with images obtained after applying a random combination of geometric and colour transformations and mixup, using imbalance-aware sampling; (v) GCT þ manifold mixup, which is oversampling of minority classes with images obtained after applying a random combination of geometric and colour transformations and manifold mixup, using imbalanceaware sampling; (vi) GCT þ MLSMOTE, which is oversampling of minority classes with images obtained after applying a random combination of geometric and colour transformations and MLSMOTE, using imbalance-aware sampling; (vii) GCT, which is oversampling of minority classes with images obtained after applying only a random combination of geometric and colour transformations, using imbalance-aware sampling; (viii) weighted sampling, which replaces imbalance-aware sampling in the base model, assigns a weight that is the inverse of the class frequency to the samples during training and applies a random combination of geometric and colour transformations; (ix) standard sampling, which replaces imbalance-aware sampling and creates minibatches using a standard sampling strategy which randomly selects the images. The method applies data augmentation and performs oversampling of minority classes whenever it comes across a minority class sample; (x) scalar mixup coefficient, which replaces the proposed matrix mixup coefficient with a scalar mixup coefficient. We note that the model architecture, loss function, learning scheduler, optimizer and other initial hyperparameter settings were the same for all the experiments.
As shown in Table 2, the proposed oversampling framework that includes geometric and colour transformation, NL mixup (R) and NL mixup (G) with imbalance-aware sampling achieves the best performance in the ablation study. It is observed that both non-linear image variants [NL mixup (R) þ NL mixup (G)] contribute most to the performance of the classification model. Since NL mixup (R) accentuates POI pixels, while NL mixup (G) provides variations in the background of POI, this implies that the transformation of selective channels according to the task requirements with controlled tweaks can further enhance the regularization effect. We note that, using a matrix as mixup coefficient improves the performance of the model by approximately 2% compared to using a scalar value.
The gap in the obtained macro F1-scores for TS1 and TS2 implies that TS2 contains images that are vastly different from the training images and therefore challenging to classify. It is observed that methods using two different data augmentation techniques do not show the same degree of performance improvement on TS2 (%2% increase from the base model) as on TS1 (%6% increase from the base model). On the contrary, more diverse data generated using three different data augmentation approaches, lead to considerable improvement for both TS1 (%9.2% increase from the base model) and TS2 (%9% increase from the base model). Experiments also demonstrated better performance with the imbalance-aware sampling approach than standard sampling and weighted sampling approach. However, for the best results the proposed sampling approach must be applied with multiple data augmentation techniques to avoid overfitting.
Furthermore, as shown in Table 2, the proposed method with NL mixup (R/G) performs better than the standard mixup, MLSMOTE or manifold mixup. MLSMOTE applies interpolation to the image pair selected from the pool of the minority class samples using nearest neighbour approach without considering their respective classes. Since the dataset has images from different cell lines (see Supplementary Fig. S2), a few synthetic images obtained from MLSMOTE contributed to the noise and exhibited comparatively poor performance. Manifold mixup applies interpolations in deep hidden layers between images from the same class for smoother decision boundaries. In current settings when applied on minority classes, it affects the predictions of concurrent classes such as Classes 4, 6 and 19 reducing the overall macro F1-scores; however, its performance on TS2 is almost as good as standard mixup.
As shown in Tables 1 and 2, the obtained P-values are smaller than 0.01, which implies that the proposed method has statistically more significant performance than the compared methods. In particular, the P-values in Table 1 are obtained by comparing the probabilities of the correct label of each test set sample in TS1. However, the various data augmentation methods in Table 2 exhibit similar performance on the majority classes. Therefore, the P-values in Table 2 for these methods are obtained by comparing the probabilities of the correct label of each test set sample from only minority and medium classes of TS1. Note: P-values smaller than 0.01 implies statistically improved performance of the proposed method over the compared methods. Best performance is indicated in bold.
GCT, geometric and colour transformations; NL mixup (G) and NL mixup (R), non-linear mixup with green and red colour component as axis for 3D rotation, respectively.

Conclusion
In this article, we have proposed an oversampling method to handle multilabel data imbalance for the PSL task. The method computes the value of the SCUMBLE metric which is used to determine the application of oversampling methods to handle data imbalance. Synthetic samples for oversampling are generated from non-linear mixup and random combinations of geometric and colour transformations. The regularization capability of non-linear mixup is enhanced by using a matrix as the mixup coefficient, and applying 3D rotation in colour space about different reference channels which correspond to the POI and the relevant cellular regions of minority classes. Synthetic samples are included for training through an imbalance-aware sampling approach, which measures the imbalance ratio per class and accordingly identifies medium and minority classes. The proposed method evaluated on a publicly available Kaggle challenge dataset outperforms existing PSL methods. Experimental results demonstrate the advantage of applying oversampling using multiple data augmentation techniques and imbalance-aware sampling approach to improve the predictions for minority and medium classes.
As a further application of our method, it will be interesting to see the ability of the proposed method to estimate the fractions of proteins in different subcellular regions.