DeepLoc 2.0: multi-label subcellular localization prediction using protein language models

Abstract The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.


DATA PARTITIONING
To generate high-quality data partitions for the SwissProt dataset, we adopted the procedure described by Gíslason et al. (7) to generate label-balanced splits for 5-fold crossvalidation. This procedure ensures that each pair of train and test fold does not share sequences that have global sequence identity greater than 30% as determined using ggsearch36, which is a part of the FASTA package (8).
The Human Protein Atlas (9) project provides annotation for 78,136 proteins. The HPA independent dataset was constructed using the following steps: 1. Homology reduction to ensure a maximum of 30% sequence identity to the whole SwissProt dataset using USEARCH v11.0.667, 32-bit (10). This leaves 23,422 proteins.
ProtT5 respectively throughout the rest of the manuscript. Additionally a suffix "(S)" or "(M)" is added to indicate whether the model was trained with single or multi-location labels. The maximum sequence length used for training was 1022 for the ESM models and 4000 for ProtT5. Proteins that exceeded this length had the middle portion of their sequence removed so that the ends are retained. This value is chosen to be lower for ESM models because of the limitations of the ESM1b model architecture.

Multi-label prediction loss
We used weighted focal loss (15) with the binary crossentropy objective to train the location predictions. The weight of each of the ten localization labels was set to be inversely proportional to the label frequency in the training dataset. This is done so that all labels are represented equally in the loss. The γ parameter of the focal loss is set to 1 following previous works that use similar losses for multi-label classification (16,17).
where L is the set of all labels, y is the target label, w l is the weight for the label, p is the output probability for the label, and γ is focal loss parameter.

Supervision using sorting signals
Normalized KL-divergence loss between the attention and the annotated signal is used whenever available. Additionally, we weight the loss to make the effective number of samples of each signal type the same.
where L is the length of the full sequence, q is the attention. p is the target probability distribution which is defined as follows, it is 0 in positions where the sorting signal is not present and 1 Lp otherwise, where L p is the length of the sorting signal.

DCT-prior-based regularization
In previous work, Tseng et al. (18) regularized the saliency (input times gradient) based on the Fourier transform and found that this improves the interpretability and stability of training by increasing the signal-to-noise ratio. We use this idea for our attention-pooling layer instead. It is known that some sorting signals are present at the N-and C-termini and thus we expect the learned attention to mimic this. Since the Fourier transform is computed assuming the periodicity of the signals, discontinuities at the ends of the sequence lead to artifacts. We avoid this issue by replacing the Fourier transform with the Discrete-Cosine Transform (DCT).
We first smooth the raw attentions scores (before softmax) using a gaussian 1-d convolutional filter of size 5 clipped to one standard deviation (K). The attention is then padded on both sides by repeating the value at the ends. The regularization loss is then computed by adding up the coefficients using the following weighting scheme. The first L 6 normalized DCT coefficients have the weight 1, and the rest are weighted using an exponentially decaying function 1 1+x 0.2 . Here, L is the length of the sequence and x is the index of the coefficient subtracted by L 6 .
where L is the length of the full sequence, S is the raw attention scores. K is the gaussian kernel described above, * operation represents convolution. w i 's are the weights according to the scheme described above.
The final loss is a weighted combination of the three losses. The attention supervision loss and the regularization loss are scaled by 0.1 to ensure that the secondary losses do not dominate the multi-label loss.

Signal type prediction
Since optimizing over multiple tasks is quite challenging, we freeze the model parameters after training with the multilabel localization and sorting signals prediction objectives. The pooled embedding vector after the attention and the final prediction probabilities are used as the input to a Multi-layer perceptron (MLP) to predict signal types in a multi-label fashion.

Training details
Different learning rates for the transformer encoder (5× 10 −6 if finetuning, 0 otherwise) and the attention-pooling, classification layer (5×10 −5 ) were used. The training was terminated after a fixed number of epochs to ensure that the models always overfit based on the randomly sampled validation set. The max number of epochs was 5 in the case of finetuning and 15 for the frozen models. The final model was picked by taking the best validation loss over all epochs.
Mixed-precision and model sharding techniques were utilized to efficiently fine-tune the models. The PyTorch-lightning (19) library and hardware provided by Google Colaboratory GPUs 1 , and 2 Tesla V100s were used for training and testing.

Benchmarking existing methods
For our experiments, DeepLoc 1.0, a retrained version of DeepLoc 1.0, and ESM12 (S) are the single-location predictors used as a baseline. The rest predict multiple locations. Each method considers a different set of possible locations, therefore we map each of these to the locations defined in Section 2.1. After reduction, the HPA dataset either does not contain or has very few proteins in Extracellular, Plastid, Lysosome/Vacuole, and Peroxisome locations and so we exclude these locations from the true labels. A consequence of this is that single label predictors can have an average number of predicted labels to be less than one since the predictions can fall outside these six true labels on the HPA independent test set.
Sequence-based methods: For YLoc+, we downloaded the standalone version of the predictor and used the animal, plant or fungi predictor whichever was appropriate, without the GO-terms option. DeepLoc-1.0 was retrained using the same procedure as originally described in (20).
GO-based methods: Sequence-based methods can be easier to benchmark against since proper homologypartitioning between the training and test sets ensures that the performance evaluation is a good estimate for unseen proteins. On the other hand, GO-based methods rely on local-alignment scores using a large database of indirect localization labels i.e. GO terms. Since the partitioning is only done considering the training and test sets and not the database, there is potential homology "leakage". This can lead to performance overestimation of these methods. We found that about ∼ 93% of sequences after Step 4 in Section 3.1 have a sequenceidentity match of greater than 30% with the ProSeq-GO database. Therefore, we reimplement Fuel-mLoc, changing only the database used. Based on the global sequence identity measured using USEARCH, we exclude 10,550 sequences (about 2% of the database), from the BLAST search. To ensure a fair comparison, first we consider Fuel-mLoc, for which the webserver was used to obtain the predictions, by selecting "Eukaryotes" and "Local database" in the options. Then, we reimplement it without any changes to the database to confirm that our version is as close to the original as possible. Finally, we use the reduced database to produce the results for comparison. Detailed results are provided in Supplementary  Table 5.
The mapping of locations for the methods used in the comparison is provided in Supplementary Table 2 and 3.

Metrics for subcellular localization
We use the following metrics to comprehensively quantify the classification performance on the datasets: • Number of predicted labels: Averaged over all predictions, this demonstrates the bias of the predictor.
• Accuracy: Requires the exact location(s) to be predicted. Since the dataset is skewed towards proteins with single localization, this metric provides an advantage to single-label predictors.
• Jaccard: Overlap between the actual and predicted labels over their union.
• MicroF1: F1 score considering the total number of true positives, false negatives, and false positives.
• MacroF1: F1 score computed for each class and then averaged, providing equal emphasis on rare and frequent classes.
• Matthews Correlation Coefficient: Measured for each class, it requires the model to perform well on all four confusion matrix entries.
For our multi-label prediction models, we computed the thresholds for each class by maximizing the MCC on the training set. The prediction thresholds can be found on the output page after submitting proteins to the webserver.

Measuring the interpretability of attention
For quantifying the relevance of attention to the sorting signals, we use the following metrics: • Importance in signal: Total attention mass present within the signal.
• Signal over background: The average attention value within the signal over the average value outside the signal.
• Metric Entropy: The entropy of the attention normalized by the information length of the protein. It ranges from 0 to 1, with lower values indicating that larger attention mass is placed on fewer residues.
• KL-Divergence: Distributional dissimilarity between the signal and attention.

Preliminary questions
Improvements due to transformer models: We trained ESM12 (Single) and retrained DeepLoc-1.0 on our crossvalidation dataset, but using only proteins with a single location. From Table 6, we find that ESM12 (Single) outperforms both DeepLoc-1.0 and its retrained version significantly. This is to be expected since ESM12 (Single) is a much larger model with unsupervised pretraining on a large dataset. Improvements due to multi-label annotations: Comparing ESM12 (Single) and ESM12 (Multi) we see that while the absolute accuracy has dropped, the rest of the metrics show that the predictions are indeed better overall, as well as for each location. Thus we conclude that the multi-label annotations provide additional useful information to the models.

Loss term ablation
We trained the largest and smallest models with and without the two additional losses for the attention layer i.e. the supervision and regularization losses. We observed that the multi-label localization performance was mostly unaffected. However, the interpretability increases dramatically by including these loss terms. Supplementary tables 8,9 show that when trained with these terms included, the signals are more prominent compared to the background, more of the attention is placed within the signal, the signal is sparser, and the KL-Divergence is lower implying that the attention correlates better with the signal. Supplementary