OrganelX web server for sub-peroxisomal and sub-mitochondrial protein localization and peroxisomal target signal detection

We present the OrganelX e-Science Web Server that provides a user-friendly implementation of the In-Pero and In-Mito classifiers for sub-peroxisomal and sub-mitochondrial localization of peroxisomal and mitochondrial proteins and the Is-PTS1 algorithm for detecting and validating potential peroxisomal proteins carrying a PTS1 signal sequence. The OrganelX e-Science Web Server is available at https://organelx.hpc.rug.nl/fasta/.


Introduction
Signatures in the amino acid sequences of proteins have been associated with domains, family functional sites and their subcellular localization [1][2][3][4]. These sequences can be used in association with machine learning (ML) approaches to develop prediction tools, that nowadays are easily findable and accessible [5][6][7][8]. Deeplearning approaches have been recently used to embed (encode) the protein sequences, which showed promising results for several tasks, including sub-cellular classification [9][10][11][12][13][14][15]. The Unified Representation (UniRep) [9] and the Sequence-to-Vector (SeqVec) [10] are two of the most promising and already used protein sequence embeddings. UniRep provides an amino-acid embedding that summarizes physico-chemical properties and phylogenetic clusters and has been shown to be efficient for distinguishing proteins from various structural classifications of protein classes [9]. SeqVec showed optimal performance for predicting sub-cellular localisation [10]. The potential of these embeddings has been recently explored for highly specific tasks, such as sub-organelle localisation: in particular, they have been used for sub-peroxisomal and submitochondrial protein localisation [16]. Peroxisomes and mitochondria are ubiquitous organelles surrounded by a single (peroxisomes) or a double (mitochondria) biomembrane that is relevant to many metabolic and non-metabolic pathways [17,18]. The full extent of the functions of peroxisomes, mitochondria and of the involved pathways is still largely unknown [19]: in this light, the discovery of new peroxisomal and mitochondrial proteins can facilitate further knowledge acquisition. Here we present the OrganelX Web Server (available at https://organelx.hpc.rug.nl/fasta/) which hosts two existing algorithms designed to predict subperoxisomal (In-Pero) and sub-mitochondrial (In-Mito) localization of a (set of) protein(s) starting from the amino acid sequence(s). The In-Pero and In-Mito algorithm have been introduced in [16] and can be used to predict the sub-cellular localization of known or putative peroxisomal and mitochondrial proteins whose localization is unknown. We also introduce a new functionality (the Is-PTS1 algorithm) for the classification of protein sequences as peroxisomal (i.e. proteins that can be imported in the peroxisome) or nonperoxisomal starting from the detection of a specific peroxisomal targeting signal (PTS1) [20].
To our knowledge, there are no online resources that allow simple and fast prediction of the sub-peroxisomal and sub- mitochondrial localization or the prediction of peroxisomal proteins through identification of the PST1 signal starting from the amino acid sequence. These tools offered online through the Orga-nelX server facilitate research on peroxisomes and mitochondria by making the prediction of protein sub-cellular localisation easy to perform (only the upload of protein FASTA sequences is needed). OrganelX can be used without the need for programming skills and significantly reduces the number of bioinformatic steps that should have been otherwise performed to extract relevant information from the protein sequences of interest [21].

In-Pero and In-Mito classifiers
The In-Pero and In-Mito algorithms and the prediction models implemented in the OrganelX Web Server have been introduced and described in Anteghini et al. (2021) [16]. We give here a brief account of the most important characteristics. We refer the reader to the original publication for full details on algorithm development, training and validation.
The In-Pero prediction model was originally trained on a curated, non-redundant (40% of sequence identity) data set of 160 peroxisomal proteins [16] with validated sub-cellular localization; the In-Mito model was trained on a curated, non-redundant (40% of sequence identity) data set of 424 mitochondrial proteins [16] also with validated sub-cellular localization.
The classification problems are solved using Support Vector Machines [22]. The In-Pero algorithm predicts whether a proximal protein belongs to the matrix or is a (trans) membrane protein, resulting in a binary classification problem. The In-Mito algorithm predicts the possible localization of mitochondrial proteins: matrix, inner-membrane, inter-membrane and outer-membrane, resulting in a four-class classification problem.

The Is-PTS1 classifier
The proteins that are imported into peroxisomes (peroxisomal proteins) are directed to the peroxisome through the PEX5 receptor that recognizes a specific region of the peroxisomal protein called a peroxisomal targeting signal 1 (PTS1) [23]. Operationally, the PTS1 is defined as dodecamer sequences at the C-terminal ends of the protein sequence which accommodate physical contacts with both the surface and the binding cavity of PEX5 and ensure accessibility of the extreme C-terminus [20]. However, the presence of the PTS1 is not a guarantee for the import of proteins across the peroxisomal membrane [24,25,20]. The problem is then to predict whether a protein carrying the PTS1 is a peroxisomal protein or not.
Due to some limitations in the embedding generation procedure, we recommend the user upload sequences with less than 1200 residues [9,10]. When dealing with longer sequences we recommend conserving the C-terminal part and eventually removing the N-terminal part.
If the PTS1 signal is found, the full amino acid sequence is encoded using the concatenation of UniRep [9] and Seqvec [10] protein embeddings [9,10] as in the case of the In-Pero and In-Mito algorithms [16]. The binary classification of the protein sequence as peroxisomal or not-peroxisomal is carried over using a Support Vector Machine classifier [22] trained on a nonredundant (40% of sequence identity) data set consisting of 72 peroxisomal proteins (positives) and 155 non-peroxisomal proteins (negatives) all carrying a putative PTS1 signal.
An additional data set of 5 different proteomes of five organisms (saccharomyces cerevisiae, homo sapiens, danio rerio, mus musculus and bos taurus) was assembled to assess how many proteins contain a putative PTS1 signal. The protein sequence was downloaded from UniProt [27] (release 04_2022) and only the reviewed sequences were considered. An overview of the number of proteins containing a PTS1 signal is reported in Table 1. Considering the proteins from all the species, 6.4% of the reviewed protein carrying a putative PTS1 signal are also annotated as peroxisomal.

Model optimization
The training, hyper-parameters optimization and validation procedures of the In-Pero, In-Mito and Is-PTS prediction models were carried over using a repeated double cross-validation approach [28,29] as detailed in [16].

Prediction results
The results of the prediction (Peroxisomal and Mitochondrial sub-cellular localization, presence of the PTS1/peroxisomal protein) are given with an associated probability. For the binary classifiers (In-Pero and Is-PTS1) the class probability is calibrated using Platt scaling [30] from the logistic regression on the SVM scores, fit by additional cross-validation on the training data. For the multiclass classifier (In-Mito), the class probability was calculated using the improved version of the coupling approach [31,32].

Data sets for extra validation
We assembled two additional data sets for extra validation of the In-Pero and Is-PTS algorithms (Web server implementation). The In-Mito algorithm was already externally validated in the original publication against two existing tools: DeepMito [33] and DeepPred-SubMito [34] (see Table 3 in [16]).
For the validation of In-Pero, we queried UniProt [35] for reviewed proteins with a clear sub-peroxisomal annotation in the membrane ("SL-0203" and "GO:0005778") or matrix ("SL-0202" and "GO:0005782"). The resulting sequences were then clustered for 40% of sequence identity with CD-hit [36]. Sequences overlap- Table 1 Summary statistics (per organism) of proteins with the putative PTS1 signals retrieved from UniProt. 'n. protein' indicates the total number of proteins retrieved per organism, the peroxisomal proteins are in brackets; 'n. matches' the number of proteins containing a putative PTS1 signal in the C-terminal part of the sequence; 'true matches' (TM) indicates how many among the 'n. of matches' are annotated as peroxisomal. ping with our original training set were removed, obtaining 85 membrane proteins and 59 matrix proteins.
To validate the Is-PTS1 algorithm we retrieved from UniProt (and processed in a similar way) 15 peroxisomal proteins carrying the PTS1 signal (true positives) and 15 non-peroxisomal proteins carrying the PTS1 signal (true negatives).
The internal services for running the classification algorithms are located on Peregrine, the high-performance computing cluster at the University of Groningen, the Netherlands. For more info see https://www.rug.nl/society-business/centre-for-information-technology/research/services/hpc/facilities/peregrine-hpc-cluster? lang = en).

Performance of In-Pero and In-Mito
The performance and benchmarking of the In-Pero and In-Mito algorithm are exhaustively illustrated and discussed in [16]. For convenience we give in Table 2 a summary of the validation results from [16].

Validation of the Performance of the Is-PTS1 algorithm
The Is-PTS1 predictor is a newly implemented algorithm. Its overall performance was assessed against the data set containing peroxisomal protein carrying a PTS1 (see Table 1). The yeast peroxisome is the organelle with the highest protein concentrations which partially explain the high quantity of annotated peroxisomal protein carrying a PTS1 signal found in Uniprot [39]. Also, peroxisomal proteins are often studied on yeast as a model organism [40]. Is-PTS1 performance on the indicated data set is excellent: ACC¼ 0:92 AE 0:01 (Accuracy),

Extra validation of the Web server implementation
The performance of the In-Pero predictor on the extra validation data set is given in Table 3: the quality metrics are in line with what observed in the original publication [16]. The performance of the Is-PTS1 predictor is consistent with the results obtained in the training data set (see Section 3.1.2).

Using the OrganelX Web Server
An overview of the functionalities available OrganelX Web Server is shown in Fig. 1. The different prediction tools (In-Pero, In-Mito and Is-PTS1) are accessible from the homepage as shown in Fig. 2.

OrganelX Web Server: input
The input for the In-Pero (sub-cellular localization of peroxisomal proteins), In-Mito (sub-cellular localization of mitochondrial Table 2 Performance of the In-Pero and In-Mito prediction algorithms from [16]. Results are given as mean AE standard deviation over a 5-fold Double Cross Validation. Prediction quality metrics: F1 score (F1, the harmonic mean of precision and recall), Accuracy (ACC), Matthews' Correlation Coefficient (MCC) [38] and the Area Under the Curve (AUC). The performance of the In-Mito classifier are quantified using MCC for each sub-cellular mitochondrial compartment: outer membrane (O), inner membrane (I), inter-membrane space (T) and matrix (M). The In-Mito performances are benchmarked with two other methods namely DeepMito [33] and DeepPred-SubMito (DP-SM) [34].  Table 3 Performances of the In-Pero and Is-PTS1 predictor on two extra validation data sets. Performance quality metrics: F1 score (F1), Accuracy (ACC), Matthews' Correlation Coefficient (MCC) [38] and Area under the curve (AUC). proteins) and Is-PTS1 (detection of a peroxisomal targeting signal) algorithms available on the OrganelX Web Server is a FASTA text file containing one or more protein sequence. Each sequence begins with a single-line description, followed by amino-acid sequence data. The single-line description consists of > sp-ID-Desc symbols, where > sp is a fixed prefix, ID is the sequence name, and Desc is a descriptive text, followed by tokens of the FASTA sequence on the next lines. Alternatively, the single-line description can be > ID as a basic FASTA file. The input window of the OrganelX Web Server is shown in Fig. 3A.

OrganelX Web Server: submitting a job
When submitting a job, the user can specify a username, and an email address (optional) and upload a FASTA file. The user can either wait for the results via email or refresh the result web page. The result page is automatically refreshed every 3 min. The compu-tation time may change depending on the file size and the traffic on the website. If an email address has been provided, the user will receive a message including the results attached in a.csv file (e.g. Fig. 4). An example can be accessed at https://organelx.hpc.rug.n l/fasta/test_example, from where an example FASTA file can be downloaded.

OrganelX Web Server: output
The results are given in a.csv file which allows easy manipulation and re-use for further analysis. The.csv file contains the classification results and the probabilities for each predicted class. The results for each class are reported under its specific column, while each row contains the IDs of the corresponding classified entries. The output window of the OrganelX Web Server is shown in Fig. 3A. Fig. 2. Homepage of the OrganelX Web Server (https://organelx.hpc.rug.nl/fasta/). From the homepage, the user can access the three prediction tools via the: Is-PTS1 (prediction of peroxisomal proteins based on the presence of the Peroxisome Target Signal), In-Pero (sub-cellular localization of peroxisomal proteins) and In-Mito (subcellular localization of peroxisomal proteins). An example is also available. Is-PTS1, In-Pero, In-Mito predictor tools as well as visualize an example. The blue buttons 'Go to page' redirect the user to the specific tool. Fig. 3. Input and output windows of the Organelx Web Server. (A) the user can specify an arbitrary username, an email address where to receive the results. The FASTA file is uploaded by clicking on the the 'Choose file' button; (B) the probabilities for each protein in the FASTA file will appear next to a specific class (e.g. 'pred membrane' or 'pred matrix'). In case of errors during the embedding generation, the protein ID will be flagged as 'not encoded'.

Conclusions
The In-Pero predictor allows for accurately classifying membrane and matrix proteins inside the peroxisomes. Is-PTS1 predictor detects peroxisomal proteins carrying a PTS1 signal. The In-Mito predictor can be used as a complementary tool when investigating ambiguous or double localization in mitochondrial proteins. These tools proved to be accurate and a valid alternative to the commonly used pipelines, which are less precise, fragmented and time demanding. These three prediction algorithms are now made easily accessible and simple to use through the OrganelX Web server. OrganelX provides a solution to the problem of accurately performing sub-organelle classification and contributes to improving the lack of specific computational methods in peroxisomal research and will facilitate the work of the many groups working on peroxisome and mitochondria research.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.