Modelling and Recognition of Protein Contact Networks by Multiple Kernel Learning and Dissimilarity Representations

Multiple kernel learning is a paradigm which employs a properly constructed chain of kernel functions able to simultaneously analyse different data or different representations of the same data. In this paper, we propose an hybrid classification system based on a linear combination of multiple kernels defined over multiple dissimilarity spaces. The core of the training procedure is the joint optimisation of kernel weights and representatives selection in the dissimilarity spaces. This equips the system with a two-fold knowledge discovery phase: by analysing the weights, it is possible to check which representations are more suitable for solving the classification problem, whereas the pivotal patterns selected as representatives can give further insights on the modelled system, possibly with the help of field-experts. The proposed classification system is tested on real proteomic data in order to predict proteins’ functional role starting from their folded structure: specifically, a set of eight representations are drawn from the graph-based protein folded description. The proposed multiple kernel-based system has also been benchmarked against a clustering-based classification system also able to exploit multiple dissimilarities simultaneously. Computational results show remarkable classification capabilities and the knowledge discovery analysis is in line with current biological knowledge, suggesting the reliability of the proposed system.


Introduction
Dealing with structured data is an evergreen challenge in pattern recognition and machine learning. Indeed, many real-world systems can effectively be described by structured domains such as networks (e.g., images [1,2]) or sequences (e.g., signatures [3]). Biology is a seminal field in which many complex systems can be described by networks [4], as the biologically relevant information resides in the interaction among constituting elements: common examples include protein contact networks [5,6], metabolic networks [7] and protein-protein interaction networks [8,9].
Pattern recognition in structured domains poses additional challenges as many structured domains are non-metric in nature (namely, the pairwise dissimilarities in such domains might not satisfy the four properties of a metric: non-negativity, symmetry, identity, triangle inequality) and patterns may lack any geometrical interpretation [10].
In order to deal with such domains, five mainstream approaches can be pursued [10]: 1. Feature generation and/or feature engineering, where numerical features are extracted ad-hoc from structured patterns (e.g., using their properties or via measurements) and can be further merged according to different strategies (e.g., in a multi-modal way [11]); 2.
Ad-hoc dissimilarities in the input space, where custom dissimilarity measures are designed in order to process structured patterns directly in the input domain without moving towards Euclidean (or metric) spaces. Common-possibly parametric-edit distances include the Levenshtein distance [12] for sequence domains and graph edit distances [13] for graphs domains; 3.
This paper proposes a novel classification system based on an hybridisation of the latter two strategies: while dissimilarity representations see the (structured) patterns according to the pairwise dissimilarities, kernel methods encode pairwise similarities. Nonetheless, the class of properly-defined kernel functions is restricted: the (conditionally) positive definitiveness may not hold in case of non-metric (dis)similarities. The use of kernel methods in state-of-the-art (non-linear) classifiers such as Support Vector Machines (SVM) [34,35] is strictly related to their (conditionally) positive definitiveness due to the quadratic programming optimisation involved: indeed, non-(conditionally) positive definite kernels do not guarantee convergence to the global optimum. Although there is some research about learning from indefinite kernels (see, e.g., [36][37][38][39][40]), their evaluation on the top of Euclidean spaces (e.g., dissimilarity spaces) retain the (conditionally) positive definitiveness, devoting matrix regularisation or other tricks to foster positive definitiveness. The proposed classification system is able to simultaneously explore multiple dissimilarities following a multiple kernel learning approach, where each kernel considers a different (dissimilarity) representation. The relative importance of the several kernels involved is automatically determined via genetic optimisation in order to maximise the classifier performance. Further, the very same genetic optimisation is in charge of determining a suitable subset of representative (prototypes) patterns in the dissimilarity space [27] in order to shrink the modelling complexity. Hence, the proposed system allows a two-fold a posteriori knowledge discovery phase: 1.
By analysing the kernel weights, one can determine the most suitable representation(s) for the problem at hand; 2.
The patterns elected as representatives for the dissimilarity space (hence determined as pivotal for tracking the decision boundary amongst the problem-related classes) can give some further insights for the problem at hand.
In order to validate the proposed classification system, a bioinformatics-related application is considered, namely protein function prediction. Proteins' 3D structure (both tertiary and quaternary) can effectively be modelled by a network, namely the so-called Protein Contact Network (PCN) [5]. A PCN is a minimalistic (unweighted and undirected) graph-based protein representation where nodes correspond to amino-acids and edges between two nodes exist whether the Euclidean distance between residues' α-carbon atom coordinates is within [4,8]Å. The lower bound is defined in order to discard trivial connections due to closeness along the backbone (first-order neighbour contacts), whereas the upper bound is defined by considering the peptide bonds geometry (indeed, 8Å roughly correspond to two van der Waals radii between residues' α-carbon atoms [41]). It is worth stressing that both nodes labels (i.e., the type of amino-acid) and edges labels (i.e., the distance between neighbour residues) are deliberately discarded in order to focus only on proteins' topological configuration. Despite the minimalistic representation, PCNs have been successfully used in pattern recognition problems for tasks such as solubility prediction/folding propensity [42,43] and physiological role prediction [44][45][46]; furthermore, their structural and dynamical properties have been extensively studied in works such as [47][48][49][50].
In order to investigate how the protein function is related to its topological structure, a subset of the entire Escherichia coli bacterium proteome, correspondent to E. coli proteins whose 3D structure is known, is considered. The problem itself is cast into a supervised pattern recognition task, where each pattern (protein) is described according to eight different representations drawn by its PCN and its respective Enzyme Commission (EC) number [51] that serves as the ground-truth class label. The EC nomenclature scheme classifies enzymes according to the chemical reaction they catalyse and a generic entry is composed by four numbers separated by periods. The first digit (1)(2)(3)(4)(5)(6) indicates one of the six major enzymatic groups (EC 1: oxidoreductases; EC 2: transferases; EC 3: hydrolases; EC 4: lyases; EC 5: isomerases; EC 6: ligases) and the latter three numbers represent a progressively finer functional enzyme classification. In this work, only the first number is considered. However, proteins with no enzymatic characteristics (or proteins for which enzymatic characteristics are still unknown nowadays) are not provided with an EC number, thus an additional class of not-enzymes will be considered, identified by the categorical label 7. It is worth noting that the EC classification only loosely relates to global protein 3D configuration, given that structure is affected by many determinants other than catalysed reactions like solubility, localisation in the cell, interaction with other proteins and so forth. This makes the classification task intrinsically very difficult.
This paper is organised as follows: Section 2 overviews some theory related to kernel methods and dissimilarity spaces; Section 3 presents the proposed methodology; Section 4 shows the results obtained with the proposed approach, along with a comparison against a clustering-based classifier (also able to explore multiple dissimilarities), and we also provide some remarks on the two-fold knowledge discovery phase. Finally, Section 5 concludes the paper. The paper also features two appendices: Appendix A describes in detail the several representations used for describing PCNs, whereas Appendix B lists the proteins selected as prototypes for the dissimilarity representations.

Theoretical Background
Let D = {x 1 , . . . , x N P } be the dataset at hand lying in a given input space X . Moving the problem towards a dissimilarity space [26] consists in expressing each pattern from D according to the pairwise distances with respect to all other patterns, including itself. In other words, the dataset is cast into the pairwise distance matrix D ∈ R N P ×N P defined as: where d(·, ·) is a suitable dissimilarity measure in D, that is d : D × D → R. Without loss of generality, hereinafter let us consider D to be symmetric: if d(·, ·) is at least symmetric, D is trivially symmetric; in case of asymmetric dissimilarity measures, D can be 'forced' to be symmetric, e.g., D := 1 2 (D + D T ). The major advantage in moving the problem from a generic input space X towards R N P ×N P is that the latter can be equipped with algebraic structures such as the inner product or the Minkowski distance, whereas the former might not be metric altogether. As such, in the latter, standard computational intelligence and machine learning techniques can be used without alterations [10]. On the negative side, the explicit evaluation of D can be computationally expensive as it leads to a time and space complexity of O(N 2 P ). To this end, in [27], a 'reduced' dissimilarity space representation is proposed, where a subset of prototype patterns R ⊂ D is properly chosen and each pattern is described according to the pairwise distances with respect to the prototypes only. This leads to the definition of a 'reduced' pairwise distance matrixD ∈ R N P ×|R| defined as: Since usually |R| < |D|, there is no need to solve a quadratic complexity problem such as evaluating Equation (1). On the negative side, however, the selection of the subset R is a delicate and challenging task [10] since:

1.
They must well-characterize the decision boundary between patterns in the input space; 2.
The fewer, the better: the number of representatives has a major impact on the model complexity (cf. Equation (1) vs. Equation (2)).
Several heuristics have been proposed in the literature, ranging from clustering the input space to (possibly class-aware) random selection [10,27,52].
Kernel methods are usually employed whether the input space has an underlying Euclidean geometry. Indeed, the simplest kernel (namely, the linear kernel [30,53]) is the plain inner product between real-valued vectors. The kernel matrix K (also known as the Gram matrix) can easily be defined as: Let K be a symmetric and positive semi-definite kernel function from the input space X towards R, that is K : X × X → R such that As in the linear kernel case, starting from pairwise kernel evaluations, one can easily evaluate the kernel matrix as and if K is a positive semi-definite kernel matrix, then K is a positive semi-definite kernel function. One of the most intriguing kernel methods property relies on the so-called kernel trick [29,30]: kernel of the form Equations (4) and (5) are also known as Mercer's kernel as they satisfy the Mercer condition [32]. Such kernel functions can be seen as the inner product evaluation on a high-dimensional (or possibly infinite-dimensional) and usually unknown Hilbert space H. The kernel trick is usually described by the following, seminal, equation: where φ : X → H is the implicit (and usually unknown) mapping function. The need for using a non-linear and higher-dimensional mapping is a direct consequence of Cover's theorem [33]. Thanks to the kernel trick, one can use one of the many kernel functions available (e.g., polynomial, Gaussian, radial basis function) in order to perform such non-linear and higher-dimensional mapping without knowing and explicitly evaluating the mapping function φ(·). Further, kernel methods can be used in many state-of-the-art classifiers such as (kernelised) SVM [35,54]. In multiple kernel learning, the kernel matrix K is defined as a properly-defined combination of a given number of N K kernels. The most intuitive combination is a linear combination of the form: where sub-kernels K (i) are single Mercer's kernels. The weights β i can be learned according to different strategies and can be constrained in several ways-see, e.g., [55][56][57][58][59][60][61], or the survey [62]. The rationale behind using a multiple kernel learning with respect to a plain single kernel learning depends on the application: for example, if data come from different sources, one might want to explore such different sources according to several kernels or, dually, one might want to explore the same data using different kernels, where such different kernels may differ in shape and/or type. In this work, a mixture between the two approaches is pursued: same source (PCN), but different representations (see Appendix A). Further, a linear convex combination of radial basis function kernels is employed. The ith radial basis function kernel is defined as and γ i is its shape parameter. Further, the weights β i are constrained as It is rather easy to demonstrate that these selections for both kernels and weights lead to the final kernel matrix (as in Equation (8)) which still is a valid Mercer's kernel, therefore it can be used on kernelised SVMs. Indeed, Cristianini and Shawe-Taylor in [31] showed that the summation of two valid kernels is still a valid kernel. Further, Horn and Johnson in [63] showed that a positive semi-definite matrix multiplied by a non-negative scalar is still a positive semi-definite matrix. Merging these two results automatically prove that kernels of the form (8) and (9) with constraints (10) and (11) are valid kernels.

Proposed Methodology
Let D be the dataset at hand, split into three non-overlapping subsets D TR , D VAL and D TS (namely training set, validation set and test set). Especially for structured data, several representations (e.g., set of descriptors) might hold for the same data, therefore let {X (1) , . . . , X (N R ) } be the set of N R representations, split in the same fashion (i.e., {X (i) ). Finally, let {d (1) (·, ·), . . . , d (N R ) (·, ·)} be the set of dissimilarity measures suitable for working in their respective representations.
The respective training, validation and test pairwise dissimilarity matrices, as in Equation (1) can be evaluated as follows: Let w ∈ {0, 1} |D TR | be a binary vector in charge of selecting columns from all matrices in Equation (12): the full pairwise dissimilarities can be sliced to their 'reduced' versions (cf. Equation (1) vs. Equation (2)), hence:D

D
(1) where, due to the number of subscripts and superscripts in Eq. (13), for ease of notation, we used a MATLAB R -like notation for indexing matrices. In other words, w acts as a feature (prototype) selector. Given this newly obtained dataset, it is possible to train a kernelised ν-SVM [64] whose multiple kernel has the form Equation (8) where each one has the form Equation (9), thus: where denotes the pairwise difference. Hence, each dissimilarity representation is subject to a proper non-linear kernel (N K ≡ N R ). A genetic algorithm [65] acts as a wrapper method in order to automatically tune in a fully data-driven fashion the several free parameters introduced in this problem. The choice behind a genetic algorithm stems from them being widely famous in the context of derivative-free optimisation, embarrassingly easy to parallelise and for the sake of consistency with competing techniques (see Section 4.4). For our problem, the genetic code has the form: where ν ∈ (0, 1] is the SVM regularisation term, contains the kernel shapes and w properly selects prototypes in the dissimilarity space, as described above. For the sake of argument, it is worth remarking that there have been several attempts to use evolutionary strategies in order to tune multiple kernel machines: for example in [66] a genetic algorithm has been used in order to tune the kernel shapes (namely, γ), whereas in [67] both the kernel shapes and the kernel weights have been tuned by means of a (µ + λ) evolution strategy [68]. Conversely, the idea of using a genetic algorithm for prototypes selection in the dissimilarity space has been inherited from a previous work [44].
The fitness function to be maximised is the informedness J (also known as Youden's index [69]) defined as: which is, by definition, bounded in range [−1, 1] (the closer to 1, the better). For the sake of comparison with other performance measures (e.g., accuracy, F-score and the like) which are, by definition, bounded in [0, 1], the fitness function sees a scaled version of the informedness [23][24][25], hence: The rationale behind using the informedness rather than other most common performance measures (mainly accuracy and F-score) is that the informedness is well suited for unbalanced classes without being biased towards the most frequent class (the same is not true for accuracy) and whilst considering also true negative predictions (the same is not true for F-score) [70].
By assuming that the full dissimilarity matrices are pre-evaluated beforehand, the objective function evaluation is performed for each individual from the current generation as follows: 1.
The individual receives the N R full dissimilarity matrices between training data samples, i.e., According to the w portion of its genetic code (see Equation (15)), a subset of prototypes is selected, leading to the 'reduced' dissimilarity matrices between training data, i.e.,D Considering the β and γ values in its genetic code, the (multiple) kernel matrix is evaluated by using Equation (14); 4.
A ν-SVM is trained using the regularisation term ν from the genetic code and the kernel matrix from step #3;

5.
The individual receives the N R full dissimilarity matrices between training and validation data, each of which is computed by considering all possible x, y -pairs where x belongs to the validation set and y belongs to the training set, i.e., D VAL as in Equation (12); 6.
The 'reduced' dissimilarity matrices are projected thanks to w, i.e.,D VAL as in Equation (13); 7.
The (multiple) kernel matrix between training and validation data is evaluated thanks to β and γ, alike Equation (14); 8.
The (multiple) kernel matrix from step #7 is fed to the SVM trained on step #4 and the predicted classes on the validation set are returned; 9.
The fitness function is evaluated.
At the end of the evolution, the best individual (i.e., the one with best performances on the validation set) is retained and its final performances are evaluated on the test set.
Finally, it is worth remarking the rationale behind the proposed, structured, genetic code since a genetic code of the form Equation (15) allows, in a two-fold manner, a deeper a posteriori knowledge discovery phase. Indeed, using upfront good classification results (for the sake of reliability), by looking at β, it is possible to check which kernels (representations) are considered as the most important (higher weights) for the learning machine in order to solve the problem at hand. Similarly, by looking at w, it is possible to check which training set patterns have been selected as representatives and ask why those patterns have been selected instead of others, leading to a pattern-wise check (possibly with help by field-experts). Especially the latter a posteriori check might be troublesome if a huge number of representatives is selected. In order to alleviate this problem (if present), it is possible to re-state the fitness function (formerly (17)) by considering a convex linear combination between the performance index and the feature selector sparsity, hence: where ω ∈ [0, 1] in a user-defined parameter which tunes the convex linear combination by weighting the rightmost term (sparsity) against the leftmost term (performance). It is worth noting that whilst fitness (17) should be maximised, (18) should be minimised.

Data Collection and Pre-Processing
The data retrieval processing can be summarised as follows. Using the Python BioServices library [71]: The entire protein list for Escherichia coli str. K12 has been retrieved from UniProt [72]; 2.
This list has been cross-checked with Protein Data Bank [73] in order to discard unresolved proteins (i.e., proteins whose 3D structure is not available).
.pdb files have been downloaded for all resolved proteins; 2.
information such as the EC number and the measurement resolution (if present) have been parsed from the .pdb file header; 3.
proteins having multiple EC numbers have been discarded.
In case of multiple equivalent models within the same .pdb file, only the first model is retained;

3.
Similarly, for atoms having alternate coordinate locations, only the first location is retained.
After this retrieval stage, a total number of 6685 proteins has been successfully collected. Some statistics on the measurement resolutions and the number of nodes are sketched in Figure 1a,b, respectively. In order to keep only good quality structures (with reliable atomic coordinates), all proteins with missing resolution in their respective .pdb files and proteins whose resolution is greater than 3Å have been discarded. Further, proteins having more than 1500 nodes have been discarded as well. These filtering procedures dropped the number of available proteins from 6685 to 4957. The class labels (EC number) distribution is summarised in Table 1. For each of the 4957 available proteins, its respective eight representations (see Appendix A) have been evaluated using the following tools: • The NetworkX library [76] (Python) for evaluating centrality measures (X (2) ) and the Vietoris-Rips complex (X (1) ); • The Numpy and Scipy libraries [77,78] (Python) for several algebraic computations, mainly spectral decompositions for energy, Laplacian energy, heat trace, heat content invariants (X (3) , X (5) , X (6) , X (8) ) and the homology group rank (X (1) ); • The Rnetcarto (https://cran.r-project.org/package=rnetcarto) library (R) for network cartography (X (4) ).
As in previous works [45,46] the 7-class classification problem is cast into seven binary classification problems in one-against-all fashion, hence the ith classifier sees the ith class as positive and all other classes as negative. The eight representations X (1) , . . . , X (8) are split into training, validation and test set in a stratified manner in order to preserve labels' distribution across splits. Thus, each of the seven classifiers sees a different training-validation-test split due to the one-against-all labels recoding. The genetic optimisation and classification stage has been performed in MATLAB R R2018a using the built-in genetic algorithm and LibSVM [79] for ν-SVMs.

Computational Results with Fitness Function f 1
The first test suite sees f 1 (17) as the fitness function, hence the system aims at the maximisation of the (normalised) informedness.
The genetic algorithm has been configured to host 100 individuals for a maximum of 100 generations and each individual's genetic code (upper/lower bounds and constraints, if any) is summarised in Table 2. At each generation, the elitism is set to the top 10% individuals; the crossover operates in a scattered fashion; the selection operator follows the roulette wheel heuristic and the mutation adds to each real-valued gene (ν, β, γ) a random number extracted from a zero-mean Gaussian distribution whose variance shrinks as generations go by, whereas it acts in a flip-the-bit fashion for boolean-valued genes (w). Table 2. Genetic algorithm parameters description. Table 3 shows the performances obtained by the proposed Multiple Kernels over Multiple Dissimilarities (MKMD, for short) approach using the fitness function f 1 . Due to randomness in genetic optimisation, five runs have been performed for each classifier and the average results are shown. Figures of merit include:

Parameter Bounds Contraints
• (Normalised) Informedness as in Equation (17) where TP, TN, FP and FN indicate true positives, true negatives, false positives and false negatives, respectively. Similarly, Figure 2 shows the ROC curves for all classifiers by considering their respective run with greatest AUC.

Computational Results with Fitness Function f 2
These experiments see the fitness function f 2 (Equation (18)) in lieu of f 1 (Equation (17)), where the weighting parameter ω is set to 0.5 in order to give the same importance to performances and sparsity. In order to ensure a fair comparison with the previous analysis, the same training-validation-test splits have been used for all seven classifiers, along with the same genetic algorithm setup (genetic code, number of individuals and generations, genetic operators). Table 4 shows the average performances obtained by the seven classifiers across five genetic algorithm runs. As in the previous case, Figure 3 shows the ROC curves for all classifiers by considering their respective run with greatest AUC.

Benchmarking against a Clustering-Based One-Class Classifier
In order to properly benchmark the proposed MKMD system, a One-Class Classification System (hereinafter OCC or OCC_System) capable of exploiting multiple dissimilarities is used. This classification system has been initially proposed in [81] and later used for modelling complex systems such as smart grids [81][82][83] and protein networks [44].
The main idea in order to build a model through the One-Class Classifier is to use a clustering-evolutionary hybrid technique [81,82]. The main assumption is that similar protein types have similar chances of generating a specific class, reflecting the cluster model. Therefore, the core of the recognition system is a custom-based dissimilarity measure computed as a weighted Euclidean distance, that is: whereˇ x 1 ,ˇ x 2 are two generic patterns and W is a diagonal matrix whose elements are generated through a suitable vector of weights w. The dissimilarity measure is component-wise, therefore the symbol represents a generic dissimilarity measure, tailored on each pattern subspace, that has to be specified depending on the semantic of data at hand. In this study, patterns are represented by dissimilarity vectors extracted from each sub-dissimilarity matrix, one for each feature adopted to describe the protein (see Section 2). In other words, patterns pertain to a suitable dissimilarity space.
The decision region of each cluster C i is constructed around the medoid c i bounded by the average radius δ(C i ) plus a threshold σ, considered together with the dissimilarity weights w = diag( W) as free parameters. Given a test patternˇ x the decision rule consists in evaluating whether it falls inside or outside the overall target decision region, by checking whether it falls inside the closest cluster. The learning procedure consists in clustering the training set D TR composed by target patterns, adopting a standard genetic algorithm in charge of evolving a family of cluster-based classifiers considering the weights w and the thresholds of the decision regions as search space, guided by a proper objective function. The latter is evaluated on the validation set D VAL , taking into account a linear combination of the accuracy of the classification (that we seek to maximise) and the extension of the thresholds (that should be minimised). Note that in building the classification model we use only target patterns, while non-target ones are used in the cross-validation phase, hence the adopted learning paradigm is the One-Class classification one [84,85]. Moreover, in order to outperform the well-known limitations of the initialization of the standard k-means algorithm, the OCC_System initializes more than one instance of the clustering algorithm with random starting representatives, namely medoids, since the OCC_System is capable of dealing with arbitrarily structured data [86][87][88]. At test stage (or during validation) a voting procedure for each cluster model is performed. This technique allows building a more robust proteins model. Figure 4 shows the schematic representing the core subsystems of the proposed OCC_System, such as the ones performing the clustering procedure and the genetic algorithm. Moreover, it is shown the Test subsystem, where given a generic test pattern and given a learned model, it is possible to associate a score value (soft-decision) besides the Boolean decision. Hence, we equip each cluster C i with a suitable membership function, denoted in the following as µ C i (·). In practice, we generate a fuzzy set [89] over C i . The membership function allows quantifying the uncertainty (expressed by the membership degree in [0, 1]) of a decision about the recognition of a test pattern. Membership values close to either 0 or 1 denote "certain" and hence reliable decisions. When the membership degree assigned to a test pattern is close to 0.5, there is no clear distinction about the fact that such a test pattern is really a target pattern or not (regardless of the correctness of the Boolean decision). For this purpose, we adopt a parametric sigmoid model for µ C i (·), which is defined as follows: where a i , b i ≥ 0 are two parameters specific to C i , and d(·, ·) is the dissimilarity measure (19). Notably, a i is used to control the steepness of the sigmoid (the lower the value, the faster the rate of change), and b i is used to translate the function in the input domain. If a cluster (that models a typical protein found in the training set) is very compact, then it describes a very specific scenario. Therefore, no significant variations should be accepted to consider test patterns as members of this cluster. Similarly, if a cluster is characterised by a wide extent, then we might be more tolerant in the evaluation of the membership. Accordingly, the parameter a i is set equal to δ(C i ). On the other hand, we define b i = δ(C i ) + σ i /2. This allows us to position the part of the sigmoid that changes faster right in-between the area of the decision region determined by the dissimilarity values falling in where in turn B(C i ) = δ(C i ) + σ i is the boundary of the decision region related to the ith cluster. Finally, the soft decision function, s(·), is defined as where C * is the cluster where the test (target) pattern falls. With the aim of making a synthesis, we remark that the OCC_System works in two phases: 1. Learning a cluster model of proteins through a suitable dataset divided into two disjoint sets, namely training and validation set; 2.
Using the learned model in order to recognise or classify unseen proteins drawn from the test set, assigning to each pattern a probability value.
The OCC parameters defining the model are optimised by means of a genetic algorithm guided by a suitable objective function that takes into account the classification accuracy. For the sake of comparison, the same genetic operators (selection, mutation, crossover, elitism) as per the MKMD system and have been considered (see Section 4.2). As concerns the complexity of the model, measured as the cardinality of the partition k, we choose a suitable value k = 120. Table 5 shows the comparison between the OCC_System and the MKMD approach. In order to ensure a fair comparison, since the OCC_System does not perform representatives selection in the dissimilarity space, in the MKMD genetic code (cf. Equation (15)), the weights vector w has been removed and all weights have been considered unitary (i.e., no representative selection). Similarly, Figure 5b and Figure 5a show the ROC curves for OCC and MKMD, respectively. From Table 5 is evident that MKML outperforms OCC in terms of accuracy, informedness and AUC (see also the ROC curves in Figure 5b and Figure 5a), but a clear winner does not exist as regards precision and recall. As regards the structural complexity, OCC is bounded by the number of clusters k, whereas MKMD is bounded by the number of support vectors as returned by the training phase [24]. Indeed, the computational burden required to classify new test data is given by: • The pairwise distances between the test data and the k clusters centres (for OCC); • The dot product between the test data and the support vectors (for MKMD).
Specifically, for OCC, a suitable number of 120 clusters has been defined for all classes, whereas the training phase for MKMD returned an average of 1300 support vectors (∼52% of the training data) for class 1, 1881 support vectors (∼76%) for class 2, 1745 support vectors (∼70%) for class 3, 1213 support vectors (∼49%) for class 4, 767 support vectors (∼31%) for class 5, 864 support vectors (∼35%) for class 6 and 1945 support vectors (∼78%) for class 7. In conclusion, whilst MKMD outperforms OCC in terms of performances, the latter outperforms the former in terms of structural complexity.

Comparing against Previous Works
In Table 6 are reported the performances (in terms of AUC only, for the sake of shorthand) between the proposed MKMD approach with fitness function f 1 (Table 3), with fitness function f 2 (Table 4) and with no representatives selection in the embedding space (Table 5) against our previous studies for solving the same classification problem. For the sake of completeness, the results obtained by OCC (Table 5) are also included. In [44], two experiments have been performed: the first relied on the Dissimilarity Matrix Embedding (DME) by considering different protein representations (similar to the ones considered in this work) and the second one relied on OCC being able to explore those different representations simultaneously (alike this work). There are three main differences between this work and [44]: first, the set of representations is different; second, we only managed to solve the binary classification problem between enzymes and not-enzymes; third, the set of considered proteins is different. In fact, in [44], we performed an additional filtering stage in order to select (for the same UniProt ID) only the PDB entry with best resolution: we found that this heavily limits the number of protein samples available, possibly reducing the learning capabilities.
In [45,46] we used the sampled spectral density of the protein contact networks (more information can be found in Appendix A.8) and the Betti numbers (more information can be found in Appendix A.1), respectively: the results in Table 6 feature the same proteins set used in this work. Indeed, thanks to the observation above, experiments have been repeated with an augmented number of protein samples [90,91].
Results in Table 6 highlight that: 1.
Avoiding to filter out PDB structures by considering only the best resolution for a given UniProt ID (as carried out also in this work) helps in improving classification models: indeed, performances from [44] are amongst the lowest ones; 2.
The proposed MKMD approach, regardless of the fitness function and/or representative selection, outperforms all competitors for all EC classes (including not-enzymes).

On the Knowledge Discovery Phase
Apart from the good generalisation capabilities, it is worth remarking that an interesting aspect of the proposed multiple kernel approach is the two-fold knowledge discovery phase:

1.
By analysing the kernel weights β, it is possible to determine the most important representations for the problem at hand;

2.
By analysing w, namely the binary vector in charge of selecting prototypes from the dissimilarity space, it is possible to determine and analyse the patterns (proteins, in this case) elected as prototypes.
Let us start our discussion from the latter point. From a chemical viewpoint, proteins are linear hetero-polymers in the form of non-periodic sequences of 20 different monomers (amino-acids residues). While artificial polymers (periodic) are very large extended molecules forming a matrix, the majority of proteins fold as self-contained water-soluble structures. Thus, we can consider the particular linear arrangement of amino-acid residues as a sort of 'recipe' for making a water-soluble polymer with a well-defined three-dimensional architecture [92]. "Well-defined three-dimensional structure" should not be intended as a 'fixed architecture': many proteins appear as partially or even totally disordered when analysed with spectroscopic methods. This apparent disorder corresponds to an efficient organisation as for protein physiological role giving to the molecule the possibility to adapt to rapidly changing microenvironment conditions [93].
This implies the two main drivers of amino-acid residues 3D arrangement (from where the particular properties of relative contact networks derive) are:

1.
To efficiently accomplish the task of being water soluble while maintaining a stable structure (or dynamics); 2.
To allow for an efficient spreading of the signal across amino-acid residues contact network so to sense relevant microenvironment changes and to reshape accordingly-allosteric effect, see [94].
Currently, we have only a coarse-grain knowledge of such complex tasks, and biochemists are still very far to be able to reproduce this behaviour by synthetic constructs.
The ability to catalyse a specific class of chemical reactions (the property the EC classification is based upon), while being crucial for the biological role of protein molecules is, from the point of view of topological and geometrical proteins structure, only a very minor modulation of their global shape [92]. Notwithstanding that, the thorough analysis of representative proteins (thus pivotal for discrimination) can give us some general hints, not only confined to the specific classification task, but extending to all the 'hard' classification problems based upon very tiny details of the statistical units.
Looking at the representative proteins (hence, endowed with meaningful discriminative power) in Tables A1-A7 (Appendix B) we immediately note that the pivotal proteins come from all the analysed EC categories and not only from the specific class to be discriminated. This is expected by the absence of a simple form-function relation, hence they can be considered as an 'emergent property' of the discrimination task. The presence of molecules of different classes crucial for a specific category modelling and thus the image in light of a peculiar strategy adopted by the system is analogue to the use of 'paired samples' in statistical investigation [95,96]. When in presence of only minor details discriminating statistical units pertaining to different categories, the only possibility to discriminate is to adopt a paired samples strategy in which elements of a category is paired with a very similar example of another category so to rely on their differences (on a sample-by-sample basis) instead of looking for a general 'class-specific' properties. This is the case of proteins whose general shape is only partially determined by the chemical reaction they catalyse: looking at the 3D structures of relevant proteins, we can easily verify they pertain to three basic patterns ( Figure 6):
A globular pattern with 'duplication': protein can be considered as two identical half-structures ( Figure 6b); 3.
Even if the three above-mentioned patterns have slightly different relative frequencies in the EC classes (e.g., pattern 3 is more frequent in non-enzymatic proteins), they are present in all the analysed classes so allowing for the 'between-categories' sample-by-sample pairing mentioned above. This peculiar situation is in line with current biochemical knowledge (minimal effect exerted by catalysed reaction on global structure) and it is a relevant proof-of-concept of both the reliability of the classification solution and of the power of the proposed approach. On the other hand, it is very hard to de-convolve the discriminating structural nuances from the obtained solution that, as it is, only confirms the presence of 'tiny and still unknown' structural details linked to the catalytic activity of the studied molecules.
As regards the former point, Figure 7 shows the average weights vector β across the aforementioned five runs for ω = 0.5, showing that the MKMD approach considers for almost all classes centrality measures (X 2 ) and the protein size (X 7 ) as the most relevant representations, followed by the Betti numbers sequence (X 1 ), heat content invariants (X 5 ) and heat kernel trace (X 6 ).   It is worth noting that enzymes have a more pronounced allosteric effect with respect to non-enzymatic structures. This is a consequence of the need to modulate chemical kinetics according to microenvironment conditions-allostery is the modulating effect of a modification happening in a site different from catalytic site on the efficiency of the reaction [97]. Allostery implies an efficient transport of the signal along protein structure and it was discovered to be efficiently interpreted in terms of PCN descriptors [98] thus, the observed kernel weights fit well with the current biochemical knowledge.

Conclusions
In this paper, we proposed a classification system able to explore simultaneously multiple representations following an hybridisation between multiple kernel learning and dissimilarity spaces, hence exploiting the discriminative power of kernel methods and the customisability of dissimilarity spaces.
Specifically, several representations are treated using their respective dissimilarity representations and combined in a multiple kernel fashion, where each kernel function considers a specific dissimilarity representation. A genetic algorithm (although any derivative-free evolutive metaheuristic can be placed instead) is able to simultaneously select suitable representatives in the dissimilarity space and tune the kernel weights, allowing a two-fold a posteriori knowledge discovery phase regarding the most suitable representations (higher kernel weights) and the patterns elected as prototypes in the dissimilarity space. The proposed MKMD system has been applied for solving a real-world problem, namely protein function prediction, with satisfactory results, greatly outperforming our previous works in which graph-based descriptors extracted from PCNs have been tested for solving the very same problem. Further, the proposed system has been benchmarked against a One-Class Classifier, also able to simultaneously explore multiple dissimilarities: whilst the former outperforms the latter in terms of accuracy, AUC and informedness, a clear winner between the two methods does not exist in terms of precision and recall.
As far as the two-fold knowledge discovery phase for the proposed application is concerned, results both in terms of selected representatives in the dissimilarity space and weights automatically assigned to different representations are in line with current biological knowledge, showing the reliability of the proposed system.
Furthermore, due to its flexibility, the proposed system can be applied to any input domain (not necessarily graphs), provided that several representations can be extracted by the structured data at hand and that suitable dissimilarity measures can be defined for such heterogeneous representations. Funding: This work has been partially supported by the Sapienza Research Calls project "PARADISE -PARAllel and DIStributed Evolutionary agent-based systems for machine learning and big data mining", 2018.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Selected Representations
The set of eight representations X (1) , . . . , X (8) used to characterise PCNs are described in the following eight subsections.

Appendix A.1. Betti Numbers
Topological Data Analysis [99,100] is a novel data analysis approach useful whenever data can be described by topological structures (networks) as it consists in a set of techniques in order to extract information from data (starting from topological information) by means of dimensionality reduction, manifold estimation and persistent homology in order to study how components lying in a multi-dimensional space are connected (e.g., in terms of loops and multi-dimensional surfaces). One can start either from so-called point clouds, where objects are described by their coordinates in a multi-dimensional space equipped with notion of distance, or by explicitly providing the pairwise distance matrix between objects. Hereinafter, the former case is considered.
The most intuitive scenario in order to study how components lying in a multi-dimensional space are connected is (trivially) by studying the connectivity itself. To this end, it is worth defining simplices as (multi-dimensional) topological objects which can be extracted from a given topological space X : points, lines, triangles and tetrahedrons are (for example) 0-dimensional, 1-dimensional, 2-dimensional, 3-dimensional simplices and, obviously, higher-order analogues exist. Simplices can be seen as descriptors of the space under analysis, thus worthy of attention when studying X . Starting from simplices, it is possible to define simplicial complexes as properly-constructed collection of simplices able to capture the multi-scale organisation (or multi-way relations) in complex networks [101][102][103]. The two seminal examples of simplicial complexes are theČech complex and the Vietoris-Rips complex [99,100,104,105], however due to its intuitiveness and lighter computational complexity, one in practice uses the latter. The Vietoris-Rips complex can be built according to the following rule: initially, all 0-dimensional simplices belong to the complex, then a given set of k points forms a (k − 1)-dimensional simplicial complex to be included in the Vietoris-Rips complex if the pairwise distances are all less than or equal to a user-defined threshold .
The homology of a simplicial complex can be described by its Betti numbers. Formally, the ith Betti number is the rank of the ith homology group in the simplicial complex. Informally, the ith Betti number corresponds to the number of i-dimensional 'holes' in a topological surface. In this work, 3-dimensional graphs are considered and the first three Betti numbers have the following interpretations: the 0th Betti number is the number of connected components, the 1st Betti number is the number of 1-dimensional (circular) holes, the 2nd Betti number is the number of 2-dimensional holes (cavities). The Betti numbers vanish after the spatial dimension.
From the above Vietoris-Rips complex definition, it is clear that the choice of is critical as it somewhat defines the resolution of the simplicial complex. In many cases, one builds a sequence of Vietoris-Rips complexes as varies in order to study how 'holes' appear and disappear as the resolution changes and then selects a desired value by studying the 'holes' lifetime in order to obtain a useful homology summary: in algebraic topology, this concept is known as persistence [106].
Instead of having a 'topological summary', following a previous work [46], the rationale is to keep proper track of the number of holes as changes. To this end, the range ∈ [4,8] with sampling step 1 is considered, according to the PCN connectivity range. Hence, the first representation X (1) sees each protein as a 15-length integer-valued vector obtained by the concatenation of b 4 , b 5 , b 6 , b 7 , b 8 , where b i is (in turn) a 3-dimensional vector containing the first three Betti numbers for = i. Technically speaking, for a given , the Vietoris-Rips complex can be evaluated in two steps [107]:

1.
Build the Vietoris-Rips neighbourhood graph G VR (V, E ): an undirected graph where edges between two nodes, The set of maximal cliques in G VR form the Vietoris-Rips complex.
Let ∂ k : S k → S k−1 be the boundary operator, an incidence-like matrix which maps S k (i.e., the set of simplices of order k) with the set of simplices of order k − 1. The k-order homology group is defined as [108]: where ker{·} and im{·} denote the kernel and image operators. The rank of H k , namely the k th Betti number is then defined as [102]: or, thanks to the Rank-Nullity theorem [109]: where the rank of the image corresponds to the plain matrix rank in linear algebra.

Appendix A.2. Centrality Measures
In graph theory and network analysis, centrality measures indicate the node/edge importance with respect to a given criterion. Let G = (V, E ) be a graph and let V and E be the set of nodes and edges, respectively. The following centrality measures are considered: • The degree centrality [110] DC(v i ) for node v i ∈ V, defined as the percentage of nodes connected to it: where A is the adjacency matrix, defined as in Equation (A22). The normalisation coefficient 1 takes into account the maximum attainable degree in a simple graph, thus making the degree centrality in Equation (A4) independent from the number of nodes in the graph; • The eigenvector centrality [110] highly rank nodes whether they are connected to other high-rank nodes. Formally, the eigenvector centrality e i for node v i ∈ V is given by: where λ = 0 is a scalar constant. Equation (A5) can be re-written in matrix form as: Hence, the eigenvector centrality vector e is the left-hand eigenvector of the adjacency matrix A associated with the eigenvalue λ. According to the Perron-Frobenius theorem, by choosing λ as the largest (in absolute value) eigenvalue of A, the solution e is unique and all its entries are positive; • The PageRank centrality [110] p i for node v i ∈ V is given by: where α is a scalar constant (usually α = 0.85) and D(v j ) is the degree of node v j . It is worth remarking the difference between degree and degree centrality: the degree is the number of nodes connected to a given node (namely Equation (A4) without the normalisation term), whereas the degree centrality includes the normalisation term. As in the eigenvector centrality case, Equation (A7) can be re-written in matrix form as: where D −1 is a diagonal matrix whose ith element equals 1/D(v i ) and β is a vector whose elements are all equal to 1−α |V | ; • The Katz centrality [110,111] k i for node v i ∈ V is given by: where β controls the initial centrality (first neighbourhood weights) and α < 1/λ max attenuates the importance with respect to higher-order neighbours (in turn, λ max is the largest eigenvalue of A). It is worth noting that if α = 1/λ max and β = 0, the Katz centrality equals the eigenvector centrality; • The closeness centrality [110] CC(v i ) for node v i ∈ V is the inverse sum of shortest path distances between node v i ∈ V and all other n − 1 reachable nodes. Formally: where δ(·, ·) indicates the shortest path distance. The normalisation factor takes into account the graph size in order to allow comparison between nodes of graphs having different sizes, also in case of multiple connected components [112]. Indeed, n can be seen as the number of nodes in the connected component in which v i lies. In case of one connected component, the scale factor (n − 1)/(|V | − 1) can be neglected since n = |V |; • The betweenness centrality [110] BC(v i ) quantifies how many times a given node v i ∈ V acts as a bridge along the shortest paths between any two nodes: where s(v j , v k ) is the number of shortest paths from v i to v j and s (v i ) (v j , v k ) is the number of shortest paths from v i to v j passing through v i . As in the closeness centrality case, it is often customary to normalise the betweenness centrality in order to avoid dependency from the number of nodes, thus: The edge betweenness centrality [113] EBC(e i ) is the edge counterpart of the "standard" (node) betweenness centrality as it quantifies how many times a given edge e i ∈ E acts as a bridge along the shortest paths between two nodes: where s (e i ) (v i , v j ) is the number of shortest paths between nodes v i and v j passing through edge e i and s(v i , v j ) is the total number of shortest paths between nodes v i and v j . As in the "standard" betweenness centrality, the edge betweenness centrality can be normalised as follows: The load centrality [113,114] LC(v i ) for node v i ∈ V is the percentage of the total number of shortest paths passing through v i ; • The edge load centrality ELC(e i ) for edge e i ∈ E is the edge-related counterpart of the load centrality (like betweenness vs. edge betweenness): it is defined as the percentage of the total number of shortest paths crossing edge e i ; • The subgraph centrality [115] SC(v i ) for node v i ∈ V is the sum of (weighted) closed walks (i.e., connected subgraphs) starting and ending at v i (the longer the walk, the lower the weight). It can be evaluated thanks to the spectral decomposition of the adjacency matrix, which reads as A = is a diagonal matrix containing the eigenvalues in increasing order and B contains the corresponding unitary-length eigenvectors, thus: where λ j and b j are the eigenvalue and eigenvector associated to node v j ∈ V and b j (v i ) indicates the value related to v i in the j th eigenvector; • The Estrada Index [116] EI(G) of a graph G quantifies the compactness (or 'folding', since the Estrada Index was indeed originally proposed in order to study molecular 3D compactness) of a graph starting from the spectral decomposition of the adjacency matrix (as in the subgraph centrality): The harmonic centrality [117] HC(v i ) is the sum of inverse shortest paths distances from a given node v i ∈ V to all other nodes: The global reaching centrality [118] GRC(G) of a graph G is the average (over all nodes) of the difference between the maximum local reaching centrality and each node's local reaching centrality. Formally: where LRC(v i ) is the local reaching centrality of node v i ∈ V and LRC max is the maximum local reaching centrality amongst all nodes. In turn, the local reaching centrality for a given node v i is defined as the percentage of nodes reachable from v i ; • The average clustering coefficient [119] ACC(G) of a graph G is given by: where cc(v i ) is the clustering coefficient for node v i , defined as: where, in turn, tri(v i ) is the number of triangles passing through node v i and D(v i ) is its degree; • The average neighbour degree [120] AND(v i ) of node v i ∈ V is given by: where N (v i ) is the set of neighbours of node v i .
Apart from ACC, EI and GRC, which are global characteristics (i.e., related to the whole graph), the others are local characteristics (i.e., related to each node or edge). As such, it is impossible to compare graphs having different sizes (number of nodes and/or edges) by considering their local centralities. The second representation X (2) sees each protein as a 27-length real-valued vector containingDC,DC,ē,ẽ,p,p,k,k,CC,CC,BC,BC,ĒBC,ẼBC,LC,LC,ĒLC,ẼLC,SC,SC, EI, HC,HC, GRC, ACC,ĀND,ÃND (where bar and tilde indicate the average and standard deviation centrality across nodes/edges).

Appendix A.3. Energy and Laplacian Energy
Let G = (V, E ) be a graph and let V and E be the set of nodes and edges, respectively. Since in this work unweighted and undirected graphs are considered, the adjacency matrix A is a binary |V | × |V | matrix defined as: From A, it is possible to define the diagonal |V | × |V | degree matrix as: where D(i) is the degree of the ith node. In turn, from A and D, it is possible to define the Laplacian matrix as: The spectrum and Laplacian spectrum of G are defined as the set of eigenvalues from A and L, respectively [121]: From Equations (A25) and (A26), it is possible to define the graph energy E and the Laplacian energy LE as The third representation X (3) sees each protein as a 2-length real-valued vector containing E and LE.

Appendix A.4. Nodes Functional Cartography
Guimerà and Amaral in their seminal work [122] proposed a methodology in order to extract functional modules from a graph by maximising its modularity using simulated annealing [123]. Their definition of modularity takes into account both within-module degree and between-module degree with the idea that a good graph partition (i.e., high modularity) must have many within-module links and few between-module links.
Each node is then assigned with two scores: the z-score and the participation coefficient P. The former measures how well-connected a given node is with respect to other nodes in its own module. The latter quantifies how many connections a given nodes has with respect to nodes belonging to different modules.
The z − P plane has been heuristically divided into seven regions and each node can be classified into one of seven functional roles by considering its z-score and its participation coefficient P. Nodes having z < 2.5 are non-hubs, whereas nodes having z ≥ 2.5 are hubs. In turn, non-hub nodes can be divided in: ultra-peripherals (if P ≤ 0.05), peripherals (if P ∈ (0.05, 0.62]), non-hub connectors (if P ∈ (0.62, 0.8]) and non-hub kinless (if P > 0.8). Finally, hub nodes can be divided in: provincial hubs (if P ≤ 0.3), connector hubs (if P ∈ (0.3, 0.75]) and kinless hubs (if P > 0.75).
The fourth representation X (4) sees each protein as an 8-length real-valued vector containing the modularity (as returned by the simulated annealing) and the percentage of nodes belonging to each functional role.

Appendix A.5. Heat Content Invariant
From the graph Laplacian and degree matrices (Equations (A24) and (A23), respectively), the normalised Laplacian matrix can be evaluated as: The spectral decomposition ofL reads as: where |V | is a diagonal matrix containing the eigenvalues in increasing order and V contains the corresponding unitary-length eigenvectors.
The heat equation associated toL is given by [124,125]: where H(t) is the |V | × |V | heat kernel matrix at time t. The heat content HC(t) of H(t) is given by: where v k (v i ) is the value related to node v i in the kth eigenvector. The MacLaurin series for the negative exponential reads as: and substituting Equation (A33) in Equation (A32) yields: By re-writing Equation (A32) in terms of power series as: where the set of coefficients q m are the so-called heat content invariants and can be evaluated in closed-form as: The fifth representation X (5) sees each protein as a 4-length real-valued vector containing the first four coefficients from Equation (A36); that is q 1 , q 2 , q 3 , q 4 .

Appendix A.6. Heat Kernel Trace
Recalling the heat equation from Equation (A31) and the spectral decomposition of the normalised Laplacian matrix from Equation (A30), the solution to the former (already in Equation (A32)) reads as: The heat kernel trace is evaluated by taking the trace of H(t): The sixth representation X (6) sees each protein as a 10-length real-valued vector containing the heat kernel trace for t = 1, 2, . . . , 10. These values for t have been chosen by visual inspection: indeed, for t > 10 the heat kernel trace decay makes proteins undistinguishable one another.

Appendix A.7. Size
The seventh representation X (7) sees each protein as a 4-length real-valued vector containing the number of nodes, the number of edges, the number of protein chains and the radius of gyration. Whilst the first two items are rather straightforward, the latter two items deserve some further comments. Proteins are composed by one or more amino-acids chains (linear polymers), thus the number of chains may impact on the overall protein size. Finally, the radius of gyration [126] is a measure of how-compact is the overall folded protein structure with respect to its centre of mass. |V | be the normalised Laplacian spectrum (namely, the set of eigenvalues from L). One of the interesting properties of the normalised Laplacian matrix is that its spectrum lies in range [0, 2], regardless of the underlying graph [127]. The size of the spectrum, however, equals the number of nodes and therefore one cannot easily compare graphs having different sizes just by considering their respective spectra. In order to overcome this problem, following previous works [45,50], it is possible to estimate the (normalised Laplacian) spectral density using a kernel density estimator (also known as Parzen window [128]) equipped with the Gaussian kernel. The spectral density thus has the form: where σ is the kernel bandwidth which determines the estimate resolution. Following [50], the Scott's rule [129] has been used in order to determine the proper bandwidth value, hence: σ = 3.5 · std λ (L) |λ (L) | 1/3 .
In this manner, the bandwidth scales in a graph-wise fashion by considering each graph's spectrum size (denominator) and its standard deviation (numerator). Let G 1 and G 2 be two graphs, their distance can be evaluated by considering the 2 norm between their respective spectral densities p 1 (x) and p 2 (x): d(G 1 , G 2 ) = 2 0 (p 1 (x) − p 2 (x)) 2 dx. (A41) The same operation can be carried in the discrete domain by extracting n samples from p(x) (equal for all graphs) and the latter collapses into the standard Euclidean distance.
The eighth representation X (8) sees each protein as an 100-length real-valued vector containing n = 100 samples uniformly drawn from their respective normalised Laplacian spectral densities.

Appendix B. Selected Prototypes
In the following, the sets of proteins elected as prototypes for each of the seven classification problems are shown. In order to shrink the output size, our a posteriori analysis has been carried only on proteins which have been selected in all of the five runs of the genetic algorithm (in order to remove 'spurious' representatives due to randomness in the optimisation procedure). Transcription factor (DNA-binding) 4PC3 Elongation factor (RNA-binding) 5G1L Isomerase Table A3. Selected proteins in order to discriminate EC 3 (hydrolases) vs. all the rest.