On component-wise dissimilarity measures and metric properties in pattern recognition

In many real-world applications concerning pattern recognition techniques, it is of utmost importance the automatic learning of the most appropriate dissimilarity measure to be used in object comparison. Real-world objects are often complex entities and need a specific representation grounded on a composition of different heterogeneous features, leading to a non-metric starting space where Machine Learning algorithms operate. However, in the so-called unconventional spaces a family of dissimilarity measures can be still exploited, that is, the set of component-wise dissimilarity measures, in which each component is treated with a specific sub-dissimilarity that depends on the nature of the data at hand. These dissimilarities are likely to be non-Euclidean, hence the underlying dissimilarity matrix is not isometrically embeddable in a standard Euclidean space because it may not be structurally rich enough. On the other hand, in many metric learning problems, a component-wise dissimilarity measure can be defined as a weighted linear convex combination and weights can be suitably learned. This article, after introducing some hints on the relation between distances and the metric learning paradigm, provides a discussion along with some experiments on how weights, intended as mathematical operators, interact with the Euclidean behavior of dissimilarity matrices.


INTRODUCTION
In the past few decades, the discipline of pattern recognition (PR), aiming to automatically discover regularities in data, focused most efforts in frameworks conceived to learn from examples, thus from observations. These frameworks exploit several machine learning techniques grounding on the data-driven approach (Bishop, 2006). Therefore, in these specific cases, the goal of a PR system is to find regularities in data aiming to reach good generalization capabilities by building a model from known observations (Hart, Stork & Duda, 2000). Thereby, at the basis of an automated PR pipeline there are the observations, that can be any type of measurements on real-world objects. Observations can be collected by hand or automatically by sensors. Furthermore, observations can be labeled and labels allow to distinguish the class or the category in which the object falls (Martino, Giuliani & Rizzi, 2018). Moving away from the "philosophical" problem among the differences between objects that live in the real world-a discussion that deserves a systematic and really interesting discussion-it can be stated that it is really difficult to enumerate all differences between two real-world objects, at least, at raw (atomic) level. So what we can reveal are the differences between the (physical or virtual) properties of two objects and tell whether they can be considered different. This task takes part to the PR process and deserves a thorough discussion. In the PR jargon, the problem is known as finding a good representation of objects, e.g., if the weight is important as a property defining the objects, it should be taken into account, otherwise the system should not consider the "weight" feature. A really interesting theoretical treatment, within the context of cognitive science, on how natural properties arise from generic objects in building a suitable representation is provided by Peter Gärdenfors with the theory of conceptual spaces (Gärdenfors, 2004).
A representation can exist in several forms, such as numbers, strings, graphs, images, spectra, time series, densities and similarities . Robert P. W. Duin states that : (i) "every real-world difference between objects that may play a role in the human judgment of their similarity should make a difference in the representation" and ii) "the representation of a real world object, i.e., the mapping from the object to its representation, should be continuous". Hence, these prescriptions indicate that the representation should consider real-world properties judged as important and, furthermore, two similar objects should be similar in their representations too. On the top of a good representation, it is possible to train a myriad of learning algorithms capable to generate a model from data objects and, finally, to generalize towards previously unseen data. In fully supervised learning, the generalization process (classification) needs labeled examples, while unlabeled examples are used in unsupervised learning schemes (Jain, Murty & Flynn, 1999;Jain, Duin & Mao, 2000). As we will see, alongside the classical learning algorithms adopted in machine learning, it can be useful to learn a dissimilarity function tailored to the data at hand. This particular task belongs to the metric learning (ML) paradigm, a florid research field in PR (Lu et al., 2018;Bengio, Courville & Vincent, 2013).
As anticipated, many real-world objects in PR cannot be simply described by a set of measurements collected in real-valued vectors. In other words, the representation of objects may not easily start from a vectorial space and in this case the dissimilarity measure cannot be simply defined as a plain Minkowski distance, for example. In this case, a data structure, known as dissimilarity matrix, becomes clearly important. Thereby, in many cases, the core of a PR system is a custom-based dissimilarity measure, that is a way to measure the dissimilarity between samples of a given complex process that are described by a set of measurements that can (even simultaneously) involve real numbers, integers, vectors, categorical variables, graphs, spectra, histograms, unevenly objects/events sequences, time series etc. This happens when real-world objects possess a complex description arising from different intrinsic characteristics, each one caught by a suitable data structure. Thence, the overall dissimilarity can be chosen within the family of Euclidean distances, or within the general class of Minkowski distances. However, the structure of the given distance needs to take into account the different data structures. In technical literature, distances involving complex and possibly heterogeneous data structures are known as component-wise or element-wise distances (Jimenez, Gonzalez & Gelbukh, 2016) where, for each component, it is used a specific difference operator for the data structure at hand and, once collected, all of them are synthesized in a "template" distance that may have the Euclidean or Minkowski general form. In other words, distances are a function of the additive combination of the contributions of their components (Beals, Krantz & Tversky, 1968). A further generalization can be derived from the weighted Euclidean distance (WED) where a weight is associated to each component. In ML tasks, these weights can be suitably learned automatically, usually through an optimization procedure.
The WED is widely applied in PR problems such as in bioinformatics and personalized medicine (Hu & Yan, 2007;Martino et al., 2020;Di Noia et al., 2020), speech synthesis (Lei, Ling & Dai, 2010) or in the industrial field (Rao, 2012). For example, WED is used in clustering application dealing with side information (Xing et al., 2002). In fact, if a clustering algorithm, such as k-means, initially fails to find a meaningful solution for the problem at hand from the user point of view, the user is forced to manually tweak the metric until sufficiently good clusters are found. In Schultz & Joachims (2003), the authors present a method for learning a distance metric starting from relative comparison such as "A is closer to B than A is to C". A similar application can be found in (Kumar & Kummamuru, 2008), where a local metric is learned.
Moreover, in many real-world problems dealing with complex systems, the starting space is not a vectorial space, being also often non-metric (e.g., in life sciences (Münch et al., 2020;Martino, Giuliani & Rizzi, 2018), engineering applications (D'urso & Massari, 2019;Kim, Lee & Kim, 2018) or cybersecurity (Granato et al., 2020(Granato et al., , 2022). Consequently, only the dissimilarity representation is available through the dissimilarity matrix, as stated above. Hence, in such cases, the dissimilarity matrix is a primitive data structure compared to the data matrix. As we will see in the following, a dissimilarity matrix D is said to be "Euclidean" if it is perfectly (isometrically) embeddable in an Euclidean vector space in which the distances calculated in the latter are identical to the ones belonging to the entries of D (De Santis, Rizzi & Sadeghian, 2018). Several standard classifiers are designed to work effectively on Euclidean vector spaces. Operating with a non-Euclidean (or even non-metric) dissimilarity matrix may cause some problems. As an example, a non-Euclidean distance matrix leads to a non-positive definite kernel and the quadratic optimization procedure used to train a support vector machine (Vapnik, 1998;Schölkopf, Burges & Smola, 1999) may thereby fail, not being fulfilled the Mercer conditions (Mercer, 1909;Duin, Pękalska & Loog, 2013;Pękalska & Duin, 2005). However, in order to train standard classifiers on this kind of data, some solutions can be found. The two main solutions are based either on considering the dissimilarity matrix as the starting vector space endowed with the standard Euclidean distance (dissimilarity space representation) or by adopting a suitable transformation of the dissimilarity matrix, leading to the Pseudo-Euclidean (PE) space (Pękalska & Duin, 2005;De Santis, Rizzi & Sadeghian, 2018). In the current study, we will consider the last case. For the first case, the interested reader can be referred to Pękalska & Duin (2005).
It is well known that from dissimilarity data collected in form of a dissimilarity matrix D it can be "reconstructed" the starting Euclidean space where the original data points lie (De Santis, Rizzi & Sadeghian, 2018). The reconstruction process (known as embedding) tries to generate the original vector space such that the distances are preserved as well as possible. Classical multi-dimensional scaling is an example of such embedding procedure (Borg & Groenen, 2005). For an Euclidean space all the distances are preserved and thus an Euclidean distance matrix can be embedded isometrically in an Euclidean space. For non-Euclidean distance matrices the Euclidean space is not "large enough" to embed the dissimilarity data even if they can be still embedded in the so called PE space (Goldfarb, 1984). The embedding procedure involves the eigendecomposition of the kernel matrix G ¼ XX T , where X is the configuration matrix with data points organized as rows, also known as the Gram matrix (Horn & Johnson, 2013). The latter is a similarity matrix, obtainable through a suitable linear transformation of the dissimilarity matrix D.
In this work, we consider a class of PR problems involving a dissimilarity matrix D deriving from a custom-based component-wise dissimilarity measure dðx; y; wÞ : F Â F ! R þ . The following study is based on the characterization of d as a composite dissimilarity matrix of the form: computed as the ' 2 norm of the vector " d c that collects the component wise (sub)-dissimilarity measures, of which the functional form is related to the specific features (i.e., data structure) within a suitable structured non-metric feature space F.
Within this framework, in this article we provide two characterizations. The first one tries facing the claim according to which the behavior of a general dissimilarity measure depends on the behavior of the component-wise (sub)-dissimilarities. Specifically, d generates an Euclidean dissimilarity matrix if the (sub)-dissimilarities d F j are Euclidean. Therefore, the features F j over which it is induced a particular dissimilarity measure, i.e., a structural dissimilarity in the sense of Duin & Pękalska (2010), can influence the nature of the mathematical space where the learning algorithm works.
As concerns the second characterization, it is really interesting to arrange a mathematical interpretation of the weights pertaining the custom-based dissimilarity matrix, in particular wondering what is the influence of a weighting matrix W on the eigenspectrum of the underlying Gram matrix G w , that is the Gram matrix obtained from the weighted version of the dissimilarity matrix D. Unfortunately the relationship between the eigenvalues of G and G w in a general case is not straightforward, being an open problem of mathematics (Zhang & Zhang, 2006;Fulton, 2000). It is approachable in particular cases of commuting matrices (in the case of matrices sharing a complete set of eigenvectors, i.e., normal matrices) or when one of the two is a scalar matrix, i.e., a matrix of the form W ¼ kI. We will trace some results in the latter case.
Although this article aims at addressing these characterizations via a theoretical and mathematical viewpoint, the interested reader can find practical applications in the following articles. In De Santis et al. (2015), De Santis, Rizzi & Sadeghian (2018) a One-Class classification approach is used in the field of predictive maintenance and in the realtime recognition of faults in a real-world power grid, by processing heterogeneous information coming from smart sensors related to the power grid equipment and to the surrounding environment. The system exploits a clustering-genetic algorithm (GA) (Goldberg, 1989) approach where the weights of a custom based Euclidean dissimilarity measure are learned solving a suitable optimization problem. In De , we addressed the problem of finding suitable representative elements in the dissimilarity space 1 in order to classify protein contact networks according to their enzymatic properties and in De , the dissimilarity space embedding has been used to recognize signals pertaining to malfunctioning states of pressurization systems for high-speed railway trains. Finally, in Martino et al. (2020) the same problem of classifying protein contact networks according to their enzymatic properties has been solved by an hybridization of dissimilarity spaces and multiple kernel learning.
The degree of "non-metricity" and even of non-Euclidean behavior can be measured suitably with specific indexes obtained from the PE embedding such as the Eigen-Ratio, the Negative Eigen-Fraction and the Non-Metricity Fraction, each of which measures the non-Euclidean behavior, e.g., of a given dissimilarity matrix . Therefore, while the second question concerns the relation between G and G w , in the first characterization we are wondering what is the influence of dissimilarity weights on the Negative Eigen-Fraction, hence we are questioning on how it is possible to tune the non-Euclidean behavior of a custom-based dissimilarity matrix.
The article is organized as follows. In "Metric Learning" it is provided a brief review of the various ML paradigms treated in the literature. "On Metric Spaces and Dissimilarity Matrices" is a concise description of metric spaces and related dissimilarity matrices that serves as background. "The Weighted Euclidean Distance" is a deepening of the Euclidean distance structure and its weighted component-wise counterpart. "Characterization of a Composite Component-wise Dissimilarity" and "On the Presence of Weights in a Component-wise Dissimilarity and the Eigenspectrum of the Gram Matrix" sketch an experimental evaluation of the proposed principal investigations and, finally, "Conclusion" concludes the article.

METRIC LEARNING
The ML problem is concerned with learning a distance function tuned to a particular task and has been shown to be useful when exploited in conjunction with techniques relying explicitly on distances or dissimilarities, such as clustering algorithms, nearest-neighbor classifiers, etc. For example, if the task is to asses the similarity (or dissimilarity) between two images with the aim of finding a match, e.g., in face recognition, we would discover a proper distance function that emphasizes appropriate features (hair color, ratios of distances between facial key-points, etc.). Although this task can be performed by hand, it is very useful to develop tools for learning automatically the subset of meaningful features for the problem at hand. In fact, as anticipated in "Introduction", useful representations can be also learned. However, it is unquestionable that, at least on a theoretical level, representation learning must be taken separate from classification tasks as depicted in Fig. 1 and discussed in Bellet, Habrard & Sebban (2013).
The ML step can be conceived as a first step in the open-loop pipeline depicted in Fig. 1, to be performed before the model synthesis stage. Moreover, both tasks can be done together in the same system, representing an advanced closed-loop and automatic PR system. It is the case, for example, of feature selection and feature extraction techniques. The last procedures can be done manually, but if they are automated (i.e., optimized) they can be a building block of the classification system itself. There are many methodologies capable to learn a representation; some authors distinguish between neural learning, that is learning by means of deep learning techniques, and ML. Despite this distinction, in general, both approaches ground the learning procedure on optimization techniques. Neural learning is useful in finding a good feature space, while ML involves the learning of suitable manifold where data objects lie and where they can be well represented for solving the problem at hand.
Many declinations of ML are available and, according to Fig. 2, they can be resumed in three principal paradigms: fully supervised, weakly supervised and semi supervised. An informal formulation of the supervised ML task is as follows: given an input distance function dðx; yÞ between objects x and y (for example, the Euclidean distance), along with supervised information regarding an ideal distance, construct a new distance function dðx; yÞ which is "better" than the original distance function (Kulis, 2012). Normally, fully Figure 1 Scheme of the common process in Metric Learning. A metric is learned from data comingfrom a suitable distribution and plugged into a predictor (e.g., a classifier, a regressor, a recommendersystem, etc.). The predictor fed with the learned metric hopefully performs better than a predictorinduced by a standard, non-learned, metric (Bellet, Habrard & Sebban, 2013). supervised paradigms have access to a set of labeled training instances, whose labels are used to generate a set of constraints. In other words supervised ML is cast into pairwise constraints: the equivalence constraints where pairs of data points belong to the same classes, and inequivalence constraints where pairs of data points belong to different classes (Bar-Hillel et al., 2003;Xing et al., 2002). In weakly supervised learning algorithms we do not have to access to the label of individual training examples and learning constraints are given in a different form as side information, while semi-supervised paradigms do not use either labeled samples or side information. Some authors (e.g., Saul & Roweis (2003)) deal with unsupervised ML paradigms, sometimes called also manifold learning, referring to the idea of learning an underlying low-dimensional manifold 2 where geometric relationships (e.g., the distance) between most of the observed data are preserved. Often this paradigm coincides with the dimensionality reduction paradigm such as the wellknown Principal Component Analysis (PCA) (Shlens, 2014;Giuliani, 2017) and the Classical Multi-Dimensional Scaling, based on linear relations. As concerns non-linear counterparts, it is worth taking note of embedding methods such as ISOMAP (Tenenbaum, De Silva & Langford, 2000), Locally Linear Embedding (Roweis & Saul, 2000) and Laplacian Eigenmap (Belkin & Niyogi, 2003). Other methods are based on information-theoretic relations such as the Mutual Information. Hence, the form or structure of the learned metric can be linear, non-linear, local. Linear ML paradigms are based on the learning of a metric in the form of a generalized Mahalanobis distance (Mahalanobis, 1936) between data objects, i.e., properties that has to be learned. In other words, the learning algorithm learns a linear transformation x ! Wx that better represents the similarity in the target domain. Sometimes, there are some nonlinear structures in the available data that linear algorithms are unable to capture. This limitation leads to a non-linear ML paradigm, that can be based on the "kernelization" of linear methods or purely non-linear mapping methods. The last cases lead, for the Euclidean distance, to a kernelized version combining the learned transformation fðxÞ : R m ! R " m with a Euclidean distance function with the capability to capture highly non-linear similarity relations, that is et al., 2012). Local metric refers to a problem where multiple local metrics are learned and often relies on heterogeneous data objects. In the last setting, algorithms learn using only local pairwise constraints. According to the scheme depicted in Fig. 2 the scalability of the solution is a challenging task, especially if we consider the growing of the availability of data in the Big Data era. The scalability could be important under the dataset dimension n and/or the dimensionality of data m. Finally, the intrinsic optimization task underlying the ML paradigm makes the optimality of the solution another important aspect. The latter, depends on the structure of the optimization scheme, that is, if the problem is convex or not (Boyd & Vandenberghe, 2004). In fact, for convex formulations it is guaranteed to reach a global maximum. On the contrary, for non-convex formulations, the solution may only be a local optimum. 2 A manifold is a topological space that resembles Euclidean space near each point. Hence a n-dimensional manifold has a neighborhood that is homeomorphic to the Euclidean space of dimension n.

ON METRIC SPACES AND DISSIMILARITY MATRICES Definitions
The standard Euclidean space, as vector space, is highly structured from the algebraic viewpoint. Moreover, the Euclidean distance is experienced daily by human beings. PR problems do not involve necessary spaces with such an high level structure. Basically, from the PR point of view, a finite number of objects have to possess such properties that guarantee generalization, hence learning. The principal property is the "closeness" that relies on the notion of neighborhood, that is a primitive property applicable to general topological spaces (Deza & Deza, 2009). Furthermore, the metric properties that enrich the structure of primitive mathematical objects can be induced not only for a space but also for a set (e.g., the set of binary strings).
If all conditions are fulfilled d is properly said distance function. Conversely, if some conditions are weakened the space continues to have some structure and d is better known in PR as dissimilarity. For example, a space (X, d) that obeys only the reflexivity condition is known as hollow space; a hollow space 5 that obeys the symmetry constraint is a premetric space; a pre-metric space obeying the definiteness is a quasi-metric space; a premetric space satisfying the triangle inequality is a semi-metric space.
Definition 2 (Metric for Dissimilarity Matrix D (De Santis, Rizzi & Sadeghian, 2018)). Let D be a symmetric dissimilarity matrix with positive off-diagonal elements d ij built on a set of n objects is a admissible measure of the dissimilarity between the objects o i and o j . D is metric if the triangle inequality d ij + d jk ≥ d ik hold for all triplets i; j; k ð Þ. It is worth noting that if two objects are similar in a metric sense, every other object that has a relation with one will have a similar relation with the other. This property allows for one of the given objects being eligible for becoming a prototype in learning algorithms (Pękalska & Duin, 2005).
where I is the identity matrix, denotes the centering matrix. If D is Euclidean, it is also metric (Gower & Legendre, 1986). Given a vector configuration fx 1 ; x 2 ; …; x n g in a Euclidean space R m ; d 2 ð Þequipped with the standard inner product x i ; x j and organized in a n × m configuration matrix 6 Here it is not used the bold notation to indicate that X is a set of generic objects and not only a vector space. Hereinafter, the calligraphic notation (instead the bold one) will be used for the dissimilarity matrix D and for the so-called centering matrix J.
as linear kernel matrix, can be expressed by the inner product between all pairs of vectors x i ; x j as G ¼ XX T . Since the squared distance d 2 2 can be expressed in terms of inner product as d 2 a linear relation between the Gram matrix G and the matrix of squared Euclidean distances D Ã2 can be found. The relation between G and D Ã2 is: Conversely, the relation between D Ã2 and G is: where g ¼ diagðGÞ.
Given a non-metric (pre-metric) or non-Euclidean symmetric dissimilarity matrix D, the eigendecomposition of the Gram matrix G by the factorization G ¼ QÃQ T , where Ã is a diagonal matrix of eigenvalues organized in descending order and Q is an orthogonal matrix of the correspondent eigenvectors, leads to the presence of negative eigenvalues and the indefiniteness of the corresponding Gram matrix G. However an embedding is still possible by constructing a suitable space, i.e., the PE space, with a suitable inner product and norm 7 .
A generalization of the well-known Euclidean distance on a vector space X R m is the Minkowski distance.
Definition 4 (Minkowski distance). Given two vectors x; y 2 R m the Minkowski distance of order p 2 (−∞, +∞) is defined as: Depending on the value of the p parameter this distance generalizes the Euclidean distance (p = 2) or the Manhattan distance (p = 1). Moreover, not for all values of p the distance is metric. For p = 2 it is trivially metric 8 being the standard Euclidean distance. For every value p ≥ 1 the Minkowski distance is metric, while there is a problem with the Triangular inequality for p 2 (0,1). In fact, if we consider a dimension m = 2 and three points: x i j j 2 indefinite inner product that is positive in R p and negative in R q . Hence, given two vectors x; y in this space the bilinear inner product can be defined as: In the same way, the squared norm is defined as Þ, that can be also negative. The Gram matrix G ¼ À 1 2 JD Ã2 J is now expressed as: where J pq is known as the fundamental symmetry in the PE space R ðp;qÞ . The isometric embedding can be found by a proper decomposition of G in a PE space: where p + q = k and Ã j j 1 2 is a diagonal matrix whose diagonal elements are the square root of the absolute value of the eigenvalues organized in descending order, first the positive ones and after the negative ones, followed by zeros. X k ¼ Q k Ã k j j is the configuration of vectors in the PE space R k ¼ R ðp;qÞ where k non-zero eigenvalues corresponding to k eigenvectors in Q are preserved.
Finally, the estimated PE covariance matrix C can be found as: Hence X is an uncorrelated representation and even if C is not positive definite in the Euclidean sense, it is positive definite in the PE sense and X can be interpreted in the general context of the indefinite kernel PCA approach.

THE WEIGHTED EUCLIDEAN DISTANCE
Let be X R mÂn a m × n data matrix with n data objects, arranged as columns, where m is the dimension of the vectorial space where data points lie. The vector space is endowed with the standard scalar product x ik x jk e ik e jk while e i is the i-th standard basis vector, i.e., a vector of all zeros except for the entry k, which has a 1. The Euclidean distance function 9 in such space equipped with the standard inner product Á; Á h i can be expressed as: where the elements D ij ¼ dðx i ; x j Þ; i; j ¼ 1; 2; …; n form the entries of the n × n distance matrix D between the objects x i and x j in X. Given a symmetric positive-definite matrix M with real-valued entries, i.e., M ¼ M T and x T Mx ! 0, x 6 ¼ 0, the entry D M ij of the WED matrix D M can be expressed as: 9 The standard Euclidean distance is an instance of a more general family of distances parametrized by the exponent p, known as Minkowski distance family. See "Characterization of a Composite Component-wise Dissimilarity" for a short introduction.
where M ¼ W T W is the Cholesky decomposition (Strang, 1976) of matrix M, that in the Hermitian general case, is found to be the decomposition of an Hermitian matrix in the product of a lower triangular matrix and its conjugate transpose.
In ML literature the distance in Eq. (8)  The above properties hold trivially for a standard Euclidean space where M ¼ I. The matrix W can be seen as a linear operator that transforms the shape of the space where data points, i.e., data vectors, lie. Specifically W defines a suitable transformation (endomorphism) V ! V of the (abstract) space V spanned by rows vector of X in itself: given a vector x in the starting space S 1 , the matrix W maps this vector in a new vector x w ¼ Wx that lies in the space S 2 , where S 1 and S 2 are isomorphic to V 10 . In the new transformed space the inner product becomes the standard inner product Á; Á h i, i.e., The arrival space is endowed with a squared norm given by Observation 1. The weighted distance d M ðx; yÞ ¼ dðx; y; MÞ with M ¼ W T W equals d I ðx w ; y w Þ ¼ dðx w ; y w ; IÞ where x w ¼ Wx and y w ¼ Wy and I is the identity matrix.
Proof. The proof follows by the same algebraic manipulation of Eq. (8).
Let M ¼ W T W, x w ¼ Wx and y w ¼ Wy. It holds that: □ The matrix W is an instance of an operator that defines a rotation and a scaling of the objects upon it operates. W maps a circle in the unweighted Euclidean space in an ellipse in the weighted Euclidean space-see Fig. 4. Hence we can state the following theorem.
Theorem 1. Applying a transformation W to all point of a circle of radius r the resulting points form an ellipse whose center is the same as the circle and length of its axes equals r times twice the square root of eigenvalues of M ¼ W T W.
□ The weight matrix M can be decomposed in its rotation and scaling components by means of the eigendecomposition operation. Specifically, by decomposing M ¼ QDQ T where Q is an orthogonal matrix with normalized column vectors, that is Q T Q ¼ I and Q T ¼ Q À1 , and D is a diagonal matrix 12 . D contains the eigenvalues λ 1 ,λ 2 ,…,λ m (organized in decreasing order) that are the scaling factors, while Q is the rotation operator matrix that leaves unchanged the (squared) norm of vectors, that is Qx (Strang, 1976). 11 We can assert that each space of vectors x comes with its dual-space of linear functionals w T . In the scalar product w T x, w T acts linearly upon vectors x and y, i.e., w T (λx+ µy) = λw T x+ µw T y. At the same time x acts linearly upon v T and w T , i.e., (λv T + µw T )x = λv T x + µw T x. So the linear functionals w T form a vector space Dual or Conjugate to the space of vectors x. Each space is dual to the other, and they have the same finite dimension. 12 The eigendecoposition results in a safe operation because M is a (square) real symmetric matrix, furthermore it can be demonstrated (spectral theorem (Strang, 1976)) that M is diagonalizable by the matrix of its eigenvectors, i.e., At this point it is possible to express the WED in terms of the above eigendecomposition: From Eq. (10) it follows that d x; y; M ð Þcan be expressed, through the eigendecomposition of the weighting matrix M, with another weighted distance with weights given by the eigenvalues matrix D. This new distance takes into account new vectors: Q T x ¼x and Q T y ¼ŷ that are the rotated counterparts of original vectors x and y. In other words, the two vectorsx andŷ are the rotated, but not scaled, version of x w and y w that originate both in space S 2 . It can be demonstrated that the length of the axis of the ellipsoid in the direction of i-th eigenvalue λ i is equal to: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi d x; y; M ð Þ=k i p . Finally if the weighting matrix M is a diagonal matrix, with real entries, the above eigendecomposition reduces to M ¼ EDE T where E is the eigenvector matrix E ¼ e 1 ; e 2 ; …; e m ½ , whose columns contains the standard basis in R m with the property e T i e j ¼ d ij , where δ ij is the Kronecker delta. In this case the matrix E represents the identity element of the rotation operator, leaving vectors in the original place, while they are scaled by a factors given by the entries of the diagonal of M, being the eigenvalues of a diagonal matrix the diagonal entries of the same matrix. An example of this phenomenon is given in Fig. 5.

CHARACTERIZATION OF A COMPOSITE COMPONENT-WISE DISSIMILARITY
When the PR problem at hand deals with heterogeneous measures on objects and these measures are both structurally and semantically different (graphs, time series, images, real numbers, etc.), a composite dissimilarity measure can be useful, for example in clustering applications. The dissimilarity measure is a combination of (sub)-dissimilarities suitably defined depending on the nature of the data. Before constructing a toy composite dissimilarity measure, it is worth to mention the following corollary valid when a dissimilarity measure is computed by combining the dissimilarities pertaining to all of the m attributes separately. In fact, given m features, a dissimilarity measures can be computed as: dðx; yÞ ¼ P m i¼1 f ðx i ; y i Þ, where f (x j , x j ) = 0 and f (x j , y j ) = f (y j , x j ) ≥ 0 for all j. The corollary states: Corollary 1. Let x; y 2 R m . Then dðx; yÞ ¼ P m i¼1 f ðx i ; y i Þ is metric iff f is metric on R. Proof. The proof can be done considering that f is non-negative, symmetric and it holds that f (s, s) = 0 for s 2 R, then the first three axioms about metric spaces, i.e., reflexivity, symmetry and definiteness, are fulfilled. Furthermore, since d is metric d(x, y) + d(y, z) ≥ d (x, z) holds for all x, y, z. If we consider x j = c x , y j = c y , z j = z x , for all j and some constants c x , c y , c j the Triangle inequality for d reduces to f(x c , y c ) + f(y c , z c ) ≥ f(x c , z c ). The ⇒ proof is trivial.
□ Moreover, it can be demonstrated (Gower & Legendre, 1986) that, at least for the Euclidean case (p = 2 in the Minkowski distance definition), if f : Now we show a demonstration of the following claim valid for a composite dissimilarity measure, making use of Def. 3 that characterizes the Euclidean behavior for dissimilarity matrices and Def. 2 for metric behavior.
Claim 1. Given two general objects x; y 2 H, where H is a generic feature space, and a component wise custom-based dissimilarity dðx; yÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðxÉyÞ T ðxÉyÞ , then if at least one component-wise dissimilarity is not Euclidean the dissimilarity matrix that arises from d applied on object within H, is not Euclidean. As stated in "On Metric Spaces and Dissimilarity Matrices", the expression "non-Euclidean" means that there is no set of vectors in a vector space of any dimensionality for which the Euclidean distances between the objects are identical to the given ones (Duin & Pękalska, 2010). We show now how Claim 1 can be demonstrated with a constructive example. Let x ¼ ðx 1 ; x 2 ; …x k ; x kþ1 ; …x m Þ and y ¼ ðy 1 ; y 2 ; …y k ; y kþ1 ; …y m Þ be two objects in a vectorial space H v . We define a set of component-wise dissimilarities induced for the first k components such as f cw j ðx j ; y j Þ ¼ x j À y j ; j ¼ 1; 2; …; k and a single component-wise dissimilarity induced for the remaining m − k components such as f p ðx s¼kþ1;…;m ; y s¼kþ1;…;m Þ ¼ P m s¼kþ1 x s À y s j j p À Á 1 p . In other words we divide the starting space H v as the Cartesian product (Strang, 1976) between two sub-spaces, the space H cw generated from the first k components, in which the component-wise dissimilarities are computed as f cw and H p where the dissimilarity is computed as the Minkowski distance f p applied to the last m − k components. Finally, the overall dissimilarity between two objects, say x; y is induced by the ℓ 2 norm in the following way: where (f cw ⊕ f p ) is the vector of dimension k + l constructed by the concatenation of the two (sub)-dissimilarities f cw j , f p , j = 1,2,…,k.
To evaluate the validity of the Claim 1 the dissimilarity in Eq. (11) is computed on a sample drawn from a multi-variate Gaussian distribution with dimension m parameterized as: m = k + l, where k is maintained fixed without loss of generality, and l is varied. It is noted that the p parameter controls the nature of the Minkowski distance, making the (sub)-dissimilarity f p metric or not metric (and even non-Euclidean 13 ) depending on the value of p as demonstrated above, such that for p ≥ 1 it is metric.
In order to measure the non-Euclidean behavior of the space induced by the Minkowski distance, we introduce the Negative Eigen-Fraction (NEF): where (p, q) is the signature of the PE space, and λ i are the eigenvalues of the Gram matrix decomposition. The NEF measures the degree of the non-Euclidean influence evaluating the ratio between the sum of the negative eigenvalues and the overall set of eigenvalues. Another index that helps to commensurate the non-Euclidean influence is the Negative Eigen-Ratio (NER): where λ min and λ max are the minimum and maximum eigenvalue of the Gram matrix. In Fig. 6 are reported, following the same experimental scheme proposed in , several curves representing the NEF for a 100 points Gaussian sample varying the p parameter of the Minkowski distance as a function of the dimensionality. Now it is clear that the Minkowski distance is non-Euclidean for any p ≠ 2, but for very high dimensionality values the Euclidean behavior is restored independently from p. However, from Def. 3, we know that a n × n (n ≥ m) dissimilarity (distance) matrix D is Euclidean if it can be embedded in a Euclidean space R m ; d 2 ð Þ, where d 2 is the standard Euclidean distance. It means that the Gram matrix G obtained as described in "On Metric Spaces and Dissimilarity Matrices" does not contain negative eigenvalues, hence it is a positive semi-definite matrix. The remainder of the discussion is then based on the eigenvalues spectrum of the Gram matrix computed from the dissimilarity matrixD ¼d ij . In Fig. 7 are reported the eigenvalues spectra for the Gram matrixĜ obtained from the dissimilarity matrixD computed for a fixed k = 5 and varying the value for l = 0, 1,… 4. The dashed lines are the case: l = 0 and l = 1. The first one represents the spectrum deducted from the first k = 5 components of H v and, as we expected, it contains only positive eigenvalues, thereby the dissimilarity matrixD is isometrically embeddable. The same holds for l = 1 because trivially we have that ; y 2 R, thus the dissimilarity measure f p remains metric. For l >1 the several spectra contain both positive and negative eigenvalues making the Gram matrixĜ indefinite. As counterexample in Fig. 8 are depicted the spectra of the dissimilarity in Eq. (11), where the parameter p of the Minkowski distance is set as p = 2. As we expect, in this case, the dissimilarity behaves in an Euclidean fashion.

ON THE PRESENCE OF WEIGHTS IN A COMPONENT-WISE DISSIMILARITY AND THE EIGENSPECTRUM OF THE GRAM MATRIX
In the discussion related to Claim 1 it is introduced a suitable component-wise custom dissimilarity that in general has the form: (sub)-dissimilarities, each one induced for a specific feature type F j . Specifically, we had two groups: the first (sub)-dissimilarities act on a vectorial subspace and they were computed as the component-wise ' 1 norm: x i À y i j j, while in the second group a unique (sub)-dissimilarity is computed as the Minkowski distance (p = 0.8, hence neither metric, nor Euclidean). Now we will discuss the case in which the same family of custom-based dissimilarities are weighted, hence they have the form described in "The Weighted Euclidean Distance" for the WED. In other words, given a pair of objects x i and y j , the dissimilarity measure under analysis has the following form: Given n objects x i , i = 1,2,…,n, the weighted dissimilarity matrix whose entries are given by d w x i ; y j À Á -see Eq. (14)-is hereinafter referred to as D w , for convenience. The latter can be decomposed according to Eq. (1) as G w ¼ À 1 2 JD Ã2 w J, where G w is the Gram matrix parametrized by the weight matrix W. As discussed in "Metric Learning", the weights act as a linear mapping M : x ! Wx. Starting from the above settings, two questions arise. The first is if, in principle, it is possible to find a suitable weighting matrix W that makes the dissimilarity matrix D w "more Euclidean". The second question is about the behavior of the Gram matrix G w in terms of eigendecomposition. In other words, one may ask what is the relationship between the eigenvalues (and eigenvectors) of the non-weighted Gram matrix G and the weighted one G w . The two questions are strongly interrelated. By the way, the first is simpler than the second. To answer the first question one may conceive a simple problem in which one wants to minimize the NEF defined in Eq. (12), hence, we can consider a diagonal matrix W diag ¼ diagð w 1 ; w 2 ; …; w d ½ Þand the task is to solve the following minimization problem: arg min NEFðG w Þ; G w s:t: 0 w i 1 i ¼ 1; 2; …; d: The NEFsee Eq. (12)depends on the eigenvalues λ i of the Gram matrix which, in turn, depend on the dissimilarity matrix D w through a non-linear operation, which in turn depends on the weighted dissimilarity measure d w x i ; y j À Á , which, finally, depends on the weights matrix W diag (if diagonal). The optimization problem can be performed via the same setting used to discuss Claim 1. Specifically, it is a simple exercise in adopting a meta-heuristic, such as a GA, in minimizing the optimization problem in Eq. (15). The two subspaces, H cw and H v have a dimensionality equal to 3 and the Minkowski parameter of the distance acting on H cw is set to 0.8 (hence neither metric, nor Euclidean).
Starting from a random population of 30 individuals (chromosomes) for the weights w, the GA converges to the (sub)-optimal solution w Ã ¼ 1; 1; 0:999; 0:0001 ½ with a fitness value (the NEF) equals to 2.0380e-06, hence negligible. As we expected, the GA finds a solution with higher weights for the "Euclidean" components and practically null value for the "Minkowski" component.
Although the answer to the first question is trivial, the second question about the relationship of the two spectra of G and G w is only apparently simple. Here we try to give a sketch of the problem. Suppose that F is a vectorial space endowed with the standard norm Á; Á h i, and X 2 R mÂn is a data matrix with the n data points organized as columns. The discussion can be restricted to an Euclidean space equipped by the standard Euclidean distance: d x i ; x j À Á ¼ x i À x j 2 for x i ; x j 2 X. The scalar product matrix or the Gram matrix, with the data matrix organized with data vectors in columns and the variables as rows, is: G ¼ X T X. The linear mapping M : x ! Wx transforms the data matrix X in M X ð Þ ¼ WX ¼ Y. Thereby, the Gram matrix becomes: In trying to find a relation between the eigenvalues of Y T Y and those of X T X, we can make use of the relation between the Singular Value Decomposition (SVD) of a m × n matrix A and the eigendecomposition of the n × n matrix A T A. In fact, any m × n matrix can be factored as A ¼ UAEV T (Strang, 1976), where the columns of matrix U (m × m) are the eigenvectors of AA T and the columns of V (n × n) are eigenvectors of A T A; finally, the r ¼ rank A ð Þ singular values in the diagonal of Σ (m × n) are the square roots of the nonzero eigenvalues of both A T A and AA T 15 .
Let Y ¼ U w AE w V T w be the SVD decomposition of Y and X ¼ UAEV T be the decomposition of X. If we multiply on the left side for W À1 both sides of the first relation we obtain It is easy to show the relation between the eigenvalues and the singular values: In the same way A T A = ΣU T ΣU T , being V T V = I. V and U are orthogonal matrices for a real A (for complex A they are unitary matrices). Σ T Σ = ΣΣ T is a n × n diagonal matrix with diagonal entries the square roots of singular values of A that are the eigenvalues of A T A or AA T . these two relations and multiply both sides for U T on the left side and V on the right side and by further considering that V T V ¼ I ¼ U T U, we come to the relation: Equation (16) is a (complex) relation between the (diagonal) singular values matrix Σ that contains as entries the singular values of X and the singular values of Y ¼ WX, placed in the diagonal of AE w . Unfortunately, calculations cannot be further performed in closed form unless we make additional assumption on W. The reason becomes clear if we think at WX as the product of two matrices: in fact, the original question about the relationship between the eigenvalues of the Gram matrices G and G w can be translated into the relation of the eigenvalues of the following matrices A; B; AB. However, this so-wanted relationship between the eigenvalues of the product of general matrices and its multiplicands is still an open problem of mathematics, even if in the literature there are a number of works that provide several inequalities for the matrix product and sum problem (Zhang & Zhang, 2006;Fulton, 2000;Watkins, 1970;Thompson & Therianos, 1971). If W is a scalar matrix of the form W ¼ kI the relation shown in Eq. (16) becomes simple. In fact, we can write WX ¼ kIX ¼ U w AE w V T w , but X ¼ UAEV T , hence WX ¼ kIUAEV T ¼ U w AE w V T w . It means that the singular vectors are the same: U ¼ U w and V ¼ V w and therefore Eq. (16) becomes: Hence, for the spectrum of G w ¼ X T W T WX, we have AE T AE ¼ k À1 AE T w AE w 16 .
Ultimately, there are no relationships between the spectrum of the product of two generic matrices and one of the single matrices, unless in simple cases 17 . In general, two generic matrices do not share the same set of eigenvectors and this makes the analysis infeasible. In order to graphically show in a computational fashion the relationship between the eigenvalues of the Gram matrix obtained from a weighted dissimilarity matrix and those obtained from a non-weighted dissimilarity matrix, we have generated a random bi-dimensional matrix X test 2 R ð2Â20Þ , hence containing 20 random 2-D vectors. Moreover, the dissimilarity matrix D test w on X test is computed through the standard Euclidean distance and finally the Gram matrix G test w is extracted. The dissimilarity measure is weighted with a diagonal matrix of the form: W ¼ a 0 0 b ! , where a; b 2 0; 1 ð 18 . Finally the eigendecomposition of G test w is performed, yielding the first two eigenvalues k w1 and k w2 as function of W.
In Fig. 9 are depicted the value of the first and the second eigenvalues of G test w , respectively, as a function of α and β in the predefined interval. In Fig. 10, as instead, it is 17 It is possible to demonstrate that diagonalizable matrices share the same eigenvector matrix S if and only if AB − BA = 0, that is, if they commute (Strang, 1976). The result holds also for normal matrices N, that is, matrices where N commutes with N H (Wilkinson, Wilkinson & Wilkinson, 1965). 18 We note that the eigenvalues of a diagonal matrix are the diagonal entries, i.e., α and β, and the eigenvectors are the canonical basis in R m .
reported the value of the first and second eigenvalues in the case α = β = k, that is the case W ¼ kI.
For completeness in Fig. 11 are reported the sum, the product, and the quotient of the first two eigenvalues of G test w , while in Fig. 12 we have the same operations in the case of α = β = k.

CONCLUSION
In solving real-world problems in pattern recognition we may incur in a complex representation of objects with the need of a custom-based dissimilarity measure whose components are (sub)-dissimilarities tailored on the nature of the object at hand. Moreover, the starting space can be non-metric and standard machine learning algorithms cannot operate directly due to the absence of a vectorial space endowed with some welldefined norm. The dissimilarity template can be a weighted Euclidean distance where weights are learned by exploiting a metric learning paradigm. Often, in real-world applications, the adopted custom-based dissimilarity measure leads to non-Euclidean dissimilarity matrices. The non-Euclidean behavior can be suitably measured by studying the spectrum of the related Gram matrix. The adopted framework shows how the (sub)-dissimilarity measure adopted can affect the Euclidean behavior and how a weighting scheme can suitably address this phenomenon. The weighting scheme concerns the spectra of the underlying dissimilarity, but only in some simple cases the problem can be addressed theoretically. Alongside the present work of a more theoretical nature, as regards the future directions, we have planned to evaluate the impact of the non-metricity of the dissimilarity matrices in some real-world applications (e.g., predictive maintenance) and as a correction expressed directly in the objective function (in line with our theoretical discussion) of an optimization system impacts on the performance of a classification system in terms of generalization capabilities.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.