Three Similarity Measures between One-Dimensional Data Sets

Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using ﬁrst-order statistics. The Glass Identiﬁcation Database is used to illustrate how to analyse a data set prior to its classiﬁcation and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.


Introduction
Today, in many tasks in which data sets are analysed, researchers strive to achieve some way of measuring the features of data sets, for instance, to distinguish between informative and non-informative dimensions.A first step could be to study whether several sets of data are similar.The similarity may be defined as a measure of correspondence between the data sets under study.That is, a function which, given two data sets X and Y , returns a real number that measures their similarity.
In data mining, there exist several similarity measures between data sets: for instance, in Parthasarathy & Ogihara (2000), a similarity is used which compares the data sets in terms of how they are correlated with the attributes in a database.A similar problem, studied in Burrell (2005), is the measurement of the relative inequality of productivity between two data sets using the Gini coefficient (González-Abril, Velasco, Gavilán & Sánchez-Reyes 2010).A similarity measure based on mutual information (Bach & Jordan 2003) is used to determine the similarity between images in Nielsen, Ghugre & Panigrahy (2004).Similarity between molecules is used in Sheridan, Feuston, Maiorov & Kearsley (2004) to predict the nearest molecule and/or the number of neighbours in the training set.
A common problem with the aforementioned similarity measures is that their underlying assumptions are often not explicitly stated.This study aims to use first-order statistics to explain the similarity between data sets.In this paper, the similarity is established in the sense that one-dimensional data sets are similar simply by comparing the statistics of the variables in each data set.
In statistics, other similarity measures between data sets are also available (González, Velasco & Gasca 2005), for instance, those which are used in hypothesis testing.In this way, a non-parametric hypothesis test based on the proposed similarity is presented in this paper and a comparative analysis is carried out with several well-known hypothesis tests.
The remainder of the paper is arranged as follows: In Section 2, we introduce some notation and definitions.Sections 3 and 4 are devoted to give two similarity measures between one-dimensional data sets.An example is presented in Section 5 to show their use.A non-parametric test is derived in Section 6 and experimental results are given to illustrate its behaviour and good features.Finally, some conclusions are drawn and future research is proposed.

Concepts
Following Lin (1998), with the purpose of providing a formal definition of the intuitive concept of similarity between two entities X and Y , the intuitions about similarity must be clarified.Thus: i) The similarity is related to their commonality in that the more commonality they share, the more similar they are; ii) The similarity is related to the differences between them, in that the more differences they have, the less similar they are; and iii) The maximum similarity is reached when X and Y are identical.
Let us denote a similarity measure between X and Y by K(X, Y ).Ideally this function must satisfy the following properties: 1. Identity: K(X, Y ) at its maximum corresponds to the fact that the two entities are identical in all respects; 2. Distinction: K(X, Y ) = 0 corresponds to the fact that the two entities are distinct in all respects; and 3. Relative Ordinality: If K(X, Y ) > K(X, Z), then it should imply that X is more similar to Y than it is to Z.
Hence, certain similarities are defined in this paper which are consistent with the above intuitions and properties.
Let us consider four one-dimensional data sets, DS 1 , DS 2 , DS 3 and DS 4 (see the Appendix), where the DS 1 and DS 2 data sets are taken from a N(1,1) distribution, the DS 3 data set from a N(0.5,1) distribution, and the DS 4 data set from a N(1.5, 1.25) distribution, where a N(µ, σ) distribution is a Normal distribution with mean µ and standard deviation σ.In practice, comparison of these data sets involves: a) plotting graphical summaries, such as histograms and boxplots, next to each other; b) simply comparing the means and variances (see Figure 1); or c) calculating correlation coefficients (if items of data are appropriately paired).These methods are straightforward to interpret and explain.Nevertheless, these approaches contain a major drawback since the interpretation is subjective and the similarities are not quantified.
Let us introduce the concept of interval distance.Given an open interval (similarly for another kind of interval) of finite length, there are two main ways to represent that interval: using the extreme points as (a,b) (classic notation) or as an open ball B r (c) (Borelian notation) where c = (a + b)/2 (centre) and r = (b − a)/2 (radius).Using Borelian notation, the following distance between intervals given in González, Velasco, Angulo, Ortega & Ruiz (2004) is considered: Definition 1.Let I 1 = (c 1 − r 1 , c 1 + r 1 ) and I 2 = (c 2 − r 2 , c 2 + r 2 ) be two real intervals.A distance between these intervals is defined as follows: where ∆c = c 2 − c 1 , ∆r = r 2 − r 1 , and W is a symmetrical and positive-defined 2 × 2 matrix, called weight-matrix.
It is clear from matrix algebra that W can be written as1 W = P t P, where P is a non-singular 2 × 2 matrix, and hence d W (I 1 , I 2 ) = P (∆c , ∆r) t , where • is the quadratic norm in R 2 , and therefore d W (•, •) is an 2 -distance.It can be observed that, from the matrix W, the weight assigned to the position of the intervals c, and to their size r, can be controlled.Furthermore, the distance (1) provides more information on the intervals than does the Hausdorff distance (González et al. 2004).
From the distance given in (1), three new similarity measures are defined in this paper.

A First Similarity
Definition 2. Given a data set X = {x 1 , . . ., x n } and a parameter > 1, the -associated interval of X, denoted by I X , is defined as follows: where X and S X are the mean and the standard deviation of X, respectively.
It is worth noting that Chebyshev's inequality states that there are at least a (1 − 1/ 2 ) proportion of observations x i in the interval I X .Hence, the similarity between two data sets X and Y can be quantified from the distance between the intervals I X and I Y .However, it is possible that some instances z ∈ X ∪ Y exist such that z / ∈ I X ∪ I Y .Thus, a penalizing factor (the proportion of instances within I X and I Y ) is taken into account in the following similarity measure.
Definition 3. Given two data sets X = {x 1 , . . ., x n }, Y = {y 1 , . . ., y m } and a parameter > 1, a similarity measure between X and Y , denoted by K W (X, Y ), is defined as follows: where #A denotes the cardinality of set A.
The function defined, K W , is a similarity measure (Cristianini & Shawe-Taylor 2000) which has been proposed based on distance measurements in Lee, Kim & Lee (1993) and Rada, Mili, Bicknell & Blettner (1989).Furthermore, for any and W, K W is a positive, symmetrical function since it is a radial basis function (Skhölkopf & Smola 2002).
Thus, the K W similarity takes into account the following characteristics: i) The position of the whole data set on the real line given by the mean; ii) The spread of the data set around its mean given by the standard deviation multiplied by a parameter > 1; iii) The weighted importance of the mean and the standard deviation of each data set, given in the weight-matrix W; and iv) A factor which quantifies, from the number of outlying values, the goodness of fit of the associated intervals.
Example 1.For the data sets of the Appendix, = 2 and W = I = 1 0 0 1 , the similarity K =2 I is given in Table 4.It can be seen that the similarities obtained are consistent with the distributions generating the data sets.
After having experimented with different choices of and W, it is observed that the numerical results differ slightly but the conclusions on their similarities remain the same.

A Second Similarity
When the size of the data set is large, consideration of only the number of outlying values and the mean and the standard deviation is grossly insufficient to obtain meaningful results.Furthermore, it is clear that these features are not likely to be very helpful outside a normal distribution family (the mean and variance are highly sensitive to heavy tails and outliers, and are unlikely to provide good measures of location, scale or goodness-of-fit in their presence).Hence more characteristics which summarize the information of each data set must be taken into account.
In this framework, the percentiles of the data set are used.Let be a set of q percentiles of a data set X with p iX ≤ p (i+1)X and q ≥ 2. Hence, q − 1 intervals, denoted by I iX , are considered as follows: Example 2. Given the DS 1 data set, an example of the Q DS1 set is given by where these values are the percentiles 2.5, 25, 50, 75, and 97.5, respectively, and q = 5.Definition 4. Given a weight-matrix W and two sets of q percentiles, Q X and Q Y , of the data sets X and Y , respectively, a similarity between X and Y , denoted by K Q W (X, Y ), is defined as follows: The K Q W similarity has the following properties: i) This function is positive and symmetrical iii) The similarity is low if the percentiles are far from each other; and iv) It is a radial basis function.
In Table 1, several examples of d W (I iX , I iY ) can be seen whereby the symmetrical weight-matrix W = {w ij } 2 i,j=1 is varied.In cases 1 and 2, W is a non-regular matrix (det(W) = 0) and therefore this situation is inadequate.In case 3, W = I is the identity matrix, and case 4 provides a straightforward weight-matrix which presents the cross product between the percentiles.
On the other hand, there are many different ways to choose the Q X set for a fixed data set X; in this paper the discretization process2 based on equal-frequency intervals (Chiu, Wong & Cheung 1991) is used.Furthermore, in order to obtain a specific value of q, there are several selections based on experience such as Sturges' formula, where the operator Int[•] is the integer part and n is the size of the data set.Henceforth, q ≡ q 1 is considered with n = max{#X, #Y } and the set of percentiles Q is obtained such that in each interval I i• there is the same quantity of items of data.
In the following section, an example is presented to show how these similarities could be used.

The Glass Identification
The Glass Identification is obtained from the UC Irvine Machine Learning Repository (Bache & Lichman 2013).This database is often used to study the performance between different classifiers.Its main properties are: 214 instances, 9 continuous attributes and 1 attribute with 6 classes (labels).The number of instances in each class is 70, 76, 17, 13, 9 and 29, respectively.Suppose that a preliminary analysis of this data set is desired before applying a classifier.Firstly, K Q W similarities between continuous attributes are given in Table 2 for W = Id.It is observed that attributes 1 and 4 are very similar to each other; and attributes 6, 8 and 9 are also very similar, particularly attributes 8 and 9. Hence, it may be a good idea to eliminate some attributes before the implementation of the classifier, for instance attributes 4, 6 and 8.
Let us study the attributes to determine similarities between the same attributes but with different labels.Hence, if the similarity obtained is low, then the classification is straightforward.
The number of instances with label 1 is 70, and with label 2 this is 76, and K Q W similarities between the nine attributes are given in Table 3.It can be seen that these values are very high, which indicates that the discrimination between these two labels is not easy.
On the other hand, the number of instances with label 3 is 17, and with label 4 this is 13, and K W similarities between the nine attributes are given in Table 3 for = 2. Hence, attributes 3 and 7 are the best in order to separate labels 3 and 4. However the main problem with respect to labels 3 and 4 is that there are very few instances.The main conclusion in this brief preliminary analysis is that the classes of the Glass Identification Database are difficult to separate based only on individual features for the given instances.A good classifier is therefore necessary in order to obtain acceptable accuracy for this classification problem.
An experiment3 is carried out to show that the conclusions of this brief analysis are correct.Thus, the algorithm considered is the standard 1-v-r SVM formulation (Vapnik 1998), by following the recommendation given in Salazar, Vélez & Salazar (2012), and its performance, (in the form of accuracy rate), has been evaluated using the Gaussian kernel, k(x, y) = e − x−y 2 2σ 2 where two hyperparameters must be set: the regularization term C and the width of the kernel σ.This space is explored on a two-dimensional grid with the following values: 2 1 , 2 2 , 2 3 , 2 4 , 2 5 } and  σ The criterion used to estimate the generalized accuracy is a ten-fold cross-validation on the whole training data.This procedure is repeated 10 times in order to ensure good statistical behaviour.The optimization algorithm used is the exact quadratic program-solver provided by Matlab software.
The best cross-validation mean rate among the several pairs (C, σ 2 ) is obtained for C = 1 and σ 2 = 1 with 70.95% accuracy rate when all attributes are used and, when attributes 4, 6 and 8 are eliminated, then the best cross-validation mean rate is obtained for C = 16 and σ 2 = 1 with 68.38% accuracy rate.This experiment indicates that the Glass Identification Database is difficult to separate and that the elimination of attributes 4, 6 and 8 only slightly modifies the accuracy rates.
In the following section, a new hypothesis test is designed and is compared with other similar hypothesis tests.

Hypothesis Testing
Definition 5. Let X = {x 1 , . . ., x n } and Y = {y 1 , . . ., y m } be two data sets.Two further data sets X c and Y c , called the quasi-typified data sets of X and Y , respectively, are defined as follows: where Z = {x 1 , . . ., x n , y 1 , . . .y m }.This process is called quasi-typification.
It is straightforward to prove that From Definition 5, a third similarity measure between data sets is given as follows: Definition 6.Let X and Y be two data sets.A measure of similarity between these sets is defined as: ) provided that X c and Y c are the quasi-typified data sets of X and Y , W is a weight-matrix, and the sets of q percentiles are Q X and Q Y of the data sets X and Y , respectively.
Example 3. KC Q I similarities between the data sets DS 1 , DS 2 , DS 3 and DS 4 are given in Table 4.In Figure 2, each subplot depicts the boxplot of data DS i , DS j , T i and T j where the T i 's are the quasi-typified data sets of DS i and DS j for i, j = 1, 2, 3, 4 and i = j.It is worth noting that all three similarities verify that the similarity between DS 1 and DS 2 is the highest and the similarity between DS 3 and DS 4 is the lowest similarity.Thus, the similarities obtained are consistent with the distribution that generates the data sets.
Several percentiles are obtained from KC Q I similarities of a simulated distribution between random samples of size 100 from two N (0, 1) distributions.The results are shown in Table 5.It is important to point out that the thresholds have been simulated 1, 000, 000 times and it is observed that the sensitivity of the thresholds is very low (less than 10 −5 units).Hence, it is now possible to use these )} be the critical region of size α where P (n * , α) is the percentile α of the simulated distribution KC Q I between two N (0, 1) distributions for n * = min(n, m).Henceforth, this test is denoted as the GA-test.
Note 1.It is worth noting that this test is valid for normal or similar populations.If another type of population is given, then the corresponding percentiles should be calculated.

Comparison with a Parametric Test
Let the following test be: H 0 : F = F versus H 1 : F = F, where F = N (µ 1 , σ 1 ), F = N (µ 2 , σ 2 ) and where µ 1 , µ 2 , σ 1 and σ 2 are unknown parameters.In this case, the null hypothesis states that the two normal populations have both identical means and variances.
Let X = {X 1 , . . ., X 100 } and Y = {Y 1 , . . ., Y 100 } be two random samples from N (µ 1 , σ 1 ) and N (µ 2 , σ 2 ) distributions.A classic test (C-test) is considered, which is a union of two tests.Firstly, a test is performed to determine whether two samples from a normal distribution have the same mean when the variances are unknown but assumed equal.The critical region of size 0.025 is where 2.258 is the percentile 0.9875 of Student's t distribution with 198 degrees of freedom.Another test is also performed to determine whether two samples from a normal distribution have the same variance.The critical region of size 0.025 is where 0.6353 and 1.5740 are the percentiles 0.0125 and 0.9875 of Snedecor's F distribution, both with 99 degrees of freedom.
In this framework, a comparison is made between the classic test for Normal populations whose critical region of size 0.0494 (= 1−0.975 2 ) is RC = RC 1 ∩RC 2 , versus the GA-test whose critical region of size 0.05 is {(X, Y ) : KC(X, Y ) < 0.70166} (see Table 5).For this comparison, it is considered that one population is N (20, 4) and the other population is N (µ, σ), and the hypothesis test is carried out for 100,000 simulations for each value µ = 18, 19, 20, 21, 22 and σ = 3, 3.5, 4, 4.5, 5.The results of the experiment are given in Table 6, where the percentage of acceptance of the null hypothesis is shown for the two tests.The best result for each value of the parameters is printed in bold, that is, the minimum of the two values except for the case σ = 4 and µ = 20 in which the null hypothesis is true and then the maximum of the two values is printed in bold.
The first noteworthy conclusion is that there are no major differences between the two methods and therefore the results of the GA-test are good.As expected, the results are almost symmetrical for equidistant values from the true mean and variance.When only one of the two parameters is the actual value, then the classic test behaves better in general, possibly due to the fact that the classic test is sequential and the other is simultaneous.However, when both parameters only slightly differ from the actual values, then the GA-test performs better.The same holds true for values of the mean that differ from the actual value and for great differences in the variance.

Comparison with a Non-Parametric Test
In this section, the GA-test is used with non-normal distributions than remains similar to a Normal distribution.At this point, the interest lies in testing H 0 : F = F versus H 1 : F = F for a number of populations F and F .The GA-test is compared against the Kolmogorov-Smirnov test.In both cases the desired level of significance is 0.05, the hypothesis test is carried out for 10,000 simulations where the populations are Bi(100, 0.2) (Binomial), P o(20) (Poisson) and N (20, 4) (Normal).Figure 3 shows that these distributions are very similar and the size of random samples is 100.
The results of the experiment are given in Table 7 in the form of percentage of acceptance of the null hypothesis.Again, the best result in each case is printed in bold, that is, the minimum of the two values when the null hypothesis is false (values outside the diagonal) and the maximum of the two values when the null hypothesis is true (values in the diagonal).It can be seen that the GA-test can differentiate between the Poisson distribution and the other two better than can the Kolmogorov-Smirnov test.Nevertheless, the Kolmogorov-Smirnov test behaves better than the GA-test in Binomial and Poisson populations under the null hypothesis (the opposite is true for the normal distribution) and when distinguishing between the normal and the binomial distributions.A final comparison is carried out with Student's t distributions with several degrees of freedom since these distributions are similar to a standard Normal distribution.The desired level of significance is 0.05, the size of random samples is 100 and the hypothesis test is carried out for 10,000 simulations.The results of the experiment are given in Table 8.Again, the best result in each case is printed in bold, that is, the minimum of the two values when the null hypothesis is false (values outside the diagonal) and the maximum of the two values when the null hypothesis is true (values in the diagonal).It is important to point that the GA-test tends to provide smaller values and therefore tends to accept the null hypothesis less frequently than does the classic test (the classic test therefore tends to be more conservative).As a consequence, the GA-test has a better behaviour when the null hypothesis is false (values outside the diagonal), by differentiating between Student's t distributions with different degrees of freedom better than does the Kolmogorov-Smirnov test, and a worse behaviour (but not much worse) when the null hypothesis is true (values of the diagonal), that is, the Kolmogorov-Smirnov test behaves slightly better under the null hypothesis.

Conclusions and Future Work
Several similarity measures between one-dimensional data sets have been developed which can be employed to compare data sets, and a new hypothesis test has been designed.Two comparisons of this test with other classic tests have been made under the null hypothesis that two populations are identical.The main conclusion is that the new test performs reasonably well in comparison with the classic tests considered, and, in certain circumstances, performs even better than said classic tests.
With the distance developed in this paper, various classifications of a data set can be carried out, either by applying the neural network technique, SVM, or via other procedures available.
Although there are other approaches to the choice of the set Q of the percentiles for the K Q W function from a data set X, such as for example the equal-width interval (Chiu et al. 1991), k-mean clustering (Hartigan 1975), cumulative roots of frequency (González & Gavilan 2001), Ameva (González-Abril, Cuberos, Velasco & Ortega 2009), and the maximum entropy marginal approach (Wong & Chiu 1987), these have not been considered in this paper and will be studied in future papers.
Only the one-dimensional setting is considered in this paper; the possible correlations that can exist between features of multi-dimensional data sets lie outside the scope of this paper and will constitute the focus of study in future work.
Another potential line of research involves the improvement of the design of our hypothesis-testing procedures by using these similarity measures, and the execution of comparisons with other existing methods.For example, the chi-squared test on quantiled bins, or the Wald-Wolfowitz runs test can be tested under the null hypothesis that the two samples come from identical distributions.

Appendix. Data Sets of Section 2
The DS 1 and DS 2 data sets are taken from a N(1,1) distribution, the DS 3 data set from a N(0.5,1) distribution, and the DS 4 data set from a N(1.5, 1.25) distribution.

Figure 1 :
Figure 1: Histograms, boxplots, means, and variances of the data sets of the Appendix.

Figure 2 :
Figure 2: Boxplots of each pair of data sets of the Appendix before (DSi data sets)and after (Ti data sets) applying the quasi-typification.

Figure 3 :
Figure 3: Graphical representation of the probability mass function of the Bi(100, 0.2) distribution and the P o(20) distribution, and probability density function of N (20, 4) distribution.

Table 1 :
Distance between intervals for different weight-matrices W.

Table 2 :
K Q W similarities between continuous attributes of the Glass data set.

Table 3 :
K W and K Q W similarities between different labels of the Glass data set.

Table 4 :
K =2 Id , K QId and KC Q Id similarities between the data sets in the Appendix.

Table 5 :
Percentiles of the simulated distribution KC Q I between two N (0, 1) distributions for n = 100.
Definition 7. Let X = {X 1 , . . ., X n } and Y = {Y 1 , . . ., Y m } be two random samples from populations F and F .Let a hypothesis test be H 0

Table 7 :
Acceptance percentage of the null hypothesis in the comparison between the Kolmogorov-Smirnov test and the GA-test for various populations.The desired level of significance is 5% and the best result in each case is printed in bold.

Table 8 :
Acceptance percentage of the null hypothesis in the comparison between the Kolmogorov-Smirnov test and the GA-test for Student's t distributions.The desired level of significance is 5% and the best result in each case is printed in bold.