Elsevier

Neurocomputing

Volume 72, Issues 7–9, March 2009, Pages 1379-1389
Neurocomputing

Nearly homogeneous multi-partitioning with a deterministic generator

https://doi.org/10.1016/j.neucom.2008.12.024Get rights and content

Abstract

The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models’ complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.

Introduction

In machine learning and in many domains of scientific studies, data sets need to be partitioned to get reliable estimates of some statistics, either through training and validation sets or through patient and control groups [12]; to save computation time through the study of a sufficiently small subset of the entire set [7]; or to perform opinion poles. In all these cases, it happens eventually that averaging the estimates obtained over many different partitions of the same data set is not possible, either because it would exceed some time or money budget (getting good estimates from huge data sets with some machine learning model or acquiring the labels of all the data from human experts, performing a census, etc.) or it is not physically feasible (giving drug and placebo to the same patient because of different partitions). Then if only one partition can be used, it should be homogeneous, i.e. all its parts should have the same distribution, in order to reduce bias of the estimated quantities. Usually, practitioners either rely on simple random draws with equal probability for each datum to generate such a partition or they attempt to control the odds to get a partition with higher homogeneity.

Stratification [4] is a common way to bias randomness in favor of homogeneous partitions. A partition is generated by hand according to some variable, or automatically using vector quantization (e.g. K-means), then data are selected at random inside each stratum to get a part of the partition. However, there is no general rule to obtain a good stratification.

Another way which has been recommended in a machine learning context [6, p. 135] is similar to the accept–reject sampling method: draw several random partitions, and keep the one which minimizes the Kullback–Leibler (KL) divergence between the density functions estimated on each part.1 Measuring homogeneity can also be carried out using two-sample tests [8]. However, running such tests or building such density estimates is time consuming (up to O(N3)). So the accept–reject approach may need many runs, and so, much time, with no insurance of getting partitions with high homogeneity.

Another way to get homogeneous partitions is to pose the partitioning problem as an optimization one and try to solve it with some optimization method. For example in a simple local search setting, a first partition is generated at random, and then the elements of different parts are permuted so as to maximize the homogeneity of the partition. So for a partition of N data in two parts with equal size, there are N/2 possible permutations from the initial state (a binary vector with N bits assigning each datum to one of the two parts), and the diameter of the search space is N/2 (no more than N/2 permutations are needed to pass from one state to any other). So reaching a local optimum of the search space takes N2 times the complexity of the homogeneity testing function. However, testing the homogeneity is usually high (up to N3 if graph-based two-sample tests are used [8]), leading to a O(N5) suboptimal algorithm. Moreover, the size of the space to explore is huge: CNN/2=N!/(N/2)!(N/2)! ways to take N/2 elements out of a set of N, so there is a high probability to get stuck into a local optimum, without knowing how far it is to the global one.

Another example is the matched random sampling approach devised in [3], [12]. The graph non-bipartite matching (NBM) algorithm of Greevy et al. [12] is designed for optimal pairing of treatment and control patients in a medical context [17]. It is based on pairing all the data so as to minimize the sum of the distances within the pairs [10]. This problem is encoded as a weighted graph matching problem, where pairs of vertices (the data) have to minimize the total weight of the edges connecting them. It may provide partitions in two parts with equal size, by assigning randomly each datum of a pair to a different part. The NBM solves a specific instance of the general maximum weight graph [15] problem, and algorithms designed to solve it could be used to find partitions with more than two parts. The complexity of optimal algorithms scales in O(|E||V|1/2) [15] where |E| and |V|=N are, respectively, the number of edges and vertices in the graph. However, if the complete graph of the data is used, then |E|=N(N+1)/2 and so complexity scales in O(N5/2). While if a more reasonable proximity graph is built such as the K-nearest neighbor graph (KNNG), or the Gabriel graph (GG) [9], number of edges scales in O(N) but the graph building process itself scales in O(N2) (KNNG) or O(N3) (GG). At last, the objective function optimized in these approaches has not been clearly related to an overall homogeneity measure of the partition.

Our work follows yet another way, which is an attempt to build incrementally an homogeneous partition. We propose a deterministic algorithm designed to generate nearly homogeneous partitions in two or more parts, with possibly unequal size of their parts. This algorithm is a heuristic which experimentally increases the homogeneity between the parts of the partition it generates. It is based on seriating the data and subsequently assigning them to each part of the partition according to their rank in the seriation.

In the next section, we define a new divergence measure between continuous pdf. Then we show that minimizing this divergence and minimizing a Hausdorf-like distance between finite samples drawn from these pdf are equivalent in the limit of an infinite sample size at least for univariate densities and the Euclidean metric. Then in Section 3, we propose a heuristic to minimize this distance measure between finite samples, and study its properties. Finally in Section 4 we demonstrate its efficiency in getting homogeneous partitions on multivariate artificial and real data sets.

Section snippets

A new divergence measure

Here we propose a new symmetric divergence measure between two pdf. It exists several divergence measures [11] intending to measure how far two pdf are from each other. Given pdf p and q with the same support X, the KL divergence DKL is given by DKL(p,q)=Xp(x)logp(x)q(x)dx,the Jensen–Shannon divergence is given by DJS=12DKLp,p+q2+DKLq,p+q2and the Renyi's divergence is defined as DR(p,q)=1α-1lnX[p(x)]α[q(x)]1-αdx,α1,α>0which corresponds to Bhattacharya's divergence for α=12. The main problem

The homogeneous multi-partitioning algorithm

Minimizing Hd(P1,P2) is a difficult combinatorial problem.2 The size of the solution space is the one of all possible partitions of N objects in two parts, which equals CNN/2 (central binomial coefficient). We propose a heuristic to solve it. Our method is based on the seriation algorithm proposed in [5].3

Illustration of HMP versus K-means clustering

We consider a D-dimensional data set containing an even number N of points drawn from a Gaussian with zero mean and unit variance. We compute the HMP and random bipartitions of this set, and we compare to a partition obtained using the K-means clustering method with K=N/2. The N/2 prototypes are initialized on N/2 data chosen at random. After training applying the standard K-means procedure, we assign the closest datum of each prototype to part P1 and the other data to part P2. Most of the

Conclusion

We propose a new divergence measure between pdf. Then we show that finding two pdf minimizing this divergence is equivalent in a one-dimensional Euclidean space, to finding a bipartition with parts drawn from these pdf minimizing a Hausdorff-like homogeneity criterion. We propose the Homogeneous Multi Partition (HMP) heuristic to minimize this homogeneity criterion. This algorithm partitions a data set into two or more parts, and tends to favor parts homogeneous to the initial set. We compare

Acknowledgments

We would like to thank the anonymous reviewers for their useful comments and suggestions.

Michael Aupetit received his M.Sc. degree in computer science engineering from the Ecole pour les Etudes et la Recherche en Informatique et Electronique (EERIE, Nimes, France) in 1998. He obtained his Ph.D. with highest honor in Industrial Engineering from the Institut National Polytechnique de Grenoble (France) in 2001. He worked for 6 years at CEA DAM applying Machine Learning tools to analyze and monitor seismic events. Now he is with the Multi-sensor intelligence and machine learning

References (17)

There are more references available in the full text version of this article.

Cited by (0)

Michael Aupetit received his M.Sc. degree in computer science engineering from the Ecole pour les Etudes et la Recherche en Informatique et Electronique (EERIE, Nimes, France) in 1998. He obtained his Ph.D. with highest honor in Industrial Engineering from the Institut National Polytechnique de Grenoble (France) in 2001. He worked for 6 years at CEA DAM applying Machine Learning tools to analyze and monitor seismic events. Now he is with the Multi-sensor intelligence and machine learning laboratory (LIMA) at CEA DRT. His research activities cover Machine Learning and Data Mining dealing with high-dimensional quantitative data.

View full text