Invariant and Metric Free Proximities for Data Matching: An R Package

Data matching is a typical statistical problem in non experimental and/or observational studies or, more generally, in cross-sectional studies in which one or more data sets are to be compared. Several methods are available in the literature, most of which based on a particular metric or on statistical models, either parametric or nonparametric. In this paper we present two methods to calculate a proximity which have the property of being invariant under monotonic transformations. These methods require at most the notion of ordering. An open-source software in the form of a R package is also presented.


Why a metric free proximity?
In social sciences researchers often need to estimate the effect of a "treatment" (like a social program, a medical therapy, etc.) on a population of individuals. The estimation generally entails the comparison of an outcome variable between the group of the subjects exposed to the treatment and a "control" group which did not receive the treatment.
The evaluation of the differential treatment effect on the outcome for treated versus control individuals has to be done given the same pre-treatment conditions (see Heckman, Ichimura, and Todd 1997;Heckman, Ichimura, Smith, and Todd 1998). In experimental studies two similar groups of individuals are randomly selected from a population: one is then exposed to the treatment and the other is not. In observational studies the assignment to the treatment is usually non-randomized and, therefore, the distributions of pre-treatment covariates are different between the two groups. Finding control units similar to each treated individual represents the preliminary problem of the analysis. Therefore, a technique is required for matching observations coming from different data sets. If the match fails partially or completely, it means that the distributions of the covariates in the two groups do not overlap. This is a case of (partial or total) lack of common support.
When there are many variables an exact match might become an unfeasible task due to dimensionality. Hence, since the seminal paper by Cochran and Rubin (1973), many authors have faced the matching problem and several matching techniques have been developed to overcome this dimensionality issue (see also Rubin 1973a,b). Cochran and Rubin (1973) proposed to solve the problem of multivariate matching using the Mahalanobis distance to match the nearest available individuals. Later Rosenbaum and Rubin (1983) introduced the notion of propensity score (PS) as the probability that an individual receives the treatment, conditional on his/her covariates. Rosenbaum (1989) introduced further the notion of optimal matching, that is a matching strategy to group treated and control units in such a way that minimizes the overall distance between observations. More recently, Diamond and Sekhon (2005) proposed a matching technique based on genetic algorithms.
The drawback of all unconstrained methods based on distances or propensity scores is that, if two data sets have no common support, a match can always be found among the "less distant" observations, but the achieved matching may be useless. In such cases, it is more effective the use of a caliper (a radius or a bound) on the distance or on the propensity score or a mix of the two as explained in Gu and Rosenbaum (1993).
Propensity score matching has been brought back to the attention of the statistical community after the work of Dehejia and Wahba (1999). In their paper the authors suggest that propensity score matching is a good device to reduce bias in the estimation of the average treatment effect in observational studies (no matter the data sets to be matched). The debate that followed (Dehejia and Wahba 2002;Smith and Todd 2005a,b;Dehejia 2005) was mainly focused on the sensitivity of the match to the model used to estimate the propensity score.
To summarize, from the recent literature it emerges that propensity score methods seem to be too sensitive to model specification but distance based methods apply properly only to data which lie in R k possibly under multi-normal assumptions on the distribution of the data. Hence the need of a measure of proximity which is not related to any notion of statistical model or does not require additional assumptions on the structure of the data. In the next sections we present two implementations of the same idea of an invariant and metric free proximity. The main argument behind this notion of proximity is that the proximities π ij are interpreted as the "belief of observation i and j to be equal in attributes" in the sense that they occupy the same region of the space (whichever the structure of the space!).
We will introduce two different methods to generate a proximity. The first approach, presented in Section 2.1 is to use regression trees to obtain rank related proximities. This is a Monte Carlo method already known to the literature which we present for completeness. Section 2.2 shows a faster and non randomized method which is introduced for the first time in this paper. This second method also allows for an easier interpretation of the proximities obtained. A comparison of these two methods is presented at the end of the section. Applications to data matching are presented in Section 3 and the corresponding package rrp, available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=rrp, for the R language (R Development Core Team 2008) is described in Section 4.

Two different algorithms
To obtain a metric free proximity, instead of defining a notion of center and fixing a radius, we make use of ranks to decide whether two (or more) observations are in the same portion of the space. For non-ordinal variables, the ranks are even superfluous. Such a rank-based proximity is also invariant under monotonic transformations of the ranked data because ranks are preserved in this case 1 . It should be noticed that the definition of rank in this paper is generalized in the following sense: if data are raw, then ranks take the usual meaning. When observations are coarsened into intervals, each observation is associated to the interval to which it belongs and hence the rank of the observation is the rank of the corresponding interval. An example of this rank assignment is given at the end of the present section.
The proximities π ij are such that π ij ∈ [0, 1]. Up to our algorithms, π ij = 1 means that the observations i and j are "likely to be equal", π ij = 0 means i and j are completely different observations. When π ij ∈ (0, 1) the interpretation of this value is strictly related to the algorithm in use.
. , x ip ) be the vector of attributes (X 1 , X 2 , . . . , X p ) for observation i = 1, . . . , n, where n is the sample size. Working with ranks, there is always the possibility to have extremal data disturbing the analysis, given that we do not make use of any distance. In order to reduce the risk, we assume that each numeric variable X k in the data set is preliminarily discretized by dividing the support of X k into m intervals, so that extremal values are likely to be isolated in cells near the border of the support of the data. After discretization, the variable X k is reclassified as an ordinal categorical variable. Both boolean and unordered categorical variables are treated as unordered categorical variable. So, from now on we assume that data consists only of categorical variables (either ordinal or not, which might have been obtained by discretization). Once the data are ready, we proceed with rank assignments. It might happen that, in case of discretization in m intervals, some intervals result empty. In such a case, we assign progressive ranks to the observations keeping the information that some cells are empty. The following example clarifies the procedure of assigning the ranks: suppose we have the following 7 data for variable X k : 1, 1, 1, 5, 5, 6, 8. Suppose we decided to split the support of X k , i.e., the interval [1,8], in m = 4 intervals as follows: [1, 2.75 #2, i.e., (2.75, 4.50], and the ranks of the data will be assigned taking this information into account, as follows: x ik 1 1 1 5 5 6 8 R ik 1 1 1 3 3 3 4 where R ik is the rank assigned to observation i along variable k. For unordered variables we do not make use of ranks, so observations with different values will be treated as different observations.

The RRP proximity
The RRP algorithm discussed in Iacus and Porro (2007a,b) is as follows: to each observation i a fictitious and exogenous response variable U i ∼ U (0, 1) is assigned, where U (0, 1) is the uniform distribution on [0,1]. A regression tree that models the random vector (U 1 , . . . , U n ) as a function of the attributes (X 1 , X 2 , . . . , X p ) is grown 2 yielding as a result a random, recursive and non empty partition of the data. A proximity measure π from this random partition is induced defining the following set of binary parameters: π ij = 1 for all observations i and j in the same leaf of the regression tree; π ij = 0 otherwise. This partition and the proximity measure entirely depend on the realization of the random vector (U 1 , . . . , U n ): therefore the procedure is replicated R times and the final proximity measure is defined as the average of the π ij 's over all the replications. Regression trees works on the ranks, hence the suggested discretization of the data is well suited for the method. The RRP algorithm is the following: 1. Choose a value of R (usually R = 250). Set r = 1.

The rank-based proximity
The same idea can be exploited using directly the ranks without making use of a Monte Carlo approach and regression trees. This second method is cleaner and for obvious reasons more efficient. The main argument is to consider for each observation in the data set its rank for each variable X k and setting π , observation i and j have ranks equal or consecutive and π (k) ij = 0 otherwise. The final proximity is obtained by averaging over all the variables:

end for
3. define Π RANK and ∆ RANK as follows: We call Π RANK the RANK -proximity matrix and ∆ RANK the RANK -dissimilarity matrix.

Comparison and properties of the two proximities
Clearly, both ∆ RRP and ∆ RANK are not metrics, therefore it is hard to make comparisons with other widely used metrics (e.g., Euclidean, Mahalanobis, Manhattan), but an important difference with respect to distances is that both dissimilarity matrices (1) and (2) should not be used in hierarchical clustering, because all totally different (up to our methods) observations will have the same dissimilarity, (i.e., δ ij = 1), making impossible to aggregate the "closest" units. Finally, both dissimilarities are not full rank (in the sense of Little and Rubin (2002), page 69), i.e., δ ij = 0 even if the condition x ik = x jk is not met for all attributes k = 1, . . . , p, due to discretization.
An advantage of the proximity defined in (2) is that it has a clear interpretation. Indeed, π RANK ij = q means that observation i and j have proximity π (k) ij = 1 along q% of the variables X k , k = 1, . . . , p. Nevertheless, even if π RANK ij = π RANK ik = 0.75 means "observation i and j and i and k are similar in 75% of the variables", this is not necessarily the same subset of variables. Unfortunately, the same interpretation cannot be attributed to the RRPproximities.
Both proximities are metric free and invariant under monotonic transformations of the data but the rank proximity (2) is easier to understand, simplest to implement and considerably faster to obtain.
Other approaches to data matching based on ranks (defined in the usual way) do exist. For example, let M k be the maximum rank for variable k and r k (i) the rank of observation i along variable X k , then is a score which is also invariant under monotonic transformation of the data. The vector (z 1 (i), z 2 (i), . . . , z p (i)) should be used to replace (x i1 , x i2 , . . . , x ip ) in the data matrix and then any distance based matching method could be applied on the new matrix. Alternatively, it is possible to assign the score z i = (z 1 (i) + z 2 (i) + · · · + z p (i))/p to each observation and treat it as an new balance score 3 .
The function rrp.dist in the package rrp implements the algorithm for the RRP method and the function rank.dist implements the other algorithm. Both functions may return an R object of class dist for memory saving and compatibility with other matching algorithms.
In the case of rank.dist the default object which is returned is a list of the same length of the observations. Each element i of the list contains a named vector of proximities where the names correspond to the row names j's of the original data set for which the proximity between observation i and the j's are positive.
The RANK -proximity algorithm makes also possible to weight differently the variables included in the match. If, for example, there are some hints or requirements to have stricter match on some of the covariates, it is possibile to redefine the proximity (2) as follows where w k > 0 are the weights. Obviously, the interpretation of the values of π RANK ij ∈ (0, 1) is less clear in this case.

Applications to data matching
The problem of data matching has been described in the Introduction. Several packages of the R statistical environment (R Development Core Team 2008) implement different matching methods. Among these, we mention MatchIt (Ho, Imai, King, and Stuart 2007) which is a generic framework for data matching, Matching (Diamond and Sekhon 2005) which implements also genetic matching, optmatch by Hansen (2004) implements optimal full matching as described in Rosenbaum (1991). In the next sections we present an analysis of randomly generated data which allow for an easy interpretation of the role of the proximities in the identification of a common support and another data matching application on real data borrowed from the econometric literature.

R>
In both cases, we calculated the curve S(λ) whose graph is also represented in Figure 1.

An econometric example
Many matching algorithms have been tested against the Lalonde data set. These data (LL in the following) are taken from the National Supported Work (NSW) Demonstration, a job training program implemented during the Seventies in the United States and analysed by Lalonde (1986). From April 1975 to August 1977 the program was carried out as a randomized experiment: some applicants were assigned to the program while some others, randomly chosen, were assigned to a control group and not allowed to participate to the program. The program provided training to the participants for 12-18 months and helped them in finding a job. As an effect, the program was supposed to yield an increase in the earnings of participants: therefore, real earnings in 1978 is the outcome variable of the analysis. Several pre-treatment variables were registered about the applicants (both participants and control individuals): age (age), years of education (education), marital status (married), lack of an high school diploma (nodegree), ethnic indicators (black, hispanic) and real earnings in 1974 (re74) and 1975 (re75). Two indicator variables u74 and u75 are included in the analysis to signal unemployement in 1974 and 1975. We will not discuss here the problem of the estimation of average treatment effect for these data which were considered by many authors (see Dehejia andWahba 1999, 2002;Smith and Todd 2005a,b;Dehejia 2005) including Iacus and Porro (2007a,b) for the RRP algorithm. In this section, we just show the performance of the two methods proposed in this paper on the data set. In his study, Lalonde used several non-experimental control groups, coming from the Panel Study of Income Dynamics (PSID) and the Current Population Survey-Social Security Administration File (CPS) and tried to replicate the experimental target. He concluded that methods based on non-experimental data cannot correctly estimate the effect of the randomized experiment, consequently shedding some doubts on the reliability of these procedures. In more recent years, Dehejia andWahba (1999, 2002) tried to show that the matching method based on propensity scores can be successful in replicating the experimental average treatment effect estimated by Lalonde even in a non-experimental context. To this aim, they selected from the experimental sample used by Lalonde a subsample (DW sample) made of 185 treated and 260 control units. As a non-experimental control group, they used a sample of 2490 units coming from PSID 6 . Many authors (see cited references) argued about the real possibility of matching the DW and PSID data. The application of the RRP and RANK -proximity provides evidence of the fact that DW and PSID data sets cannot be matched. Next R code performs such analysis: R> require("rrp") R> data("DWvsPSID") R> n <-dim(DWvsPSID) [1] R> ctr <-which(DWvsPSID$treated == 0) R> trt <-which(DWvsPSID$treated == 1) R> group <-(DWvsPSID$treated == 1) R> DWvsPSID$u74 <-factor(DWvsPSID$re74 > 0) R> DWvsPSID$u75 <-factor(DWvsPSID$re75 > 0) R> DWvsPSID$black <-factor(DWvsPSID$black) R> DWvsPSID$married <-factor(DWvsPSID$married) R> DWvsPSID$nodegree <-factor(DWvsPSID$nodegree) R> DWvsPSID$hispanic <-factor(DWvsPSID$hispanic) R> str(DWvsPSID) 'data.frame': 2675 obs. of 12 variables: $ treated : num 1 1 1 1 1 1 1 1 1  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 1 2 ... $ married : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ... $ nodegree : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1  : num 12418 15953 12804 1294 12591 ... $ hispanic : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ... $ u74 : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 1 2 2 2 1 1 2 ... $ u75 : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 2 2 2 2 1 1 2 ...

We remove the group variable and re78
R> DWvsPSID <-DWvsPSID [-c(1, 9)] We now calculate the RANK -proximity. A list will be returned.
R> px <-rank.dist (DWvsPSID,cut.in = 20) and identify the twins between treated and control units: :16113 we can see the effectiveness of the match. After the match the distributions of treated and control units looks more similar between the two groups. Unfortunately, as it can be seen, the observations involved in the match are only a few of the original ones: pre-match treated=185, controls=2490 post-match treated=41, controls=32 showing that reliable match can only be attained for an extremely small subset of the data. The same results can be obtained using the RRP -proximity as shown in Iacus and Porro (2007b).

A brief account about the software
This section is intended to be a short introduction to the rrp package for the R language (R Development Core Team 2008). For this reason, it reviews also functions related to applications of the proximities not discussed in the above: the interested reader can refer to Iacus and Porro (2007a,b) for further details. We assume that the reader has some knowledge of the R language even if the code should appear understandable to non R experts too. The package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=rrp.
One of the main function of the package is rrp.dist which generates a dissimilarity matrix, say D (the proximity can be obtained as 1 -D). This matrix can be used in further analysis. The rrp.dist function accepts several parameters. The ones relevant to the applications of this paper are rrp.dist (X,msplit = 10,Rep = 250,cut.in = 15,asdist = FALSE) where X is the data matrix, msplit is the minimum split parameter 7 (by default 10) which corresponds to the same argument in the rpart package, Rep is the number of RRP replications (by default 250), cut.in is the number of equally spaced cut points (by default 15) used to split the support of numeric variables of X, i.e., the number of intervals equals to cut.in -1.
The parameter asdist specifies if the return value should be an R object of class dist. If asdist is set to FALSE (the default value) the return value is an object of class externalptr (external pointer) for which coercion methods exist in the package. This choice was made to increase efficiency in presence of big data sets and to speed up the algorithm because the dist object is allocated in memory only once and it is not passed (and hence copied) back and forth between successive R calls. The function XPtrToDist converts an external pointer to a dist object and newXPtr creates a brand new external pointer object. For example newXPtr(n, k = 0) creates an external pointer of class XPtr (a class specific to the rrp package) as if it was a dist object of size n and initializes it with zeroes.
The function rank.dist has a similar interface to rrp.dist. The most relevant parameters are the following: rank.dist(X, cut.in = 0, thr = 0.75, weights, asdist = FALSE) The thr argument is a threshold below which the value of the proximity is not retained by the algorithm. The argument weights is a vector of weights which by default is a vector of one's and asdist when set to FALSE (the default) returns a list with the same length of the observations. Each element of the list is a named vector containing the proximities greater or equal to thr. The names of the vector correspond to the row names of X for the observations for which the proximity is greater or equal thr. Each vector is also sorted in decreasing order to allow for fast nearest neighbor classification. The choice of returning a list instead of an external pointer was made because for the rank-based proximity (2) it is quite easy to choose a reasonable threshold for the proximities, hence the corresponding dist (or XPtr) object is essentially a sparse vector. In our experience, the list representation is more efficient in terms of memory occupation for big data sets.

Other useful functions of the package
We now show an example of classification on the Iris dataset. We remind that the Iris data set contains 4 measurement variables and 1 class variable (called Species). We randomly sample 10 observations as test set and use the remaining observations as training set. The dissimilarity matrix is built on the whole data set by excluding the class variable (which is variable number 5) from the data matrix.