An Efficient Approach for Statistical Matching of Survey Data Trough Calibration, Optimal Transport and Balanced Sampling

Statistical matching aims to integrate two statistical sources. These sources can be two samples or a sample and the entire population. If two samples have been selected from the same population and information has been collected on different variables of interest, then it is interesting to match the two surveys to analyse, for example, contingency tables or covariances. In this paper, we propose an efficient method for matching two samples that may each contain a weighting scheme. The method matches the records of the two sources. Several variants are proposed in order to create a directly usable file integrating data from both information sources.


Introduction
Integrating data from different sources represents a major challenge in statistics. Yang and Kim (2020) and Kim and Tam (2020) discussed a set of methods for data integration. Statistical matching is the field of statistics that deals with the best way to merge two different files by matching units based on a group of common variables (D'Orazio et al., 2006;D'Orazio, 2019). Renssen (1998) distinguishes between two types of analysis. The macro approach focuses on estimating a full joint distribution between the respective variable of interest in the two samples. This can be, for example, a covariance matrix, a correlation or a contingency table. The objective of the micro approach intends to complement one file with information from the other, imputation to correct non-response is an example related to this approach (Haziza, 2009;Chen and Haziza, 2019). Kim et al. (2016) use the technique of fractional imputation to perform statistical matching.
In this paper, we propose an efficient method for statistical matching. Units are matched based on the proximity of a group of variables measured in both surveys. Moreover, both sources can either have common units or have an empty intersection. One of the two sources may even contain the entire population. In addition, we impose a set of constraints in order to take advantage of all the information available in the two sources. This method can also be used to realize imputations and so can also be used to micro approach analyses.
Both sources of information may contain a weighting system that allows the files to be extrapolated to the full population. These weights are usually calculated to take into account inclusion probabilities, non-response treatment and calibration. In official statistics, the calibration methods have been proposed by Deville and Särndal (1992) and Deville et al. (1993) to adjust survey data on census data or a register. Calibration can also be used to adjust or harmonize several surveys from different sources (see Guandalini and Tillé, 2017, and references therein). Dudoignon (2018) through an unpublished paper, proposed to use the idea of optimal transport on internet audience data.
We have set out a series of recommendations that a matching method should follow: The method should match common units as a priority. The result of the matching must integrate information from both sources. The matching should also take into account the weighting system. After matching, the estimated totals of the variables common to both sources must be identical to the totals before matching. An optimal matching should take advantage of all the information available in both sources.
The proposed methods therefore allow the matching of two data files but also the imputation of one file on another. First, calibration theory is used to harmonise the two samples. Then a linear program is used to perform an optimal matching while taking into account the weights. This program can be written as an optimal transport problem. Finally, the values to be matched can be selected using a balanced stratified sampling technique as presented in Jauslin et al. (2021b) and implemented in the R package 'Strat-ifiedSampling' (Jauslin et al., 2021a). The methods either perform matching based on a division of weights, produce a prediction, or impute a value from one source to another.

Problem and notation
Consider a population U = {1, . . . , k, . . . , N} of size N from which two random samples S 1 and S 2 of size respectively equals to n 1 and n 2 have been selected. It is assumed that three groups of variables can be measured on the population units. The vectors of variables x k ∈ R p , k ∈ U are measured on both units selected in S 1 and S 2 . The vectors of variables y k ∈ R q , k ∈ U are measured only on the units selected in S 1 . The vectors of variables z k ∈ R r , k ∈ U are measured only on the units selected in S 2 .
Generally in survey sampling, samples are designed with complex weighting systems v 1k , k ∈ S 1 and v 2ℓ , ℓ ∈ S 2 . These weights can take into account the inverses of the inclusion probabilities, a possible re-weighting to compensate questionnaire non-response and a possible calibration. Frequently, a statistical match is made when one of the two file is seen as recipient file while the other one is seen as donor file. Throughout this manuscript, we will suppose, without loss of generality, that the sample S 1 is the recipient file while S 2 is the donor file. Generally we have n 2 > n 1 , note that one of the two sample might be the whole population.
The population totals on the common auxiliary variables are equal to: It can be estimated either by the sample S 1 or the sample S 2 , which are mathematically written: Following the same idea on the variables of interest, the totals Y = k∈U y k and Z = k∈U z k , can be estimated using the following estimators: In the micro approach, the two samples S 1 and S 2 are merged into a single usable file, while, the macro approach focuses on the joint distribution of the variables of interest. Under the usual hypothesis that, conditionally to the variables x k , the variables y k and z k are independent, the relationships between the variables y k and z k can be analyzed. For example, if the variables y k and z k are dummy variables with respectively q and r variables, we could be interested in the estimation of the contingency table If the variables of interest are continuous, we could be interested in computing the covariance matrix of the totals Σ yz = cov(Y, Z).

Harmonization by calibration
Since we are working with two different samples, we firstly harmonize the sampling weights v 1k , k ∈ S 1 and v 2ℓ , ℓ ∈ S 2 in order to have same totals given in Equations (1). That means we are looking for a new weighting system w 1k , k ∈ S 1 and w 2ℓ , ℓ ∈ S 2 , such that we have the following results: One aspect that must be taken into account is the intersection between S 1 and S 2 . Several cases can occur, the sample S 1 can be included in S 2 or vice versa, the intersection can also be empty. Let n 12 = #(S 1 ∩ S 2 ) denote the size of the intersection of the two samples. Guandalini and Tillé (2017) analyse estimators of the form X α = α X v1 + (1 − α) X v2 . They showed that, when p = 1, to best estimate X using both S 1 and S 2 , the value of α must be equal to This optimal value minimizes the variance of X α . However, it depends on unknown variances and on a covariance that must be estimated. Since variance estimators are particularly unstable, we may find ourselves far from the optimal estimator. Guandalini and Tillé (2017) suggest to use a proxy value for α opt that only depends on the sample sizes and the size of the overlapped sample: One can then construct the estimator X * = α * X v1 +(1−α * ) X v2 . In particular, if S 2 ⊂ S 1 , then α * = 1 and X * = X v1 . Moreover, if S 1 ∩ S 2 = ∅, then α * = n 1 /(n 1 + n 2 ). In order to compute the two new weighting systems (2) close to v 1k , k ∈ S 1 and v 2ℓ , ℓ ∈ S 2 , the two samples are calibrated X * . If G k (w 1k , v 1k ) is one of the pseudodistance defined in Deville and Särndal (1992), one can search the weighting systems that solve the following problem: The calibration problem must ensure that the new weights remain positive. This can be obtained, for example, by taking as pseudo-distance the divergence of Kullback-Leibler, i.e. G k (w 1k , v 1k ) = w 1k log w 1k /v 1k . Thus, the new weights obtained have the same sum: They also allow us to define new and more coherent estimators for X, Y and Z

Renssen's methods
A method for estimating contingency table is developed in Renssen (1998) and more recently presented in D' Orazio et al. (2006). The general idea consists of harmonizing the weighting systems as explained in the Section 3 and then uses the matching variables x k to create linear models to get an estimated contingency table. At the first step, regression coefficients β yx and β zx are computed from the samples S 1 (respectively S 2 ) by using the weights, w 1k , k ∈ S 1 (respectively w 2ℓ , ℓ ∈ S 2 ). Using a weighted linear model, the following coefficients are obtained: The contingency table is then estimated using the matrix product: where α * is the coefficient (3) that depends on the value n 12 . Renssen's method can be easily generalized to continuous case, but some assumptions must be satisfied on variables y k , k ∈ S 1 and z ℓ , ℓ ∈ S 2 . For more information, we refer the reader to the article of Renssen (1998) and the book written by D' Orazio et al. (2006).

Matching by optimal transport
The main idea of our method uses the optimal transport to perform a statistical matching. Optimal transport is an old mathematical problem that consists of finding the best solution to minimize the cost of transporting some quantities of goods from a given set of locations to a given set of destinations. In its simple case, the optimal transport problem can be solved with a linear program. However, it has been a very fruitful topic in statistics for the past 10 years and it is strongly related to the notion of Wasserstein distance. We refer the reader to Panaretos and Zemel (2020) for more information on optimal transport and the Wasserstein distance.
In the particular case of statistical matching, we use the optimal transport problem to match the units of two different samples. By reminding that S 1 denotes the recipient sample while S 2 stands for the donor sample, the idea is then to see which units of S 2 can be associated to a particular unit k ∈ S 1 . We start by computing a n 1 × n 2 matrix D containing the distances between the units of S 1 and the units of S 2 . We can for example use the usual Euclidean distance or a Mahalanobis distance defined as follows: where Then, we search for weights W kℓ for each couple k ∈ S 1 , ℓ ∈ S 2 . To do this, we solve the following linear program: ℓ∈S 2 W kℓ = w 1k for all k ∈ S 1 , W kℓ ≥ 0, for all k ∈ S 1 , and ℓ ∈ S 2 , where W kk = min(w 1k , w 2k ), for all couples of identical units in S 1 and S 2 . These constraints force the matching of identical units that can be selected from both samples. This linear program is nothing more than an optimal transport problem for which it exists many efficient implementations. Most of the W kℓ weights are zero. It is therefore not necessary to manipulate a large matrix of data. The realized calibration is not adversely affected in the linear program. Thus we have The output of the linear program ends with a matrix of weights W of size (n 1 × n 2 ). The non-zero entries in the ith rows of the matrix W contain the corresponding weights of the matched units in the sample S 2 . We generally do not have a one-to-one match, which means that for each unit k in S 1 we have more than one unit with weights not equal to 0 in S 2 . The next two sections propose two different ways to obtain, from the output of the optimal transport, a file where each unit from S 1 has only one imputed unit from S 2 . Without loss of generality, in the following development, we suppose that sample S 1 is completed by realizing a prediction from S 2 .
Matching with optimal transport Output of the optimal transport

Matching by using prediction
We can do a prediction by computing the weighted averages of the x ℓ and z ℓ of S 2 . Formally, this gives the following quantity to compute: , for all k ∈ S 1 , ℓ ∈ S 2 .
By using these new weights, we can then compute a prediction of the x k and the z k k ∈ S 1 , x k = ℓ∈S 2 q kℓ x ℓ and z k = ℓ∈S 2 q kℓ z ℓ , for all k ∈ S 1 .
The matching quality can be evaluated by comparing the x k with the predictions x k . For the predicted values x k , the calibration is always valid. Indeed that However, the interest of the procedure is that we now have predicted values z k for each unit of S 1 whereas these variables were only measured on S 2 .

Matching by using stratified balanced sampling
As explained in Section 5, the output of the optimal transport has generally repetition of some units. In some cases, we are interested in obtaining a synthetic file without any repetition of some units. In this section, we propose an imputation method based on the optimal transport result. We propose to use a balanced sampling method to select from the repetition stratum of a unit k of S 1 , only one unit of the units of S 2 . This is in line with hypothesis that S 2 is the donor file while the S 1 is the recipient file. To do this, we randomly generate a matrix of Bernoulli random variable a kℓ , k ∈ S 1 , ℓ ∈ S 2 , where a kℓ is 1 if unit ℓ ∈ S 2 is imputed to unit k ∈ S 1 . Since each unit k of S 1 can only receive one imputation, we must have ℓ∈S 2 a kℓ = 1, for all k ∈ S 1 .
We now want to generate the random matrix of a kℓ with expectations E(a kℓ ) = q kℓ in such a way that the following system of equations is satisfied at best This sampling problem is known as 'stratified balanced sampling' (see Jauslin et al., 2021b;Hasler and Tillé, 2014;Jauslin et al., 2021a). Indeed, each unit k of S 1 can be seen as a stratum for which a unit ℓ of S 2 must be selected.
The imputed values are then However, the interest of the procedure is that we now have valuesz k for each unit of S 1 whereas these variables were only measured on S 2 . If E q (.) is the expectation to the a kℓ conditionally to S 1 and S 2 , then, for all k ∈ S 1 , we have,

Analysis of the data
Once the optimal transport is performed, different analysis are possible. We can either work directly with the optimal transport result or use the prediction or the imputations methods. This gives us five possibilities to analyse the data.

Use the predicted values
3. Use the imputed values (x k ,x k , y k ,z k , w 1k , k ∈ S 1 ) by imputing the values of k ∈ S 1 .
For all these possibilities, the estimations of the means are completely consistent for the five possibilities. Indeed, we obtain If the variables are categorical, we can then estimate a contingency table using the results of the optimal transport matching, or the prediction on S 1 (respectively on S 2 ), We can also use the imputed values that give slightly different results, If the variables are continuous, we can estimate the covariances between the y k and the z ℓ variables. We can also work indifferently from S 1 × S 2 , S 1 or S 2 . Indeed, we have As previously seen, it is also possible to use the imputed values that give slightly different results The three estimators are thus very close to each other. One can thus use in an undifferentiated way S 1 × S 2 , S 1 or S 2 .

Simulations
In this section we propose two simulations, the first one on a simulated normal dataset where the conditional independence assumption holds and a second one on the Austrian data EU-SILC (European Union Statistics on Income and Living Conditions).

Gaussian example
In this section we discuss a generated dataset such that the CIA is satisfied. Let a population U of 10 000 units generated such that the matching variables x k ∈ R p , k ∈ U, the variables recorded only in S 1 y k ∈ R q , k ∈ U and the variables recorded only in S 2 , z k ∈ R r , k ∈ U are normally distributed with joint distribution equals to where the parameters are defined as follow To ensure that the CIA holds, the quantity Σ yz must be determined by the following equality (D'Orazio et al., 2006): Let suppose that p = 3, q = 2 and r = 2. We simulate 10 000 samples of recipient sample S 1 and donor S 2 by using simple random sampling without replacement with sizes fixed to n 1 = 600 and n 2 = 3000. The parameters of the matching variables are equal to By using the properties of the Gaussian distribution, if y k = (y k1 y k2 ) ⊤ ∈ R 2 , k ∈ U and z k = (z k1 z k2 ) ⊤ ∈ R 2 , k ∈ U are linear combination of the matching variables, the CIA holds and we can estimate the covariance matrix Σ yz . Linear combinations are then equal to where ε k ∼ N (0, 1) and the variance-covariance matrix is given by Σ yz = −3.667 −5.368 2.627 −9.772 .  Table 1 shows the mean squared errors of the simulations of the three different methods. The mean squared error result shows that optimal transport procedure and balanced imputation method are better. It is important to note here that the optimal transport is slightly biased. Optimal transport and balanced imputation do not use a model prediction to create the match. In this particular case, where everything is linear, the Renssen's method will be very powerful in creating a statistical match. Nevertheless, the mean squared error is better than using the optimal transport methods. In general, with a real dataset, the CIA does not hold, and furthermore, it is even not testable. The next section presents an example where the CIA does not hold and the optimal transport methods are better.

EU-SILC example
This section proposes a simulation study to see how the proposed method performs compared to the method proposed by Renssen (1998) on the dataset eusilc available in the R package Alfons and Templ (2013). This dataset contains 14 827 observations and 28 variables. It is based on real Austrian data EU-SILC (European Union Statistics on Income and Living Conditions). We slightly modified the dataset to remove the missing values. It represents then a dataset of 12 107 observations. Table 2 shows a summary of the different variables used for the simulations. In particular, pl030 is the categorical variable representing the economic status while eqIncome represents the continuous variable of household income. pl030 is the variable of interest recorded only in S 1 while eqIncome is recorded only in S 2 . Figure 3 shows the income household by economic status, each simulation will estimate the average income by category.
We run simulations using stratified balanced sampling design (Jauslin et al., 2021b) with sample size n 1 = 1000 for each sample S 1 and n 2 = 4000 for each sample S 2 . Inclusion probabilities are selected such that the design respects an optimal stratification, i.e., the number of unit selected in each stratum is proportional to the product of the stratum size and the standard error of eqIncome. In addition, the sample are balanced on the totals of the matching variables.
To measure the effectiveness of our proposed estimators, we use the mean squared error (MSE). Table 3 shows the mean squared errors of the estimation of the average income within each category based on 10 000 simulations. The table displays also the biasvariance decomposition. We can observe that the matching based on optimal transport is the best in terms of mean squared error.  Alfons and Templ (2013). The first five variables are the ones used for the matching while the two last ones are the variables of interest.

Matching variables
hsize The number of persons in the household. db040 The federal state in which the household is located. age The person's age. rb090 The person's gender. (male or female) pb220a The person's citizenship (AT, EU and Other).

Conclusion
Statistical matching is set to become a valuable tool with the increasing amount of data created in this century. In this article, we propose new methods for matching two complex surveys. The proposed statistical matching methods are flexible depending on the type of analysis we want to perform. We can either have a one-to-one unit matching using stratified balanced sampling, or use the optimal output of the linear program, or finally use prediction using weighted averages. Based on simulations, we observe that the proposed methods have lower cumulative mean squared error. A major assumption that persists in statistical matching is the conditional independence. Since in most cases this assumption is not satisfied, it is generally difficult to do anything other than assuming this assumption. Our method returns a mean squared error smaller in both cases where conditional independence assumption is satisfied or not. Thus, our results show that the proposed methods are less sensitive to a conditional independence defect. This suggests that they are more efficient and give a better quality statistical match.