Statistical Data Integration in Survey Sampling: A Review

Finite population inference is a central goal in survey sampling. Probability sampling is the main statistical approach to finite population inference. Challenges arise due to high cost and increasing non-response rates. Data integration provides a timely solution by leveraging multiple data sources to provide more robust and efficient inference than using any single data source alone. The technique for data integration varies depending on types of samples and available information to be combined. This article provides a systematic review of data integration techniques for combining probability samples, probability and non-probability samples, and probability and big data samples. We discuss a wide range of integration methods such as generalized least squares, calibration weighting, inverse probability weighting, mass imputation and doubly robust methods. Finally, we highlight important questions for future research.

1 Introduction a wide range of integration methods including calibration weighting, inverse probability weighting, mass imputation and doubly robust methods.
We then consider data integration methods for combining probability and big nonprobability samples. Depending on the roles in statistical inference, there are two types of big data: one with large sample sizes (large n) and the other with rich covariates (large p).
In the first type, the non-probability sample can be large in sample size. How to leverage the rich information in the big data to improve the finite population inference is an important research. In the second type, there are a large number of variables. There is a large literature on variable selection methods for prediction, but little work on variable selection for data integration that can successfully recognize the strengths and the limitations of each data source and utilize all information captured for finite population inference. Section 4 presents robust data integration and variable selection methods in this context. To summarize, Section 5 describes the direction of future research along the line of data integration including sensitivity analysis to assess the robustness of study conclusions to unverifiable assumptions, hierarchical modeling, and some cautionary remarks.
2 Combining probability samples

Multiple probability samples and missingness patterns
Combining two or more independent survey probability samples is a problem frequently encountered in the practice of survey sampling. For simplicity of exposition, let U = {1, . . . , N} be the index set of N units for the finite population, with N being the known population size. Let (x T i , y i ) T be the realized value of a vector of random variables (X T , Y ) T for unit i. Let I i be the sample indicator such that I i = 1 indicates the selection of unit i into the sample and I i = 0 otherwise. The probability π i = P (I i = 1 | i ∈ U) is called the first-order inclusion probability and is known by the sampling design. The design weight is d i = π −1 i . The joint probability π ij = P (I i I j = 1 | i, j ∈ U) is called the secondorder inclusion probability and is often used for variance estimation of the design-weighted estimator. The sample size is n = N i=1 I i . The main advantage of probability sampling is to ensure design-based inference. For example, the Horvitz-Thompson (HT) estimator of the population mean of y, denoted by µ y , is µ HT = N −1 i:I i =1 π −1 i y i , and the design-variance estimator is (π ij − π i π j ) π ij y i π i y j π j .
We consider multiple sources of probability data. For multiple datasets, we use the subscript letter to indicate the respective sample; for example, we use d A,i as the design weight of unit i in sample A.
Depending on the available information from multiple data sources, each sample has planned missingness by design. As illustrated in Table 1, the combined sample exhibits different missingness patterns: monotone and non-monotone. For monotone missingness, our framework covers two common types of studies. First, we have a large main dataset, and then collect more information on important variables for a subset of units, e.g., using a two-phase sampling design (Neyman;1938;Cochran;Wang et al.;2009). Consider the U.S. Census of housing and population as an example. The short form consists of 100% sample, for which basic demographic information was obtained. The long form consists about 16% sample, for which other social and economic information as well as demographic information were obtained. Deming and Stephan (1940) considered this setup as a classical two-phase sampling problem and use calibration weighting for demographic variable to match the known population counts from the short form.
Second, we have a smaller and carefully designed validation dataset with rich covariates, and then link it to a larger main dataset with fewer covariates. The setup of two independent samples with common items is often called non-nested two-phase sampling. Consider the U.S. consumer expenditure survey as an example. Two independent samples were selected from the same finite population, including a diary survey sample, referred to as sample A, and a face-to-face survey sample, referred to as sample B. In sample A, observe auxiliary information X and outcome Y ; whereas in sample B, observe common auxiliary information X. Zieschang (1990) considered using sample weighting to estimate detailed expenditure and income items combining sample A and sample B. Another example is the Canadian Survey of Employment, Payrolls and Hours considered by Hidiroglou (2001). Sample A is a small sample from Statistics Canada Business Register, in which the study variables Y , number of hours worked by employees and summarized earnings, were observed. Sample B is a large sample drawn from a Canadian Customs and Revenue Agency administrative data, in which auxiliary variables X were observed.
Finally, we will consider combining two independent surveys with non-monotone missing patterns. Statistical matching technique will be introduced in Section 2.3 as a general statistical tool under this setup. Table 1: Missingness patterns in the combined samples: " " means "is measured" d is the sampling weight, where the subscript indicates the sample, X is the vector of auxiliary variables, Y , Y 1 and Y 2 are scalar outcome variables.

Two approaches for probability data integration
We classify probability data integration methods based on the level of information to be combined: a macro approach and a micro approach. In the macro approach, we obtain summary information such as the point and variance estimates from multiple data sources and combine those to obtain a more efficient estimator of the parameter of interest. In the micro approach, we create a single synthetic data that contains all available information from all data sources.
2.2.1 Macro approach: Generalized least squares (GLS) estimation Renssen and Nieuwenbroek (1997), Hidiroglou (2001), Merkouris (2004), Wu (2004), Ybarra and Lohr (2008) and Merkouris (2010) considered the problem of combining data from two independent probability samples to estimate totals at the population and domain levels. Merkouris (2004) and Merkouris (2010) provided a rigorous treatment of the survey integration through the generalized method of moments. We focus on the monotone missingness pattern. The same discussion applies to the other patterns. From each probability sample, we obtain different estimators for the means of common items. The GLS approach combines those estimates as an optimal estimator. Let µ x,A and µ x,B be unbiased estimators of µ x from sample A and sample B, respectively. Let µ B be an unbiased estimator of µ y from sample B.
To combine the multiple estimates, we may build a linear model of three estimates with two parameters as follows: where (e 1 , e 2 , e 3 ) T has mean (0, 0, 0) T , variance-covariance and var(·) and cov(·) are the variance and covariance induced by the sampling probability.
Based on model (1), treat ( µ x,A , µ x,B , µ B ) as observations and define a sum of squared error term The optimal estimator of (µ x , µ y ) that minimizes Q(µ x , µ y ) is and .
To see the efficiency gain of µ GLS over µ B , using (2), we express which is not larger than var( µ B ). The GLS estimator for non-monotone missingness can be constructed similarly. See Fuller and Breidt (1999) for an application in the National Resource Inventory.

Micro approach: mass imputation
Mass imputation (also called synthetic data imputation) is a technique of creating imputed values for items not observed in the current survey by incorporating information from other surveys. Breidt et al. (1996) discussed mass imputation for two-phase sampling. Rivers (2007) proposed a mass imputation approach using nearest neighbor imputation but the theory is not fully developed. Schenker and Raghunathan (2007) reported several applications of synthetic data imputation, using a model-based method to estimate totals and other parameters associated with variables not observed in a larger survey but observed in a much smaller survey. Legg and Fuller (2009) and Kim and Rao (2012) developed synthetic imputation approaches to combining two surveys. Chipperfield et al. (2012) discussed composite estimation when one of the surveys is mass imputed. Bethlehem (2016) discussed practical issues in sample matching for mass imputation.
The primary goal is to create a single synthetic dataset of proxy values y i for the unobserved y i in sample B and then use the proxy data together with the associated design weights of sample A to produce projection estimators of the population mean µ y . This is particularly useful when sample B is a large scale survey and item Y is very expensive to measure. The proxy values y i are generated by first fitting a working model relating Y to X, E(Y | X) = m(X; β 0 ) based on the data {(x i , y i ) : i ∈ A} from sample A. Then the synthetic values of Y can be created by y i = m(x i ; β) for i ∈ B. Thus, sample A is used as a training sample for predicting Y in sample B. The mass imputation estimator of µ y is µ I = N −1 i∈B d B,i y i . Kim and Rao (2012) showed that µ I is asymptotically design-unbiased if β satisfies With (4), and var( µ I ) = var( P B ) + var( Q A ).
The asymptotic unbiasedness holds regardless of whether the regression model is true or not. However, a good regression model will reduce the variance of µ I . For variance estimation, either linearization or replication-based sampling (Kim and Rao; can be used.

Mass imputation with non-monotone missingness
For non-monotone missingness, the mass imputation method of Kim and Rao (2012) is not directly applicable as the sample with partial observations may contain additional information for parameter estimation. Often, one can consider a joint model of all variables and use the EM algorithm to estimate the model parameters. The joint model deduces the conditional distribution of the missing variables given the observed values for imputation.
For illustration, consider the non-monotone missingness I structure in Table 1. The goal is to develop mass imputation for both Y 2 in sample B and Y 1 in sample C. It is attempting to specify the conditional distribution of Y 2 given (X, Y 1 ) to impute Y 2 in sample B and the conditional distribution of Y 1 given (X, Y 2 ) to impute Y 1 in sample C. However, this approach may result in model incompatiability. That is, there does not exist a joint model of (Y 1 , Y 2 ) given X that leads to the corresponding conditional distributions. To avoid model incompatibility, we use a joint model for (Y 1 , Y 2 ) given X for prediction though specifying the sequential conditional distribution where θ = (θ T 1 , θ T 2 ) T , θ 1 and θ 2 are unknown parameters. For parameter estimation, it suffices to use observations in sample A; however, this approach ignores the partial information in sample B and sample C and therefore is not efficient. Let the joint set of sampling indexes be S = A ∪ B ∪ C. Assuming no overlap between the samples, we define and let d i be the design weight for unit i ∈ S without specifying which sample it belongs to. That is, To incorporate all available information, the EM algorithm can be used as follows.
[E-step] Let θ (t) be the parameter estimate at iteration t. Compute the conditional expec-tation of the pseudo log-likelihood functions: where y i,obs is the observed part of (y 1i , y 2i ).
The E-step and M-step can be iteratively computed until convergence, leading to the pseudo maximum likelihood estimator θ. Given θ, mass imputation can be done for both Y 2 in sample B and Y 1 in sample C. The imputation model for To generate imputed values from (6), one may use Markov Chain Monte Carlo methods or the parametric fractional imputation of Kim (2011). We now consider the non-monotone missingness II structure in Table 1. Sample A and sample B are probability samples were selected from the same finite population. In sample A, observe (X, Y 1 ) and in sample B, observe (X, Y 2 ). The question of interest is the associational relationship of Y 1 and (X, Y 2 ). If (X, Y 1 , Y 2 ) were jointly observed, one can fit a simple regression model of Y 2 on (X, Y 2 ). However, based on the available data, Y 1 and Y 2 not available simultaneously.
This problem fits into the statistical matching framework (D'Orazio et al.;. In statistical matching, the goal is to create Y 1 for each unit in sample B by finding a "statistical twin" from the sample A. Typically, one assumes the conditional independence assumption that Y 1 and Y 2 are conditionally independent given X, or equivalently, Then, the "statistical twin" is solely determined by "how close" they are in terms of X's. However, in a regression model of Y 1 on (X, Y 2 ), (7) sets the regression coefficient associated with Y 2 to be zero a priori, which is contrary to the study question of interest.
For a joint modeling of (X, Y 1 , Y 2 ) without assuming (7), identification is an important issue. Consider the following joint model of (Y 1 , Y 2 ) given X, where cov(e 1 , e 2 ) = 0.
In general, adding non-linear terms in the parametric assumption can help achieve identification. For example, consider adding X 2 in (8) such that Again, (α 0 , α 1 , α 2 ) is identifiable from sample A. Coupling (9) and (10) leads to Thus, β 0 + α 0 β 2 , β 1 + α 1 β 2 and α 2 β 2 are identifiable from sample B. As long as α 2 = 0, (β 0 , β 1 , β 2 ) is then identifiable. For an identifiable model, parameter estimation can be implemented either using the EM algorithm or GLS. Other assumptions can be invoked to achieve model identification. Kim et al. (2016) used an instrumental variable assumption for model identification and develop fractional imputation methods for statistical matching. Park et al. (2016) presented an application of the statistical matching technique using fractional imputation in the context of handling mixed-mode surveys. Park et al. (2017) applied the method to combine two surveys with measurement errors.
3 Combining probability and non-probability samples

Combining a probability sample with a non-probability sample
Statistical analysis of non-probability survey samples faces many challenges as documented by Baker et al. (2013). Non-probability samples have unknown selection/inclusion mechanisms and are typically biased, and they do not represent the target population. A popular framework in dealing with the biased non-probability samples is to assume that auxiliary variable information on the same population is available from an existing probability survey sample. This framework was first used by Rivers (2007) and followed by a number of other authors including Vavreck and Rivers (2008), Lee and Valliant (2009a), Valliant and Dever (2011a), Brick (2015), Elliott and Valliant (2017) and Chen, Li and Wu (2018), among others. Combining the up-to-date information from a non-probability sample and auxiliary information from a probability sample can be viewed as data integration, which is an emerging area of research in survey sampling (Lohr and Raghunathan;. Data integration for finite population inference is similar to the problem of combining randomized experiments and non-randomized real-world evidence studies for causal inference of treatment effects (Keiding and Louis;2016). In randomized clinical trial, the treatment assignment mechanism is known and therefore treatment effect evaluation based on randomized clinical trial is unconfounded. However, due to restrictive inclusion and exclusion criteria, the trial sample may be narrowly defined and can not represent the real-world patient population. On the other hand, by the real-world data collection mechanism, the real-world evidence study is often representative of the target population. Combining trial and real-world evidence studies can achieve more robust and efficient inference of treatment effect for a target patient population. Table 2 draws a parallel comparison of data sources between data integration in survey sampling and that in treatment effect evaluation. Real world evidence study Non-probability sample Randomized experiment * In survey sampling, some probability samples may not observe the study variable of interest; for treatment effect evaluation, randomized experiments provide unbiased estimation of treatment effect due to treatment randomization.
Survey statisticians and biostatisticians have provided different methods for combining information from multiple data sources. Lohr and Raghunathan (2017) and Rao (2020) provided comprehensive reviews of statistical methods for finite population inference. Existing methods for data integration of a probability sample and a non-probability sample can be categorized into three types as follows. The first type is the so-called propensity score adjustment (Rosenbaum and Rubin;1983). In this approach, the probability of a unit being selected into the non-probability sample, which is referred to as the propensity or sampling score, is modeled and estimated for all units in the non-probability sample. The subsequent adjustments, such as propensity score weighting or stratification, can then be used to adjust for selection biases; see, e.g., Lee and Valliant (2009b); Valliant and Dever (2011b); Elliott and Valliant (2017) and Chen, Li and Wu (2018). Stuart et al. (2011Stuart et al. ( , 2015 and Buchanan et al. (2018) used propensity score weighting to generalize results from randomized trials to a target population. O'Muircheartaigh and Hedges (2014) proposed propensity score stratification for analyzing a non-randomized social experiment. One notable disadvantage of the propensity score methods is that they rely on an explicit propensity score model and are biased and highly variable if the model is misspecified (Kang and Schafer;. The second type uses calibration weighting (Deville and Särndal;1992;Kott;. This technique calibrates auxiliary information in the non-probability sample with that in the probability sample, so that after calibration the weighted distribution of the non-probability sample is similar to that of the target population (DiSogra et al.;2011). The third type is mass imputation, which imputes the missing values for all units in the probability sample.
In the usual imputation for missing data analysis, the respondents in the sample constitute a training dataset for developing an imputation model. In the mass imputation, an independent non-probability sample is used as a training dataset, and imputation is applied to all units in the probability sample; see, e.g., Breidt et al. (1996); Rivers (2007); Kim and Rao (2012); Chipperfield et al. (2012); Bethlehem (2016); and Yang and Kim (2018).

Setup and assumptions
Non-probability samples become increasingly popular in survey statistics but may suffer from selection bias that limits the generalizability of results to the target population. We consider integrating a non-probability sample with a carefully designed probability sample which provides the representative covariate information of the target population. Let X ∈ R p be a vector of auxiliary variables (including an intercept) that are available from two data sources, and let Y ∈ R be the study variable of interest. We consider combining a probability sample with X, referred to as sample A, and a non-probability sample with (X, Y ), referred to as sample B, to estimate µ y the population mean of Y . We focus on the case when the study variable Y is observed in sample B only, but the other auxiliary variables are commonly observed in both data. Although the big data source has a large sample size, the sampling mechanism is often unknown, and we cannot compute the first-order inclusion probability for Horvitz-Thompson estimation. The naive estimators without adjusting for the sampling process are subject to selection biases, as illustrated in Table 3. On the other hand, although the probability sample with sampling weights represents the finite population, it does not observe the study variable. The complementary features of probability samples and non-probability samples raise the question of whether it is possible to develop data integration methods that leverage the advantages of both sources.  S Y is the population variance of Y .
Because the sampling mechanism of a non-probability sample is unknown, the target population quantity is not identifiable in general. Unlike the previous case in Section 2, the sampling mechanism of sample B is unknown and therefore µ y is not identifiable in general. Let f (Y | X) be the conditional distribution of Y given X in the superpopulation model ζ that generates the finite population. We make the following assumption.
Assumption 1 (i) The sampling indicator I B of sample B and the study variable Y is independent given X; i.e. P (I B = 1 | X, Y ) = P (I B = 1 | X), referred to as the sampling score π B (X); and (ii) π B (X) > 0 for all X.
Assumption 1 (i) and (ii) constitute the strong ignorability condition (Rosenbaum and Rubin;1983). This assumption holds if the set of covariates contains all predictors for the outcome that affect the possibility of being selected in sample B. This setup has previously been used by several authors; see, e.g., Rivers (2007) and Vavreck and Rivers (2008). Assumption 1 (i) states the ignorability of the selection mechanism to sample B conditional upon the co-

Propensity score weighting
Under Assumption (i) and (ii), we can build a model for π B (X) = P (I B = 1 | X) and use it to adjust for the selection bias in sample B. In practice, the propensity score function π B (X) is unknown and needs to be estimated from the data. Let π B (X; α) be the posited models for π B (X), where α is the unknown parameter. Several authors have proposed different estimation strategies. For example, α can be obtained by a weighted regression of I B,i on x i combining sample A and sample B (I B,i = 0 for i ∈ A and I B,i = 1 for i ∈ B), weighted by the design weights from sample A, which is valid if the size of sample B is relatively small 2011b). Chen, Li and Wu (2018) proposed estimating α by solving which is a sample version of the population estimating equation S(α) = i∈U {I B,i − π(x i ; α)} x i = 0. Instead of using (11), one can also use which is closely related to the calibration weighting approach for nonresponse adjustment. Given α, the inverse probability of sampling weighting estimator of µ y is Variance estimation of µ IPW can be obtained by the standard M-estimation theory. One of the notable disadvantages of the propensity score methods is that they rely on an explicit propensity score model and are biased if the model is mis-specified (Kang and Schafer;. Moreover, if the estimated propensity score is close to zero, µ IPW will be highly unstable.

Calibration weighting
The second weighting strategy is calibration weighting, or bench marking weighting (Deville and Särndal;1992;Kott;. This technique can be used to calibrate auxiliary information in the non-probability sample with that in the probability sample, so that after calibration the non-probability sample is similar to the target population (DiSogra et al.; 2011).
Instead of estimating the propensity score model and inverting the propensity score to correct for the selection bias of the non-probability sample, the calibration strategy estimates the weights directly. Toward this end, we assign a weight ω B,i to each unit i in the sample B so that where i∈A d A,i x i is a design-weighted estimate of the population total of X from the probability sample. Constraint (13) is referred to as the covariate balancing constraint (Imai and Ratkovic;2014), and weights Q B = {ω B,i : i ∈ B} are the calibration weights. The balancing constraint calibrates the covariate distribution of the non-probability sample to the target population in terms of X. Instead of calibrating each X, one can calibrate model-based calibration (McConville et al.;. In this approach, one can posit a parametric model for E(Y | X) = m(X; β) and estimate the unknown parameter β based on sample B. The model-based calibration specifies the constraints for Q B as We estimate Q B by solving the following optimization problem: subject to ω B,i ≥ 0, for all i ∈ B; i∈B ω B,i = N, and the balancing constraint (13) or (14). The objective function in (15) is the entropy of the calibration weights; thus, minimizing this criteria ensures that the empirical distribution of calibration weights are not too far away from the uniform, such that it minimizes the variability due to heterogeneous weights.
This optimization problem can be solved using convex optimization with Lagrange multiplier. Other objective functions, such as L(Q B ) = i∈B ω 2 B,i , can also be considered. This optimization problem can be solved using convex optimization with Lagrange multiplier. By introducing Lagrange multiplier λ, the objective function becomes Thus by minimizing (16), the estimated weights are and λ solves the equation which is the dual problem to the optimization problem (15).
The calibration weighting estimator is Variance estimation of µ cal can be obtained by the standard M-estimation theory by treating λ as the nuisance parameter and (17) as the corresponding estimating equation.
The justification for µ cal subject to constraint (13) relies on the linearity of the outcome model, i.e., m(X) = X T β * for some β * , or the linearity of the inverse probability of sampling weight, i.e., {π B (X)} −1 = X T α * for some α * (Fuller; 2009; Theorem 5.1). The linearity conditions are unlikely to hold for non-continuous variables. In these cases, µ cal may be biased. The justification for µ cal subject to constraint (14) relies on a correct specification of m(X; β) in the data integration problem. Chan et al. (2016) generalize this idea further to develop a general calibration weighting method that satisfies the covariance balancing property with increasing dimensions of the control variables m(x). Zhao (2019) developed a unified approach of covariate balancing PS method using Tailored loss functions. The regularization techniques using penalty terms into the loss function can be naturally incorporated into the framework and machine learning methods, such as boosting, can be used. The covariate balancing condition, or calibration condition, in (13), can be relaxed. Zubizarreta (2015) relaxed the exact balancing constraints to some tolerance level. Wong et al. (2019) used the theory of reproducing Kernel Hilbert space to develop an uniform approximate balance for covariate functions.

Mass imputation approach
The third type is mass imputation, where the imputed values are created for the whole elements in the probability sample. In the usual imputation for missing data analysis, the respondents in the sample provide a training dataset for developing an imputation model. In the mass imputation, an independent big data sample is used as a training dataset, and imputation is applied to all units in the probability sample. While the mass imputation idea for incorporating information from big data is very natural, the literature on mass imputation itself is very sparse.
In a parametric approach, let m(X; β) be the posited model for m(X), where β ∈ R p is the unknown parameter. Under Assumption 1, β can be obtained by fitting the model to sample B. We assume that β is the unique solution to for some p-dimensional vector h(x i ; β). Thus, we use the observations in sample B to obtain β and use it to construct y i = m(x i ; β) for all i ∈ A.
Under some regularity conditions, the mass imputation estimator where β 0 is the true value of β andṁ(x; β) = ∂m(x; β)/∂β. Also, where e i = y i − m(x i ; β 0 ). The justification for µ I relies on a correct specification of m(X; β) and the consistency of β. If m(X; β) is misspecified or β is inconsistent, µ I can be biased. For variance estimation, either linearization method or bootstrap method can be used. See  for more details.

Doubly robust estimation
To improve the robustness against model misspecification, one can consider combining the weighting and imputation approaches (Kim and Wang;. The doubly robust estimator employs both the propensity score and the outcome models, which is given by The estimator µ dr is doubly robust in the sense that it is consistent if either the propensity score model or the outcome model is correctly specified, not necessarily both. Moreover, it is locally efficient if both models are correctly specified (Bang and Robins;2005;Cao et al.;2009). Let µ HT = N −1 i∈A d A,i y i be the Horvitz-Thompson estimator that could be used if y i were observed in sample A. We express µ dr − µ HT = − i∈A d A,i e i + i∈B {π B (x i ; α)} −1 e i where e i = y i − y i . To show the double robustness of µ dr , we consider two scenarios. In the first scenario, if π B (X; α) is correctly specified, then which is design-unbiased of zero. In the second scenario, if m(X; β) is correctly specified, then E( e i ) ∼ = 0. In both cases, µ dr − µ HT is unbiased of zero and therefore µ dr is unbiased of µ y .
If either π B (X T α) or m(X T β) is correctly specified, To estimate V 1 , we can use the design-based variance estimator applied to m(X T i β) as To estimate V 2 , we further express V 2 as Let σ 2 (X T i β * ) = E {Y i − m(X T i β * )} 2 , and let σ 2 (X i ) be a consistent estimator of σ 2 (X T i β * ).
We can then estimate V 2 by By the law of large numbers, V 2 is consistent for V 2 regardless of whether one of π B,i (X T i α) or π B,i (X T i β) is misspecified, and therefore it is doubly robust.
4 Combining probability and big data

Big data sample
To meet the new challenges in the probability sampling, statistical offices face the increasing pressure to utilize convenient but often uncontrolled big data sources, such as satellite information (McRoberts et al.;, mobile sensor data (Palmer et al.;, and web survey panels . Couper (2013), Citro (2014), Tam and Clarke (2015), and Pfeffermann et al. (2015) articulated the promise of harnessing big data for official and survey statistics but also raised many issues regarding big data sources. While such data sources provide timely data for a large number of variables and population elements, they are non-probability samples and often fail to represent the target population of interest because of inherent selection biases. Tam and Kim (2018) also covered some ethical challenges of big data for official statisticians and discuss some preliminary methods of correcting for selection bias in big data. Combining information from several sources to improve estimates for population parameters is an important practical problem in survey sampling. In the past decade, more and more auxiliary information became available, including large administrative record datasets and remote sensing data derived from satellite images. How to combine such information with survey data to provide better estimates for population parameters is a new challenge that survey statisticians face today. Tam and Clarke (2015) presented an overview of some initiatives of big data applications in official statistics of the Australian Bureau of Statistics. Such big data are becoming increasingly popular and they come from a variety of sources such as remote sensing data, administrative data such as tax data, so on. Suppose that there are two data sources, one from a probability sample, referred to as sample A, and the other from a big data source, referred to as sample B. Table 4 illustrates the observed data structure. 4.2 Scenario 1: leverage auxiliary information in big data to improve efficiency In Scenario 1, the probability sample contains Y observations. Therefore, µ y is identifiable and can be estimated by the commonly-used estimator solely from sample A, denoted by µ A . We can leverage the X information in the big data sample to improve the sample A estimator. We consider the case when additionally the membership to the big data can be determined throughout the probability sample. The key insight is that the subsample of units in sample A with the big data membership constitutes a second-phase sample from the big data sample, which acts as a new population. We calibrate the information in the second-phase sample to be the same as the new acting population. The calibration process in turn improves the accuracy of the mass imputation estimator without specifying any model assumptions. Let h = (I B , 1 − I B , I B X). Following Yang and Ding (2018), we can consider a class of estimators satisfying Heuristically, if (23) holds exactly rather than asymptotically, by the multivariate normal theory, we have the following the conditional distribution Let V yy,A , Γ and V be consistent estimators for V yy,A , Γ and V . We set n 1/2 A ( µ A − µ y ) to equal its estimated conditional mean n 1/2 , leading to an estimating equation for µ y : n 1/2 Solving this equation for µ y , we obtain the estimator Under certain regularity conditions, if (23) holds, then Y is consistent forȲ N , and in distribution, as n A → ∞. Given a nonzero Γ, the asymptotic variance, V yy,A − Γ T V −1 Γ, is smaller than the asymptotic variance of µ A , V yy,A .
The asymptotic variance of µ can be estimated by Kim and Tam (2018) also explored similar ideas. They develop a calibration weighting method to incorporate the big data auxiliary information and apply the method to the official statistics in Australian Bureau of Statistics. In this application, the big data is the Australian Agricultural Census with 85% response rate and the probability sample is the Rural Environment and Agricultural Commodities Survey used for calibration. In this application, the measurement from Census data is the auxiliary variable used for calibration.

Scenario 2: leverage probability sampling designs to correct for selection bias
In Scenario 2, we have a similar setup as in Section 3. Depending on the roles in statistical inference, there are two types of big data: one with large sample sizes (large n) and the other with rich covariates (large p). We review methods for the two types of big data.

Robust mass imputation estimation
In the first type, the non-probability sample can be large in sample size. How to leverage the rich information in the big data to improve the finite population inference is an important research. We review robust mass imputation methods.
When the sample size of the big data is large, mass imputation is more desirable. In mass imputation, we can train a predictive model from the big data and impute the missing y i in sample A. Instead of a parametric approach, we can also consider nonparametric approaches. To find suitable imputed values, we consider nearest neighbor imputation; that is, find the closest matching unit from sample B based on the X values and use the corresponding Y value from this unit as the imputed value.
Using sample B (big data) as a training data, find the nearest neighbor of each unit i ∈ A using a distance measure d(x i , x j ). Let i(1) be the index of its nearest neighbor, which satisfies The nearest neighbor imputation estimator of µ is Yang and  showed that under some regularity conditions, µ nni has the same asymptotic distribution as µ HT = N −1 i∈A d A,i y i . Therefore, the variance of µ nni is the same as the variance of µ HT . This implies that the standard point estimator can be applied to the imputed data{(x i , y i(1) ) : i ∈ A} as if the y i(1) 's were observed values. Let π A,ij be the joint inclusion probability for units i and j. They showed that the direct variable estimator based on the imputed data Yang and Kim (2018) also considered two strategies for improving the nearest neighbor imputation estimator, one using K-nearest neighbor imputation (Mack and Rosenblatt;1979) and the other using generalized additive models (Wood;. In K-nearest neighbor imputation, instead of using one nearest neighbor, they identify multiple nearest neighbors in the big data sample and use the average response as the imputed value. This method is popular in the international forest inventory community for combining ground-based observations with imagines from remote sensors (McRoberts et al.;. In the second strategy, they investigated modern techniques of prediction for mass imputation with flexible models. They used generalized additive models (Wood; to learn the relationship of the outcome and covariates from the big data and create predictions for the probability samples. We note that this strategy can apply to a wider class of semi-and non-parametric estimators such as single index models, Lasso estimators (Belloni et al.;, and machine learning methods such as random forests (Breiman et al.;1984).

Variable selection in the presence of a large number of covariates
In the second type, when there are a large number of variables, there is a large literature on variable selection methods for prediction, but little work on variable selection for data integration that can successfully recognize the strengths and the limitations of each data source and utilize all information captured for finite population inference.
In practice, subject matter experts recommend a rich set of potentially useful variables but typically will not identify the set of variables to adjust for. In the presence of a large number of auxiliary variables, variable selection is important, because existing methods may become unstable or even infeasible, and irrelevant auxiliary variables can introduce a large variability in estimation. Gao and Carroll (2017) proposed a pseudo-likelihood approach for combining multiple non-survey data with high dimensionality; this approach requires all likelihoods be correctly specified and therefore is sensitive to model misspecification. Chen, Valliant and Elliott (2018) proposed a model-based calibration approach using LASSO; this approach relies on a correctly specified outcome model. Yang et al. (2019) proposed a doubly robust variable selection and estimation strategy.
In the first step, it selects a set of variables that are important predictors of either the sampling score or the outcome model using penalized estimating equations. In the second step, it re-estimates the nuisance parameter (α, β) based on the joint set of covariates selected from the first step and considers a doubly robust estimator of µ, µ dr ( α, β) in (19), where the estimating functions are J(α, β) = J 1 (α, β) Importantly, the two-step estimator allows model misspecification of either the sampling score or the outcome model. In the existing high-dimensional causal inference literature, the doubly robust estimators have been shown to be robust to selection errors using penalization (Farrell; or approximation errors using machine learning . However, this double robustness feature requires both nuisance models to be correctly specified. Using (27) relaxes this requirement by allowing one of the nuisance models to be misspecified. This also enables one to construct a simple and consistent variance estimator (20)+(22) allowing for doubly robust inferences.

Concluding remarks
Data integration is an emerging area of research with many potential research topics. We have reviewed statistical techniques and applications for data integration in survey sampling context. Probability sampling remains as the gold standard to obtain a representative sample, but the measurement of the study variable can be obtained from an independent non-probability sample or big data. In this case, assumptions about the sampling model or the outcome model are required. Most data integration methods are based on the unverifiable assumption that the sampling mechanism for the non-probability sample (or big data) is non-informative (corresponding to the missingness at random in the missing data literature).
If the sampling mechanism is informative, imputation techniques can be developed under the strong model assumptions for the sampling mechanism (e.g., Riddles et al.;2016;Morikawa and Kim;. Like the non-informative sampling case, the informative sampling assumption is unverifiable. In such settings, sensitivity analysis is recommended to assess the robustness of the study conclusions to unverifiable assumptions. This recommendation echoes Recommendation 15 of the National Research Council (NRC) report entitled "The Prevention and Treatment of Missing Data in Clinical Trials" (National Research Council;. Chapter 5 of the NRC Report describes "global" sensitivity analysis procedures that rigorously evaluate the robustness of study findings to untestable assumptions about how missingness might be related to the unobserved outcome. When the training dataset has a hierarchical structure, multi-level or hierarchical models can be used to develop mass imputation. This is closely related to unit-level small area estimation in survey sampling (Rao and Molina;. The small area estimation is particularly promising when we apply data integration using big data. That is, when we use big data as a training sample for prediction, the multi-level model can be used to reflect the possible correlation structure among observations. The parameter estimates for the multi-level model computed from the big data can be used for predicting unobserved study variables in the survey sample if the same multi-level model can be made. Further research in this direction, including the mean squared error estimation for this small area estimation, will be a topic of future research. Finally, the uncertainty due to errors in record linkage and statistical matching is also an important problem. The matched sample using recond linkage techniques (Fellegi and Sunter;1969) is subject to linakge errors. Zhang and Chambers (2019) covers several research topics in the statistical analysis of combined or fused data.