Using machine learning to impute legal status of immigrants in the National Health Interview Survey

We describe a novel machine learning method of imputing legal status for immigrants using nationally representative survey data from the Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS). K-nearest Neighbor (KNN) classifier and Random Forest (RF) Algorithm machine learning were described as novel imputation methods compared to established regression-based imputation. After validating the imputation methods using sensitivity, specificity, positive predictive value (PPV) and accuracy statistics, the Random Forest Algorithm was more accurate in identifying undocumented immigrants and minimized bias in both socio-demographic variables included in the imputation, and unobserved health variables relative to regression-based imputation and KNN.• We developed a new machine learning method of imputing legal status for immigrants that can be used with nationally representative, publicly available data.• Our findings indicate that using machine learning to impute legal status of immigrants, specifically the Random Forest Algorithm, was more accurate in identifying undocumented immigrants and minimized bias relative to other imputation methods.


Method details
No national health survey in the U.S. captures information on the legal status of foreignborn respondents. In the absence of direct measurement, researchers studying the undocumented population have relied on proxy measures and sub-national data sources [15] . One possibility to derive quantitative evidence on the undocumented U.S. population that has been underutilized in health research is legal status imputation. In this paper we will introduce a novel approach to conduct imputation in the National Health Interview Survey (NHIS). The paper is divided into three parts. The first part provides an overview of legal status imputation methods and challenges. Next, we present a novel machine learning-based imputation approach and evaluate the performance under the sub-optimal conditions imposed by the available national health data. We accomplished this by running multiple simulations in the Survey of Income and Program Participation (SIPP). Finally, we demonstrate how the machine learning method can be applied to the NHIS and present data from this imputation on the socio-demographic composition of the undocumented population.

Imputation methods for legal status
Rather than only using information that is given in a survey such as Green Card status to derive a measure of legal status, imputation approaches use information about the undocumented population that is external to the "target sample" to predict respondents' legal status. This external information typically takes the form of a "donor sample", which includes either a direct or a reliable proxy measure of legal status but lacks the size or variables of interest that are included in the target sample [13] . Rather than explicitly matching observations in the donor and target samples, which is usually not possible as respondents likely differ between the two and are anonymized, the imputation methods described in this paper predict which respondents in the target sample are most likely undocumented, based on the population characteristics derived from the donor sample.
This approach to legal status imputation allows researchers more freedom to choose a target sample that is best suited to their research question and allows for inference at a national level, but this freedom comes at a cost. Even the most sophisticated imputation approach will lack the accuracy of a good proxy or direct measure of legal status. If respondents incorrectly classified as undocumented differ systematically from the truly undocumented, legal status imputation increases the risk of introducing bias into subsequent analyses. As Van Hook and colleagues argue, this risk is particularly high if the donor sample does not include the outcome variable of interest (joint observation condition) or the donor and target sample are not derived from the same universe (sameuniverse condition) [13] . In our case, the outcome of interest was the muti-dimensional health status of the U.S. undocumented population. The non-existence of a national health survey that captures respondents' legal status makes the violation of at least one of these conditions inevitable. Therefore, we chose an imputation approach that minimized the risk of introducing systematic bias and that could lead to incorrect estimates of the health of the undocumented population.
This risk of bias and the computational challenges have likely contributed to the limited use of legal status imputation in health research. Demographers on the other hand have long used imputation methods to derive information about the socio-demographic characteristics of the undocumented population. The most commonly cited source for information on size and make-up of the undocumented population in the U.S. is the Pew Research Center, which uses information from a number of public and administrative datasets (i.e. multiple donor samples) to impute legal status in their target sample: the American Community Survey (ACS) and the Current Population Survey (CPS) (J. [9] ). Similarly, the Migration Policy Institute also uses the ACS as the target sample of their analysis of the U.S. undocumented population and uses the SIPP as the donor sample for their legal status imputation [6] . In a rare example of legal status imputation in health research, Wilson et. al (2020) used the Los Angles Family and Neighborhood Survey (LAFANS) as the donor sample to impute legal status in the target sample: The Medical Expenditure Panels Survey (MEPS).
Building on well-established methods used in the field of demography, this paper will explore a new method of legal status imputation for the health of the U.S. undocumented population. NHIS is the nation's largest health survey, making it an ideal target sample for health focused legal status imputation. NHIS provides the size and scope necessary to study the diverse yet small population of undocumented immigrants by interviewing roughly 35,0 0 0 households every year using a nation-wide stratified sampling strategy.
We focus on the risk of bias introduced by the violation of the joint observation condition. To date, no national survey elicits both legal status as well as detailed information on health outcomes and healthcare access. The use of legal status imputation for health research will inevitably violate the joint observation condition. Therefore, it is critical to identify an imputation method that minimizes bias under the suboptimal conditions imposed by publicly available data.
Evaluating the quality of an imputation method requires a data source capable of measuring an immigrant's true legal status. The second wave of the 2008 Survey of Income and Program Participation (SIPP 2008) provides a commonly used proxy for adult immigrants legal status by asking whether foreign-born respondents entered the U.S. as Permanent Legal Residents (LPR) and whether their status has since been adjusted to LPR [13] . Following these common practices, any non-citizen SIPP respondent entering the U.S. after 1981 without having or since adjusting to LPR status and without other indicators of legal status (see Logical Imputation below) will be treated as a truly undocumented immigrant for the purpose of this study. This binary legal status indicator will function as the target classification variable of the imputation approaches tested in this paper.
We focused on three factors to evaluate which imputation approach is best suited for imputing legal status in the NHIS. First, we evaluated the imputation method's ability to accurately assign undocumented status. Second, we evaluated the ability of the method to characterize the sociodemographic profile of the undocumented population on which the imputation is based. Third, we evaluated the method's ability to accurately assess the relationship between legal status and healthrelated characteristics that are not included in the donor sample, which simulates the violation of the joint-observation condition.
Borrowing from machine learning practices, the accuracy of legal status classification (other than for logical imputation) is evaluated using performance metrics based on cross-validation of the imputed and truly undocumented survey respondents, including the probability of a truly undocumented respondent being classified as undocumented (sensitivity), the probability of a truly documented individual being falsely assigned undocumented status (specificity), the probability that a respondent classified as undocumented is actually undocumented (Positive Predictive Value), and the overall percentage of cases being correctly classified (accuracy).
In a second step, we investigated whether the imputation methods lead to bias in estimating the relationship between undocumented status and health-related variables in the target sample. To capture both the domain of individual health, as well as healthcare access, we measured self-rated health (poor or fair) and private health insurance status as binary indicators. We tested for bias by calculating Pearson's correlation coefficients between undocumented status and the two binary health variables. Self-rated health is only asked in the fourth wave of the 2008 SIPP; therefore the correlation analysis is restricted to those foreign-born adult individuals that responded to both the second and fourth wave of the SIPP (N = 7998).

Logical imputation
Logical imputation is arguably the simplest legal status imputation method applied in survey research, as it does not require the use of a secondary "donor" sample. Instead, the external information used to assign legal status is a list of individual characteristics that are mutually exclusive with undocumented status. In the specification of the logical imputation approach used here, this list includes citizenship, Medicare coverage, veterans and active-duty military status, and receipt of public assistance, supplemental or social security income. Any survey respondent reporting one or more of these characteristics was logically determined to be documented. The residual, those respondents who cannot be logically determined to be documented based on their survey responses, were classified as undocumented.
The main drawback of this approach is that many truly documented individuals will remain in the undocumented sample. In our case only 33.4% of individuals (N = 1672) of those classified as undocumented by logical imputation (N = 4924) were truly undocumented. The selective inclusion of many documented immigrants in the undocumented group can lead to misleading conclusions about the relationship between undocumented status and health outcomes. As the results in Table 2 illustrate, logical imputation leads to an overestimate of the negative relationship between undocumented status and both poor/fair self-rated health and private health insurance, relative to the true relationship observed in the SIPP.
The number of documented immigrants falsely assigned undocumented status could be decreased by employing additional exclusion criteria such as employment in a federally licensed occupation or Medicaid coverage. Any expansion the strictly logical criteria carries with it the risk of systematic bias. Borjas and Cassidy [1] for instance determined individuals that reported being covered under Medicaid as definitively documented. Rather than a logical certainty, the association between legal status and Medicaid receipt was strongly correlated because a small number of undocumented immigrants reported Medicaid coverage in several surveys. These individuals might receive Medicaid coverage through state-level provisions that cover pregnant women, mothers, and children or misreport their coverage status due to confusion (e.g., due to previous "Emergency Medicaid" coverage) or fear of disclosing their legal status. The strict exclusion of these individuals can in turn result in misleading conclusions due to the substantial correlation between Medicaid coverage and other socio-economic indicators, most notably, poverty [12] .
While the low specificity of the cautious logical imputation employed should deter its use on its own, limiting the logical imputation to logical exclusion criteria maximizes sensitivity, i.e., the excluded population contains no truly undocumented immigrants. Logical imputation can thus be employed to reduce the foreign-born sample prior to applying further imputation methods without losing any truly undocumented observations. For the remainder of this paper, we used this two-step approach.

Logistic regression imputation
One way to improve on the results of the Logical Imputation approach is by using statistical methods to identify members of the "possibly undocumented" group that are likely undocumented based on their socio-demographic characteristics. The common statistical method used to facilitate this prediction is logistic regression modeling, which can be applied in either a single or multiple imputation framework. We will first consider the simpler Single Logistic Imputation (see [11] for an example).
Establishing a relationship between respondents' socio-demographic characteristics and their probability of being undocumented requires as "donor sample" that includes both socio-demographic variables as well as an indicator of legal status in the form of a direct or reliable proxy measure. The information gained from the "donor sample" can then be used to predict undocumented status in the "target sample". To simulate this donor-target relationship within the SIPP, we follow common machine learning practices by randomly splitting the SIPP into a training and a test sample. Our test sample consists of 20% of the initial SIPP sample and the undocumented identifier is muted, hence simulating a "target sample" that is missing this information. The remaining 80% of the SIPP that make up the training sample remain unchanged, representing the "donor sample" with the full set of information. As described above, both samples are subset to include only the "possibly undocumented" identified by the Logical Imputation. The procedure is repeated for ten different random splits of training and test data and the results presented in Tables 1 and 2 averaged across the ten iterations to ensure that the results are not driven by any one random split.  Table 2 Average correlations between (Imputed) legal status and health variables using bootstrapped cross-validation.  The predictors used to build the logistic regression model are years lived in the U.S., educational attainment, poverty status, region of birth, marital status, difficulties speaking English, Medicaid coverage, household size, spousal citizenship, age, number of children, employment status, race and Hispanic ethnicity. Rather than imputing (using the Census provided imputations) or case-wise deleting missing values, we coded them as an additional level for categorical predictors because nonresponse, specifically to immigration-related questions cannot be expected to be missing-at-random. An alternative specification that includes self-rated health as a predictor was considered and results for this alternative specification are reported in Tables 3 and 4 . We opted for the final predictor set presented here because self-rated health is only available for those respondents retained in the SIPP's fourth wave, and the model performance does not indicate substantial improvements in prediction performance warranting this loss of observations.
Model coefficients were derived from running the logistic regression on the training sample. After predicting the probability of being undocumented among respondents in the test sample, all those with a predicted probability greater than 50% are determined to be undocumented. Model performance indicators are reported in Table 1 . Unlike the Logical Imputation, the Logistic Imputation leads to some undocumented immigrants being falsely assigned documented status, resulting in a sensitivity of 0.69. With a PPV of 0.35, it only presents a minor improvement in the share of truly undocumented individuals in the imputed undocumented group over the Logical Imputation. As the results in Table 2 illustrate, Logistic Imputation also results in a significant overestimation of the negative relationship of undocumented status and health insurance coverage relative to the true relationship in the test sample (-0.27 vs. -0.17). Both the imputed correlation between undocumented status and self-rated health and the true correlation in the test SIPP were insignificant. The main benefit of using a Multiple rather than a Single Imputation framework is the ability to capture the uncertainty inherent in any attempt to predict legal status in a sample that lacks this information. In practice, any analysis using legal status based on multiple imputations will yield higher Standard Errors to account for this uncertainty [13] . In extensive testing, Van Hook et al [13] also show that Multiple Imputation (MI) based on logistic regression yields unbiased estimates of the relationship between legal status and insurance coverage. But this result only holds if the joint observation condition is met, i.e., if legal and health insurance status are both observed in the donor sample. As described above, research on the health of the U.S. undocumented population must inevitably violate this condition when using any form of cross-survey imputation. Thus, we must evaluate whether MI can reduce the resulting bias relative to the Single Imputation approach tested above, even under the sub-optimal conditions imposed by publicly available data.
Following common practice, the MI approach builds on Logistic Imputation using chained equations facilitated by the mice package in R [4] . We used the same logistic specification as outlined above but instead of predicting legal status once in the test data, the MI approach creates ten separate test datasets, that are all equal except for differences in the respondent's imputed legal status. All subsequent analysis is then performed in all 10 datasets separately and results are pooled to account for the uncertainty in imputing legal status.
Unlike Single Imputation, MI treats cross-survey legal status imputation as missing data rather than a prediction problem. Because of this philosophical difference, MI is not designed to provide a definitive classification for each observation. Instead, it has been designed to impute missing values in datasets while retaining the statistical relationship between all variables in the model. This poses several practical constraints with regards to the ultimate analysis in the target sample following legal status imputation. Because MI doesn't assign a definitive legal status, the results cannot be easily combined with additional variables in the target sample to use in the final analysis. All variables used in the final analysis must instead be included in the imputation itself, regardless of their predictive power.
Moreover, cross-validation and socio-demographic summary statistics of the imputed undocumented population are not meaningful measures when evaluating the MI approach. Instead, we relied on the Pearson correlation coefficient between legal status and the health variable to assess the ability of MI to reduce bias. The results in Table 2 show that MI leads to a small positive bias in the relationship between imputed legal and private health insurance status and reproduces the insignificant correlation between undocumented status and self-rated health found in the test sample.

Machine learning imputation
Unlike traditional regression models, non-parametric machine learning algorithms do not require any prior assumptions about the functional form that is underlying the relationship between sociodemographic predictors and undocumented status. With this increased flexibility, non-parametric models can account for more complex relationships and can reduce the bias observed in Logistic Regression Imputation.
One of the most popular non-parametric machine learning classification algorithms is the Knearest Neighbor (KNN) classifier. The basic idea underlying the KNN approach is to identify the k observations that are most similar to the observation for which classification is required. Whichever class the majority of these "nearest neighbors" belong to is assigned to the observation in question. Similarity or "nearness" between different observations is established via Euclidean Distance in an n-dimensional space, with n being equal to the number of selected predictors [7] . To account for differences in units of measurement and possible maximum and minimum values, predictors are typically normalized.
To compare the performance of the KNN algorithm to the Logistic Imputation approach we used the same test and training samples as above, as well as the same 15 predictors. The optimal value for k is determined to be 31 based on repeated 10-fold cross validations using the caret package in R (Max [8] ). As the results in Table 1 show, the KNN Imputation results in a lower sensitivity and slightly higher specificity than the Logistic Regression Imputation, resulting in a lower PPV of 0.29 and a slightly higher accuracy of 88.6. Like the Single Logit Imputation, KNN overestimates the negative relationship of legal status and private health insurance, but with a correlation coefficient of -0.207, it does so to a smaller extent. The correlation coefficient between KNN-imputed legal status and self-rated health also shows a slight negative bias relative to the true relationship in the test sample but remains insignificant. Overall, the KNN Imputation only offers a minor improvement over the Logistic Regression imputation in terms of bias, at the expense of lower sensitivity in identifying undocumented respondents.
Another non-parametric machine learning algorithm is the Random Forest (RF) Algorithm [2] . It builds on the concept of the decision tree, where each node of the branch represents a predictor, and each branch ends in an assignment to a class group. The RF Algorithm grows a large ensemble of decision trees, each based on a random subsample of both the training sample (drawn with replacement) and the predictors (drawn-without replacement). The algorithm chooses node-splits that maximize homogeneity in the resulting split groups. RF is referred to as an ensemble method, as each tree is grown independently and produces a prediction for each observation in the test sample. These "votes'' are then aggregated across trees, leading to the final categorization of each observation based on the majority vote. This approach reduces the risk of overfitting the model that often occurs in simple decision tree models, i.e. on average the RF algorithm performs better in unknown data than comparable models [3] . With the large number of interactions between predictors introduced by the tree design, the RF algorithm also has the potential to account for more complex relationships between socio-demographic characteristics and legal status.
Like the KNN Imputation, we used the same training data for the using 10-fold repeated cross validations for the RF algorithm. The results based on a forest of 500 trees shows 12 to be the optimal number of predictors randomly drawn for each tree from the same predictors as described above. The results of running the test data through the tuned model are reported in Table 1 . The RF outperforms both logistic and KNN imputation in terms of sensitivity, specificity and yields the highest PPV among the three with a value of 0.44. With a correlation coefficient between RF-imputed undocumented status and private health insurance coverage of -0.173, the RF had the best reproduction of the true relationship in the test data among all tested approaches ( Table 2 ). Like the other approaches, and in line with the results in the test data, RF produces an insignificant correlation between undocumented status and self-rated health.
In summary, non-parametric Machine Learning approaches provide a viable alternative to existing strategies in legal status imputation [14] . Specially the Random Forest Algorithm shows superior performance compared to traditional approaches as it is more accurate in identifying undocumented immigrants and minimizes bias in both socio-demographic variables included in the imputation, as well as in unobserved health variables relative to regression-based imputation.

Application to the National Health Interview Survey
Having identified the RF approach as the best performing imputation method under the suboptimal conditions imposed by the availability of suitable national health survey data, we applied it to the NHIS. The NHIS is a stratified random sample of the non-nationalized U.S. population conducted by the National Center for Health Statistics (NCHS). While the basic socio-demographic information needed for legal status imputation is available for all household members, detailed health information is only captured for one adult respondent per household. We will thus restrict the imputation to this Adult Sample.
Despite the large, nationally representative sample of the NHIS, the small share of undocumented immigrants in the U.S. can still yield small cell-sizes each year when stratifying the final analysis by factors such as years lived in the U.S., region of origin or healthcare access status. Samples sizes can be increased by pooling multiple years of the NHIS. Moreover, the composition of the U.S. undocumented population has changed markedly over time, as is evident in the descriptive statistics for the 2004, 2008 and 2014 SIPP presented in Table 5 , as well as in the results presented by the Pew Research Center [10] ).
To account for the changing composition of the U.S. undocumented population, we grow separate RF models, following the approach presented above in the 20 04, 20 08 and 2014 cohorts of the SIPP and apply them to NHIS cross-sections from 20 0 0 to 20 06, 20 07 to 2012 and 2013 to 2018, respectively. A limitation of this approach is the substantial change in SIPP's survey design between the years 2008 and 2014. Most notably, the question whether respondents have changed their status to "permanent" since arriving in the U.S. is dropped from the survey entirely, making the legal status proxy and thus subsequent imputation based on it less accurate than previous SIPP cohorts. This inconsistency can be addressed by either using only the 2004 and 2008 cohorts of the SIPP for imputation across all years of the NHIS, risking inaccuracy for the later years, or by dropping the later observations of the NHIS entirely, restricting the final analysis to more historic data. Alternatively, one can capture any systemic differences in the characteristics of the imputed undocumented population between years by including year fixed effects in the final statistical analysis in the NHIS thus avoiding confounding bias resulting from the different donor samples but rendering a longitudinal analysis of the NHIS data impractical.
The presented imputation approach also faces challenges from differences between the SIPP and NHIS. While ostensibly sampling randomly from the same universe, i.e., the non-institutionalized U.S. population, at roughly the same time, differences in sampling strategies between SIPP and NHIS and thus in sample selection are unavoidable, especially when surveys are conducted by different organizations, as is the case here. One organization might, for example, have more translators available, resulting in a higher probability of non-English speaking individuals being selected into the survey. There are multiple approaches to account for such differences in the probability of being sampled. One approach is to assign a propensity score to each respondent representing the probability of being selected into the SIPP, based on observable characteristics, including the predictors used in the imputation model. This propensity score is then included as an additional predictor in the imputation, thus reducing possible bias introduced into the imputation by systematic differences in the sampling probability between the SIPP and the NHIS [5] . In the specification presented here, in addition to the predictors included in the logistic regression model, individual's region of residence and occupation were added to calculate the propensity score.
The results of applying the discussed RF imputation approach to the pooled cross-section of the NHIS are presented in Table 5 . The NHIS respondents defined as "documented" in the table include both those excluded from the undocumented population via logical edits, as well as the "possibly undocumented" that were excluded based on the imputation model. Consistent with previous research  and SIPP data, the imputed undocumented population in the NHIS is on average younger, more likely to be Hispanic and of lower socio-economic status than their documented counterparts. The main difference between the imputed NHIS sample and the SIPP sample is a smaller proportion of undocumented immigrants originating from Central and South America and, correspondingly, a larger proportion from Asia. Future research that uses non-publicly available data with detailed information on country of origin should consider stratifying the legal status imputation by region of origin to account for this discrepancy.

Application to other data sources and questions
The use of the multi-survey Random Forest imputation approach is not dependent on health data and could be applied to other fields and topics where quantitative data sources that identify undocumented immigrants are scarce or unavailable. Quantitative research in critical, yet underresearched dimensions of undocumented immigrants' wellbeing in the U.S., such as discrimination, integration, or job market experience, could thus be possible using the methodology advanced in this paper.
Legal status imputation also provides an avenue for junior researchers to engage in quantitative research that concerns undocumented immigrants. This group of scientists, which includes many individuals with close ties to the communities involved, often lack access to the resources necessary to conduct primary data collection. By enabling the use of publicly available secondary data, advances in legal status imputation could thus be a means to promote diversity in research concerning the U.S. undocumented population. In such effort s to democratize data access, it remains imperative that the anonymity of survey respondents remains ensured and that imputations methods are used for scientific inquiry only.

Ethics statements
Our work used publicly available data and does not meet the definition of human subjects research.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Publicly available data were used that are available to download from the internet for free.