Healthcare resource utilisation and mortality outcomes in international migrants to the UK: analysis protocol for a linked population-based cohort study using Clinical Practice Research Datalink (CPRD), Hospital Episode Statistics (HES) and the Office for National Statistics (ONS) [version 2; peer review: 1 approved with reservations, 1 not approved]

An estimated 14.2% (9.34 million people) of people living in the UK in 2019 were international migrants. Despite this, there are no largescale national studies of their healthcare resource utilisation and little is known about how migrants access and use healthcare services. One ongoing study of migration health in the UK, the Million Migrants study, links electronic health records (EHRs) from hospital-based data, national death records and Public Health England migrant and refugee data. However, the Million Migrants study cannot provide a complete picture of migration health resource utilisation as it lacks data on migrants from Europe and utilisation of primary care for all international migrants. Our study seeks to address this limitation by Open Peer Review


Amendments from Version 1
The revised protocol clarifies the definition of international migrants that will be used in this study. Edits have been made to address minor errors with referencing and the reporting of findings from previous studies. The planned sensitivity analyses have also been updated to reflect the consideration of additional methods to assess comparability between migrant and nonmigrant groups. The strengths and limitations of the study have been revised to include addition considerations raised by reviewers.

Introduction
An estimated 14.3% (9.4 million people) of people living in the UK in 2019 were international migrants 1 . Despite this, little is known about how migrants access and use healthcare services in the UK. A systematic review of migrant healthcare in Europe showed high emergency care service use but low uptake of preventive services including outpatient care and screening 2 . One study in Scotland also showed that people of South Asian ethnicities, including those born outside of the UK, had higher rates of avoidable hospital admissions compared to the white Scottish population 3 . However, existing studies of migrant healthcare utilisation in the UK are mostly limited to outpatient, hospital and emergency care. In addition, some have used proxy measures of migration which are unable to provide a true estimate of the impact of migration. For example, a study in England using registration with a GP after the age of 15 as a proxy for migration estimated hospital admission rates to be half the rate of the general population 4 .
The Million Migrants study is an ongoing population-based linked cohort study examining secondary healthcare utilisation and mortality in 1.5 million non-European Union (EU) migrants to England 5 . It will link Public Health England (PHE) records of non-EU migrants and refugees to secondary care electronic health records (EHRs) and death registration records. The novel record linkage and cohort size means the Million Migrants will be able to examine in detail the health needs of migrants in England in all hospital-based services (emergency, inpatient and outpatient care) without relying on proxy measures of migration. However, information governance restrictions prevent linkage of PHE migrant and refugee records to UK EHRs from primary care, often the first point of contact in the UK health system and a central part of the NHS Long Term Plan for preventive care 6 . The Million Migrants study is also limited to individuals migrating from outside of the EU. These two factors mean it cannot provide a complete picture of migration health.
To use UK primary care EHR to study migration health without linking to PHE records, a valid migration phenotype is necessary: a transparent reproducible algorithm using clinical terminology codes to determine migration status 7 . A valid migration phenotype is one that determines the migration status for a large number of individuals with high certainty and who are representative of migrants in the general population. A phenotype that is poorly defined or lacks comprehensiveness leads to selection bias and reduces the validity of any findings 7 .
A recent study using Clinical Practice Research Datalink (CPRD), one of the largest UK primary care EHRs, described phenotypes for social factors amongst older individuals including migration status 9 . The study estimated that 1.3% of individuals aged ≥ 65 years in CPRD GOLD were international migrants. However, the study did not evaluate the migration phenotype in CPRD Aurum. As 81.3% of migrants in England are aged between 16 and 64 years old 1 , it is likely that applying a migration phenotype to individuals of any age in CPRD GOLD and Aurum will identify a higher proportion of international migrants. If this phenotype is then found to be broadly representative of the UK migrant population, it will be possible to use CPRD and datasets linked to CPRD to describe primary care and hospital-based healthcare resource utilisation and mortality in migrants from EU and non-EU countries compared to non-migrants across the UK.
This protocol describes the planned methods of a feasibility study and a main study to describe healthcare resource utilisation and mortality for migrants in the UK using CPRD. This will generate evidence to address the gaps outlined in migration health research and inform policy aimed at increasing equitable healthcare for international migrants attending UK primary care. The definition of migrants used in this study will reflect that of the International Organization for Migration, where a migrant is an individual who "moves away from his or her place of usual residence, whether within a country or across an international border, temporarily or permanently, and for a variety of reasons" 10 .

Aims and objectives
The feasibility study aims to assess the validity of a migration phenotype in CPRD. Specific objectives are: 1. To develop a migration phenotype.
2. To assess the completeness of recording of migration status using the migration phenotype.
3. To assess the representativeness of recording of migration status using the migration phenotype.
The main study will be completed if the phenotype is found to be valid and aims to describe healthcare resource utilisation and mortality in migrants to the UK who have registered with primary care. Specific objectives are: 1. To describe patterns of primary care and hospital-based healthcare resource utilisation by migrants compared to non-migrants.
2. To describe the costs of primary care and hospital-based healthcare resource utilisation by migrants compared to non-migrants.
3. To estimate total healthcare resource utilisation patterns across primary and secondary care and investigate whether distinct groups of patients exist based on degree of utilisation.
4. To describe mortality outcomes in migrants compared to non-migrants.

Ethical approvals
The feasibility and main study were approved by the MHRA (UK) Independent Scientific Advisory Committee (ISAC protocol 19_062R), under Section 251 (NHS Social Care Act 2006). This study will be carried out as part of the CALIBER programme. CALIBER, led from the UCL Institute of Health Informatics, is a research resource consisting of anonymised, coded variables extracted from linked electronic health records, methods and tools, specialised infrastructure, and training and support 11,12 .
Feasibility study Study design. An observational, retrospective longitudinal population-based cohort study.
Data resource and processing. Data will be extracted from CPRD using the CALIBER resource. CPRD collects de-identified data of patients registered with a network of GP practices across the UK. The data encompass 45 million patients, including 13 million currently registered patients, across two datasets: CPRD GOLD and CPRD Aurum 13 . CPRD GOLD contains data contributed by practices using Vision® electronic patient record system software and is broadly representative of the UK general population with respect to age, sex and ethnicity 14 . CPRD Aurum contains data from practices using EMIS Web® electronic patient record system software and is broadly representative of the UK general population with respect to age, sex, geographical spread and deprivation 16 .

Study population.
Individuals of all ages listed in CPRD where the individual record was of 'acceptable' research quality as verified by the CPRD and the GP that the patient is registered to has been deemed to be contributing 'up-to-standard' (UTS) data at the study start date 14 .
The study start date is 1st January 1997.

Development of phenotype.
Previously established methods by CALIBER will be used for the development of a migration phenotype 12 . The CPRD code browsers will be searched for diagnostic terms relating to migration using the following search terms: *migrant*, *migrat*, *countr*, *asylum*, *refugee*, *visa*, *abroad*, *born in*, *origin*, *illegal*, *language*. This initial phenotype will then be reviewed and refined by migration health experts and experts in using CPRD from the CALIBER team. Finally, each diagnostic term will be assigned a category based on the type of term (visa status, language, country of birth, origin) and a category based on the certainty of migration status ("definite", "probable", "possible"). We have found 434 diagnostic terms in an initial search (see Extended data 15 ).

Analysis plan.
Previously developed methodology to assess the validity of phenotypes in CPRD will be used to achieve outcomes 2 and 3 including: Completeness: we will examine the percentage of recorded migrants in CPRD throughout the study period, per year and at the time of the 2011 English census will be calculated by dividing the number of individuals identified as migrants by our phenotype by the total number of individuals in the CPRD dataset. This will be done for all migrants and sub-groups according to type of migration term and certainty of migration status. Distribution by sex, age and geographical region of birth will be estimated.
Representativeness: we will undertake a comparison of recorded migrants in CPRD with the percentage of migrants in ONS country of birth statistics per year (examined visually and using chi-squared test of proportions; calculating ratio of proportion in CPRD compared to proportion in ONS) 17 . Comparison of recorded migrants in CPRD living in England on the date of the 2011 English census to 2011 English Census data on country of birth stratified by sex, age and geographical region of origin (examined visually and using chi-squared test of proportions; calculating ratio of proportion in CPRD compared to proportion in ONS).
Main study Study design. An observational, retrospective longitudinal population-based cohort record linkage study.
Data resources, processing and linkage. Data will be extracted from the CPRD GOLD and Aurum datasets and linked to Hospital Episodes Statistics (HES) datasets, death registration data and Index of Multiple Deprivation records obtained through the CALIBER resource 11,18 . CPRD GOLD and Aurum have been described earlier in this paper in the feasibility study section of the methods. For patients in English practices that have consented to take part in the CPRD linkage schemes, a subset of CPRD data is linked to HES, ONS mortality records and patient and practice-level IMD records. We describe the linked records that will be used for our study below. Data linkage in England is carried out by the Trusted Third Party NHS Digital 19 .  Data is provided as quintiles or deciles of the deprivation score to prevent disclosure of patient location. Access is provided by CPRD subject to ISAC approval. This dataset will only be used if patient-level IMD data is not available for an individual.

Study population.
Individuals of all ages listed in CPRD where the individual record was of 'acceptable' research quality as verified by the CPRD and the GP that the patient is registered to has been deemed to be contributing 'up-to-standard' (UTS) data at the study start date.
The study start date is 1st January 1997, although the exact start date will be informed by the feasibility study taking representativeness of migrant phenotype over time into account. For primary care analyses, the end of the study period is limited by the most recent data available: December 2018 for CPRD GOLD and September 2018 for CPRD Aurum. An individual will stop contributing to active follow up at the earliest of: the date a patient's care was transferred out of a CPRD practice, the practice's last collection date for GOLD/Aurum data extraction, patients' date of death or the last date of the study.
Exposure. Migration to the UK is the exposure of interest. This will be defined using the migration phenotype developed and validated as outlined previously in the feasibility study section.
Comparator population. The non-exposed cohort: individuals with no evidence of migration to the UK as defined by the migration phenotype.

Outcomes.
We have selected outcomes that are important to researchers and policy-makers as well as migrants and refugees who have attended our public engagement workshops. Where possible, outcomes are in alignment with the Million Migrants study to facilitate triangulation of results 5 . Outcomes fall into one of three categories: primary care, hospital-based care and mortality. Table 1 summarises the clinical and statistical definition of these outcomes. All outcomes will be explored by subgroup conditions where appropriate. Numerical indicator for number of consultations.

Poisson regression
Prescriptions Prescription for any medication issued in primary care.
Numerical indicator for number of prescriptions.

Poisson regression
Referrals to secondary care Referral made from primary care to hospital-based services.
Numerical indicator for number of referrals.

Poisson regression
Missed appointments Appointments in primary care that were not attended.
Numerical indicator for number of appointments coded as did not attend.

Diagnosis of existing health conditions
Presence of a health condition from one of the sub-groups outlined in Table 2.
Binary indicator for presence of health condition (yes/no) from which a numerical indicator for number of people with a condition can be estimated.

Hospital-based outcomes
Hospital attendances Hospital attendances in inpatient, outpatient, or A&E.
Numerical indicator for number of attendances.

Poisson regression
Hospital admissions Admission into the hospital as an inpatient.
Numerical indicator for number of admissions. Numerical indicator for number of emergency readmissions recorded within 30 days of the index admission discharge date.

Missed outpatient appointments
Outpatient appointments that were not attended.
Numerical indicator for number of outpatients appointments coded as did not attend.

Poisson regression
Missed procedures Procedures that were not attended. Numerical indicator for number of appointments for procedures coded as did not attend.

Poisson regression
Diagnosis of existing health conditions Presence of health conditions by subgroups of conditions outlined in Table 2.
Binary indicator for presence of health condition (yes/no) from which a numerical indicator for number of people with a condition can be estimated.

Mortality outcomes
Death from all causes Deaths in England from any cause Binary indicator for presence of death due to any cause (yes/no).

Death from specific conditions
Deaths in England from conditions within sub-groups outlined in Table 2.
Binary indicator for presence of death due to any cause (yes/no).
Cox proportional hazards model. subgroups (e.g. international migrants from Poland or India) to all non-migrants at the 5% significance level.
After completion of the feasibility study, we will use the results to update our sample size calculation with the number of individuals with diagnostic terms indicating migration. We will use the results of this updated sample size calculation to assess whether to proceed to the full study or not in conjunction with the overall representativeness compared to aggregate ONS data on migration as demonstrated by the feasibility study.
If the feasibility study finds completeness or representativeness is worse than the 2017 study of social factors including migration in older people 9 or the updated sample size calculation means that the study does not have the level of statistical power required, we will not proceed with the main study.
Analysis plan. All statistical analyses will be carried out using the latest available versions of R software.
Patterns of healthcare resource utilisation: Annual incidence rates and incidence rate ratios will be calculated for all primary and hospital-based care outcomes presented in Table 1 and subgrouped by outcomes in Table 2. Poisson regression will be used to generate rate ratios, with robust standard errors to produce 95% confidence intervals.
Costs of healthcare resource utilisation: Methods previously used to study this in patients with irritable bowel syndrome in linked CPRD and HES data 23 will be replicated. Absolute costs will be calculated as total mean individual annual costs with 95% confidence intervals. The costs of health services in primary care will be obtained from nationally calculated unit costs as NHS reference costs 24 and costs of medications from the British National Formulary 25 . The cost of secondary healthcare utilisation will be calculated according to national tariff prices based on the national average unit costs of providing each service; this is published as the National Schedule of Reference Costs 24 .
Total healthcare utilisation patterns: Markers of total healthcare utilisation within primary and secondary care will be identified and patients will be classified according to total healthcare utilisation defined by their chronological sequence of clinical events in all healthcare settings. An exploratory multivariate statistical technique such as Cluster Analysis (K-mean clustering or hierarchical clustering) will be applied to determine whether separable groups of patients who have missed opportunities for preventive healthcare exist.
Mortality outcomes: Standardised mortality ratios (SMR) using ONS death data will be summarised by age and gender. For deaths due to specific conditions, an appropriate regression model will be used. Suicide rates will be based on the ONS definition of suicide, which includes deaths with an underlying cause of intentional self-harm, as well as those with an underlying cause of undetermined intent.
Covariates. The following covariates will be included in the analysis model for all outcomes and sub-conditions: age, sex, deprivation level (Index of Multiple Deprivation quintile), and ethnicity. Additional lists of covariates will be developed where relevant to specific conditions in the sub-groups outlined in Table 2.

Sensitivity analyses.
Where possible, stratified measures will be calculated according to: sex, age, socioeconomic status, ethnicity, migrant visa type, geographical region of birth, general practice consultation type (e.g. face to face versus telephone-based), staff type (e.g. role, gender), method of hospital admission and hospital specialty.
CPRD practices may not be representative of all practices in the UK or of practices serving international migrants to the UK.
To mitigate this, proportions of migrants will be described regionally -if there is a large amount of variation, analyses will be weighted to account for this using previously described methods by Aldridge et al. 26 .
The distribution of covariates across migrant and non-migrant groups will be assessed. Additional methods to achieve comparability between groups will be considered in sensitivity analysis where the uneven distribution of covariates is likely to introduce significant bias.

Information governance
All analyses will be completed on the UCL Data Safe Haven (DSH), an information technology infrastructure certified to national and international information governance standards. The dataset will be securely destroyed after 20 years, in line with UCL's record retention policy. There may be small numbers with specific outcomes or of specific migrant types and in line with CPRD policy, we will not report any data with a cell containing <5 events and, where necessary, we will 'protect' these counts with secondary suppression.

Dissemination of results
We will disseminate research findings to a variety of stakeholders, including patients, healthcare professionals, voluntary organisations, policy-makers, politicians and the public. We will achieve this through the co-creation of research dissemination materials (e.g. lay reports and videos) as well as research engagement stands and workshops in patient and public settings.

Study status
At the time of submission, CPRD GOLD data has been extracted for analysis, cleaned and prepared for validation and validation started with ongoing refinements. Data has been prepared and explored for subsequent analyses in GOLD. A linkage request for linkage to IMD data has been completed and the data provided by CPRD. A linkage request for HES and ONS data is being prepared. Analyses using Aurum data have not yet started.

Discussion
This protocol describes a method of creating and validating and EHR phenotype to describe the healthcare utilisation, morbidity and mortality of international migrants to the UK across primary and secondary care.
Many of the strengths of this study are shared with the Million Migrants study 5 . These include the large size of the cohort and extensive stakeholder engagement. We have collaborated with migrants, refugees and advocacy groups as well as a range of clinical, research and policy stakeholders to ensure ethical and efficient data use and optimise the impact of our research findings. It will also be possible to triangulate secondary care and mortality outcomes for non-EU migrants in the present study with the results of the Million Migrants study.
Unique strengths of the present study include the methods used to develop the migration phenotype, specifically the involvement of migration health experts and clinicians. The study includes primary care data and imposes no restrictions on country of birth or visa types. This means that our study addresses important limitations of the Million Migrants study and profiles a larger part of the patient journey. Another unique strength is the cluster analyses: these will focus on identifying clusters of patients attending GP services that have missed opportunities for care/less resource utilisation so may not be benefiting from preventive services largely delivered in primary care. These findings can then be used to inform development and evaluation of interventions to improve care for underserved groups.
Nonetheless, there are some important sources of bias that must be considered when interpreting any results relating to the fact that determining migration status is dependent on clinician coding. First, clinician coding may be incorrect resulting in misclassification bias. Second, clinician coding may be incomplete resulting in missing data, and therefore, there may be under-recording of migration and the presence of migrants in the comparator population. Third, language coding was incentivised between 2008 to 2011 so representativeness may be better during that period and the cohort may be skewed towards non-English speaking migrants (selection bias) 27 . This could also be a unique strength of the study as the cohort could be particularly useful for understanding healthcare access and use by non-English speaking migrants who may face additional barriers to care. Fourth, this study only captures the healthcare utilisation of migrants who are known to the NHS, rather than those who do not use healthcare or face significant access barriers that prevent them from accessing care. Findings are unlikely to be representative of migrant subgroups like asylum seekers and undocumented migrants and others who are unable to access without fear of being charged for NHS services 28 . Fifth, CPRD does not provide routine linkages to data on individuallevel deprivation, and the study's use of area-level deprivation does not account for the individual-level measures of socioeconomic position that play a role in the association between migration and healthcare utilisation.

Conclusion
In summary, this study has been designed as a novel linkage study to complement the Million Migrants study by including data from primary care and EU migrants. The findings of this study will address important gaps in migration health research and inform policy aimed to increase equitable healthcare for international migrants attending UK primary care.

Data availability
Underlying data No data are associated with this article.

Laurence Gruer
Usher Institute, University of Edinburgh, Edinburgh, UK In their Author response 2, Pathak states: Thank you for highlighting that the figure 1.6% should read 1.3% in the introduction. We have corrected the figure to 1.3% in the introduction, in line with Jain et al's calculation that approx. 81% of the 1.6% of individuals with immigration status codes (country of birth and language) were international migrants.
The authors have thus modified the text in the Introduction to read "The study estimated that 1.3% of individuals aged ≥ 65 years in CPRD GOLD were international migrants." This continues to be an incorrect expression of what Jain et al. report. They found that among people aged over 65 in the CRPD database only 1.6% had their immigration status recorded at all, (0.7% country of birth, 0.9% first language status). First language is not a reliable indicator of migration status as many children born in the UK of migrant parents (and therefore not migrants themselves) learn their parents language first and only subsequently learn English at school. This means that place of birth, the key datum upon which to build the "immigration phenotype" is not available for 99.3% of the individuals in the CRPD database. As Jain et al state "The most incompletely recorded social factor was immigration status". Whilst Jain et al. only focus on people aged 65 and over, there is no reason to think that the completeness of the data for this variable would be much better for younger people. It is also apparent that the 1.6% of people in the CPRD database aged over 65 with "immigration status" are not a representative sample of the population as a whole, as 81% of them were classified as "international migrants" when the actual proportion, as stated in the first line of the paper, is estimated by ONS as 14.2%. If the authors had correctly understood from the Jain paper that the only reliable piece of information upon which to base an "immigration phenotype" was available for less than 1% of the people in the CRPD database and those had been recorded in a non-random manner, they would have realised that the CPRD database is a wholly inappropriate source of data for comparing the healthcare resource utilisation and mortality of migrants versus non-migrants. They would therefore not have embarked on the feasibility study in the first place and a lot of time and money would have been saved.
Given the immense heterogeneity of "migrants", any study of healthcare resource utilisation and mortality, even if it could accurately distinguish between "migrants" and "non-migrants" would then have to be able to categorise the "migrants" in various different ways such as by sex, age, different country of birth, duration of time in the UK and socio-economic circumstances, as well as the range of outcomes the authors state they would like to examine. All these variables also need to be recorded accurately for the analysis to provide a good approximation of reality. I played a key part in enabling ethnic group to be requested at death registration in Scotland from 1 January 2012. We had high hopes that this would allow us to compare the mortality rates of different ethnic groups. These hopes rose when we found there was around 95% completeness of the ethnic group recording (i.e. about 100 times more complete than recording of place of birth in the CPRD database!) However, more detailed analysis revealed that the deceased in some ethnic groups were significantly less likely to be recorded at all than the White Scottish majority, and, of those that were recorded, the ethnic group was significantly more likely to be wrong (when compared to how that person had self-reported their ethnic group in the census). This has meant that after almost 10 years of recording ethnic group at death registration, National Records of Scotland have been unable to publish any analyses of death rates by ethnicity. 1 I have also been involved in considerable efforts made in Scotland to increase the recording of ethnic group of hospital in-patients and out-patients in the hope of being able study health service resource utilisation etc by ethnic group. Over several years, this raised the completeness rate to about 80% (similar to the rate of recording ethnic group in the CPRD). However, a detailed analysis revealed that some ethnic groups such as Irish, Arabs and Gypsy travellers had such low apparent admission and attendance rates that it was clear they were not being consistently recorded whereas some other rates looked unrealistically high. Consequently, no routine analysis of hospital utilisation by ethnic group in Scotland using these data has been published. 2 The only source of ethnic group and place of birth that has proved reliable is the census. By linking the census to various health care databases in the Scottish Health and Ethnicity Linkage Study, we have been able to publish a wide range of studies of mortality, morbidity and health service utilisation by ethnic group in Scotland. 3

expertise to state that I do not consider it to be of an acceptable scientific standard, for
The "migration phenotype" would be generated by developing a "a transparent reproducible algorithm using clinical terminology codes to determine migration status". This would entail the following: "The CPRD code browsers will be searched for diagnostic terms relating to migration using the following search terms: *migrant*, *migrat*, *countr*, *asylum*, *refugee*,*visa*, *abroad*, *born in*, *origin*, *illegal*, *language*. This initial phenotype will then be reviewed and refined by migration health experts and experts in using CPRD from the CALIBER team. Finally, each diagnostic term will be assigned a category based on the type of term (visa status, language, country of birth, origin) and a category based on the certainty of migration status ("definite", "probable", "possible")." This could work if the relevant terms are recorded in the CPRD with a high degree of completeness and accuracy. Are they?
In the introduction, it is stated "An estimated 14.3% (9.4 million people) of people living in the UK in 2019 were international migrants." It later states: "A recent study using Clinical Practice Research Datalink (CPRD), the largest UK primary care EHR, described phenotypes for social factors amongst older individuals including migration status. The study estimated that 1.6% of individuals aged ≥ 65 years in CPRD were international migrants." As 1.6% seemed very low, I read the paper about this study (Jain et al. 2017). I discovered the above statement was incorrect. In fact, the study found that data completeness for immigration status was only 1.6%. This comprised 0.7% where country of birth was recorded and 0.9% where a "first language" code was recorded. Indeed, the paper explicitly said: the most incompletely recorded social factor was immigration status. It added, "among those with data on immigrant status, there was marked over-representation of immigrants (n = 7,866, ~81% of the total) among those with recorded data but under-representation when immigrant status was considered as a binary variable (1.3% of the total study population compared to 9.9% non-UK born individuals in the English Census)." This means that country of birth, the main indicator of immigration status, was not recorded at all for 99.3% of individuals in the CRPD database to be used to develop the phenotype. Even if the less reliable term "first language' is added, the two key terms for determining immigrant status were missing for 98.4% of the individuals in the database . None of the other search terms proposed for the phenotype algorithm could compensate for this vast amount of missing data. While completeness of recording of "ethnicity" was much higher at about 80%, this is not an adequate proxy for "migrant" as many people with a non-White British ethnicity were born in the UK and are thus non-immigrants. Whilst the study by Jain et al. was limited to people over 65, there is no reason to believe the recording of immigrant status in the CRPD database would be any better for younger people.
In the Discussion, the study team acknowledge the risk of missing data, stating "clinician coding may be incomplete resulting in missing data, and therefore, there may be under-recording of migration and the presence of migrants in the comparator population". What they don't seem to have appreciated is that over 98% of the relevant data are missing! Assuming the findings in the paper by Jain et al. are correct, they indicate that the CPRD cannot be used to develop a reliable immigrant phenotype.
From our experience in Scotland, the only reliable source of country of birth data is the Census. It was for this reason that the Scottish Health and Ethnicity Study was made possible through the successful linkage of the Census, with self-reported ethnic group and country of birth, to health and death records.
I thus respectfully invite the authors to reconsider whether either the feasibility study or the main study are viable.

Are sufficient details of the methods provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Not applicable Thanks very much for your helpful comments. We appreciate the time you have taken to read and provide constructive feedback on our protocol. We have addressed all of your points below, and included your comments alongside our responses in bold italics. We have numbered our responses in order to cross-reference them with the review. ******

Reviewer comment 1:
The viability of the study therefore hinges on being able to create a reliable "migration phenotype". This phenotype would be based on relevant data items in individuals' electronic health records in the Clinical Practice Research Datalink (CPRD) which could clearly differentiate "migrants" from "non-migrants". If reliable, the phenotype could then be used as the basis for the main study, in which migrants could be compared with non-migrants across a wide range of health and health service indicators. The "migration phenotype" would be generated by developing a "a transparent reproducible algorithm using clinical terminology codes to determine migration status". This would entail the following: "The CPRD code browsers will be searched for diagnostic terms relating to migration using the following search terms: *migrant*, *migrat*, *countr*, *asylum*, *refugee*,*visa*, *abroad*, *born in*, *origin*, *illegal*, *language*. This initial phenotype will then be reviewed and refined by migration health experts and experts in using CPRD from the CALIBER team. Finally, each diagnostic term will be assigned a category based on the type of term (visa status, language, country of birth, origin) and a category based on the certainty of migration status ("definite", "probable", "possible")." This could work if the relevant terms are recorded in the CPRD with a high degree of completeness and accuracy. Are they?

Author response 1:
We agree that this study hinges on a migration phenotype and for that reason have included a feasibility study of the migration phenotype as part of the study protocol outlining how we would assess completeness and representativeness.
Since submitting this study protocol, we have completed the feasibility study, which shows that while migrants are under-recorded in CPRD GOLD compared to ONS migrant population estimates, the cohort's demographic characteristics are largely representative of the wider migrant population according to ONS. Sufficient power can also be achieved with the present migrant cohort to examine a variety of primary care outcomes. However, as this is a protocol paper we do not believe that these results should be included in this paper.
Publishing study protocols is important for transparency in research which is why we have written this protocol including the feasibility study.

Reviewer comment 2:
In the introduction, it is stated "An estimated 14.3% (9.4 million people) of people living in the UK in 2019 were international migrants." It later states: "A recent study using Clinical Practice Research Datalink (CPRD), the largest UK primary care EHR, described phenotypes for social factors amongst older individuals including migration status. The study estimated that 1.6% of individuals aged ≥ 65 years in CPRD were international migrants." As 1.6% seemed very low, I read the paper about this study (Jain et al. 2017). I discovered the above statement was incorrect. In fact, the study found that data completeness for immigration status was only 1.6%. This comprised 0.7% where country of birth was recorded and 0.9% where a "first language" code was recorded. Indeed, the paper explicitly said: the most incompletely recorded social factor was immigration status. It added, "among those with data on immigrant status, there was marked over-representation of immigrants (n = 7,866, ~81% of the total) among those with recorded data but under-representation when immigrant status was considered as a binary variable (1.3% of the total study population compared to 9.9% non-UK born individuals in the English Census)." This means that country of birth, the main indicator of immigration status, was not recorded at all for 99.3% of individuals in the CRPD database to be used to develop the phenotype. Even if the less reliable term "first language' is added, the two key terms for determining immigrant status were missing for 98.4% of the individuals in the database . None of the other search terms proposed for the phenotype algorithm could compensate for this vast amount of missing data. While completeness of recording of "ethnicity" was much higher at about 80%, this is not an adequate proxy for "migrant" as many people with a non-White British ethnicity were born in the UK and are thus non-immigrants. Whilst the study by Jain et al. was limited to people over 65, there is no reason to believe the recording of immigrant status in the CRPD database would be any better for younger people.
In the Discussion, the study team acknowledge the risk of missing data, stating "clinician coding may be incomplete resulting in missing data, and therefore, there may be underrecording of migration and the presence of migrants in the comparator population". What they don't seem to have appreciated is that over 98% of the relevant data are missing! Assuming the findings in the paper by Jain et al. are correct, they indicate that the CPRD cannot be used to develop a reliable immigrant phenotype.

Author response 2:
Thank you for highlighting that the figure 1.6% should read 1.3% in the introduction. We have corrected the figure to 1.3% in the introduction, in line with Jain et al's calculation that approx. 81% of the 1.6% of individuals with immigration status codes (country of birth and language) were international migrants.
We do not agree that Jain et al's paper indicates that the CPRD cannot be used to develop and test a reliable phenotype. Rather, Jain et al's study shows that a phenotype can be developed, but that it has not been evaluated in the whole CPRD population, i.e. it did not complete evaluation of the phenotype in under 65 year olds in CPRD GOLD, and did not evaluate the phenotype at all in CPRD Aurum. According to ONS 2011 census estimates, those aged 65 years and over only make up 27% of the migrant population, so Jain et al's study hasn't evaluated a migration phenotype in approximately three-quarters of the migrant population. We have outlined why we think it is reasonable to pursue completing a feasibility study of the phenotype across all age ranges in paragraph four of the introduction and, in giving us approval to complete this study, the Independent Scientific Advisory Committee for CPRD agree that it is reasonable to proceed with the feasibility study.

Reviewer comment 3:
From our experience in Scotland, the only reliable source of country of birth data is the Census. It was for this reason that the Scottish Health and Ethnicity Study was made possible through the successful linkage of the Census, with self-reported ethnic group and country of birth, to health and death records.

Laurence Lessard-Phillips
Institute for Research into Superdiversity, University of Birmingham, Birmingham, UK This is a fascinating research protocol for looking into migrants' healthcare utilisation and mortality, with the aims to develop -and apply-a migration phenotype in two stages: first through a feasilbility study, and then through a main study. The main data source is the CPRD. This is highly relevant and timely, with the potential to explore a relatively unexplored/limited area of research, often due to the lack of available data.
In order to make the research protocol clearer, I would suggest that the authors consider the following points: It is unclear what migration proxies are being criticised; some of those proxies are also used in the data that will be used to assess the phenotype. It would be interesting to hear a bit more about how this will be dealt with and what types of assessment criteria will be used in the feasibility study, as these were not entirely clear to me. ○ On the point above: would it be worth looking into getting access to the APS (or LFS), which are used to produce the ONS 'estimates' (with their own source of biases) so that there could be a more detailed comparison? ○ Is there a possibility to also consider the different between country of birth and nationality? Could there be instances where individuals are misclassified if nationality/citizenship is not looked into (nationality may also grant specific rights to individuals, regardless of where they, or their parents, were born).

○
Out of curiosity (and linked to issues of healthcare charges): will the data risk including visitors as well as migrants?
○ Is the comparator population in the main study too heterogenous? This stems from my interest in migrant generations, but I assume that heterogeneity within the non-migration could be established from some of the included covariates.
○ It would be good to emphasise the fact that expert input has been used for the development of the phenotype (as it seems to be alluded to toward the end of the protocol).

○
It was good to mention that clinician coding could lead to bias. Could another source of bias also be that the data source, as relevant as it might be, may also exclude those who 1) do not use healthcare and/or 2) are being barred from using healthcare? Given the link to the ○ DOTW/MdM reports, it may be relevant to mention.
What is the level of certainty that EU 'migrants' will be captured within the CPRD data? Could there be different terms being used? What if they are not? This is probably where expert input will be useful in the assessment of the phenotype.

○
The covariates included are important, but is there a danger of associating area-level IMD to individual deprivation/SES? Is there a way to capture this? ○ On the IMD: given the longitudinal aspect of the study, would it be worth also using earlier measures (if available)? ○ One minor point: is the reference in the third introductory sentence the correct one, as it seems to deal with Scotland when the text refers to a European systematic review... Maybe use the original reference within that article? ○ Of course, the points above are quite minor, and most are meant to be points to reflect on as the study develops and be dealt with relatively quickly. As mentioned at the start of this review, this study, and the data that it will generate, has great potential for our understanding of migrant health in the UK.
Is the rationale for, and objectives of, the study clearly described? Yes

Are sufficient details of the methods provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Not applicable read and provide constructive feedback on our protocol. We have addressed all of your points below, and included your comments alongside our responses in bold italics. We have numbered our responses in order to cross-reference them with the review. ******

Reviewer comment 1:
It is unclear what migration proxies are being criticised; some of those proxies are also used in the data that will be used to assess the phenotype. It would be interesting to hear a bit more about how this will be dealt with and what types of assessment criteria will be used in the feasibility study, as these were not entirely clear to me. On the point above: would it be worth looking into getting access to the APS (or LFS), which are used to produce the ONS 'estimates' (with their own source of biases) so that there could be a more detailed comparison?

Author response 1:
We have separated out "proxies" from the original sentence to make the examples easier to follow: "In addition, some have used proxy measures of migration which are unable to provide a true estimate of the impact of migration. For example, a study in England using registration with a GP after the age of 15 as a proxy for migration estimated hospital admission rates to be half the rate of the general population." The assessment criteria for the feasibility study includes completeness and representativeness of the resultant CPRD GOLD migrant cohort, compared to ONS aggregate data. We have chosen not to compare results to the Labour Force Survey because the ONS states "The Labour Force Survey (LFS) is not designed to measure changes in the levels of population or long-term international migration... levels and changes in levels should be used with caution".

Reviewer comment 2:
Is there a possibility to also consider the different between country of birth and nationality? Could there be instances where individuals are misclassified if nationality/citizenship is not looked into (nationality may also grant specific rights to individuals, regardless of where they, or their parents, were born).

Author response 2:
We agree that the conflation of country of birth and nationality through miscoding by clinicians is likely to affect the summary outputs by region of birth provided as part of the feasibility study and any sensitivity analysis conducted using region of birth. However, it is not possible to study nationality using CPRD electronic health record codes as it is difficult to ascertain whether non-UK origin terms definitively indicate nationality. The potential for misclassification bias, and other sources of bias, has therefore been addressed in the Discussion: "Nonetheless, there are some important sources of bias that must be considered when interpreting any results relating to the fact that determining migration status is dependent on clinician coding. First, clinician coding may be incorrect resulting in misclassification bias. Second, clinician coding may be incomplete resulting in missing data, and therefore, there may be under-recording of migration and the presence of migrants in the comparator population. Third, language coding was incentivised between 2008 to 2011