Million Migrants study of healthcare and mortality outcomes in non-EU migrants and refugees to England: Analysis protocol for a linked population-based cohort study of 1.5 million migrants

Background: In 2017, 15.6% of the people living in England were born abroad, yet we have a limited understanding of their use of health services and subsequent health conditions. This linked population-based cohort study aims to describe the hospital-based healthcare and mortality outcomes of 1.5 million non-European Union (EU) migrants and refugees in England. Methods and analysis: We will link four data sources: first, non-EU migrant tuberculosis pre-entry screening data; second, refugee pre-entry health assessment data; third, national hospital episode statistics; and fourth, Office of National Statistics death records. Using this linked dataset, we will then generate a population-based cohort to examine hospital-based events and mortality outcomes in England between Jan 1, 2006, and Dec 31, 2017. We will compare outcomes across three groups in our analyses: 1) non-EU international migrants, 2) refugees, and 3) general population of England. Ethics and dissemination: We will obtain approval to use unconsented patient identifiable data from the Secretary of State for Health through the Confidentiality Advisory Group and the National Health Service Research Ethics Committee. After data linkage, we will destroy identifying data and undertake all analyses using the pseudonymised dataset. The results will provide policy makers and civil society with detailed information about the health needs of non-EU international migrants and refugees in England.

report report report report

Introduction
In 2017, 8.6 million (15.6%) people living in England were born abroad, with 5.3 million (9.2%) born outside of the European Union (EU) 1 . However, little is known about how these international migrants use England's National Health Service (NHS) or their subsequent health needs and mortality outcomes.
Here we define international migrants as people born outside of England 2 . This may include, for example, people who either have chosen to migrate (e.g. work, study, or join families) or those who may have been forced to migrate due to conflict, persecution or environmental disasters (e.g. refugees and asylum seekers).
A previous systematic review of healthcare usage in Europe showed that usage of accident and emergency (A&E) services by international migrants was high, but that screening and outpatient care usage was low 3 . Similarly, a recent study from Scotland showed that some ethnic minorities, whether migrants or their offspring, have higher levels of avoidable hospital admissions when compared to the non-migrant white Scottish population 4 . These findings suggest that migrants may be receiving poor quality primary and preventative healthcare or face barriers to accessing health services.
In England, no large scale studies have been conducted to date that were able to examine migrants' usage of hospital-based healthcare services. One primary care based study in England found that individuals who registered with a general practitioner (GP) for the first time when over the age of 15 -used as a proxy for international migrants -had half the rate of hospital admissions than the general population 5 . However, another study found that only one-third of new migrants had registered with a GP, complicating the use of this proxy measure as well as highlighting poor uptake of primary healthcare registration 6 . Other studies in England 7,8 have used country of birth to examine migration, but did not have information on the date of migration, the country which an individual was migrating from, or the visa category under which they entered England. These studies illustrate the methodological challenges of accurately identifying migrants within national data sources and the lack of available information on their migratory history, thus limiting the interpretations of previous research.
Despite poor access to preventative healthcare, there is evidence that migrants have a mortality advantage compared to host populations in the high-income countries to which they migrate. We recently conducted a systematic review and metaanalysis on the global patterns of mortality data in international migrants 9 . Our review showed that the levels of mortality in international migrants, measured using standardised mortality ratios, were lower compared to the host population in the countries of destination for most disease causes. Two exceptions to these findings were an increase in mortality due to infectious disease and external causes of mortality 9 . As we found very little data on refugees, asylum seekers and other forced migrants, our data is most representative of international migrants in high-income countries who are studying, working or have joined family members. Although our findings supported the idea of the healthy migrant hypothesis -an empirically observed mortality advantage of migrants relative to the host population 10 , there is evidence from migrants residing in England and Wales that suggests that this advantage declines with age 7,8 . However, these studies were not able to assess whether this advantage also changed with duration of residence in England.
The evaluation of morbidity and mortality outcomes of migrants has been limited by difficulties in their identification in national data sources. To date, we have no data in England linking migrant and refugee health service usage and subsequent health conditions and mortality outcomes. There are existing disparate data sources on the health of international migrants; however, despite their existence, these data have not been integrated or analysed systematically. These data contain information that would enable us to comprehensively evaluate the health needs in migrants and refugees and develop evidence on how to improve access to hospital-based health services and preventative healthcare in England. The Million Migrant study will generate this evidence for the first time by linking records that contain data on the health outcomes for 1.5 million non-EU migrants and refugees.

Aim and objectives
The Million Migrant study will be a population-based cohort study that aims to examine secondary healthcare performance (e.g. quality and accessibility) and mortality in 1.5 million non-EU migrants and refugees in England. There are two main objectives of the study. First, to profile hospital-based healthcare performance by identifying existing health conditions and examining hospital admissions, readmissions and duration of admission of non-EU migrants and refugees compared to the general population in England. Second, to investigate mortality outcomes by health condition for non-EU migrants and refugees in comparison to the general population. This objective will examine whether or not our data replicates the mortality advantage of international migrants found in the literature. The study includes analyses that acknowledge the wider determinants influencing the health of international migrants such as the legal, social, economic and health structures and systems, health service access and support, exposures and behaviours, and epidemiological changes associated with population mobility (Figure 1).

Protocol
Data collection, processing, and linkage Our study will link four data sources as outlined in Figure 2. First, non-EU migrant tuberculosis pre-entry screening data. This data set contain records on international non-EU migrants resident in a country where tuberculosis is common (40 cases per 100,000 people), and who are planning to come and live in United Kingdom (UK) for more than 6 months 11 . Individuals in this dataset were screened by the UK pre-entry tuberculosis screening programme between Jan 1, 2006, andDec 31, 2017. UK pre-entry tuberculosis screening was conducted either by the International Organization for Migration (IOM) or by international clinics recognised by the UK Home Office and quality assessed by Public Health England (PHE). Second, refugee pre-entry health assessment. Refugees undergo a health  assessment that allows pre-departure information to be shared with local authorities and health services in the UK. Refugee health assessments were conducted by IOM between Mar 1, 2013 andDec 31, 2017. Our study will therefore include two cohorts of international migrants to the UK -non-EU migrants and refugees that will be linked to the final two datasets from Jan 1, 2006, andDec 31, 2017. Third, national hospital episode statistics (HES), including hospital admissions and attendances. Fourth, Office of National Statistics (ONS) death records which contain cause of death information on all deaths between Jan 1, 2006, andDec 31, 2017. We focus on non-EU migrants because of their availability in this dataset, but also because EU migrants are more likely to dual-use health systems and less likely to need adaptation in terms of cultural competence 13 . The analysis will be limited to England as the HES data will only be obtainable for English hospitals.
We will obtain identifying variables (forename, surname, aliases, date of birth, sex, country of origin, country of departure, date of tuberculosis pre-entry screening or refugee pre-entry health assessment, and visa category) from all non-EU migrants and refugees under appropriate legal and ethical approvals. We will then use the Personal Demographics Service (PDS), the national electronic database of NHS patient information such as name, address, and date of birth, to identify and add NHS numbers by matching on these identifying variables to the tuberculous pre-entry screening and refugee health assessment records. Where available, NHS numbers that have already been added by PHE will be integrated. Once the non-EU migrants and refugees have been matched to their NHS number, we will then undertake deterministic linkage using NHS number to identify HES and ONS mortality records. Personal identifiers will then be removed, resulting in a pseudonymised linked dataset which will be securely transferred to UCL for data analysis, as outlined in Figure 2. We will compare matching levels (e.g. whether an individual record is linked to their NHS number or not) across different age, sex, year and age of entry to UK and migrant country of origin groups. The non-EU migrant tuberculosis pre-entry screening dataset might contain multiple duplicate records for some individuals who require repeat tuberculosis screening. Duplicate records will be analysed on the basis of whether they occurred within 12 months of each other or not as per previously defined rules 14 provided in full in Extended data file 7. Non-EU migrant and refugee data are cleaned by the IOM epidemiology unit in coordination with clinics to ensure that records included all results on individuals screened and that any duplicate entries resulting from administrative error were removed or consolidated into one record. Cleaning and consistency checking of the final dataset will be undertaken by examining the distribution of variables, the range of individual variables, and missing data.

Ethics and information governance
To undertake this work, we require access to patient identifiable data without individual consent. We will apply to the Secretary of State for Health through the Confidentiality Advisory Group (CAG) to obtain approval for this work. After data linkage (see earlier section) we will destroy all identifying data and undertake all analyses using the pseudonymised dataset outlined in Figure 2. We will also seek ethical approval for the study from an NHS Research Ethics Committee. We believe that the benefits of this study for the migrant population outweigh any risks. The primary risk that could be anticipated is a data breach of sensitive information; we are developing a comprehensive data management and data sharing plan to minimise this such that no risks are anticipated. The dataset will be created and analysed by a team with extensive experience of handling large datasets securely, and the organisations involved in this work (UCL and PHE) have extensive Information Security and Governance procedures in place to minimise this risk.

Comparator groups
We will compare three groups in our analyses: 1) non-EU international migrants, 2) refugees, and 3) general population in England. We aim to disaggregate the non-EU migrants and refugees by: 1) age at migration, 2) sex, 3) ethnicity, 4) visa category, 5) country of origin, and 6) date of screening. We will examine each group over the follow up time period. For refugees, we will also examine country of departure as it is often different from the country of origin. Due to their smaller numbers, we will likely examine refugees by World Health Organisation (WHO) region of origin instead of country they migrated from. Final geographical categorisations used for migrant groups will be taken to minimise risk of disclosure and therefore migrants may be grouped into WHO sub-regions. Secondly, we will compare the non-EU migrants and refugees to the general population in England by deprivation level (Index of Multiple Deprivation -IMD -quintile). This will be done using an anonymous HES sample. Here the general population will be composed of mostly England-born residents, along with other types of international migrants who did not partake in one of the two screening programmes. These include EU migrants, international migrants that are not required to get a pre-entry tuberculous screening (e.g. migrants from low-tuberculous countries such as the United States), irregular migrants (e.g. undocumented), and migrants and refugees who arrived before the start of their subsequent screening programmes.

Outcomes
We present a series of outcomes for our healthcare and mortality analyses. We chose outcomes to ensure our analysis is consistent with those used in previous published analyses 4,7,8 , in addition to outcomes that are of high interest to researchers but that previous studies were not powered to collect. Moreover, we have chosen a range of outcomes that collectively reflect the priorities of health policy makers as well as migrants and refugees who attended our patient engagement workshops on consent process, data linkage and analysis.
Hospital-based healthcare outcomes. We will profile the following hospital-based healthcare outcomes in non-EU migrants and refugees: 1) hospital attendances (inpatient, outpatient, and A&E), 2) hospital admissions (inpatient), 3) duration of hospital admission, and 4) 30 day emergency readmissions. We will explore these four outcomes by sub-conditions where appropriate. The clinical definitions and methodological approaches for each outcome are provided in Table 1. Full clinical definitions of each outcome's subgroup are provided in Table 2. HES currently uses ICD-10, the 10th version of the international classification of diseases and related health problems, to code for conditions and OPCS-4, the classification of interventions and procedures, to code all interventions and procedures. ICD-10 and OPCS-4 code lists for each are provided in Extended data file 2-5.
Mortality outcomes. We will examine the following mortality outcomes in non-EU migrants and refugees: 1) death and 2) death due to a specific condition. The clinical definition and the methodological approaches for each outcome are provided in Table 1. Full clinical definitions of each outcome's condition are provided in Table 2. ONS datasets use ICD-10 codes to code for the health condition registered at death. ICD-10 code lists for each are provided in Extended data file 2-Extended data file 5. Lastly, we aim to examine if the mortality advantage found in the literature can be replicated with this migrant and refugee population in England.

Entry and exit from cohort
We will include all visa applicants (non-EU migrants) from 101 countries (see Extended data file 1) with pre-entry tuberculosis screening in this study and all refugees who underwent pre-entry health assessment.
Individuals will enter the cohort at whichever is the latest of: date at which they were screened for tuberculosis pre-entry screening or refugee pre-entry health assessment. Individuals will be followed up until the earliest of: end of the follow-up period (31st December 2018), emigration, or death. Individuals found to have tuberculosis at pre-entry screening will not enter the cohort as they are not given a certificate of clearance for tuberculosis and therefore their visa process is put on hold until they are treated and a certificate is then produced. Only then will they enter the cohort.
Individuals in our cohort will be at risk of hospital admission or death from the date of entry until the first of the following events: death, emigration, or the end of the follow-up period. Since data were unavailable to indicate whether an individual migrant is living in Scotland/Wales/Northern Ireland (or resettlement to these countries in the case of refugees), or emigration, these events will be accounted for probabilistically by multiple imputation and building on previously described methods 14,22 . Binary indicator for emergency readmission (yes/no) recorded within 30 days of the index admission discharge date. Emergency admissions are defined as those where the admission method is waiting list, booked or planned (11, 12 or 13). To be explored in subgroup of people with an initial hospitalisation.

Mortality outcomes
Death from all causes Deaths in England from any cause.
Binary indicator for death due to a all-causes (yes/no). Deaths will primarily be identified through linkage to ONS deaths registration data, but also through HES (where the Method of Discharge field is coded as "dead" (4)) as the latter method may better ascertain information on recent deaths where there is a delay in death registration (e.g. because a coroner's report is required).   Conditions where hospital admissions or death is for a common mental and behavioural disorder.

Multimorbidity
Co-existence of two or more chronic conditions, each one of which is either: (1) a physical noncommunicable disease of long duration, such as a cardiovascular disease or cancer; (2) a mental health condition of long duration, such as a mood disorder or dementia; or (3) an infectious disease of long duration, such as HIV or hepatitis C 18 .
All causes* Death due to any cause.
ICD-10 chapter* 19 Death due to a specific conditions, such as infectious disease, disease of the blood, cardiovascular diseases, digestive disease, genitourinary disease, musculoskeletal disease, nervous disease, respiratory disease, endocrine disease, injury or external causes, mental and behavioural, or neoplasms 20 .
Maternal deaths* Death of a woman while pregnant or within 42 days of termination of pregnancy, irrespective of the duration and site of the pregnancy, from any cause related to or aggravated by the pregnancy or its management but not from accidental or incidental causes 21 .
*Subgroup only applies to the mortality outcomes.
Sample size 1,700,000 non-EU migrants who underwent pre-entry tuberculous screening to enter the UK between 2005-2017 will be included. After removal of duplicate records there will be just over 1,500,000 unique individuals. Approximately 10% of migrants move to Scotland and Northern Ireland and will not be linked to HES and ONS data. Therefore, linkage to HES and ONS will be on approximately 1,380,000 non-EU migrants. Table 3 provides examples of the precision by which prevalence estimates will be estimated and the changes in relative risk detectable between migrant subgroups. The study therefore has sufficient statistical power (80%) to detect changes in common outcomes (e.g. admission due to ambulatory care sensitive condition) and rare outcomes (e.g. mood (affective) disorders), at the 5% significance level.

Analysis plan
Our analysis plan has been designed to meet our two main objectives of profiling hospital-based healthcare and mortality outcomes for non-EU migrants and refugees. To achieve this, we will undertake the analysis in three phases. In the first phase we will summarise and compare baseline characteristics (see Table 4) between the non-EU migrant group, the refugee group and the general population in England. With the exception of ethnicity, all baseline characteristics are anticipated to be fully recorded (chronic disease is presumed to be absent unless recorded). Missing values for ethnicity will be analysed grouped as "not recorded". In the second phase, we will summarise the hospital-based healthcare and mortality outcomes of the three study populations, using the general population of England as a reference. We will estimate the crude association between each of the outcomes and these study population groups. We will then re-estimate the association between each of the outcomes and the study population group after adjusting for characteristics at the time of hospital admission or time of death: age, sex, chronic disease, calendar time period, and reason for hospital admission or cause of death. Finally, an appropriate statistical model (selected on the basis of meeting assumptions such as proportional hazards for Cox regression) will be used to analyse the relationship between the study comparison groups and each of the outcomes. Crude models will be fitted prior to adjustment for "baseline" measurements at or before the index admission. We will write-up the analysis in accordance with the Reporting of **This column indicates the hazard ratio detectable for the subgroup using the largest group as baseline (e.g. 100,000 Bangladesh) and assuming this baseline group has the same rate as the overall population level (e.g. 8.4 per 1,000 for ACS conditions).

Variable Description
Age at migration (in years) as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment Sex as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment Ethnicity as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment Visa category as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment Country of origin as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment Country of departure as recorded at the refugee pre-entry health assessment Date of pre-entry tuberculous screening or refugee pre-entry health assessment as recorded at the non-EU migrant pre-entry tuberculous screening or refugee pre-entry health assessment

Sensitivity analyses
To determine the robustness of our final results and to quantitatively account for any uncertainty, sensitivity analyses will be conducted to examine the extent to which our findings are affected by changes in methods or values of unmeasured variables. We are uncertain about the length of time non-EU migrants remain in England following their arrival. In a first sensitivity analysis, all migrants will be assumed to stay for one and a half years, the median time of stay for an international migrant, providing a lower estimate of person time at risk. In a second sensitivity analysis, all migrants will be assumed to stay until the end of the study period of 31st December 2017. This is the more conservative assumption, and whilst it unrealistically inflates the denominator, it provides a lower bound for the estimates of incidence and prevalence.

Dissemination of results
At the end of the study we will convene a group of policy makers, non-governmental organizations, migrants and refugees and the public to feedback the results of our study and seek suggestions on ways to take this work forwards. We will also disseminate our findings to relevant policy makers and schemes through a series of regional workshops.

Study status
We are currently in the process of seeking ethical and Confidentiality Advisory Group (CAG) approvals for the study.

Discussion
We describe a novel record linkage study that will use routine data from multiple sources to generate the Million Migrant study. The study has several advantages including the opportunity to accurately identify migrants in UK routine health records and their subsequent health needs. The strengths of this approach are the creation of a highly-powered cohort study that harnesses existing data to uncover health patterns in this often difficult to identify or invisible population. The Million Migrant study aims to improve evidence on hospital-based events and mortality and will better position the scientific community to inform policy makers and civil society with rigorous data about the health of migrants in England.
There are several limitations to our study. We will be unable to include data from primary care due to the lack of a national dataset available for this purpose. As a result, we will be unable to examine any contribution to the health and care of the individuals from primary and community or social care. Designing a new linkage of the million migrant cohort to primary care data would provide a more comprehensive understanding of primary care usage within this population and allow us to identify new opportunities for community-level interventions, and this may become a possibility in the coming years. We also do not include data from all migrant sub-groups. Irregular migrants (e.g. entrants who enter, stay or work in a country without the necessary authorization such as undocumented entrants, failed asylum seekers, visa overstayers, children born to irregular migrant couples), migrants entering on a temporary visa (e.g. tourist visa), EU and EEA migrants, international migrants from lowincidence tuberculosis countries who subsequently do not have a pre-entry tuberculous screening (e.g. United States of America, Chile, and Egypt), and international migrants who emigrated before the start of either health screening programme will not be captured. As such, the study findings will not be generalisable to these groups. Additionally, although smaller in number, these groups could potentially be included in our randomly generated sample from the general population residing in England. This is important to consider in our interpretation of the findings. Our proposed study will not be able to assess whether the healthcare and mortality outcomes were affected by frequency of travel abroad, health service usage abroad, uncertainty in length of residence in the UK, movement in the UK outside of England to access healthcare, or wider socio-environmental determinants of health. These factors will be later examined through the creation of an electronic longitudinal cohort study using a mobile phone application to collect data on the health of migrants after moving to the UK (part of RWA's Wellcome Trust Fellowship, Public health data science to investigate and improve migrant health in the UK).
To help ensure impact from the work, we have engaged with policy makers, non-governmental organizations, migrants and refugees and the public throughout the design and conduct of the study to ensure relevance to them and prepare a pathway for impact, and will continue to do this through to the end of the project. In designing this study, we held a workshop with international migrants and refugees to understand their views on the consent process, data linkage and analysis. We have also involved national and international policy makers in the design stage.
In England today, nearly 15.6% of the population are international migrants, constituting an important and large group of people. However, there is still a limited understanding of their health needs and use of secondary care in the NHS. This study will fill an important gap in the literature and provide local, regional and national policy makers with detailed information about the health needs of this population. Our results will include information about how and where healthcare services can be improved to prevent hospital admissions, and data on the causes of death in this large and important group of people. Whilst maintaining the highest scientific and ethical standards on the use of existing data, or big data, we will set forth an avenue to advance knowledge and good practices in the field of migration and health, for the UK and internationally.

Data availability Underlying data
No data is associated with this article.

Extended data
All extended data is publicly available on Open Science Framework: Million Migrants study of healthcare and mortality outcomes in non-EU migrants and refugees to England: Analysis

Reporting guidelines
To review the study's RECORD checklist, please see Extended data file 8. It describes their intention to use routinely collected data for a subset of the total migrant population. Those who receive a refugee health check. And those who are screened for TB because they come from originating countries with high TB rates. It is not possible to analyse all migrants; only those who enter as refugees and those who undergo screening because they originate from countries with high TB rates. This is the basis of their migrant cohort, which, as authors point out, omits migrants from developed countries with few TB cases.

Grant information
Data linkage I concentrate here on data linkage aspects because two reviewers have previously covered other aspects. The data linkage relies on the Personal Demographics Service (PDS) dataset. This dataset is key because they will use it to add NHS numbers to the migrant/refugee datasets to add NHS numbers and enable deterministic joining using the NHS number of the migrant cohort to their corresponding records in HES and ONS mortality datasets.
While there is an extended file on the de-duplication of refugee/migrants to create the migrant cohort, there are not additional files on the linkage. I suggest one is added. As it currently stands, there are insufficient details in the methods to allow replication. The datasets they will be using would have some useful additional detail. Their explanation would benefit from details: What is the percentage of NHS numbers added by PHE? What will they do with the cohort members who do not have an NHS number?
The completeness of the PDS dataset, and allocation of an NHS number relies on the person having utilised health care services. What is the percentage of the cohort they expect to have an NHS number?
What will they do if the PDS contains only a third of the migrants who have registered with a GP, as highlighted by Stagg et al. ? I remain sceptical about the PDS having data added through all the sources mentioned on the NHS Digital website (pharmacies, child and mental health, secondary care providers) and believe the main source of data is from the patients themselves when they register with a new GP.
Checking thoroughly with NHS Digital at this stage may save time later. Or adding such reassurances to the protocol would help.
Will they simply assume no NHS number means no utilisation? Thereby including these cohort migrants Will they simply assume no NHS number means no utilisation? Thereby including these cohort migrants in the no healthcare utilisation denominator? They are using data up until end 2017 and it is likely the rates of NHS numbers populated increase the longer they spend in the UK. Adding time since entry to the UK would also be a useful variable to stratify your comparisons. A young healthy migrant is unlikely to need any healthcare, particularly if they are male, due to lack of screening and reproductive care needs and I note they will use sex in comparing matching levels.
Outcomes and analysis plan I suggest that unplanned (emergency) admissions are the type that should be avoided ideally, and are also an indicator that there is a lack of preventative care. Can this be added as an additional group for analysis? These then would be separate from the planned admissions that have a prevention goal.
There seems like there is an opportunity to model several outcomes together in a combined model, for instance using multivariate generalized linear mixed models. This would place the relative importance of the different outcomes in the same model and take into account multiple testing corrections they will have to complete otherwise (e.g. Bonferroni). Therefore they should use a smaller significance level to account for multiple hypothesis testing; there are four outcomes for healthcare utilisation, for instance. Justification of the use of Cox regression seems a little confused. They should rephrase as time to event (e.g. hospital attendance) rather than specify the follow up time as the event itself. Their note about confirming statistical modelling approaches following review of data is usual in the case of routine data when the distribution of each outcome variable is unknown. Why not use Cox regression for the all cause death outcome also? Overall their analysis plan seems underspecified.
Sample size I note that they mention loss of 10% from migration to Scotland and Northern Ireland. Although the proportion moving to Wales will be small, it is incorrect to include Wales because these data are held in the Patient Episode Dataset for Wales (PEDW) and are not included in the HES dataset. This needs adjusting.
Discussion/additional points I agree with the first reviewer that it is particularly important the authors remove the conflicting statements about the analytical potential to examine the wider determinants of health. Additionally, it is important to note that other national data sources exist in which one can identify migrants and examine these outcomes, contrary to that suggested by the authors. For example, see the Office for (LS) which combines data from National Health Service systems, National Statistics Longitudinal Study censuses and civil registers. With the LS data, it is possible to identify migrants and examine all-cause and cause-specific mortality, cancer registrations, self-reported health and long-term illness.

References
The above issues stem from how the study is framed: around all international immigrants in England. Indeed, the introduction defines international migrants in the broadest possible terms. However, the Million Migrants Study excludes a significant proportion of all international immigrants (at least 40%). With the study focusing mainly on refugees and non-EU migrants from high TB prevalence countries, the introduction should explicitly focus on these groups. Then it be accurate to state that there is a lack would of studies in England on refugees (I think just one, see: Swerdlow, 1991 ). Similarly, then it be would accurate to state that there is a lack of national data sources in England that have the relevant information to be able to identify refugees (which would require information on country of birth, reason for arrival, visa type etc…). I recommend the authors re-frame their introduction around the migrant subgroups they are studying, place their study more specifically in the context of the previous literature on migrant mortality in England , and be more explicit in their phrasing.

Aims
As alluded to above, the study sample is selective: refugees and non-EU migrants from countries with high tuberculosis prevalence. This is fine and explicitly acknowledged by the authors in their discussion as a limitation. However, it conflicts with a repeatedly-stated objective to investigate whether or not a migrant mortality advantage can be replicated in England (this also taps in to how the introduction is framed). Of course, a migrant mortality advantage has already been observed in England (and Wales). Observing, or indeed not observing, lower mortality among the migrants in this sample relative to the general population would not provide evidence of an overall migrant mortality advantage in England. Suggest the authors qualify such statements to reflect this.

Data collection, processing, and linkage
Are data on EU migrants simply not available or is it a choice by the authors to exclude them? If the data is not available, then the authors can simply state this. Better to do this than to over-generalize about culture and use of health care systems among EU migrants. If it is a restriction of the data, the exclusion does not need to be further justified and cannot be avoided. If, on the other hand, it was a choice, then the authors should consider including EU migrants to provide a broader appeal of this new data source for its potential users.
Will the study cover immigrants arriving at all ages?
Exactly what variables will be available in this data source? It might be nice if the authors were to provide a preliminary list to give some indication of the scope of the data.

Comparator groups
It is not possible to remove other types of migrants who did not undergo screening from the general population? Further, it is important to note that the general population will include children of migrants, who form a small but growing proportion of England's population. A growing body of research suggest 1 1-14 who form a small but growing proportion of England's population. A growing body of research suggest that children of migrants have different mortality patterns to majority populations of high-income host countries too. As a basic approximation, if we take the 6,4% of EU migrants, some proportion of the 9,2% of non-EU migrants who are not subjected to screening, and the 9,2% of the population who are children of migrants (from Eurostat, 2014), then approximately 16% of the general population is formed of individuals who should be excluded (or form other categories). Is there any additional information (country of birth or ethnicity) the authors can use to remove these individuals from the general population?
More information on exactly how the general population is sampled would also be very useful. Knowing this is as knowing migrants are accurately defined. as important Please expand upon the rationale behind comparing migrants, who have just arrived and have very little exposure to conditions in England, to the general population by IMD score.

Entry and exit from cohort
Accurately calculating exposures for mobile populations such as migrants is one of the most challenging tasks to achieve in the studies of migrant populations (see previous studies on under-coverage, over-coverage, censoring bias etc…). With this in mind, it would be informative if the authors devoted a little more space to telling us exactly how immigrant entries and exits are captured in their data and discussing the potential biases incurred from the information they have used.
For example, entries are defined as the latest date of two pre-entry screenings Could this lead to cases where individuals are considered "at risk" when they are still living in their origin countries? Are delays between pre-entry screening and arrival to England common? How long do they tend to be? Are there people who undertake screening, but never arrive in England? What happens to these individuals? Could they be incorrectly included in the data source? Some clarification on the way emigrations are imputed would also be helpful (rather than linking to previous studies). Is there any information in the four data sources that could provide some (in)direct information on (r)emigration? These events are unlikely to be negligible among foreign-born and a lack of knowledge about exactly when people leave can contribute to the downward biasing of mortality rates. I appreciate the sensitivity analysis that the authors have done in response to this issue. Nonetheless, a good deal of uncertainty remains around when exactly individuals leave the country.

Sample size
I am not sure if the sample size of the general population is stated anywhere (forgive me if I missed it). We know that the study will contain around 1.3 million migrants. It would make sense to state the general population figure here too to provide overall sample size.