An Individual Level Method for Improved Estimation of Ethnic Characteristics

This paper develops an improved method for estimating the ethnicity of individuals based on individual level pairings of given and family names. It builds upon previous research by using a global database of names from c. 1.7 billion living individuals, supplemented by individual level historical census data. In focusing upon Great Britain, these resources enable, respectively, greater precision in estimating probable global origins and better estimation of self-identification amongst long-established family groups such as the Irish Diaspora. We report on geographic issues in adjusting the weighting of groups that are systematically under- or over-predicted using other methods. Our individual level estimates are evaluated using both small area Great Britain census data for 2011 and individual level data for asylum seekers in Canada between 1995 and 2012. Our conclusions assess the value of such estimates in the conduct of social equity audits and in depicting the social mobility outcomes of residential mobility and migration across Great Britain.


Introduction
Ethnicity is a salient characteristic of individual identity. Of relevance to regional science, it has underpinned research into residential differentiation and social segregation (e.g. Finney and Simpson 2009;Lan et al. 2020), labour market recruitment (Yemane and Fernández-Reino 2021), inter-generational social mobility (Clark and Cummins 2015), innovation processes (Wilson et al. 2018), and health outcomes (Petersen et al. 2021). It is also of policy interest to provide timely inter-census estimates of population characteristics (Office for National Statistics 2017), as demonstrated during the 2020 COVID pandemic and following Brexit. Related work has documented the correspondence between individual naming practices and ethnicity, and consequently, the ways in which given (forename) and family (sur-)names may be used to indicate ethnicity (Mateos et al. 2009;Parameshwaran and Engzell 2015). As such, names-based classification of ethnicity is of wide applicability to many issues of relevance to regional scientists in studies of migration, urban structure and regional functioningissues that we return to in our conclusions.
Names-based ethnic classification methods typically develop algorithms to identify significant forenamesurname associations and assign labels to the resulting cultural, ethnic, and linguistic groups at different levels of aggregation. A recent development of these approaches is the Ethnicity Estimator software (Kandt and Longley, 2018) that was developed in collaboration with the Office for National Statistics (ONS). A novel aspect of this latter approach is the evaluation of estimates with respect to survey respondent self-identifications: such procedures are of particular value where names span different ethnic groups (as with members of the Black Caribbean and White British UK Census groups) or where long-settled groups may no longer identify with their ancestral origins (as with some White Irish individuals in Britain). Kandt and Longley's (2018) software and the derivative small area estimates of annual changes in local ethnic group composition have been used in circa 60 research projects to date (CDRC, personal communication). The free availability of this classification software for research purposes and the peer-reviewed documentation of its predictive success marks this software as a basis for the further evaluation and improvements developed in this paper. Kandt and Longley (2018) use the ONS Secure Research Service, previously the Virtual Microdata Laboratory (Ritchie 2008), for names classification by first using a names dictionary and queries to a secure census database to calculate the probabilities of membership of each of 11 census groups (see Table 1) used in the 2011 Census. Summed scores for each group can be calculated for every forename and surname pair that occurs in their names dictionary by summing these (equally weighted) probabilities. Using secure access to individual 2011 Census records, Kandt and Longley (2018) reweight the resulting assignments to match the pattern of self-reported assignments in the Census records. The authors demonstrate that their approach results in greater predictive success than a previous ('Onomap ': Mateos et al. 2011) algorithmic approach and that their weighting factors are optimised within the constraints of secure research facility access. However, it is apparent that the White Irish group is consistently under-estimated, that there are systematic mis-assignments between individuals identifying with Indian subcontinent countries, and that there are failures in predictions of occurrences of the Other Asians, Black Caribbean and Other groups. It is also desirable for ethnicity audits to be able to disaggregate the 'White Other' and 'Other Asian' categories into constituent countries that may typically confer quite different human and social capital upon their citizens and, by extension, different migration outcomes in migration destinations such as Great Britain.
Our research objectives are to improve or refine estimates of membership of: (a) the long-established White British majority population that was actually present in the 19 th century; (b) the long-established White Irish population that continues to identify with this group; (c) the Black Caribbean population that shares naming conventions with white ethnic groups; (d) groups originating in the Indian sub-continent; and (e) the 'catch all' Black African, Black Caribbean, White Other and Other Asian groups, which may be attributed to particular countries that confer quite different circumstances upon migrants from them. Details of development and SQL code used to develop the software, Onomap3, can be found on the cdrc.ac.uk website, for access for research purposes upon successful application.

Data Sources
Our approach is to use the near-complete Linked Consumer Register (LCR) of all adult individual names and addresses in Great Britain in 2011 (see Lansley et al. 2019;Van Dijk et al. 2021) as a frame to estimate ethnicities. The 2011 LCR provides an annual snapshot of the UK adult population created and curated by the ESRC Consumer Data Research Centre (CDRC), as part of a corpus of such data initially covering the period 1997-2016. The LCRs are individual level data compiled from the public version of the UK Electoral Register and other consumer data sources. Lansley et al. (2019) describe the data cleaning, triangulation, imputation and validation processes that are intrinsic to Here, we estimate the ethnicity of every individual on the 2011 LCR. By georeferencing each record we are then able to compare our estimates with Census figures for the same year at the level of the Lower layer Super Output Area (LSOA, a small area geography in England and Wales with a typical population of 1500). We use these initial results to adjust the weights assigned to forenames and surnames for different ethnic groups. For the specific case of the White Irish population, we also refer to individual level 1881 Census records to evaluate the merit of deeming a contemporary bearer to self-identify with the 'White Irish' Census category. The digitised versions of the GB Censuses for 1851-1911 are curated by the I-CeM project (Higgs and Schurer 2019), and individual level records including names, addresses and birthplaces were made available to us by the UK Data Service under special licence. We use the individual level data for 1881, based on our exploratory findings that the data capture process for this year appears to have been particularly effective.
We also use the WorldNames2 (WN2) database that arises from an ongoing project to assemble a representative range of forenames and surnames for every country of the world. O'Brien and  detail the various sources used, including public electoral registers, telephone directories and professional or school registers. The database currently comprises circa 1.7 billion individuals' names, or about one fifth of the world's population (calculated based on 7.9 billion according to the UN estimates as of 2021), each with country attribution. Based on the sampled names in the countries and their total populations, frequencies per million (FPMs) of family name occurrences and their estimated populations sizes are derived in the WN2 database.
Aggregate 2011 Census adult population counts classified into 11 ethnicity categories (listed with their abbreviations in Table 1) provide a benchmark for evaluation of the ethnicity estimates developed using the LCRs. The ethnicity categorisations recorded in the 2011 Census questionnaires differ slightly between the different constituent countries of the UK but can be harmonised into the 11 categories. Table 1 also compares the GB population breakdown by ethnic groups estimated by applying Kandt and Longley's publicly available software to the 2011 LCR and the corresponding 2011 Census figures. Both over-estimation and under-estimation are observed amongst the LCR group assignments.

Methods and Outcomes of Reassignments or Enhancements
The 2011 classifications of ethnicity used by the UK ONS are the outcome of extensive consultation with stakeholders with regard to the end uses of statistical sources so classified (Office for National Statistics 2009), which is reflected in the subtle variations among the ethnic categories adopted by Northern Ireland, Scotland, and England and Wales. The outcome is, inevitably, a snapshot of policy concerns that resonate with the governments of the constituent countries of the United Kingdom. The resultant classes also manifest a long sweep of British history that accommodates Irish and New Commonwealth migration, but not the specific consequences of successive EU enlargements during the UK's period of EU membership or refugee migration. Our dual purpose is to improve the efficacy of Kandt and Longley's assignments to the harmonised classes used in Table 1 while also extending it to differentiate between other nations, membership of which might also affect the circumstances of migrants to Britain.
As such, our aim is to extend the granularity of ethnic classification while also retaining sensitivity to the issues of self-identification developed in Kandt and Longley's (2018) work. We use their Ethnicity Estimator (EE) as a baseline model for our proposed improvements and extensions. The core process of the EE, summarised in equation (1), is to assign each forename-surname pairing a probability of assignment to each of the Census ethnic categories E, as detailed in Table 1. For any name pairing, p E, f and p E, s denote the probabilities of assignment to each ethnic group E for the forename and surname respectively, as defined in two EE name-ethnicity lookup tables. Two weighting factors that sum to unity, w f and w s , are used to specify the relative contributions of forename and surname to the estimated outcome score S E . In the original EE algorithm, these weights are each set equal to 0.5. After calculating the score S E for every one of the 11 ethnicity categories, the name pair is assigned to the ethnic group with the highest composite score In developing and extending this approach to classify Great Britain residents, we use additional individuals' names obtained from the 1881 Great Britain Census and from WN2. We validate the results using aggregate 2011 Census small area statistics for the same year as the 2011 LCR. Ethnicity classification of the 2011 LCR follows a chronology of steps (see Table 1 for abbreviations used), for reasons set out in our discussion below: 1) The EE classifications are assigned as provisional estimates.
2) Family names classified as White British (WBR) but that are not recorded at all in the 1881 Great Britain Census are reassigned to their second highest predicted category amongst the remaining 10 census ethnic groups. 3) Individuals classified as WBR or White Irish (WIR) are then pooled. Reassignments between them are made using Bayes' Theorem and WN2 data as detailed below. 4) Individuals classified as Asian Indian (AIN), Asian Pakistani (APK), Other Asian (AAO) are pooled and reassigned using re-weightings as detailed below. 5) Individuals classified as Black Caribbean (BCA), WBR or All Other (OXX) are pooled and reassigned using rules as detailed below. 6) WN2 data are used to assign most probable countries to records assigned to the AAO, BAF, BCA and WAO groups. Kandt and Longley (2018) identify the WIR group as systematically under-estimated, attributing this to self-identification of descendants of previous generations of Irish migrants with the WBR group. We take the explicit decision to define WIR in terms of being long settled in the Irish Republic and WBR as conveying establishment in the United Kingdom. Our approach to accommodating this tendency is threefold: (a) we constrain WBR assignments by filtering out family names not present in the 1881 Great Britain Census; (b) we adjust the forename and surname relative probabilities p E, f and p E, s between WBR and WIR in the name-ethnicity lookup tables using data relating to the relative frequencies of each in the UK and Ireland as recorded in the WN2 population estimates; and (c) we tune the two weighting factors w f and w s in equation (1) in order to align our estimates to compare with the total size of the WIR population in the 2011 Census (Table 1) and its geographic distribution.

The White British and White Irish Groups
Reassigning White British names. There are ambiguities in ascribing the label 'White British' to any individual whose name does not indicate ancestry beyond Great Britain within historic periods (e.g. see the genetic study of Winney et al. 2012). In refining the EE approach to reduce the over-prediction of the WBR, we choose 1881 (for which well-curated digital Census records are available) as a convenient threshold date for inclusion of any family name as long-established 'White British'. We begin by filtering out family names that were not present in the 1881 Census and assigning them to their second highest EE category. 1,284,829 bearers of names classified as White British by EE are thus reassigned to their second highest class. The results shown in Table 2 identify that most all such names are reclassified as White Other or White Irish.
Adjusting the name-ethnicity lookup tables. We next adjust the forename and surname probabilities p E, f and p E, s between WBR and WIR in the name-ethnicity lookup tables by calculating conditional probabilities of belonging to either group based upon forenamesurname pairings. Estimates of the bearers of different UK and Irish Republic forenames and surnames are provided by the WN2 data. Bayes' Theorem is then used to calculate the conditional probabilities of belonging to either WBR or WIR. Table 3 illustrates the steps taken to derive the conditional posterior probabilities, taking the forename 'James' as an example. The final two rows of the Table present the conditional probability based upon the estimated populations of name bearers, independent of the total populations of the host countries. The probabilities of WBR and WIR membership for each forename or surname are thus recalculated and replaced in the look-up tables using the conditional probabilities derived in Table 3.
Tuning the weighting factors. In equation (1), the original EE adopts equally weighted contributions from a forename and a surname (w s = w f = 0.5). We postulate, however, that members of long-established migrant Irish family groups (as identified by surnames) may be less likely to self-identify as WIR. We also postulate a lesser consideration that forename may be a useful indicator of recent migration from the Irish Republic or lingering affinity to the island amongst long-settled migrant families. Accordingly, we downweight the importance of forenames and, consistent with replicating the number of individuals identifying as WIR in the 2011 Census, experiment  Table 3. Conditional Probability of Belonging to the WBR or WIR Using the Name is 'James' as an Example, According to Bayes' Theorem.

Variables Notation
Population of Great Britain G Population of Ireland I Estimated population of name 'James' in the UK g Estimated population of name 'James' in Ireland i Probability of belonging to WBR PðAÞ ¼ G=ðG þ IÞ Probability of belonging to WIR PðBÞ ¼ I=ðG þ IÞ Probability of being named 'James' given one is British PðYjAÞ ¼ g=G Probability of being named 'James' given one is Irish PðYjBÞ ¼ i=I Probability of being named 'James' in the UK or Ireland PðYÞ ¼ PðYjAÞ * PðAÞ þ PðYjBÞ * PðBÞ Probability of belonging to WBR given the name is 'James' PðAjYÞ ¼ PðYjAÞ * PðAÞ PðYÞ Probability of belonging to the WIR given the name is 'James' with a range of values for w s from 0.76 to 0.85. We compare the numbers and spatial distributions of predicted WIR to the WIR population identified in the 2011 Census. There are tensions in this approach, since prediction success is not spatially invariant, and fine-tuning of weights may cause systematic deterioration of urban predictions at the expense of rural predictions, and vice-versa. Ethnic minorities remain concentrated in towns and cities (albeit decreasingly so), with distinctive regional patterning of different ethnic groups. There is no obvious analytical solution to this issue, particularly given that mis-assignments between some categories may have less severe implications in (some) applications than others. In what follows, we rely upon a visual comparison of observed (census) versus predicted distributions, in the context of aggregate numerical comparisons. Figure 1 suggests that surname weight 0.84 gives the closest predictions to the Census. Table 4 presents the transition matrix of the reassignment between the WBR and WIR after the lookup table adjustments with the selected surname weight 0.84. Together with the reassignment to WIR in the previous step, we predict 546,743 White Irish at this stage, which accounts for 99% of the 2011 Census observations. Figure 2 shows the observed and estimated 2011 populations of White Irish by LSOA, where our method correctly picks up the concentration of Irish in urban areas such as London, Birmingham, Liverpool, Manchester, and Glasgow, albeit with modest underestimation. This sensitivity analysis is finely balanced, with the global solution required to balance prediction success in rural and urban areas: in particular, it is apparent from sensitivity analysis that Scottish WBR rural names bear more than passing similarities to urban WIR ones.

Indian Sub-continent and Other Asian Groups
Among the Indian sub-continent groups shown in Table 1, the aggregate predictions of Bangladeshis (ABD) are very close to observations from the Census. However, predictions of Indians (AIN) and (especially) Pakistanis (APK) are overestimated while Any Other Asian (AAO) occurrences are substantially underestimated. The principal 'Any Other Asian' countries are listed in Table 5. We aim to improve estimation by  reallocating individuals from AIN and APK to AAO. In order to address this, we first adjust the name probabilities in the name-ethnicity lookup tables relating to the three groups by using estimated populations of bearers of different names across these countries and Bayes' Theorem, as in Section 3.1.2. Additionally, since the EE predicts 136%, 152% and 37% of the observed AIN, APK and AAO Census figures, respectively, all of the adjusted name probabilities relating to the three groups are further reweighed by multiplying the corresponding reciprocal factors: 0.7 (AIN), 0.7 (APK) and 2.7 (AAO). The ABD estimates, which approximate the Census figures, are not included in this reweighting. With the above modified name relative probabilities p E, f and p E, s for the AIN, APK and AAO groups, we explore a range of relative forename and surname weighting factors w s . Weights for this heterogeneous group ranging from 0.25 to 0.75 are applied to names from the LCR classified by EE as Indian, Pakistani or Any Other Asian, to improve the correspondence between ethnicity estimates and 2011 Census figures (see Table 6). The closest predictions of each group to the Census observations are highlighted in bold in this Table. The comparison between predictions and census observations suggests surname weight 0.3 and forename weight 0.7 are the overall best combination, although the Indian group is still over-predicted. Future improvements could consider exploring separate surname weights for the four groups.
In so doing, we reallocate predictions among the AAO, AIN and APK groups from the provisional EE categories. Table 7 presents a confusion matrix of ethnic group transitions between the EE predictions and our revision following the adjustments.

Black Caribbean Groups
Members of the BCA group share both forenames and surnames with the White British and, as a minority population, are under-enumerated in names-based ethnicity estimators. Although the BCA group is underestimated by the EE in terms of the total population, they are nevertheless overestimated by the EE in some parts of Great Britain, where they are possibly confounded with members of the Any Other (OXX) group. We seek to accommodate this by comparing the frequencies per million (FPM) of forenames and surnames in the UK with those for Caribbean countries with British colonial history. The FPMs of forenames and sur names in available relevant Caribbean countries (Table 8) are extracted from the WN2 database and the highest FPM of a name in any single Caribbean jurisdiction is retained as the FPM of that name in the Caribbean. After experimentation and sensitivity analysis, we alight upon a multiplicative index to measure the likelihood of a name being assigned to the BCA group (equation (2)). The first component of the index records how many times more popular a forename is in the Caribbean than in the UK. The second component records the corresponding multiplier for a surname. The product of the two terms is used as an indicator of the likelihood of belonging to the Black Caribbean group. Making use of the index, Figure 6 illustrates the logic of assigning possible 'WBR' and 'OXX' to 'BCA'. For those who are classified as WBR, BCA and OXX, their multiplicative indices are calculated and compared with different empirical thresholds: 1.5 for 'BCA', 4.9 for 'WBR' and 15 for 'OXX'. The outcomes determine whether the original classifications are retained or they are reassigned to another group among BCA, WBR and OXX  Index ¼ðCaribbean forename FPM=UK forename FPMÞ * ðCaribbean surname FPM=UK surname FPMÞ (2) Table 9 shows the confusion matrix of reassignments for the LCR following the adjustments to allocations between the WBR, OXX and BCA groups. The 377,245 adult BCA assignments after all of the previous adjustments compare with 496,195 recorded in the Census, and the estimated 133,803 Caribbean Londoners compare with a Census figure of 268,014. It should be noted that there are 1793 WIR estimated by EE that are reassigned to WBR in the previous steps but are returned to BCA in this step. Figure 7 illustrates the general geographic correspondence between our estimates and the observed incidence in the Census. There is a dilemma posed by adjusting classification thresholds since under-prediction in London and in Birmingham is partially offset by over-prediction elsewhere in predominantly rural areas. There is scope, however, for further improving estimates for urban areas for applications in which rural areas are not of primary concern.    However, the flows of individuals from over-represented to under-represented groups are very encouraging, as shown in Table 11. The 235,088 increase in the size of the White Irish group improves capture of WIR estimates from 54% to 99% of the recorded Census total, achieved by transfers from the over-represented White British

Enhanced Estimation of Countries of Origin
Census categories such as the White Other Group (WAO) have been agreed by the ONS over time through consultation for policy purposes and they inevitably cannot include all groups. Blanket categorisation masks within group variation, potentially straining any assumption of within group homogeneity in research applications: for example,  study of UK residential segregation (e.g. ) would likely benefit were it possible to differentiate between different groups within the ONS 'catch all' categories. We therefore use the WN2 data to apportion the WAO, AAO, BAF and BCA categories to probable countries of ancestral origins. We evaluate each name pair's relative probabilities of assignment to a specific country using similar procedures to those underpinning equation (1). We replace the name-ethnicity lookup probabilities p E, f and p E, s with the normalised frequencies per million (FPMs) for each individual's forename and surname in the assignment process. Following extensive sensitivity analysis, we adopt 0.65 and 0.35 as the surname and forename weighting factors w f and w s . We retain the three most probable countries of origin: in deference to subjective self-assignments in Britain, where the most probable country estimate is inconsistent with the EE classification, we defer to the second highest country and, if necessary, the third highest. If no consistent estimate can be found the observation is assigned to the 'Any Other' (OXX) category.
Following these rules, we further disaggregate the blanket groups including AAO, BAF, BCA and WAO into countries of origin. Table 12 lists the largest populations in the 2011 LCR by country of origin within each of the four groups. We take the largest WAO group in London, the Polish community, as an example and map their geographic distribution across Greater London in 2011 in Figure 8. They were mainly concentrated in West and North London, particularly in Ealing, Brent and Waltham Forest.

Validation and Discussion
It is usually difficult to obtain ground truth ethnicity data at individual level to validate the results of ethnic classification. Here we use data obtained under Freedom of Information requests pertaining to 47,979 seekers of asylum in Canada (Norris 2019), which records individuals' names and self-reported countries of origin. Reported countries of asylum seeker origins may be imprecise (e.g. sub-continent rather than specific country) or inaccurate, particularly in instances of chain migration. Such data are thus inherently ambiguous, and also do not pertain to the UI, where strictures of General Data Protection Regulation (GDPR) make it particularly difficult to obtain names data classified by ethnicitya Sensitive Personal characteristic under GDPR. With these caveats, we assign stated countries in the Canadian data to the 11 ONS Census groups used in EE and use our procedures to estimate group and most probable country of origin.  Caribbean group successfully predicted. The majority of misclassifications of the BCA are assigned to the WBR group.
We have mixed reflections on these results. Migrating and asylum-seeking are heavily selective, and the phenomenon of chain migration likely renders the dataset very noisy. Asylum seekers may be more likely to be of mixed heritage (best represented by the OXX category), something that names-based classification finds very difficult to discern. Asylum seekers may perceive their chances of success to be increased with identification with white groupswith our predictions of many 'Other Asian' group members to be 'White Other' providing a prominent example. There are also ambiguities in the assignment of countries to EE groups, such as classifying South African asylum seekers uniformly as 'Black African'.
In some respects, data pertaining to Canadian asylum seekers present an unreasonable challenge: the ONS ethnicity classification is designed to fulfil UK needs and the prominence of the White British and White Irish groups is an irrelevant distraction in this context. In the global context, our enhancements to predictions of origins within the Indian sub-continent appear to be robust. But in other instances, the results confirm global challenges to names classification, with the inherent ambiguity of Black Caribbean names presenting a prominent example. Our own analysis of geographic variation in prediction success within Great Britain also testifies that this problem   occurs across different geographic scales, and it may also be affected by changing fashions for particular forenames.

Conclusion
Issues of ethnicity underpin our understanding of population diversity and the regional patterning of population characteristics in the wake of recent and historic waves of migration. Elsewhere  we have argued that regional origins in 'Old World' countries have enduring inter-generational consequences for social mobility outcomes, and one of our motivations for improving the efficacy of names-based classification is to describe and evaluate the relative social circumstances of citizens who can trace their origins through any of a succession of waves of migration to the UK. As such, the creation of Onomap3 has several methodological and substantive touchpoints with research previously reported in this journal, as well as for regional science investigations more generally. Most fundamentally, the work is consistent with the view that data pertaining to human individuals, rather than aggregations of them, provide the most secure foundations to regional analysis. The advent of new sources of georeferenced data at highly disaggregate scales ) enables new methods of conducting migration research that goes far beyond early aggregate formulations in regional analysis (Greenwood and Hunt 2003). It also has potential implications for the conduct of inputoutput analysis (Miller and Blair 1981). Such detail and flexibility enable a much more robust and transparent definition of the urban structures that are arranged in urban hierarchies (Broitman et al. 2020), while namesbased classifications enable the variegated social mixing of established populations and more recent migrants to be described and analysed ). Our use of asylum seekers to validate the research is integral to the case for using names to identify and appraise migrant characteristics in regional analysis more generally (e.g. Lozano-Gracia et al. 2010).
In other respects, names-based classification is of strategic importance in synthesising data that are not routinely collected. Ethnicity is a sensitive personal characteristic under the General Data Protection Regulation (GDPR), and our experience is that names classifications become essential when data collection about ethnicity has not been considered proportionate in service delivery, but subsequently becomes essential in unforeseen social equity audits or health care studies. Our own involvement in auditing the rehousing decisions made post the Grenfell Tower disaster and evaluating hospitalisation outcomes during the COVID-19 pandemic (Thomas et al. 2021) provide prominent examples. In future, the development of trusted research environments (TREs, see Chalstrey 2021) may provide data linkage solutions, but in the meantime, names-based classification provides the only expedient solution, particularly in emergency situations.
In methodological terms, the research reported here provides several lessons to guide this quest. It is widely understood that the heterogeneity of ethnic groups varies geographically, and our work highlights that names-based classification should be cognisant of context: our prediction success is better for Great Britainthe territory for which it was intendedthan Canada, yet this focus allows issues of self-assignment in particular cultural contexts to be incorporated, analysed and evaluated. The WN2 data present global evidence of the need to reweight the relative importance of forenames and surnames for some origin jurisdictions and we acknowledge that there is scope for further empirical refinement of the procedures developed here. Our sensitivity analysis and evaluation of results rely upon visual interpretation of mapped results alongside aggregate numerical comparisons. This approach might be supplemented in future research by the use of optimisation criteria and weightings to prioritise assignments (or 'near misses') to particular groups of interest. Future research might also address issues arising from transliteration of names (O'Brien and Longley 2018), homonymic family names, the mutation of family names over time and following migration over space, and cultural practices in assembling unique forenames or surnames.
Our approach is guided by the virtue of retaining self-assignments of census respondents in England and Wales while expanding and future-proofing the dictionary of names to include current popular forenames as well as new names imported into Britain from abroad. The classification is thus data led but also guided by GB cultural conventions. Issues of self-assignment may reinforce apparent inequalities of outcome or (as in COVID-19) set researchers on a search for physiological sources to societal problems. Yet our own view is that these issues are best addressed through classifications that are robust, transparent and open to scrutiny and that evaluations such as ours are instructive to minimise risks of misuse or misinterpretation.
Our own motivation for this work is to develop tools to understand the processes that underpin inter-generational inequalities of social mobility outcomes in Great Britain, at geographical and ethnic granularities that range from the effects of local ancestral origins of long-established populations through the inter-generational outcomes experienced by Irish migrants through to the outcomes of global migration in the 20 th and 21 st centuries. We intend this paper as a contribution to justify the approaches we are taking in this endeavour but hope that it stimulates wider debate about the value and veracity of names-based classification in the widest range of investigations into issues of social equity.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Engineering and Physical Sciences Research Council [EP/M023583/1]; Economic and Social Research Council [ES/ L011840/1].