Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program

Summary How race, ethnicity, and ancestry are used in genomic research has wide-ranging implications for how research is translated into clinical care and incorporated into public understanding. Correlation between race and genetic ancestry contributes to unresolved complexity for the scientific community, as illustrated by heterogeneous definitions and applications of these variables. Here, we offer commentary and recommendations on the use of race, ethnicity, and ancestry across the arc of genetic research, including data harmonization, analysis, and reporting. While informed by our experiences as researchers affiliated with the NHLBI Trans-Omics for Precision Medicine (TOPMed) program, these recommendations are applicable to basic and translational genomic research in diverse populations with genome-wide data. Moving forward, considerable collaborative effort will be required to ensure that race, ethnicity, and ancestry are described and used appropriately to generate scientific knowledge that yields broad and equitable benefit.


INTRODUCTION
Heeding the well-founded calls to increase diversity in genomic research 1,2 requires researchers to appropriately conceptualize, use, and report on race, ethnicity, and ancestry. Indeed, the role of race in genomic and other biomedical research is a widely discussed and historically fraught issue. [3][4][5][6][7][8] The National, Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) program provides a compelling and concrete use case to grapple with such issues, comprising over 80 contributing studies with diversity in terms of populations, geographic locations, genetic ancestries, and areas of phenotypic focus. 9 Below, we elaborate on the challenges and oppor-tunities for the genomics research community in analyzing diverse and heterogeneous datasets and our approach in the TOPMed program.
While the field of human genetics may have reached consensus that race is a socio-political rather than biological construct, 10 the correlation between race and genetic ancestry-in that racial categories are often enriched for specific ancestries 11 -continues to complicate scientific and public discourse. Studies show that genomics professionals use heterogeneous definitions and applications of race and ancestry in research and practice [12][13][14][15] and that such scientific uses evolve in broader social and political contexts. 16 In addition, the tendency to categorize ancestry at the continental level leads to conflation with the concept of biological race. 17 Race and ethnicity are still misused to avoid confounding due to genetic ancestry, 16,18 despite alternate approaches. 19,20 Overall, lack of agreement in the genomics research community has led to an ad hoc collection of research practices, with negative implications including reification of race as a biological construct [21][22][23] and over-attribution of health disparities to genetic rather than social and structural causes. [24][25][26][27] To address the challenges noted above, investigators affiliated with the TOPMed program created a set of recommendations on the use of race, ethnicity, and ancestry when analyzing genome-wide data. These recommendations are organized by chronology of a standard research process: terminology (assessing what data is available for analysis and the population nomenclature), harmonization (combining and standardizing race, ethnicity, and ancestry variables across datasets), analysis (conducting and interpreting association analyses), and reporting (communicating the findings). We do not address prospective data collection, as TOPMed utilizes pre-existing phenotype data. We discuss below the common applications of race, ethnicity, and ancestry in each stage of research, the challenges we observed, and recommendations for how to move forward.

BACKGROUND
We are researchers affiliated with the NHLBI TOPMed program motivated to conduct scientifically robust and ethically responsible genetic research that leads to equitable benefit. Our prior experiences working with human genomics consortia and discussion of relevant literature and media (see details in supplemental information) led us to establish recommendations for TOPMed researchers that address the challenges of working with diverse data and incorporate anti-racist principles 8 into the research process. Here, we present recommendations developed for the use and reporting of race, ethnicity, and ancestry in TOPMed, which are broadly applicable to genetic research in diverse populations (described below and summarized in Box 1).
TOPMed as a motivating use case TOPMed is a large consortium of ongoing ''omic'' (i.e., genomic, transcriptomic, proteomic, metabolomic, and methylomic) studies that encompass people of different races, ethnicities, geographic locations, and ancestries. 9 TOPMed comprises >80 studies based within and outside of the US, including founder populations such as Samoan and Amish. Broadly, TOPMed participants are 41% European ancestry (European, European American), 31% African ancestry (African, African American, African Caribbean), 15% Hispanic/Latino (including Mexican, Mexican American, Central American, South American, Cuban, Dominican, Puerto Rican), 9% Asian ancestry (Chinese, Taiwanese, Asian American, Pakistani), and 4% ''other'' (Samoan, Native American, multiple, or unknown). 9 This diversity enables the expansion of knowledge of genetic variation and an improved understanding of disease. 28 For example, 78.7% of 400 million variants observed in TOPMed were not previously deposited in dbSNP. 9 Establishing recommendations for TOPMed We created recommendations for TOPMed investigators to encourage researchers to make well-founded and responsible analytical and methodological decisions when using race, ethnicity, and ancestry variables and to communicate these concepts in an informed, transparent, and respectful manner. These recommendations were discussed in relevant TOPMed Committees (Ethical, Legal, and Social Issues [ELSI] and Analysis), approved by the TOPMed Executive Committee, and presented at consortium-wide meetings. However, they do not represent official TOPMed policy or a consensus view of the over 1,000 TOPMed investigators. We solicited examples from study investigators of study-specific considerations and preferences, e.g., for population labels, and incorporated diverse expertise and experiences to make the recommendations practical, robust, and compelling for a wide audience of genetics researchers. Ultimately, these recommendations guide investigators through challenges of using socially and genetically defined groups in scientific discussions by presenting an overview of commonly used terminology, highlighting considerations for data harmonization and analysis, and providing guidance on how to report results. While developed in the context of the TOPMed program, we contend that these recommendations are relevant for genetic and biomedical researchers working in other contexts, especially those involving diverse populations and/or the genetic study of conditions that suggest health disparities.

RECOMMENDATIONS Terminology
When presenting information on the race, ethnicity, or ancestry of participants in a study, it is essential to be clear about whether the labels used refer to reported or genetically inferred information. ''Race'' and ''ethnicity'' generally refer to social, not biological, categories, and they are often used interchangeably. In contrast, ''ancestry'' is generally used in genetic research to refer to one's biological ancestors from whom their DNA was inherited or to imply something about a person's genetic origins; for example, the continental origin of the majority of their ancestors (sometimes referred to as ''continental ancestry''). 29,30 Ancestry can also refer to having ancestors from specific countries or geographic regions and is often how ancestry is used colloquially. Here, we use the terms race and ethnicity to refer to non-biological social categories, and we use the term genetic ancestry to describe genetic origins. Because reported race or ethnicity and genetic ancestry may all be used analytically and appear in scientific discussions and communications, care must be taken to describe exactly what is being presented and why.
Recommendations for investigators include the following: 1. Explicitly distinguish between variables that derive from non-genetic, reported information versus genetically inferred information. 2. Avoid using terms that are historically linked to hierarchical, racial typologies. For example, ''Caucasian'' should not be used; 31,32 instead, use ''White'' when referring to race and ''European ancestry'' when referring to genetic ancestry. 3. Follow standards from publishers, including the APA's guidelines on bias-free language regarding racial and ethnic identity 33 and the AMA Manual of Style. 34 Harmonization of race and ethnicity across studies Race and/or ethnicity are commonly collected by having study participants fill out a form, which leads to ''self-reported'' values.
Other collection methods include designation by a third party (healthcare provider or study data collector) who typically infers the participant's ascriptive race or through study documents that describe the recruitment population but do not ask whether the self-reported race and/or ethnicity of specific individuals differs from the target population. Race and/or ethnicity may also be collected multiple times, for example, in a longitudinal study, which can lead to multiple values for the same participant if their self-identification changes over time. However collected, the race and/or ethnicity of a participant is almost always a function of the specific options provided in study instruments, which will often vary by location or the research interests of investigators. The diversity in data-collection methods presents a challenge for investigators attempting to combine data from multiple studies. Unlike quantitative phenotypes that can be transformed to a single scale during data harmonization, there is often no straightforward method to convert one set of race or ethnicity categories into another. This is particularly the case when study cohorts include individuals sampled from distinct national contexts where socio-cultural understandings of racial and/or ethnic identity differ, when working with studies over different recruitment periods, or when different studies provide different options for race and ethnicity categories (such as offering the descriptor Asian on a form versus offering more specific identifiers, like ''East Asian'' or ''South Asian''). Thus, it is important to keep in mind the complexities and nuances of social identity when attempting to harmonize race and ethnicity variables across studies.
Recommendations for investigators include the following: 1. Clearly describe the source data for race and ethnicity information from each study when using harmonized variables. Include details such as whether source information is self-reported or ascribed and whether multiple categories are collapsed. Be aware that cross-study harmonized variables may represent a simplification of more complex sources of information that may not translate between different studies and jurisdictions. 2. Avoid assuming that non-genetic, reported variables are by self-report. Study-or cohort-specific documentation may help determine whether variables (e.g., race or ethnicity) were self-reported versus recorded by study personnel without soliciting self-report from the participant. 3. Avoid applying US race categories to participants of studies based outside of the US. Concepts of racial and/or ethnic identity differ across countries, and approaches to capturing this information vary across geographic location and over time. 35 For example, the racial category ''Black'' is used by many countries but with different meanings in each country (e.g., the US and   36,37 ), so combining those categories is inappropriate. Some countries do not collect race information at all; for example, Australia abandoned the use of racial classification in 1974 and instead collects information on ethnicity. 35 4. Preserve specific population information when possible rather than prematurely collapsing different populations into broader categories. We encourage retaining as granular of information as is practical during harmonization to allow flexible tailoring of downstream analysis and accurate reporting. For example, preserve detailed population descriptors such as ''Chinese American'' and ''Pakistani'' rather than harmonizing into a single Asian group.

Analysis
When considering how to use race, ethnicity, and/or genetic ancestry information in an analysis, analysts should first assess the goals of the study and the intended purpose of including those variables in models. In a genome-wide association study (GWAS), the goal is to identify genetic variants that are associated with a particular trait or disease. Race and ethnicity are often tied to social and environmental factors influencing health [38][39][40][41][42] and, in such cases, may explain variation in the trait or disease of interest that is dependent on aspects of social identity (e.g., that may result from systemic or individual racial discrimination) rather than genetic ancestry. For example, African Americans with a high proportion of European ancestry may suffer the same lack of access to adequate health care as African Americans with little to no European ancestry. While race and ethnicity can be, and often are, included as covariates in association models to proxy such effects, 16 this approach may inadvertently reinforce harmful stereotypes. Therefore, it is preferable to include relevant environmental or socioeconomic variables (e.g., measures of healthcare, diet, or neighborhood disadvantage) directly in association models as covariates when available. However, adjustment for covariates that explain variation independent of genotype may either increase or decrease precision of genotype effect estimates and in turn affect statistical power to detect association. 43,44 Whether and how to integrate social factors into GWASs is an evolving and unresolved discussion in the genomics community.
On the other hand, adjusting for genetic ancestry is widely accepted practice in GWASs because it reduces false positives when populations have different trait values or disease prevalences as well as different allele frequencies and patterns of linkage disequilibrium, i.e., when there is confounding due to population stratification. 20,[45][46][47] One approach to adjust for this confounding is to perform a pooled analysis (i.e., an analysis including all study samples) and include genetic ancestry measures derived from sample genotype data as covariates. A distinct benefit of this approach is that it does not require arbitrarily clustering participants into groups or cross-study harmonization of demographic variables. Further, this approach allows for inclusion of all participants in the analysis, including those with either missing or underrepresented race or ethnicity. 48 A popular method to measure genetic ancestry is principalcomponent analysis (PCA), which generates eigenvectors that represent the genetic ancestry variation among participants as a continuous, multidimensional distribution, 20 in which those with ancestors from the same geographical area often cluster together. 49 Alternatively, admixture analysis estimates the proportion of each participant's genome descended from pre-specified reference populations of known ancestry. 19 Adjusting for either of these measures in a pooled analysis can effectively control for confounding due to genetic ancestry. The continuous nature of these measures illustrates the heterogeneity in genetic ancestry among individuals who may identify as the same race or ethnicity, particularly admixed individuals. For example, those who identify as Hispanic/Latino in the US represent a wide variety of genetic ancestries, with different proportions of ancestry admixture from Africa, the Americas, Asia, and Europe. [50][51][52] This highlights that simply using race and/or ethnicity as a proxy for genetic ancestry, or vice versa, is problematic in that it falsely equates the two correlated, albeit distinct, concepts.
Association tests are often conducted via meta-analysis, where different racial, ethnic, or ancestry groups are stratified and analyzed separately, and summary statistics from each group are subsequently combined. The motivations for performing meta-analysis may be logistical, e.g., the inability to combine participant data due to technical or data-sharing constraints, and/or analytical, such as the desire to adjust for genetic ancestry, environmental, or socioeconomic factors separately by group. Indeed, meta-analysis can be a useful tool, but it requires careful consideration of how groups are constructed and interpreted-an issue avoided in pooled analysis. We encourage investigators who take this approach to focus on the final meta-analysis results and exercise caution when interpreting the group-specific results.
A commonly referenced motivation for stratifying and interpreting group-specific results is to determine whether participants of a particular group are ''driving'' the observed association signal. While a statistically significant association may be observed in one group and not another, in our experience, we contend that this is likely due to differences in statistical power to detect an association (e.g., due to sample size or allele frequency differences) rather than fundamental differences in the underlying biological impact of the same variant in different groups of people. For example, when analyzing TOPMed data, we typically have not found additional signals from group-specific analyses that were not also identified by pooled analysis including the same individuals. On the other hand, population-specific results of previously understudied populations may provide actionable findings. Therefore, it is critical to engage with study participants or representatives on whether it is appropriate to pursue population-specific analysis and how best to represent them in the study. Researchers must earn the trust of the communities involved in their research, especially in the case of minority groups who have been historically exploited in biomedical research studies and the scientific community. 53,54 Ultimately, it is important to recognize the various technical and contextual factors that influence analytical decisions and to be transparent about which approach was taken and why.
1. Articulate and justify why race, ethnicity, or ancestry variables were used in a given analysis. Explain the reasoning behind analytical decisions to use non-genetic and/or genetically inferred variables in the methods section. Analytical decisions are nuanced and often reflect a weighing of various pros and cons to different approaches. 2. Consider that while using race or ethnicity as a covariate may explain trait variation due to social factors, it may also reinforce harmful stereotypes. Race or ethnicity may correlate with non-genetic, social factors, but the effects of such factors can be better accounted for when used directly, if the data are available. Whether or not including such variables is statistically beneficial is nuanced and requires careful consideration. 3. Avoid using reported race or ethnicity as a proxy for genetic ancestry or using genetic ancestry to represent race or ethnicity. Race and ethnicity can be correlated with genetic ancestry, but they are not the same. Individuals who identify as the same race or ethnicity can have a wide variety of genetic ancestries, and individuals with similar genetic ancestry may identify as different races or ethnicities. 4. Focus attention on pooled-or meta-analysis results of all participants. Whether a pooled-or a meta-analysis was used may depend on logistical and/or analytical reasons. Describe which approach was taken, why, and what the limitations may be. 5. Consider potential benefits versus potential harms when thinking about whether and how to conduct a population-specific analysis. Consult with study representatives or documentation to understand if their study participants would find it acceptable, or even preferred, to acknowledge their unique population history and evolution. For some understudied populations, population-specific results may provide actionable findings for that population. 55,56 However, in some instances, participants may not wish to associate membership in their population with a specific trait that could be considered stigmatizing. 57

Reporting
Reporting on race, ethnicity, and ancestry is typically necessary to describe methods, justify approach, and interpret results. Reviews of human genetic studies identified inadequate descriptions of race, ethnicity, and ancestry variables, which hinders transparency, replicability, and interpretability. 58,59 We offer guidance on the reporting of race, ethnicity, and ancestry variables to augment existing and emerging reporting recommendations (e.g., Brothers et al., 8 American Psychological Association, 33 and Flanagin et al. 34 ).
Recommendations for authors or presenters include the following: 1. Acknowledge the broader social context of health and healthcare disparities when invoking these disparities as a justification for genomic research. Health disparities are differences in health ''closely linked with eco-nomic, social, or environmental disadvantage.'' 60 While health disparities often disproportionately affect minority racial and ethnic groups, the underlying reasons are typically due to social and structural determinants of health rather than genetic factors. 24,61,62 Genetic research may be part of the solution to address health disparities but should be integrated into ''social models of disease and interdisciplinary research methods.'' 25 2. Avoid reinforcing the idea that race and ethnicity are genetic concepts when presenting genetically derived data. When presenting figures or summary statistics, be clear about how labels were defined, use terms that represent the source of the information, and justify their use in the given context. For example, if labeling participants in PC plots by race and/or ethnicity, it is important to state why this was done and use the original racial or ethnic designations rather than re-labeling with (proxy) ancestry terms. As another example, do not assume that allele frequencies from a reference population apply to a particular racial or ethnic group, or vice versa. 3. Describe participants in alignment with their communities' preferences and study-specific reporting guidelines. Given the number and complexity of studies with diverse data, and the potential for conflicting studyspecific recommendations in cross-study analyses, we encourage authors to discuss these issues with study investigators or participant representatives (e.g., via a community advisory board 63,64 ). Where direct access to these stakeholders is infeasible, identify and follow reporting standards or precedents in the study. 4. Avoid generalizing from a single population to represent another, broader population. Keep in mind the limitations of population identifiers and generalizability to larger population groups. 65 For example, if a study includes Samoans but no other Pacific Islander populations, do not generalize the Samoan people to represent all Pacific Islanders.

CONCLUSION
Conducting genetic research in the context of large-scale, diverse consortia presents both challenges and opportunities, as illustrated by our experiences in the TOPMed program. Genetics researchers need to make structural changes to the research process and within the scientific community to realize the benefits of diversifying genetics research. We should critically evaluate each research step to ensure that race, ethnicity, and ancestry are described and used appropriately. This includes hypothesis generation, study design, data collection, harmonization, analysis, and reporting. For example, when we set out to identify genetic associations with disease and explore whether differences in association between racial groups exist, it can be easy to conclude that genetic differences rather than social or structural determinants of health are driving observed outcomes. Instead, by incorporating non-genetic factors into an explicit hypothesis up front, 66 we can further address their influence on health disparities. 8 Additionally, measuring and integrating key social and structural factors into genetic analyses  [67][68][69] It is important to counteract, rather than reinforce, racialized thinking when studying differential health outcomes or group differences. 6 We recognize our recommendations as part of a broader conversation in the scientific community about refining terminology, strengthening reporting guidelines, and advancing statistical and other research methodologies needed to strive for an anti-racist science. 6,70 Establishing new standards for terminology and incorporating updated publication requirements that demand clear and rigorous definitions of race, ethnicity, and ancestry variables are crucial in extinguishing racialized thinking from genetics research and literature. 8,34,58 These measures encourage investigators to be more critical when applying these concepts in the design, development, and conduct of their research. In addition to changes in language and reporting, methodological advancements that accommodate analyses of diverse populations and a re-evaluation of existing methodologies are necessary. 71 For example, systematic investigation of a stratified versus pooled approach to association testing will provide empirical evidence for if, and when, stratifying participants is necessary. This work is needed because, if used indiscriminately, stratification by race may reify race as genetic and obscure the non-genetic, ''fundamental causes'' of health inequities. 72 We should also critically examine the use of continental ancestry in genetic research. 17 The selection of reference populations with ancestry from specific geographic areas is somewhat arbitrary, yet these samples are widely used to represent entire continents. 73 For example, despite early guidance against such oversimplification, the HapMap Yoruba in Ibadan, Nigeria (YRI) are often used to represent all of Africa; however, this population represents a small amount of diversity present across African genomes. 65 Further, the usual classification of people as having European, Asian, American, or African ancestry makes reference to a specific time period, i.e., after the global geographic dispersal of Homo sapiens from Africa and prior to the European colonization, especially of the Americas, that accompanied the so-called Age of Discovery. We could just as easily define continental ancestry based on a different time period, such as current human geography. 73 While no more right or wrong, this approach would lead to a very different understanding of, for example, American ancestry. While categorizing ancestry components by continent can be a useful model of the data, we must keep in mind that it is only a model, and one that obscures genetic heterogeneity within continents and the complex, dynamic political, social, and migratory histories of those regions. 74 Scientists are trained to evaluate new data to see if they match expectations, but this training can work against us when it intersects with our social biases because we view results that reflect those biases as more likely to be ''true'' than other results. This can lead to a belief in the correspondence of continental ancestry with historical races rather than recognizing the practice of clustering genomes in more or fewer population groups as a modeling choice. 75 Allele frequencies and patterns of linkage disequilibrium differ across populations, but these differences are a result of processes including mutation, genetic drift, selective pressure, and population bottlenecks and expan-sion, reflecting rich population history and migration 73 rather than static genetic differences between a fixed number of population groups.
Averting and correcting misuses of race and ancestry in genetics research now is critical before they potentially get ''baked into'' emerging applications. One example is the development of polygenic risk scores (PRSs), which provide estimates of an individual's genetic risk for a clinically relevant outcome. 76 PRSs are typically based on summary statistics derived from GWAS data, which to date have been heavily biased toward European populations. This bias has led to poorer predictive performance in non-European and admixed individuals, which could exacerbate health and healthcare disparities. 77 Diversifying study populations in GWASs and developing PRS methods applicable to diverse and admixed populations is of prime importance, but first we need to critically evaluate the roles that race, ethnicity, and ancestry play in these efforts. [78][79][80][81][82][83] Further, we contend that the recommendations presented here are relevant to PRS development and application, as well as other clinical and translational genomics efforts. 13,15 Ultimately, awareness, transparency, and sensitivity among researchers are needed to encourage thoughtful data stewardship, foster collaboration, and work toward expanding the diversity and representation needed to further translational genomic research. 1,2,84 As genetic scientists, we should promote meaningful genomic knowledge and scientific advancements with equitable benefit. We should commit to recruiting, supporting, and amplifying the voices of underrepresented scientists in academia and the genetics community more broadly, including internationally. 85 We recognize that addressing race, ethnicity, and ancestry in genetics research is a nuanced practice with changing perspectives. There is much to learn on how best to appropriately consider social factors in genetics research and translation and ensure that we dismantle any remnants of racialized thinking from this work. In order to tackle these issues successfully, we must be open to new and evolving ideas and approach this work with ongoing reflection and humility.

DECLARATION OF INTERESTS
The authors declare no competing interests.

INCLUSION AND DIVERSITY
One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in science. One or more of the authors of this paper self-identifies as a member of the LGBTQ+ community. One or more of the authors of this paper self-identifies as living with a disability.