Introduction

Hypertension (HTN) is a major public health concern and remains a leading risk factor for stroke and cardiovascular disease1,2,3,4. The diagnosis and treatment of HTN is straightforward, but the lack of control is commonplace with about 40% of treated patients achieving blood pressure targets in the United States5. Precision rule-based algorithms as tools for the development of hypertension treatment and prevention strategies are a promising solution6; the incorporation of multi-dimensional data that include genetics, nutrition, environment, and other biomarkers expand the potential prevention and intervention targets. AoU allows communities to participate in data collection further enriching the available data. Our rationale for this study was to validate the definition of HTN7 in the new resource, the All of Us (AoU) Research Program using rule-based algorithms. The validity of this definition based on electronic health record (EHR) data in underrepresented populations is unknown.

The National Institutes of Health Precision Medicine Initiative of which, the AoU Research Program is a component, is a longitudinal cohort study based on asking participants to play an active role in collecting and sharing their unique health information including EHR for use in precision medicine studies8. The aim is to enroll over a million participants who represent the diversity of the United States.

AoU demonstration project teams were charged with replicating known associations from published literature to demonstrate the utility of the data and to test the Researcher Workbench interface prior to release. Our aim was to use published methods7 to replicate known differences in HTN prevalence in groups underrepresented in biomedical research (UBR) and illustrate variation in HTN prevalence in geographic regions of the U.S. We compared our results to the 2015–2016 National Health and Nutrition Examination Survey (NHANES) HTN prevalence results9. Our findings may inform the use of AoU data to develop rule-based algorithms based on EHR data for prevention and treatment of hypertension in clinical practice.

Methods

All of Us demonstration projects

The goals, recruitment methods and sites, and scientific rationale for AoU have been described previously8. Demonstration projects were designed to describe the cohort, replicate previous findings for validation, and avoid novel discovery in line with the program value to ensure equal access by researchers to the data. The work described here was proposed by Consortium members, reviewed and overseen by the program’s Science Committee, and was confirmed as meeting criteria for non-human subject research by the AoU Institutional Review Board. All methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all the participants. All experimental protocols involving human participants were approved by Ethics committee/Institutional Review Board (IRB) of the AoU Institutional Review Board.

The initial release of data and tools used in this work was published recently10. Results reported are in compliance with the AoU Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20. AoU enrollment started in May 2018 and currently enrolls participants 18 years of age or older from a network of more than 340 recruitment sites11. From October, 2019 to February, 2020, 38 demonstration projects were performed using the AoU Research Program Curated Data Set (CDR) on a secure server, utilizing a Researcher Workbench interface. The Research Workbench included 188,781 participants.

All of Us research hub

This work was performed on data collected by the previously described AoU Research Program8 using the AoU Researcher Workbench, a cloud-based platform where approved researchers can access and analyze data. The data currently includes surveys, EHR data and physical measurements (PM). The details of the surveys are available in the Survey Explorer found in the Research Hub, a website designed to support researchers12. Participants could choose not to answer specific questions. PM recorded at enrollment include systolic and diastolic blood pressure, height, weight, heart rate, waist and hip measurement, wheelchair use, and current pregnancy status. EHR data was linked for those consented participants. All three datatypes (survey, PM, and EHR) are mapped to the Observational Health and Medicines Outcomes Partnership (OMOP) common data model v 5.2 maintained by the Observational Health and Data Sciences Initiative (OHDSI) collaborative. To protect participant privacy, a series of data transformations were applied. These included data suppression of codes with a high risk of identification such as military status; generalization of categories, including age, sex at birth, gender identity, sexual orientation, and race; and date shifting by a random (less than one year) number of days, implemented consistently across each participant record. Documentation on privacy implementation and creation of the CDR is available in the AoU Registered Tier CDR Data Dictionary13. The Researcher Workbench currently offers tools with a user interface (UI) built for selecting groups of participants (Cohort Builder), creating datasets for analysis (Dataset Builder), and Workspaces with Jupyter Notebooks (Notebooks) to analyze data. The Notebooks enable use of saved datasets and direct query using R and Python 3 programming languages10. We used R version 4.0.3 to perform the analyses. We used EXCEL to create figures to display the hypertension prevalence and 95% confidence intervals.

Participants completed informed consent, provided consent for sharing of electronic health record data with the Data and Research Center (DRC), and provided survey responses on demographics, health status and behaviors including cigarette smoking, alcohol use, and illicit drug use at baseline.

Definition of HTN

We defined HTN using the published electronic Medical Records and Genomics Network (eMERGE) algorithm (https://phekb.org/phenotype/resistant-HTN) for a study of resistant HTN cases versus controls with treated HTN14. The eMERGE definition for HTN required individuals to have an outpatient measurement of systolic blood pressure greater than 140 or diastolic blood pressure greater than 90 prior to meeting medication criteria or International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code of 401.* (essential HTN) or International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) code of I10 code (essential HTN) at any time and at least one medication from the HTN medication classes. The eMERGE network has published evidence of the improved positive predictive value (PPV) of using 2 instances of diagnosis/billing codes for phenotype algorithms in EHR data15. Since we did not have complete data on systolic and diastolic blood pressures from EHR across all sites, we adapted the eMERGE definition to include at least 2 diagnosis/billing codes on separate dates in the EHR data AND at least one HTN medication. We defined the index date for newly diagnosed HTN cases by date of first HTN medication code. We defined age at index date for HTN cases. Females or males were identified as participants with female or male sex assigned at birth.

Data collection from in-person study visit and EHRs

Study protocols at each site were used to measure data on blood pressure at in-person “Physical Measurement” (PM) visits. Clinical data on blood pressure collected for routine patient care and recording in participant EHRs were extracted and transformed into OMOP tables at each enrollment site. Data was transferred securely to the Data Research Center at Vanderbilt University. PM visit and EHR data were used to identify blood pressure measurements for each data source. Survey data were used to collect data on demographics, including sex and gender identity, income, education, race/ethnicity, age, and geography (U.S. state of residence).

EHR data extraction

We extracted SNOMED codes for essential HTN, defined the first SNOMED code, and defined a second SNOMED code on distinct date. A participant was defined as having HTN if two distinct SNOMED codes for HTN were identified. For the 48,289 participants with the SNOMED code for essential HTN (59,621,000) on any date, we extracted each participant’s detailed dates of SNOMED code for essential HTN from the Researcher Workbench table ‘cb_search_all_events’. We found 39,779 participants the SNOMED code for essential HTN on at least two distinct dates.

Extraction of medication treatment history for anti-hypertensive medications

We selected medications from the following six classes based on RxNorm codes in the Researcher Workbench: peripheral vasodilators, agents acting on the renin-angiotensin system, beta blocking agents, antihypertensives, calcium channel blockers, and diuretics. The Researcher Workbench table ‘concept_ancestor’ was used to extract all medications within the six medication classes.

Statistical analysis

Participants that had at least one Systemized Nomenclature of Medicine (SNOMED) code for HTN in their EHR were considered for the analysis. SNOMED codes are a standardized term for medical conditions used by healthcare providers for uniformity in diagnostics, billing and documentation. After considering multiple potential definitions, we decided to use the EHR data (SNOMED codes for HTN on 2 distinct dates and at least one HTN medication) as the primary definition of HTN14. For the 48,289 participants with the SNOMED code for essential HTN (59,621,000) on any date, we extracted each participant’s detailed dates of SNOMED code for essential HTN from the Researcher Workbench table ‘cb_search_all_events’. We selected medications from the following six classes based on RxNorm codes in the Researcher Workbench: peripheral vasodilators, agents acting on the renin-angiotensin system, beta blocking agents, antihypertensives, calcium channel blockers, and diuretics. The Researcher Workbench table ‘concept_ancestor’ was used to extract all medications within the six medication classes. We excluded SNOMED essential HTN codes (59,621,000) recorded on the same date as SNOMED pregnancy codes (24,898,207), There were 13,481 pregnant participants based on SNOMED pregnancy codes (24,898,207) and 1,665 with HTN and SNOMED pregnancy codes on the same date.

We calculated crude, and age-adjusted prevalence of HTN standardized by age from US Census data as in Crim et al.7 Based on methods used in Crim et al. paper7, we classified age at date of enrollment (e.g. PPI date) into 3 groups: 18–39, 40–59, ≥ 60, 4 groups: 18–39, 40–59, 60–74, ≥ 75, and 5 groups: 18–49, 50–59, 60–69, 70–79, ≥ 807. We calculated an age-standardized HTN prevalence according to the age distribution of the U.S. Census. The census population size at each age group is as of July 1, 2018 and based on https://www.census.gov/newsroom/press-kits/2019/detailed-estimates.htmlA . Age-standardization was performed for 3 groups: 18–39, 40–59, ≥ 60; 4 groups: 18–39, 40–59, 60–74, ≥ 75; and 5 groups: 18–49, 50–59, 60–69, 70–79, ≥ 80. Race/ethnicity was coded into 6 groups based on AoU race and ethnicity variables in the Researcher Workbench as Non-Hispanic White race, Non-Hispanic Black race, Non-Hispanic Asian race, more than one race, other race (included Native Hawaiian and Other Pacific Islander, Middle Eastern and North African) and Hispanic ethnicity. The confidence interval for hypertension prevalence was computed using the Normal approximation interval based on the central limit theorem. We also tested for difference in HTN prevalence for males versus females with a Chi-square test. Socioeconomic status (SES) was classified on the income and education variables as a binary variable with low SES defined as low income (≤ $25,000) OR low education (< high school degree or GED) vs. not low in either category. Individuals with missing values for education or income were included in the group high income/high education based on the assumption that individuals with income higher than $25,000 might be more likely to have missing values for income and education than individuals with income less than $25,000. We assessed the agreement between the income and education variables by looking at the percent overlap of high income and high education versus low income and low education. We tested for significance of the overlap with a Chi-square test. For education and income, we did sensitivity analyses for crude HTN stratified by the education and income variables: low education (< high school degree or GED) versus high education (above high school or GED) and low income (≤ $25,000) versus high income (> $25,000). We reported the frequency of missing values for education and income. Geographic division of the U.S. was based on 9 U.S. Census Geographic divisions (https://www.cdc.gov/nchs/products/databriefs/db289.htm.): Division 1—New England (Maine, Vermont, New Hampshire, Massachusetts, Connecticut, and Rhode Island); Division 2—Middle Atlantic (New Jersey, New York, and Pennsylvania); Division 3—East North Central (Wisconsin, Michigan, Ohio, Indiana, and Illinois); Division 4—West North Central (North Dakota, South Dakota, Nebraska, Kansas, Missouri, Iowa, and Minnesota); Division 5—South Atlantic (Maryland, Delaware, West Virginia, Virginia, North Carolina, South Carolina, Georgia, Florida, and District of Columbia); Division 6—East South Central (Kentucky, Tennessee, Alabama, and Mississippi); Division 7—West South Central (Oklahoma, Arkansas, Texas, and Louisiana); Division 8—Mountain (Idaho, Montana, Wyoming, Colorado, Utah, Nevada, Arizona, and New Mexico); Division 9—Pacific (Washington, Oregon, California, Alaska, and Hawaii). West North Central Division (n = 1) and West South Central Division (n = 0) were excluded from analyses due to extremely low sample size. The Census division information for participants was derived from the PPI data.

Results

Researcher Workbench EHR and medication data were available on 104,047 participants, SNOMED codes were available on 112,468 participants, and 103,270 participants had both medication and SNOMED data. Thus, 103,270 was the denominator for prevalence calculations. Sociodemographic differences for individuals with and without HTN are shown in Table 1.

Table 1 Sociodemographic Characteristics of the All of Us Research Program Electronic Health Record Dataset in January 2020 Among Participants with HTN and Without HTN1.

The total number of persons with SNOMED codes on at least two distinct dates and at least one antihypertensive medication was 33,310 for a crude prevalence of HTN of 32.2%. The crude prevalence was 7.7% among ages 18–39, 32% among ages 40–59, and 50.4% among ages ≥ 60 (Table 2). The census population size for each age group as of July 1, 2018 is shown in Table 2.

Table 2 Crude HTN Prevalence in All of Us Research Program with Age Distribution in All of Us Compared with the U.S. Census (2018)1.

Crude HTN prevalence in AoU for each age group by gender is shown in Table 3

Table 3 Crude HTN prevalence in All of Us research program with age distribution in All of Us by gender.

.

All of Us data are skewed towards older age groups. Using methods of Crim, et. al.7 we calculated age-adjusted HTN prevalence based on the 2018 U.S. data. Age-adjusted HTN prevalence was 27.8% using 3 groups, 28.2% using 4 groups, and 28.5% using 5 groups. In comparison, NHANES age-adjusted prevalence was 29.6% for 3 groups, and 29.8% for 4 groups in NHANES 2007–2008 in Crim et al.7 Fig. 1 displays the prevalence of HTN calculated using AoU data (Fig. 1) and data from NHANES 2015–20169 (Fig. 2).

Figure 1
figure 1

Prevalence of HTN among adults aged 18 and over, by age: All of Us, 2018–2019; ages 18 and over (blue), 18 to 39 (red), 40 to 59 (green), and 60 and over (purple) years. All estimates are age adjusted using the census population size at each age group as of July 1, 2018, based on https://www.census.gov/newsroom/press-kits/2019/detailed-estimates.html. Error bars show 95% confidence intervals for HTN prevalence estimates. Figure was created with Microsoft Excel for Mac, Version 16.46.

Figure 2
figure 2

Prevalence of HTN among adults aged 18 and over, by age: United States, 2015–2016; ages 18 and over (blue), 18 to 39 (red), 40 to 59 (green), and 60 and over (purple) years. All estimates are age adjusted by the direct method using computed weights based on the subpopulation of persons with HTN in the 2007–2008 National Health and Nutrition Examination Survey, using age groups 18–39, 40–59, and 60 and over. Access data table for Fig. 2 at: https://www.cdc.gov/nchs/data/databriefs/db289_table.pdf#4. SOURCE: NCHS, National Health and Nutrition Examination Survey, 2015–2016. Figure was created with Microsoft Excel for Mac, Version 16.46.

Both figures show HTN prevalence in the 3 age groups (red, green and purple bars) and the overall age-adjusted prevalence (blue bar). Stratified by sex, age-adjusted prevalence (95% CI) was 28.7% (28.7–28.8) in males, 27.6% (27.57–27.58) in females in AoU vs. 30.2% in males and 27.7% in females in NHANES9. Table 4 shows the crude and age-adjusted HTN prevalence among race categories (as defined in US Census data), where American Indian and Alaska Native, and Native Hawaiian and Other Pacific Islander are combined as ‘Other’.

Table 4 HTN prevalence in the All of Us Research Program among race/ethnic groups adjusted for age based on U.S. Census data for age distribution of the population in 4 groups, 18–39, 40–59, 60–74, ≥ 75.

Figure 3 shows crude HTN prevalence by socioeconomic status (SES) in AoU, 2018–2019. U.S. Census data is not available for age-distribution by SES categories. With respect to missing data, we noted that 28.1% (n = 29,024) did not report income and 2.2% (n = 2,312) did not report education. HTN prevalence (95% CI) stratified by income < 25,000 versus > 25,000 was 39.9% (39.1%–40.7) versus 30.4% (30.1–30.8), respectively. For individuals that did not report income, HTN prevalence was 37.0% (35.0–38.9). HTN prevalence (95% CI) stratified by education < high school/GED versus > one or more years of college was 34.9% (34.-35.4) versus 30.8% (30.4–31.1), respectively. For individuals that did not report education, HTN prevalence was 37.0% (35.0–38.9).

Figure 3
figure 3

Crude HTN prevalence by SES in All of Us, 2018–2019. Error bars show 95% confidence intervals for HTN prevalence estimates. U.S. Census data is not available for age-distribution by SES categories. Figure was created with Microsoft Excel for Mac, Version 16.46.

Figure 4 shows crude HTN prevalence in All of Us by geographic region, 2018–2019.

Figure 4
figure 4

Crude HTN prevalence in All of Us by geographic region, 2018–2019. Error bars show 95% confidence intervals for HTN prevalence estimates. U.S. Census data is not available for age-distribution by geographic region. Figure was created with Microsoft Excel for Mac, Version 16.46.

U.S. Census data is not available for age-distribution by geographic region. HTN prevalence was higher among those who live in the Middle Atlantic, South Atlantic, and East South Central regions of the U.S. Prevalence was lower among those who live in the Mountain region of the U.S.

Discussion

We completed the first analysis of HTN using data from the AoU Research Program Researcher Workbench. We reproduced known associations between race, SES, and geographic region and HTN9. The prevalence of HTN varies in the United States (U.S.) by age, sex, and socioeconomic status9, 16. AoU age-adjusted HTN prevalence using three age groups was 27.9% compared to 29.6% in NHANES. Using four age groups, aged-adjustment prevalence was 28.2% in AoU compared to 29.8%7. Fryar studied temporal trends in age-adjusted NHANES HTN prevalence, age-adjusted to four groups, in two year periods (2009–2016) with relatively stable rates of 28.6%, 28.7%, 29.3%, and 29.0% for 2015–20169. Thus, AoU HTN prevalence is about 1% lower than reported prevalence in NHANES9. NHANES is considered a primary source of HTN statistics (e.g. prevalence and control) that informs public health and clinical care. We have shown that AoU data provides very similar prevalence estimates, which supports the data’s validity.

For more than 15 years, the U.S. saw a rise in blood-pressure (BP) control from 31.8% to 53.8%17. However, BP control dropped to 43.7% from 2013–2014 to 2017–201817. A greater proportion of Americans, particularly those in marginalized communities, are living with uncontrolled HTN18,19,20. The drop in BP control highlights the need for healthcare providers to recommit to prioritizing BP control. Evidence suggests that computerized clinical decision support systems may be a promising tool for reducing the burden of HTN6,21,22. AoU may serve as a strategic platform to develop diversity-by-design rule-based algorithms for treatment and prevention of HTN that are generalizable to various populations. Researchers, clinicians, patients, and community stakeholders, and analytics professionals (and possibly more) are all needed to ensure that the right additional checks and balances are in place for responsible algorithm deployment. The AoU data is available to everyone. The open-access nature of AoU data may address inherent bias problems caused by the underrepresentation of diversity in the individuals that have access to data.

NHANES, another open-access cohort, captures data on a nationally-representative sample of approximately 5,000 participants annually. NHANES includes data from survey interviews and in-person physical measurements. NHANES defined HTN for participants by (a) systolic blood pressure ≥ 140 or diastolic blood pressure ≥ 90 mm Hg, (b) if the subject said “yes” to taking antihypertensive medication, or (c) if the subject was told on two occasions that the subject had HTN. For AoU data, we chose an EHR-based definition of hypertension23,24,25 instead of using a clinical definition such as the ACC/AHA Guidelines published in 201726. Once the clinical diagnosis of HTN is made, clinicians and insurers make decisions using the EHR-based definition27. Thus, our EHR-based HTN findings that replicate NHANES’ HTN prevalence9 have important real-world implications for improving the management of HTN.

We demonstrated some modest differences in sex stratified HTN prevalence: age-adjusted male prevalence was 28.8% in AoU compared to 30.2% in NHANES and age-adjusted female prevalence was 30.2% in AoU vs. 27.7% in NHANES9. These differences could be due to inclusion of HTN medication use in our HTN definition. In prior work, Geldsetzer, et al. reported that among those with HTN, 39.2% were aware of their diagnosis, 29.9% had received treatment, and 10.3% had control of their HTN28. They also reported that older age, female or a non-smoker, and higher levels of education and income were associated with higher progression through the cascade of HTN care28. HTN can often be treated successfully with medication29,30,31,32 and prevented or delayed with lifestyle modifications32,33,34. Even with these established HTN intervention and prevention strategies, the prevalence of HTN continues to be at levels of public health concern1.

Limitations

EHRs were limited to data that is collected within a single healthcare network, and thus may not capture out of network care. In theory, AoU will ultimately include EHR data from individuals across multiple institutions. Some AoU recruitment sites are in the process of EHR data extraction and transfer to the Data Research Center. We currently do not have information on data completeness from each recruitment site in the AoU Research Program. Thus, our preliminary findings may underestimate HTN prevalence in the U.S. The geographic representation in the AoU Research Program is currently weighted towards regions with healthcare provider organizations that are funded for large scale recruitment. As more direct volunteers are recruited in the future, we expect the geographic representation to improve.

Strengths

The AoU dataset provides advantages over datasets like NHANES. AoU has more covariates such as EHR data and genetic information for broader analyses. Data from AoU may contribute additional value to existing national resources used to study HTN through the scale at which measured data are available. Using the entire EHR allowed us to extract coded data on HTN diagnoses and medications, a method that has been shown to be valid by the eMERGE consortium15. To avoid a racially biased algorithm35, the diagnostic algorithm for hypertension did not use race or ethnicity data. Additionally, the diversity within AoU may provide insight into factors relevant to HTN prevention and treatments in a variety of social and geographic contexts and population strata in the U.S. given that over 80% of AoU participants have been historically underrepresented in biomedical research from the perspectives of age, race/ethnicity, sexual orientation and gender identity, geography or other dimensions.

In summary, the AoU Research Program data capture known differences in the prevalence of HTN by demographic7 and geographic characteristics. AoU has great potential to contribute to the vision of precision medicine for hypertension to improve clinical outcomes in patients with and at-risk for HTN. Future research that takes advantage of the rich data (including social determinants of health, genomics and biomarkers) in AoU may lead to novel insights into differences among under-represented groups. This cohort presents the opportunity to analyze data streams derived from genomics combined with clinical and geographical data to discover mechanisms and potential target molecules from which drugs or treatments can be developed.