Color Data v2: a user-friendly, open-access database with hereditary cancer and hereditary cardiovascular conditions datasets

Abstract Publicly available genetic databases promote data sharing and fuel scientific discoveries for the prevention, treatment and management of disease. In 2018, we built Color Data, a user-friendly, open access database containing genotypic and self-reported phenotypic information from 50 000 individuals who were sequenced for 30 genes associated with hereditary cancer. In a continued effort to promote access to these types of data, we launched Color Data v2, an updated version of the Color Data database. This new release includes additional clinical genetic testing results from more than 18 000 individuals who were sequenced for 30 genes associated with hereditary cardiovascular conditions as well as polygenic risk scores for breast cancer, coronary artery disease and atrial fibrillation. In addition, we used self-reported phenotypic information to implement the following four clinical risk models: Gail Model for 5-year risk of breast cancer, Claus Model for lifetime risk of breast cancer, simple office-based Framingham Coronary Heart Disease Risk Score for 10-year risk of coronary heart disease and CHARGE-AF simple score for 5-year risk of atrial fibrillation. These new features and capabilities are highlighted through two sample queries in the database. We hope that the broad dissemination of these data will help researchers continue to explore genotype–phenotype correlations and identify novel variants for functional analysis, enabling scientific discoveries in the field of population genomics. Database URL: https://data.color.com/


Introduction
The use of next-generation sequencing (NGS) technologies in research and clinical laboratories has led to a rapid increase of genetic data. However, there is a lack of publicly available genetic data, especially paired genotypicphenotypic data. In 2018, we launched Color Data, an open-access, cloud-based database containing genotypic and self-reported phenotypic information from 50 000 individuals who were sequenced for 30 genes associated with hereditary cancer (1). Color Data has already made an impact on the scientific community, being utilized in peerreviewed publications (2,3) and presented as a resource to educators (4). Its user-friendly interface enables researchers to easily execute their own queries with self-serve filters and displays the results as text, tables and graphs. Results can also be downloaded in different file formats for further analyses and shared via email or social media.
Another important consideration when designing and implementing the database was scalability for volume and integration of different data points. For the second release of Color Data ('Color Data v2'), we added new features to the existing hereditary cancer dataset as well as a new dataset of genotypic and self-reported phenotypic information related to hereditary cardiovascular conditions from more than 18 000 individuals. Importantly, the hereditary cardiovascular conditions dataset retains the same user-friendly interface, making it easy for researchers and scientists to explore a new disease area.
Here we describe updates made to the database, including changes to the cohort and the addition of new query filters and results such as family health history. We also added clinical risk models to Color Data v2: the Gail Model (5) and the Claus Model (6) for breast cancer, simple officebased Framingham Coronary Heart Disease Risk Score (7) for coronary heart disease and CHARGE-AF simple score (8) for atrial fibrillation. These risk models are commonly used by healthcare providers in the clinic and are important tools to estimate risk. Recent work has demonstrated that polygenic scores can also accurately predict and stratify risk for common, complex diseases and can identify individuals who have magnitude of risk for disease similar to those with pathogenic or likely pathogenic variants (9,10). As such, Color Data v2 also includes polygenic scores for breast cancer, coronary artery disease and atrial fibrillation. We highlight the addition of the hereditary cardiovascular conditions dataset, clinical risk models and polygenic scores through two sample queries in the database. To our knowledge, Color Data is the first database to include pre-calculated scores from clinical risk models and for polygenic risk, which can help researchers investigate the relationship between different types of risk factors, both genetic and non-genetic, for disease.

Materials and methods
Design and implementation of the database were previously described in the flagship publication by Barrett,Neben et al. (1). All individuals included in Color Data v2 received a multi-gene NGS panel test from Color Genomics, Inc. ('Color', Burlingame, CA) for 30 genes associated with hereditary cancer. In addition, a subset of individuals also received multi-gene NGS panel testing for 30 genes associated with hereditary cardiovascular conditions. All individuals consented to have their genetic and selfreported phenotypic information appear in Color's research database.
Laboratory procedures, bioinformatics analysis and variant interpretation for the multi-gene panel tests were performed at Color (Burlingame, CA) under Clinical Laboratory Improvements Amendments (#05D2081492) and College of American Pathologists (#8975161) compliance as previously described (11). Bioinformatics analysis included the previously described 30 genes associated with hereditary cancer and was updated to include an additional Laboratory procedures and imputation for low coverage whole genome sequencing were performed at Color as previously described (12,13). Data from low coverage whole genome sequencing were used to calculate previously published polygenic scores for breast cancer (10), coronary artery disease (9) and atrial fibrillation (9). Each polygenic score was normalized using principal components analysis to account for the effects of population stratification. While polygenic scores have the highest performance in people of European ancestry, recent studies have demonstrated that they have stratification ability across global populations as well (14,15). To note, if users would like to view polygenic risk score results for a given query, they must select 'Calculated' in the polygenic risk score filter because only a subset of the individuals in the database have a calculated polygenic risk score. Individuals who do not have polygenic risk scores calculated are captured under the filter value 'Unknown'. Other self-reported phenotypic and genotypic information from 'Calculated' and 'Unknown' individuals is included in other query results by default, unless otherwise selected.
Genotypic and self-reported phenotypic information were used in the following clinical risk models: Gail Model for 5-year risk of breast cancer (5), Claus Model for lifetime risk of breast cancer (6), simple office-based Framingham Coronary Heart Disease Risk Score for 10-year risk of coronary heart disease (7) and CHARGE-AF simple score for 5-year risk of atrial fibrillation (8). To note, only a subset of individuals have a risk score calculated. Individuals who do not have a risk score calculated are labeled as 'Unknown' if not enough information was provided to calculate a risk score or 'Ineligible' if they did not meet the model criteria. The eligibility criteria for risk models are as follows:  Table 1 and for hereditary cardiovascular conditions in Table 2.

Web interface
The Color Data home page (https://data.color.com/) links to two new query/result pages (hereafter referred to as 'dashboards'): one for hereditary cancer and one for hereditary cardiovascular conditions. The links to three sample queries on the home page have been updated to demonstrate to users potential use cases of these dashboards as well as new query filters and results.
On the hereditary cancer dashboard (https://data. color.com/v2/cancer.html), users can apply the new query filters for family health history, risk models and a polygenic risk score. These new filter categories and filter values are listed in Table 1  will only be displayed if a user selects the filter value 'Calculated' because only a subset of individuals have polygenic risk scores calculated. These filter categories and filter values are listed in Table 2. The same 'AND' logic and 'OR' logic apply, as described above.

Population characteristics
The  There are 9727 unique variants, over half of which are benign or likely benign (54.4%). The frequency of pathogenic and likely pathogenic variants in the total population is 1.4%. Finally, 0.3% of individuals are categorized as being at high-risk for atrial fibrillation using the CHARGE-AF simple score, and 7.5% are estimated to have a high 10-year risk for coronary heart disease using the simple office-based Framingham Coronary Heart Disease Risk Score.

Sample query 1: frequency of pathogenic and likely pathogenic variants in genes associated with hereditary cardiovascular conditions
Cardiovascular disease is a leading cause of death in the USA, accounting for one-third of deaths worldwide (16). Many individuals with hereditary cardiovascular conditions progress asymptomatically, and as a result, go undiagnosed until they present with a sudden cardiac event. Users can investigate the frequency of pathogenic and likely pathogenic variants in genes associated with hereditary cardiovascular conditions in the database by filtering 'Classification: Pathogenic or Likely pathogenic' (https:// data.color.com/v2/cardio.html#classification=Likely%20 pathogenic&classification=Pathogenic). A total of 223 individuals have a pathogenic or likely pathogenic variant, the majority of which are female (67.7%) ( Figure 1A) and Caucasian (76.2%) ( Figure 1B). The average age at testing was 45.0 years ( Figure 1A), and the majority of individuals reported no personal history of cardiovascular disease and/or events (66.8%) ( Figure 1C). Nearly onefourth of variants were identified in LDLR (19.6%), followed by MYBPC3 (18.1%), KCNQ1 (9.7%) and PKP2 (9.7%), among others ( Figure 1D). Of the 223 pathogenic or likely pathogenic variants identified, the most common result was a heterozygous APOB c.10580G > A (p. Arg3527Gln) (n = 16) ( Figure 1E), which is associated with familial hypercholesterolemia and found in approximately 0.06% of individuals of European, non-Finnish ancestry (17, 18). Familial hypercholesterolemia is characterized by elevated levels of low-density lipoprotein (LDL) and an increased risk of premature coronary artery disease, with 50% of men and 30% of women developing coronary artery disease by the age of 55, if left untreated (19). To investigate the subpopulation of individuals with this variant, users can filter by 'Gene: APOB' and 'Variant: c.10580G>A' (https://data. color.com/v2/cardio.html#gene=APOB&variant=c.1058-0G%3EA). In this subpopulation of 16 individuals, the majority were female (68.8%) and of Caucasian ethnicity (87.5%) ( Figure 1F). On average, individuals in this subpopulation were 45.2 years old at the time of testing, and one individual reported a personal history of bundle branch block or heart block ( Figure 1G). Taken together, researchers could use these data to better characterize the prevalence of hereditary cardiovascular disorders in a younger, unaffected population to identify asymptomatic individuals who are at risk for future cardiovascular disease and/or events.
Sample query 2: monogenic and polygenic breast cancer risk in women with a personal history of breast cancer Breast cancer is a common, complex disease that is associated with rare pathogenic and likely pathogenic variants ('monogenic risk') and the cumulative effect of many common changes across the genome ('polygenic risk') (9,10). Recent work has suggested that monogenic risk and polygenic risk interact to modify an individual's overall risk for disease (13,(20)(21)(22). To investigate monogenic and polygenic risk in women with a personal history of breast cancer, users can filter by 'Sex: Female', 'Personal health history: Breast' and 'BC Polygenic Risk Score: Calculated' (https://data. color.com/v2/cancer.html#sex=Female&personal_health_ history=Breast&bc_polygenic_risk_score=Calculated). A total of 1443 females in the dataset reported a personal history of breast cancer and had a polygenic risk score for breast cancer risk calculated (Figure 2A). The majority of individuals are Caucasian (73.9%) ( Figure 2B), with an average age of 59.9 years at the time of genetic testing (Figure 2A). The average age of diagnosis for breast cancer was 53.2 ± 11.2 years (standard deviation), and 39.2% of females were <50 years old at the time of diagnosis (Figure 2C). The pathogenic frequency in this population was 13.3% (Figure 2A). A total of 82 677 variants were identified, with the majority of variants in BRCA2 (14.6%), BRCA1 (12.0%) and APC (11.3%) ( Figure 2D). Compared with the normal distribution of risk scores among all individuals with a Polygenic Score, the distribution in women with a personal history of breast cancer (individuals filtered by query) is left-skewed ( Figure 2E). Taken together, users could use these data to investigate the risk conferred by monogenic and polygenic risk factors in women with a history of breast cancer.

Discussion
In Color Data v2, we added new query filters and results such as family health history as well as clinical risk models and a polygenic score for breast cancer to the existing hereditary cancer dataset. Overall, the total number of individuals in this dataset increased to 54 000; however, the individuals in Color Data v1 are not a strict subset of the individuals in Color Data v2. This is due to a difference in inclusion criteria between the two versions. In Color Data v1, individuals were included in the database if they had received clinical genetic testing for all or any subset of the 30 hereditary cancer genes. In Color Data v2, individuals were only included if they received genetic testing for all 30 genes. This change in cohort composition likely contributed to the observed change in frequency of pathogenic and likely pathogenic variants in the total population and the increase in the number of total and unique variants.
Database users can also now explore a new disease area with self-reported phenotypic information and genetic data for 30 genes associated with hereditary cardiovascular conditions from 18 738 individuals. The frequency of pathogenic and likely pathogenic variants in our dataset was higher than previously reported estimates (23,24). This could be due to the generally younger age of individuals in the cohort and/or reduced penetrance in asymptomatic carriers. Genetic testing for hereditary cardiovascular conditions at population scale has only recently begun, and sharing results through genetic databases such as Color Data will help rapidly advance our understanding of cardiovascular disease risk. Coronary artery disease may be of particular interest given the influence lifestyle modifications have been shown to have on lowering risk for disease. In a prospective study of more than 55 000 individuals, it was found that a healthy lifestyle was associated with significantly reduced risk of cardiovascular events across all genetic risk groups (25).
Similar to Color Data v1, Color Data v2 may be limited by selection bias for Caucasians and women as well as by self-reported phenotypic information. Not all individuals in the database provided enough information to calculate risk for the newly included clinical risk models or had polygenic risk scores calculated, resulting in incomplete datasets. As the field continues to evaluate the personal and clinical utility of polygenic risk scores, it will be important to consider their predictive power in light of other risk factors. In addition, the clinical risk models and polygenic scores shown may change over time as more evidence emerges and novel models are developed.