The penetrance of rare variants in cardiomyopathy-associated genes: A cross-sectional approach to estimating penetrance for secondary findings

Summary Understanding the penetrance of pathogenic variants identified as secondary findings (SFs) is of paramount importance with the growing availability of genetic testing. We estimated penetrance through large-scale analyses of individuals referred for diagnostic sequencing for hypertrophic cardiomyopathy (HCM; 10,400 affected individuals, 1,332 variants) and dilated cardiomyopathy (DCM; 2,564 affected individuals, 663 variants), using a cross-sectional approach comparing allele frequencies against reference populations (293,226 participants from UK Biobank and gnomAD). We generated updated prevalence estimates for HCM (1:543) and DCM (1:220). In aggregate, the penetrance by late adulthood of rare, pathogenic variants (23% for HCM, 35% for DCM) and likely pathogenic variants (7% for HCM, 10% for DCM) was substantial for dominant cardiomyopathy (CM). Penetrance was significantly higher for variant subgroups annotated as loss of function or ultra-rare and for males compared to females for variants in HCM-associated genes. We estimated variant-specific penetrance for 316 recurrent variants most likely to be identified as SFs (found in 51% of HCM- and 17% of DCM-affected individuals). 49 variants were observed at least ten times (14% of affected individuals) in HCM-associated genes. Median penetrance was 14.6% (±14.4% SD). We explore estimates of penetrance by age, sex, and ancestry and simulate the impact of including future cohorts. This dataset reports penetrance of individual variants at scale and will inform the management of individuals undergoing genetic screening for SFs. While most variants had low penetrance and the costs and harms of screening are unclear, some individuals with highly penetrant variants may benefit from SFs.


Introduction
Cardiomyopathies (CMs) are diseases of the heart muscle, characterized by abnormal cardiac structure and function that is not due to coronary disease, hypertension, valve disease, or congenital heart disease. Many affected individuals have a monogenic etiology with autosomal dominant inheritance. Penetrance is incomplete and age related, and expressivity is highly variable. These features present huge challenges for disease management. In particular, the penetrance of variants in CM-associated genes is incompletely characterized and poorly understood, especially when identified in an asymptomatic individual without family history of CM. With the growing availability of exome and genome sequencing in wider clinical settings and consumer-initiated elective genomic testing, 1 the importance of estimating the penetrance of individual variants identified as secondary findings (SFs) to guide intervention is ever increasing.
SFs are genetic variants that are actively sought out (as opposed to incidental findings) but that are unrelated to the clinical indication for genetic testing and can therefore be considered as opportunistic genetic screening. Genes associated with inherited CMs make up one-fifth of the 78 genes recommended by the American College of Medical Genetics and Genomics (ACMG SF v.3.1) for reporting SFs during clinical sequencing. 2 It is recommended to return variants that would be classified as pathogenic or likely pathogenic in an affected individual with >90% confidence that the variant is causing the observed disease. This is independent of the probability that an individual carrying the variant will develop disease Ó 2023 (penetrance). The ACMG SF guidelines have not yet been adopted globally; the European Society of Human Genetics recommends a cautious approach but is responsive to accumulating evidence. 3,4 We are concerned that the costs, harms, and benefits have not been fully characterized. We have previously discussed issues with the recommendations based on the lack of estimates of the harms and cost of this approach for variants in specific genes. 5 These estimates are required to conform to the ninth rule of Wilson and Jungner's principles of screening. 6 The burden of the implementation of reporting SFs in specific healthcare systems remains unassessed. There is little evidence for clinical utility and limited justification for use of resources. 4 Research is beginning to become available on implementation frameworks 7 and the perspectives of and impact on individuals with disease. [8][9][10][11][12] Subclinical phenotypic expressivity of rare variants in CM-associated genes has been demonstrated in the UK Biobank (UKBB) population cohort. [13][14][15] Causes of variability in penetrance may include (1) genetic and allelic heterogeneity, as different alleles have different consequences on protein function; (2) environmental modifiers altering genetic influence (e.g., age, sex, hypertension, lifestyle); and (3) additional genetic modifiers with additive or epistatic interactions with the variant of interest (other variants or combinations of genetic factors, e.g., polygenic risk, variants in cis that drive allelic imbalance, imprinting, epigenetic regulation, compensation, threshold model, and transcript isoform expression). [16][17][18][19][20][21][22] Variant-specific estimates of penetrance are required to appropriately inform clinical practice and to fully utilize genetics as a tool to individualize the risk of developing disease in asymptomatic heterozygotes. 5,23 It is challenging to estimate the penetrance of individual rare variants through other study methods, as longitudinal population studies require very large sample sizes and long-term follow-up is required if penetrance is age related. Where data are available for rare variants in CM-associated genes, reported penetrance is mostly estimated from family-based studies. These may be affected by ascertainment biases and secondary genetic and environmental factors 24 and thus less applicable to SFs. Penetrance has been estimated in aggregate by gene and by disease. 13,25,26 Variant-specific penetrance in the general adult population for rare variants in CM-associated genes is unknown.
Here, we apply a cross-sectional approach by using a method 26 that compares the allele frequency of individual rare variants in large cohorts of phenotypic affected individuals with the background frequency of the same variants in the population (phenotype agnostic) to estimate penetrance. As well as providing aggregate penetrance estimates for groups of rare variants (e.g., those curated as pathogenic), this approach can estimate the penetrance of individual rare alleles. Importantly, these estimates represent variants in the general population rather than in families ascertained for disease.

Case cohort
Sequencing data for 10,400 individuals referred for hypertrophy cardiomyopathy (HCM) gene panel sequencing and 2,564 individuals referred for dilated cardiomyopathy (DCM) gene panel sequencing was collected from seven international testing centers: three UK-based centers-the NIHR Royal Brompton Biobank, Oxford Molecular Genetics Laboratory, and Belfast Regional Genetics Laboratory; two US-based centers-the Partners Laboratory of Molecular Medicine and GeneDx; the National Heart Centre, Singapore; and Aswan Heart Centre, Egypt. Although the diagnosis cannot directly be reconfirmed, given genetic testing guidelines (e.g., Wilde et al., 27 Ackerman et al. 28 ), a clinical diagnosis of CM is implicit. For information on DNA sequencing and data obtained for analyses, see the supplemental information.
For each variant observed in one or more individuals referred for CM sequencing, we calculated the allele count (AC) and allele number (AN) and further stratified by reported age, sex, and ancestry where the data allowed. All research participants provided written informed consent, and the studies were reviewed and approved by the relevant research ethics committee (Aswan Heart Centre: FWA00019142, research ethics committee code 20130405MYFAHC_CMR_20130330; NIHR Royal Brompton Biobank: South Central -Hampshire B Research Ethics Committee, 09/H0504/104þ5, 19/SC/0257; National Heart Centre Singapore: Singhealth Centralised Institutional Review Board 2020/2353 and Singhealth Biobank Research Scientific Advisory Executive Committee SBRSA 2019/001v1; UK Biobank: National Research Ethics Service 11/NW/0382, 21/NW/0157, under terms of access approval number 47602).
In addition, diagnostic laboratories (Oxford Molecular Genetics Laboratory, Belfast Regional Genetics Laboratory, the Partners Laboratory of Molecular Medicine, and GeneDx) provided aggregated (and therefore fully anonymous) cohort-level summaries of variant data collected for clinical purposes during routine healthcare. Secondary use of this data did not require research consent from individuals, and approval for public release of the data followed local governance procedures. Data are publicly available through DECIPHER (https://www.deciphergenomics.org/). Analyses of these data do not require research ethics committee approval.

Population cohort
167,478 participants of the UK Biobank (UKBB) with wholeexome-sequencing data available for analyses and 125,748 exome sequenced participants of the Genome Aggregation Database (gnomAD; version v.2.1.1) were included in this study.
Briefly, the UKBB recruited participants aged 40-69 years old from across the UK between 2006 and 2010, 29 of which the 200,571 exome tranche of individuals that had not withdrawn were included in this study. 30 The maximal subset of unrelated participants was used, identified by those included in the UKBB principal-component analysis (PCA) (S3.3.2, 29  gnomAD contains sequencing information for unrelated individuals sequenced as part of various disease-specific and population genetic studies. 31 The version 2 short variant dataset spans 125,748 exomes. We used Ensembl Variant Effect Predictor 32 (VEP,version 105) to incorporate the variant-specific summary counts. Variants flagged by gnomAD as AC0 were excluded from gnomAD counts. For more information on the incorporation of these datasets, please see the supplemental information.
Protein-altering variants, defined with respect to MANE transcripts, that were annotated as high or moderate impact by Sequence Ontology and Ensembl were included in the analysis. We restricted the analysis to genes with strong or definitive evidence of causing CM following ClinGen guidance 37,38 and expert curation 39 (Table S4). Variants with consequences consistent with the known disease-causing mechanism were retained.
Further manual annotation was undertaken following ACMG guidelines with ClinVar 35 and Cardioclassifier, 40 as previously published. 13 For analyses of variants in aggregate, the UKBB data were filtered following the same thresholds and used to estimate aggregate penetrance.

Statistical analysis
Estimation of penetrance and 95% confidence interval Penetrance, the probability of a disease given a risk allele, is expressed as a probability function on a scale of 0-1 or as a percentage. Penetrance was estimated from case-population data in a Binomial framework following Bayes' theorem 26 PðDjAÞ ¼ PðDÞ PðAjDÞ PðAÞ penetrance ¼ population prevalence case allele frequency population allele frequency where, D, disease; A, allele; P, probability; PðDjAÞ ¼ penetrance (probability of disease given a risk allele), PðDÞ ¼ prevalence, the population baseline risk of disease (probability of disease); PðAjDÞ ¼ allele frequency in the case cohort (probability of the allele given disease); and PðAÞ ¼ allele frequency in the population cohort (probability of the allele).
We define penetrance in this setting as the probability of dominant CM by late adulthood (UKBB had a mean age of 56 years old at recruitment). We assume the independence of the random variables in the penetrance equation above to derive the 95% confidence interval for penetrance as the product and ratio of binomial proportions. We used the specialized version of the central limit theorem, the delta method, on the log-transformed random variable logðDjAÞ ¼ logðDÞ þ logðAjDÞ À log ðAÞ with an improved mean approximation and adjustment for degeneracy (as allele frequency tends to 0 for rare variants). Please see additional methods and alternative approaches considered (supplemental methods, Table S3; Figures S4 and S5).
For estimates of penetrance by sex, we adjusted all terms of the penetrance equation by values for sex-specific parameters. For estimates of penetrance by ancestry, we kept PðDÞ as estimated for CM (there are few estimates of the prevalence of CM in specific ancestries) and proportioned PðAjDÞ and PðAÞ by reported ancestry. For estimates of penetrance by age, we normalized PðDÞ by the number diagnosed in the case cohort by a particular age in a cumulative fashion, with PðAjDÞ by a particular age and PðAÞ fixed as total population allele frequency (supplemental methods).

Estimated cardiomyopathy prevalence
To incorporate PðDÞ in our penetrance analysis, we estimated the uncertainty surrounding the reported prevalence of CM (Tables S1 and S2; Figures S1-S3). For HCM, we meta-analyzed four imaging-based prevalence estimates 13,[41][42][43] excluding studies with potential selection biases. From the meta-analysis estimate (p D hPðDÞ) and its confidence interval, we derived values of allele count, x D , and allele number, n D (where p ¼ x n ). A literature review was also completed for DCM, but there were not enough imagingbased prevalence estimates in literature, so we used 39,003 participants of the UKBB imaging cohort to estimate phenotypic DCM [44][45][46] (supplemental methods). Using the same methods and included studies, we derived estimates for male-and femalespecific HCM and DCM prevalence.

Results
Case cohort summary information Sequencing data for 10,400 individuals referred for HCM genetic panel sequencing and 2,564 individuals referred for DCM genetic panel sequencing were included in the analysis. Aggregate frequency of rare protein-altering variants in well-established disease-associated genes was 41% for HCM and 32% for DCM in the respective case cohorts (Tables S6 and S7). Of the cohorts with age, sex, and ancestry information available (20% of HCM-affected individuals, 42% of DCM-affected individuals), 35% and 32% were female, 93% and 91% were of EUR ancestry, and mean age was 48 and 49 years old, for HCM and DCM, respectively (Table S5).

Estimates of the prevalence of CMs
To estimate the prevalence of CMs, we undertook a literature review and meta-analysis (Tables S1 and S2; Figures S1-S3). Prevalence is underestimated when derived from national cohorts using coding systems such as ICD codes because of incomplete ascertainment through diagnostic and procedure coding. 47 We would therefore expect the most accurate estimates of the prevalence of CM to come from imaging studies in populations, where echocardiogram or cardiac magnetic resonance imaging was used to identify CM within a population sample that is representative. The estimates are not generalizable if the prevalence is estimated for selected subgroups of individuals, such as young, elderly, or athletic cohorts. We therefore meta-analyzed four imaging-based prevalence estimates, which resulted in an HCM population prevalence estimate of 1 in 543 individuals (p D ¼ 0.18% [95% CI D ¼ 0.15%-0.23%]). 13,[41][42][43] The well reported estimate of 1 in 500 individuals for HCM prevalence (0.20%) is within this confidence interval.
A literature review revealed insufficient imaging-based estimates to undertake a direct meta-analysis of the prevalence of DCM. Instead, we used 39,003 participants of the UKBB imaging cohort to estimate phenotypic DCM. [44][45][46] This derived a DCM population prevalence of 1 in 220 individuals (p D ¼ 0.45% [95% CI D ¼ 0.39%-0.53%]), which includes the well reported estimate of 1 in 250 (0.40%) 48 within the confidence interval.
We also estimated sex-specific CM prevalence. This resulted in an HCM population prevalence of $1 in 1,300 fe- Estimated penetrance of rare variants in aggregate In individuals with cardiomyopathy referred for diagnostic sequencing, we identified 1,332 rare (inclusive population allele frequency of <0.1%) variants in HCM-associated genes (4,305 observations, case frequency 41%) and 663 rare variants in DCM-associated genes (831 observations, case frequency 32%) (Tables S6-S9). The UKBB dataset was filtered following the same pipeline. We used 1,719 rare variants in HCM-associated genes (9,152 observations, 5.5% population frequency) and 4,568 rare variants in DCM-associated genes (22,177 Figure 1, Table S15). An estimate of the aggregate penetrance of both pathogenic and likely pathogenic variants in HCM was 10.7% (8.7%-13.3%) with this approach, concordant with a recent estimate derived via direct assessment of cardiac imaging in UKBB (10.8%; individuals with variants and left ventricular hypertrophy (LVH) R 13mm without hypertension or valve disease; binomial 95% confidence interval of 3.0%-25.4%; n ¼ 4/37). 10 This concordance was also observed for other variants in the same paper (e.g., VUSs), for which we estimated penetrance as 0.55% (0.45%-0.68%) compared to 0.57% (0.07%-2.03%, n ¼ 2/353). 10 The aggregate penetrance of pathogenic and likely pathogenic variants in DCM was 11.3% (9.3%-13.6%). Population penetrance of rare variants in DCM-associated genes in UKBB has been previously estimated as %30% 49 for a clinical or subclinical diagnosis in an analysis of 44 DCM-associated genes and in the range of 5%-6% for truncating variants in TTN (TTNtvs, 1.9%-12.8%; 877 individuals with variants) 5 depending on the definition used. We report a concordant penetrance estimate from our analysis of strong and definitive evidence DCM-associated genes only and 9.8% (8.0%-12.1%) for all TTNtvs (Figures 2 and S12).
Variants predicted to result in premature termination codons (PTCs; nonsense-mediated decay competent or incompetent 50 ) in MYBPC3, BAG3, DSP, and LMNA were the most penetrant. Inframe deletions in TNNT2 were highly penetrant for both HCM and DCM. TTNtvs and missense variants predicted to be damaging in TPM1 and TNNC1 had moderate penetrance (Figures 2 and S12; Tables S13, S14, S18, and S19).
Stratification by variant rarity showed that variants absent from gnomAD were the most penetrant subgroup ( Figure 1, Table S16). Stratification of penetrance by sex identified increased penetrance for males compared to females for rare variants in HCM-associated genes (Figures 1 and S13; Table S20). We estimated penetrance as <20% up to 50 years of age by modeling the penetrance of CM as an age-related cumulative frequency by using the proportion of affected individuals referred at each age decile ( Figure 1; Table S17).
Group 2 included 316 variants found multiple times in both affected individuals and population reference datasets (case AC R 2, pop AC R 2). This group is expected to include variants with intermediate penetrance, including founder effect variants. For this group, we can estimate AF in both populations and therefore can estimate penetrance (Figure 4, Interactive Figure S15; Tables S10 and S11). These account for more than half of all variants identified in HCM-associated genes and include those most likely to be identified as SFs, as they are identified multiple times in the population. For HCM, 257 variants were identified a total of 2,203 times (21% case frequency, 51% observations). 11% were P (n ¼ 29, 37% HCM group 2 observations), 25% LP (n ¼ 64, 31% observations), 59% VUSs (n ¼ 151, 29% observations), and 5% likely benign (LB, n ¼ 13, 3% observations). 49 of these variants were recurrent at least ten times and described a large portion of observations (case AC R 10; found 1,424 times, 33.0% of case cohort observations, case frequency of 13.7%). The median penetrance of these was 14.6% (514.4% SD). For DCM, 59 variants A B Figure 2. The aggregate estimates of penetrance of loss-of-function variants are high for specific genes The plot depicts estimated penetrance and 95% confidence interval of HCM-associated (A) and DCM-associated (B) rare variants. Predicted loss-of-function (pLoF) and non-pLoF variant groups are plotted in green and blue, respectively. *, TTNtvs that are PSI > 90%. Pathogenic TNNT2 inframe deletions caused an increased penetrance signal for inframe deletions for both HCM and DCM (see Figure S12). PTC, premature termination codon; PAV, proteinaltering variant; NMDc/NMDi, nonsense mediated decay competent/incompetent.
The impact of age, sex, and ancestry on variant-specific penetrance estimates For group 2, where age-related penetrance could be derived, we estimated the penetrance of specific variants by decade of age (e.g., Figure 5). For some variants (e.g., MYBPC3 c.1624G>C [p.Glu542Gln] [GenBank: NM_000256.3]), the age-related penetrance curve shows infrequent onset before middle age. These curves may inform surveillance strategies in individuals with variants unaffected at first assessment.
We identified rare variants in HCM-associated genes where estimated penetrance for males was significantly increased compared to females ( Figure S13). Identification of such variants allows for future investigations of modifiers protecting females with variants from disease.
For estimates of penetrance by ancestry, variants that were nominally more common in AFR, EAS, or SAS ancestries compared to EUR ancestry were identified (Table S12). We interpret these as more consistent with an inaccurate penetrance estimation arising from ancestries where the variant is sparsely observed rather than true differences in penetrance on different ancestral background. For example, MYBPC3 c.1544A>G (p.Asn515Ser) (GenBank: NM_000256.3) was identified 5/492 times in AFR affected individuals (AF ¼ 0.005) and 33/10,655 times in AFR population participants (AF ¼ 0.0016; penetrance of 0.6% [0.2%-1.5%]) compared to 1/9,692 times in EUR affected individuals (AF ¼ 0.00005) and not observed in 211,532 EUR population participants. Even when ancestry is A B Figure 3. Penetrance of individual variants could be estimated for 316 recurrently observed rare variants from group 2 (A) The figure shows variant counts and subgroups for rare variants in HCM-associated (left) and DCM-associated (right) genes. (B) The pie charts plot the proportion of all variant observations in each subgroup (also denoted as ''Gþ''). The observations approximate to the number of individuals with variants, although a small number of individuals may carry more than one variant. All, denotes frequency of the variant in affected individuals; obs, denotes observations of allele count. Group 1: variants observed recurrently in affected individuals and absent or singleton in the population; penetrance estimates are unreliable as the population frequency is uncertain. This group is expected to include most definitively pathogenic, high-penetrance variants. Group 2*: variants observed recurrently in affected individuals and the wider population; these are the variants most likely to be observed as secondary findings. *Penetrance can be estimated. Group 3: variants observed once in affected individuals and recurrently in the population; penetrance estimates are unreliable, as the case frequency is uncertain. Variants in this group are likely either not pathogenic or have low penetrance. Group 4: variants are singleton in affected individuals and absent or singleton in the population; current data is too sparse to estimate penetrance. nominally matched, broad continental groupings hide great diversity and results may be misleading due to stratification between case datasets (mostly North AFR from Egypt) and population reference datasets (e.g., UKBB participants from the Caribbean) (Box 1).
Clinical impact of specific variants now shown to have low penetrance We can define the upper bound of the penetrance estimate for some variants. 162 rare variants in HCM-associated genes (63% of variants, observed 745 times [7% case frequency; 17% of observations]) have a penetrance of %10%, according to the upper limit [UCI] of the 95% CI for our estimate. These included two variants previously curated as definitively pathogenic and 25 variants curated as likely pathogenic.
One of the pathogenic variants is splice acceptor MYBPC3 c.26À2A>G (GenBank: NM_000256.3), which has an estimated penetrance of 1.0% (0.4%-2.8%) or 0.9% (0.3%-2.5%) in EUR ancestry, as it was identified four times in EUR affected individuals and 20 times in population participants (90% were EUR). The potential for this variant to have incomplete penetrance has been noted previously through identified asymptomatic individuals with variants (see ClinVar ID 42644). There is in silico evidence of an alternate splice site downstream that could result in an in-frame deletion of two amino acids.
The second pathogenic variant identified with a UCI of % 10% is the missense variant MYH7 c.3158G>A (p.Arg1053Gln) (GenBank: NM_000257.4), which is a Figure 4. Variant-specific estimates of penetrance for the 316 recurrently observed rare variants in CM-associated genes from group 2 An interactive widget is available for browsing the individual variants in this figure (see Figure S15). The variants depicted (HCM n ¼ 257, A; DCM n ¼ 59, B) were identified multiple times in affected individuals and population reference datasets and penetrance could therefore be estimated. Presented is the estimated penetrance and 95% confidence interval. The x axis denotes the number of times the variant was observed in each case cohort. AC, allele count; B/LB, benign/likely benign; VUS, variant of uncertain significance; LP, likely pathogenic; P, pathogenic.
Finnish founder mutation. This variant had an estimate penetrance of 2.2% (0.9%-5.2%), as it was identified seven times in EUR affected individuals and 17 times in the population cohort (16 Finnish from gnomAD, one NWE from UKBB). Estimates of penetrance are sensitive to allele frequency differences across ancestries. Analysis of founder mutations in the population they derive from would provide additional confidence in their penetrance estimates.
For DCM, 17 rare variants (29% of variants) observed 45 times (2% case frequency; 5% of observations) met this criterion. None of the 17 variants were curated as P/LP.

Penetrance estimate simulations of increased cohort sizes
We anticipate two benefits to estimating the penetrance of rare variants from increasing cohort sizes: (1) there will be more variants that are observed recurrently in affected individuals and populations, permitting AF estimates and hence penetrance estimates, and (2) the precision of our penetrance estimates will increase as AF of rare variants is ascertained with greater precision.
We sought to understand whether it would be more valuable to focus resources on aggregating data from larger numbers of affected individuals ($100,000 plausible affected individuals with global collaboration efforts), and/or from larger numbers of population participants with near-term publicly available population datasets ($5,000,000 participants).
Efforts to increase reference population sample size will provide additional confidence in penetrance estimates once case aggregation to 10,000 affected individuals is reached ( Figure S6). There is substantial confidence to be gained by increasing the population cohort size: we found that increasing the population dataset from 300,000 participants to 4.5 million participants could provide $20% certainty, depending on the penetrance of the variant ( Figures S7-S11). The increase in confidence gained from increasing the case cohort sample size from 10,000 affected individuals to 100,000 affected individuals was limited (with the caveat that more variants will be identified).

Discussion
We show that some subgroups of rare variants in the population are penetrant and for these it may be reasonable to return as SFs. These include ultra-rare variants, predicted PTCs in certain genes where loss of function is a known disease mechanism, and variants with enough evidence to have been classified previously as definitively pathogenic.
There is still uncertainty regarding the penetrance of individual ultra-rare variants, and the implications of returning SFs in healthcare systems have yet to be estimated. While we have previously attempted to assess the burden of long-term surveillance for DCM, 5 cost-effect analyses are vital to fully understand the risks and benefits of reporting SFs in different healthcare systems. For variant types with low penetrance, it is very uncertain that the benefit of returning SFs will outweigh harms and justify costs.
Here, we provide at-scale estimates of variant-specific penetrance for variants in CM-associated genes that include those likely to be most frequently identified as SFs. Most have low estimated penetrance, where an asymptomatic individual without family history of disease may choose no or less-frequent surveillance depending on the healthcare system and follow-up cost.
Population penetrance estimates derived from unselected individuals (with certain caveats 54 ) that are agnostic to personal or family history of disease should provide a better estimate of the probability of manifesting disease when a variant is identified as an SF. Importantly, the penetrance of variants found in individuals with CM and relatives in a clinical setting is increased compared to the penetrance of variants estimated for those identified through SFs (e.g., MYBPC3 c.1504C>T [p.Arg502Trp] [GenBank: NM_000256.3] with estimated penetrance of 50% in individuals with HCM and 6% here in the population).
While published data are sparse and heterogeneous, overall estimates of penetrance by adulthood in the general population are lower than family-based studies. We used unpublished data to assess the penetrance of asymptomatic individuals with variants referred to hospital for predictive testing after identification of a genotype-and CM-positive relative. For HCM, 17 of 65 individuals with variants (26.2%) were diagnosed with HCM (ten on first clinical evaluation, seven during 2 years of follow up). For DCM, two of 22 individuals with variants (9.1%) were diagnosed with DCM (two on first clinical evaluation, The variant MYBPC3 c.1504C>T (p.Arg502Trp) (GenBank: NM_000256.3) was found in our cohort 159 times in individuals referred for HCM genetic panel sequencing (3.7% of total observations; 1.5% total case frequency). To date, the variant has been classified on ClinVar 15 times as pathogenic (ClinVar ID 42540). Penetrance has been previously estimated as $50% (increased relative risk of 340) by 45 years old in a clinical setting, and major adverse clinical events in heterozygotes are significantly more likely when another sarcomeric variant is present. 52 In our case cohort, heterozygotes of this variant were reported as broadly European ancestry (Oxford, n ¼ 59; London, n ¼ 11; Belfast, n ¼ 30; LMM, n ¼ 45; GDX, n ¼ 14). In gnomAD, the variant was identified ten times, of which seven heterozygotes were non-Finnish Northwestern Europeans (NWE; plus one African; one South Asian, and one other), and in the UK Biobank, the variant was found 77 times, of which 68 heterozygotes were NWE (plus eight other Europeans and one other). The population frequency of the variant in Ensembl population genetics showed that the variant (rs375882485) is only found multiple times in NWE ancestry sub-cohorts. Thus, the variant is most common in NWE populations: the UK, Ireland, Belgium, the Netherlands, Luxembourg, Northern France, Germany, Denmark, Norway, Sweden, and Iceland.
We use this relatively common variant to highlight the effect of ancestry on estimated variant penetrance (see related figure in this text box): we estimated the penetrance as 6.4% (4.6%-9.0%) with the UK Biobank cohort (93% European) and this is inflated to 35.1% (18.2%-67.5%) when we estimated the penetrance with the gnomAD dataset (45% European) as a result of the difference in the proportion of individuals with NWE ancestry. In individuals of NWE ancestry only, the penetrance of this variant is 6.4% (4.6%-9.0%). Penetrance estimated from the NWE subset of gnomAD or UKBB do not differ significantly.
As access to larger genomic datasets becomes available, including more diverse ancestries, we can increase the precision of these variant-specific penetrance estimates by gaining further confidence in maximum population allele frequencies. 53

A B
Penetrance estimates are inflated with underestimated population frequency (A) The map of the world emphasizes the large proportion of observations of MYBPC3 c.1504C>T (p.Arg502Trp) in HCM-affected individuals of Northwestern European (NWE) ancestry. The numbers on the map are the counts of rare-variant-genotype-positive observations (n z cohort participants) from each cohort with the specified ancestry, and the percentages derive the proportion of observations that are due to the MYBPC3 c.1504C>T (p.Arg502Trp) variant. (B) The graph shows the estimated penetrance and 95% confidence interval for the variant on the basis of subgroups of reference dataset participants included. The penetrance is inflated when estimated with gnomAD because the variant is most common in participants with NWE ancestry (which dominates the UKBB dataset). Population frequency of gnomAD, UK Biobank, and Ensembl population genetics showed that this variant (rs375882485) is only found multiple times in NWE ancestry sub-cohorts. The map excludes Antarctica for figure clarity. A limitation is the low sample sizes for AFR, SAS, and EAS ancestries. 26 0 during 2 years of follow up [excluding five with hypokinetic non-dilated cardiomyopathy and four with isolated left ventricular dilatation]). Additionally, a study of individuals with variants identified during family screening who did not fulfill diagnostic criteria for HCM at first evaluation identified HCM or an abnormal ECG in 127 of 285 individuals with variants (44.6%; 82 at baseline, 45 over a median of 8 years follow-up). 25 First degree relatives in the same household may be at increased risk of disease due to shared environment and other genetic factors.
The ACMG guidelines for reporting ''medically actionable'' variants in 78 genes come with the caution that evaluating SFs requires an increased amount of supportive evidence of pathogenicity given the low prior likelihood that variants unrelated to the indication are pathogenic. 55 Here, we show that variants with a definitive pathogenic assertion in ClinVar had the highest penetrance estimates. This may be because penetrant variants are more likely to yield sufficient evidence for confident interpretations, especially family segregation data.
Genetic laboratories communicate their confidence on whether a variant has a role in disease (i.e., pathogenicity) but do not consistently indicate the penetrance. Pathogenicity addresses whether a variant explains the etiology of an individual with disease. In comparison, penetrance addresses the probability of future disease in individuals with variants. The ClinGen consortium Low-Penetrance/ Risk Allele Working Group recommends providing penetrance estimates on clinical reports (aggregate gene-level or individual variants) and noting when penetrance is assumed or where current information is limited/ unavailable.
Individually rare TTNtvs are collectively common in the general population ($1 in 250 for variants in exons constitutively expressed in the adult heart; likely due to the size of TTN and only moderate constraint [loss-of-function observed/expected upper bound fraction (LOEUF) of 0.35 in gnomAD]), and we show that the penetrance in aggregate of TTNtvs is reduced compared to predicted loss-of-function variants in other CM-associated, haploinsufficient genes. While recent work has increased our understanding of the functional mechanisms of TTNtvs in disease, 56,57 future work is required to identify modifiers of TTNtvs to understand this reduced penetrance in the population.
The penetrance of a variant may depend on characteristics of the variant itself and modulating effects of genetic background and environment. This study characterizes individual variants, while ongoing work is dissecting the role of secondary genetic influences. Polygenic scores may identify individuals at particular risk of disease, modifying the estimated penetrance of a single dominant variant.
We present two dimensions to estimates of penetrance: the penetrance in the general population and variant-specific penetrance. As described, the results of this method are concordant with previous population estimates of aggregate penetrance in the UKBB population derived with independent approaches, providing confidence in the methods. In addition, we provide updated estimates for the population prevalence of HCM and DCM and stratify by sex. The addition of future, publicly available, largescale, global population datasets and biobanks will aid this area of research by allowing for increased confidence in ancestry-specific population allele frequencies and CM prevalence. We provide the summary counts for each variant via an online browser and the function to estimate penetrance in R for transferability and use in other diseases and datasets.

Limitations
This study has not been undertaken without careful consideration of the limitations. This method cannot quantify the penetrance of pathogenic variants that are absent/singleton in the population, while in aggregate the penetrance of this group of variants is significant.
Comparisons of case and control allele frequency are vulnerable to confounding by population stratification, and we have explored some examples in this manuscript. We do not have genome-wide variation data to directly assess genetic ancestry for the case cohort, so this is based on data reported by the referring clinician. As the EUR participants dominate our case and population datasets, greater representation of diverse ancestral backgrounds is essential for equitable access to genomic medicine. Estimates of the penetrance of variants and the prevalence of cardiomyopathies in more ancestral groups are required. The current data for both comes from UKBB, which has limitations. 54 In the absence of genome-wide data, we cannot exclude the possibility of unrecognized or cryptic relatedness within the case cohort. As described by Minikel et al., 26 when a variant is highly penetrant, cryptically related individuals are likely included in case series and, if a disease is fatal, population cohorts are likely depleted of causal variants.
Case allele frequency in unrelated affected individuals may not be a fair estimate of the case allele frequency in all cases observed in the clinic. Our estimate of case allele frequency, and therefore of penetrance, is influenced by genetic testing referral practice. If clinicians are cautious and only refer selected high confidence affected individuals for testing, case allele frequency and estimated penetrance will be high, whereas if clinicians were to test widely and indiscriminately, then our apparent case allele frequency would be lower, resulting in lower penetrance estimates. 18 Current diagnostic data assume that the testing center obtained complete coverage of the gene. Limited data were available on age and sex for large portions of the case cohorts. Our DCM-referred cohort was only moderate in size, and thus increases in sample size here through global collaboration would aid our estimates of penetrance for variants in DCM-associated genes. We have estimated penetrance for rare variants that are reported by diagnostic laboratories and have not estimated penetrance for more common variants of smaller effect that may contribute to risk in combination.
Finally, the UKBB volunteer population cohort is healthier than the average individual, 54 and the gnomAD consortium includes some individuals with severe disease but likely at a frequency equivalent to or lower than the general population. 31 The proposed penetrance model is an approximation since in reality the three parameters used on the right-hand side of the penetrance equation share some degree of dependence.

Conclusion
We present an evaluation of the penetrance of individual rare variants in CM-associated genes at scale. These recurrent variants are those that are likely to generate SFs. Variants previously annotated as pathogenic, loss-of-function variants in specific genes susceptible to haploinsufficiency, and those that are the rarest in the population, have high penetrance, similar to observations from family studies. This initial attempt at estimating the penetrance of rare variants has highlighted the requirement for large case and population datasets with known genetic ancestry. We are now able to start putting bounds on the estimate of penetrance for a specific variant identified as a secondary finding: for some, including those expected to be most penetrant, we do not currently have enough data; for others, we can provide asymptomatic individuals with variants with an estimated probability of manifesting disease.
Table S11 Penetrance estimates for 59 rare variants in DCM-associated genes.
Table S12 Estimated penetrance of eleven variants more common in non-EUR ancestry.

Table of figures
Figure S1 Meta-analysis of population prevalence estimated for DCM in literature.   Figure S4 Assessment of nine methods to estimate the 95% confidence interval of penetrance. Figure S5 A fully Bayesian approach is not suitable for estimating penetrance. Figure S6 With 10,000 cases, increasing population participants aids penetrance estimates. Figure S7 Negligible gains in confidence will be provided by increasing case sample size, while substantial gains will be observed by incorporation of future large-scale population datasets. Figure S8 Simulation of the gain in confidence of the penetrance estimate with increasing sample size. Figure S9 As the probability of the allele increases, precision increases, and the estimate of penetrance decreases. Figure S10 As the probability of the allele given disease increases, penetrance increases, and the precision of the estimate of penetrance has less confidence. Figure S11 Simulations of the expected penetrance estimates in the range of the probability of the allele and the probability of the allele given disease, observed in this study.
Figure S12 Aggregated penetrance of loss of function variants is highest. Figure S13 Variants with significantly decreased penetrance in females compared to males from Group 2.
Figure S14 Aggregate penetrance of variants in CM-associated genes grouped by rarity and consequence.

Simulations
The simulations showed that penetrance estimates for highly penetrant variants (e.g., >50% penetrance) have large confidence intervals. However, if a variant has at least 10% population penetrance (via the lower bound of the confidence interval), it is unlikely that the carrier will be released from future clinical follow up. For variants with a more modest estimated penetrance (e.g., <50%), we show that we are now able to estimate penetrance more confidently for variants likely to be identified as secondary findings.
The rate of change of the "error" to the limit confirmed that the gain in confidence from increasing case samples is negligible (the plot plateaus) but increases in future population participants would provide a substantial gain in confidence surrounding the penetrance estimates.
We assessed the size of the confidence interval when varying population allele frequency and case allele frequency. As described by the penetrance equation through the ratio of ( | )/ ( ) and observed in the simulations, the rarer the variant is in the population (e.g., observed twice in 300,000 participants) and the more common the variant is in the case cohort, the larger the confidence interval. The penetrance equation promotes the increase of the confidence interval in such cases when the penetrance is high due to the unbalanced allele frequency between the smaller case cohort and very large population cohort. In addition, through assessment of simulations within the allele frequency ranges of the variants observed in this study, variants with a very high penetrance and can have an estimated penetrance of >100%. While theoretically this could be the case, we did not observe any real variants in our dataset that had a combination of case and population allele frequencies that resulted in an estimated penetrance of >100% (maximum penetrance was 66.8% for HCM, 78.6% for DCM). Such variants are unlikely to be observed several times in the population reference cohort. Figure S1 Meta-analysis of population prevalence estimated for DCM in literature.

Supplementary figures
(Left panel) Forest plot depicting the prevalence and associated binomial confidence interval for each literature reference. (Right panel, zoom) The same forest plot with the xaxis shortened to between 0 and 0.008. Coding system, prevalence estimates that were derived using large population datasets with International Classification of Diseases (ICD) or other coding systems and have decreased prevalence estimates; Imaging, prevalence estimates that were derived using imaging data such as cardiac MRI or echocardiography and provide estimates that better reflect the true DCM prevalence; Selection bias, patients referred for imaging measures based on previous symptoms and have increased prevalence, or, participants are active, selected for being young or athletic and have decreased prevalence, or, participants are elderly and the prevalence estimate is substantially increased.

Figure S2
Meta-analysis of population prevalence estimated for HCM in literature.
(Left panel) Forest plot depicting the prevalence and associated binomial confidence interval for each literature reference. (Right panel, zoom) The same forest plot with the xaxis shortened to between 0 and 0.005. Coding system, prevalence estimates that were derived using large population datasets with International Classification of Diseases (ICD) or other coding systems and have decreased prevalence estimates; Imaging, prevalence estimates that were derived using imaging data such as cardiac MRI or echocardiography and provide estimates that better reflect the true HCM prevalence; Selection bias, patients referred for imaging measures based on previous symptoms and have increased prevalence, or, participants are active, selected for being young or athletic and have decreased prevalence, or, participants are elderly and the prevalence estimate is substantially increased. References 12,13,[28][29][30][31][32][33][34][35][36][37]14,78,79,15,20,[23][24][25][26][27] . Four studies were included that had assessed for the prevalence of HCM using imaging for population screening. The heterogeneity indexes are not significant (P>0.05).

Common effect model Random effects model
Heterogeneity: I 2 = 0%, t 2 = 0, p = 0.87 Zou Figure S4 Assessment of nine methods to estimate the 95% confidence interval of penetrance.

Figure S5
A fully Bayesian approach is not suitable for estimating penetrance.
A) Based on real data parameter specifications, the Beta distribution of the prevalence and the posterior Beta distribution of the penetrance | have marginal overlap. B) The divergence between the known distribution of (Beta-Binomial) once and | are specified (beta and Binomial densities, respectively) and the Binomial distribution of (independent from and | ) are very different.

Figure S6
With 10,000 cases, increasing population participants aids penetrance estimates.
Efforts to increase reference population sample size will provide additional confidence (i.e., narrower confidence intervals) than further case aggregation after 10,000 cases is reached (with the caveat that more variants will be identified). The graph denotes the results of a simulation of a variant with 10% estimated penetrance and 55% estimated penetrance. The x-axis varies population reference cohort size, and the legend varies case cohort size. Black line, 100% penetrance; pink line, penetrance estimate. Population participants (log scale) population AF: 5.3x10 -5 ; penetrance: 55% Simulation (10% and 55% penetrance)

Figure S7
Negligible gains in confidence will be provided by increasing case sample size, while substantial gains will be observed by incorporation of future large-scale population datasets.
Example variants had a penetrance of ~10%, ~20%, ~55%, and ~75% (popAF=0.000013, caseAF=0.0008, 0.0016, 0.004, 0.0056, respectively). The penetrance estimate is shown as a black line, the UCI are coloured above the penetrance estimate and the LCI coloured below. The grey horizontal line denotes depicts a penetrance of 1.0 or 100% for assessment of the UCI. The grey vertical line denotes the sample size used in this study. The sizes of population reference cohorts are depicted as coloured points. The x-axis describes case cohort samples, and the legend describes the number of gnomAD and UKB participants.   The maximum | observed was 0.008, the median was 0.0003, and the minimum was 0.0001 for the variants included in this study. Y-axis, estimates of penetrance; X-axis, four tested. The coloured points represented population reference sample size. The UCI is above the estimate of penetrance (black points) and the LCI below. The grey horizontal lines depict an estimated penetrance of 1.0 or 100%. While theoretically shown here, variants with a very high penetrance can have an estimated penetrance of >100, we did not observe any real variants in our dataset that had a combination of case and population allele frequencies that resulted in an estimated penetrance of > 100% (maximum penetrance was 66.8% for HCM, 78.6% for DCM). The plot depicts estimated penetrance of rare variants in HCM-associated (left) and DCMassociated (right) genes. LoF and non-LoF variant groups are plotted in green and blue, respectively. LoF, predicted loss of function variants; *, TTNtv that are PSI>90%. This plot provides additional stratification for missense variants predicted as deleterious (using REVEL). NMDi PTCs other PAV

Figure S13
Variants with significantly decreased penetrance in females compared to males from Group 2.
The plot depicts the sex-specific estimates of penetrance of seven rare variants in HCMassociated genes with decreased penetrance in females. The variants are more common in females in our data. The variants on the right side of the plot were variants observed in male cases but not in males of the population reference datasets. Overlapping confidence intervals was observed for the sex-specific penetrance estimates of all other variants. Variant-specific estimates of penetrance for the 316 recurrently observed rare variants in CM-associated genes from group 2. The variants depicted (HCM n=257 (top), DCM n=59 (bottom)) were identified multiple times in cases and population reference datasets and penetrance could therefore be estimated. The x-axis denotes the number of times the variant was observed in each case cohort. AC, allele count; B/LB, benign/likely benign; VUS, variant of uncertain significance; LP, likely pathogenic; P, pathogenic.  See excel file.

Table S3
Selection of the Agresti-Coull method and comparison of binomial proportion methods for deriving parameters from the meta-analysis results.
̂ and ̂ were derived from the meta-analysis results ̂, ̂ and ̂. ̂̂=̂/̂ , LL and UL are the estimated lower and upper 95% confidence interval obtained by each method, using the corresponding estimated values of ̂ and ̂. Relative error (%) with sign associated with each method is defined as the difference between the meta-analysis value minus the corresponding estimated one divided by the meta-analysis value.

Overview of the estimation of penetrance and its confidence interval
In this study, we adapted the estimate of penetrance from Minikel et al. (2016) 1 P( | ) = P( ) P( | ) P( ) , (Eq. S1) where P( | ) is the penetrance of the variant (by adulthood), i.e., the probability of disease given a risk allele; P( ) is the prevalence of the disease, i.e., the baseline risk in the general population; P( | ) is the frequency of individuals with the disease who have the allele, i.e., the allele frequency in cases; and P( ) is the frequency of the allele in the general population, i.e., the population allele frequency.
An alternative approach would be to estimate penetrance via a likelihood ratio test, i.e., the probability of disease given a risk allele divided by a positive test. However, this requires healthy controls, i.e., the identification of healthy controls instead of population cohorts. This is erroneous without known cardiac status.
In the following, we indicate with | , , | and , the random variables (r.v.s) for the penetrance of the variant, the prevalence of the disease, the allele frequency as a proportion in cases, and the allele frequency as a proportion in the general population, respectively. We indicate with | ≡ P( | ), ≡ P( ), | ≡ P( | ) and ≡ P( ), the probability of the corresponding events. Finally, we specify with ( | ), ( ), ( | ) and ( ), the distribution of the corresponding (discrete or continuous) r.v.s.
To estimate the confidence interval surrounding the estimate of penetrance, we assessed several methods: • Minikel et al. (2016) 1 used binomial confidence intervals to estimate the uncertainty regarding the penetrance. The authors estimated the binomial proportion (1 − )% confidence interval for | and independently for , divided separately the lower limits (LL) and the upper limits (UL) of the confidence intervals and multiply them by estimated . In this framework, the penetrance confidence interval could be outside the interval [0,1] ("overshooting" 2 ) and was therefore truncated in the interval [0,1]. • We considered using the above estimate of uncertainty and tested other methods proposed in literature for the confidence interval of binomial proportions (e.g., simple asymptotic or Wald method, Wilson score method, etc. see for instance 3,4 and references therein) and adjusted the nominal level of significance such that the coverage probability aligns with the (1 − )% nominal level 5 . • We also wanted to fully estimate the uncertainty surrounding the penetrance estimate. To do this, we aimed to undertake a fully Bayesian approach to estimate the confidence interval for penetrance including an estimate of uncertainty regarding the prevalence of cardiomyopathy described in the literature. In our framework, this was not possible. When a joint beta-binomial model is specified for ( | ) ∝ ( | ) ( ), where | follows a binomial distribution with the probability of success ( ), and is distributed as a beta density, the marginal distribution ( ) is given 6 . Thus, a Bayesian approach cannot be used to quantify the uncertainty of penetrance. In our cross-sectional approach, is assumed independent from | and follows a binomial distribution, whereas from a Bayesian perspective, ( ) is derived by marginalizing out form the joint distribution ( , ), i.e., ( ) = ∫ ( | ) ( ) . For comparison, we plotted (see Figure S5) the beta-binomial distribution derived from the marginalization of the joint distribution against the corresponding binomial distribution assuming and | are independent.
• We also tested a Monte Carlo approach to overcome the problem of the fully Bayesian formulation by using an inverse logit transformation of a normal distribution as the prior density for , while retaining the above specification for ( | ) (binomial distribution with probability of success ( )) and ( ) (binomial distribution) and sampled independent realisations from ( , | ) and to derive the (1 − )% Monte Carlo confidence interval for penetrance. To avoid overshooting, each realisation of the Monte Carlo simulation was checked and, if necessary, truncated in the interval [0,1]. • Our final approach, and the approach used here, was to assume the independence of the r.v.s , | , and , to derive the (1 − )% confidence interval for penetrance as the product and ratio of binomial proportions. Our method of choice used the specialised version of the Central Limit Theorem, the Delta method 7 on the logtransformed random variable log( | ) = log( ) + log( | ) − log( ) with an improved mean approximation and adjustment for degeneracy 3 . The parameterisation of the binomial distribution ( ) was derived from a metaanalysis of literature-based estimates of the prevalence of HCM, while UK Biobank CMR-derived estimate of was used in the penetrance equation for DCM where few published studies were available for inclusion in the meta-analysis.

Estimation of penetrance
Following   1 , penetrance is defined as the probability of developing disease given a risk allele | and can be estimated by Bayes' rule (Eq. S1). Three parameters were used to define penetrance by adulthood: i) the prevalence of the disease, i.e., the baseline lifetime risk in the general population, ii) the proportion | of individuals with the disease who have the allele, i.e., the allele frequency in cases, and iii) the frequency of the allele in the general population, i.e., the population allele frequency. The allele frequency is used in | and and it is estimated as the probability of success in a binomial experiment by using the allele counts , i.e., the binomial number of successes, and allele number , i.e., the binomial number of trials. We estimate the penetrance of an allele under a dominant genetic model 8 as (Eq. S2) The penetrance ( | ) of an allele is estimated using three parameters: , the fixed probability of disease calculated by meta-analysis of reported prevalence of disease from literature (with , allele count, and , the allele measure, both estimated, see below), | , the probability of the allele given disease, estimated from allele frequency in cases (with | and | observed), and , the probability of the allele, estimated from the allele frequency in population cohorts (with and observed).

Probability of the disease: cardiomyopathy prevalence estimates
The prevalence of cardiomyopathy has been previously estimated and reported as the most simplified ratio of 1 in 500 for hypertrophic cardiomyopathy (HCM) and 1 in 250 for dilated cardiomyopathy (DCM) 9 . To identify the true confidence with our current knowledge of the prevalence of cardiomyopathy, a literature review was undertaken to identify population-based prevalence estimates of cardiomyopathy (Table S1).
We have previously found the use of cardiac imaging to have higher sensitivity in estimating cardiomyopathy prevalence than ICD codes 20 . Only one article used imaging in identifying DCM prevalence. We therefore assessed the prevalence of clinical DCM (LVEDV > 232ml in males and > 175ml in females, plus LVEF < 50%, in the absence of a record of CAD or HCM) in the imaging tranche of the UK Biobank 21 . This criterion was adapted from Mestroni et al. with UK Biobank imaging reference ranges 21 . 177 DCM cases were identified from cardiac imaging of 39,003 participants ( = 0.45% (binomial 95% CI = 0.39%-0.53%) or 1 in 220) 22 . As a meta-analysis cannot be undertaken with only two cohorts, we were restricted to using the UK Biobank estimate only, which is similar to the expected DCM prevalence of 1 in 250 9 .
From the meta-analysis estimate of and its confidence interval, we derived the values for and (solving two unknown values in two equations, one describing the estimation of and the other, its confidence interval). However, since several ways to estimate the confidence interval for binomial proportions have been proposed in literature 3,4 , different values of and can also be obtained. We assessed four popular methods: the Wald method, based on a simple asymptotic normal approximation (Eq. S4), the Arcsine method, based on the Delta method for variance stabilization using the sin −1 √ transformation (Eq. S5), the Agresti-Coull method 40 , which relies on the asymptotic normal approximation centred in where ̂ is the meta-analysis estimate of and 1− /2 is the 1 − /2 quantile of the standard normal distribution (Eq. S6), and the Clopper-Pearson method, an exact method for the confidence interval of a binomial proportion (Eq. S7).
The best method chosen was the one that minimizes the Euclidian distance between the meta-analysis estimates (̂, LL , UL ), where LL and UL are the lower and upper limit of the (1 − )% confidence interval for the prevalence of the disease, and (̂̂, LL , UL ), i.e., the same quantities estimated by each method after the corresponding estimates of and are obtained. At = 0.05, given the results of the meta-analysis, the Agresti-Coull method performed the best with the lowest L2 norm and with low relative errors, defined as the relative difference between the meta-analysis values and their estimated values calculated by each method (Table S3). See List S1 for details. This derived = 97 and = 52,660 for HCM.
List S1 Selection of the Agresti-Coull method and other methods assessed to estimate the number of cases and the population size for the disease prevalence.
For each method considered, the estimated values of and are derived as shown below. The best method was selected by assessment of the Euclidean distance between (̂, LL(̂), UL(̂)), the estimated value, and the lower and upper limits of the 95% confidence interval of the prevalence obtained from the meta-analysis, and (̂̂, LL(̂̂), UL(̂̂)) obtained by each method, using the corresponding estimated values of ̂ and ̂. where LL and UL are the theoretical exact lower and upper limits of the 95% confidence interval and Beta −1 (•) is the inverse of the cumulative density function of the Beta density. The solution for and , with ≤ , is obtained numerically as the values that minimise the Euclidean distance between (LL, UL) defined (Eq. S7) and (LL, UL ), the lower and upper limits of the 1 − confidence interval of the prevalence obtained from the meta-analysis, respectively. To reduce the computational cost of the exhaustive search, we also assume ̂< ⌈̂/2⌉ as in the Agresti-Coull method.

Probability of the allele given disease: allele frequency in the case cohort
The allele frequency of variants in the case cohort was used in the penetrance calculation for | . The allele count and allele number were used for | and | , respectively. See Section 4.10 for further information.

Probability of the allele: allele frequency in the population reference datasets
The allele frequency of variants in the combined population cohort of UK Biobank and gnomAD was used in the penetrance calculation for . The allele count and allele number were used for and , respectively. It is assumed that the population datasets include individuals who will later die of cardiac disease, thus enabling direct use of the gnomAD and UK Biobank allele frequencies combined as . See Section 4.9 for further information.

Confidence intervals
Since it is not possible to undertake a fully Bayesian analysis to estimate the confidence interval for penetrance, we used a different approach; the specialised version of the Central Limit Theorem, the Delta method 7 , on the log-transformed random variable log( | ) = log( ) + log( | ) − log ( ) (Eq. S1), assuming the independence between the binomial random variables | , and , with an improved mean approximation and adjustment for degeneracy 3 . The Delta method concerns the approximate distribution of a function of random variables which is asymptotically normal where the mean and variance are obtained by a first-order Taylor approximation expanded around the means. The improved mean involves a better approximation of the first moment of the asymptotic normal distribution by using a second-order Taylor expansion. To address the problem of degeneracy, i.e., the confidence interval's width is 0 when the probability of success is 0, we added the constant = = = 0.5 to all and , respectively 41,42 , as the allele frequencies and | of rare variants will always tend towards zero.
We compared this approach with seven other methods for deriving the confidence intervals of penetrance (List S2). In the first group of methods (G1), we derived the confidence interval for penetrance as the (1 − )% confidence interval of the ratio of binomial proportions | and , similar to the derivation of confidence intervals for the relative risk 43 , and multiplied it by the estimated value of . In the second group (G2), we considered as a random variable subject to uncertain quantification. We assessed the methods in groups G1 and G2 using an example variant with the following parameters: 97/52,660 for / obtained from the HCM meta-analysis, 10/20,000 for | / | and 3/600,000 for / . For all methods, we tested with degeneracy adjustment or by adding a continuity correction 3 . Our method of choice (part F in List S2) fully encompasses the uncertainty regarding ( Figure S4). An example of a Bayesian approach where the prior and posterior have little overlap is depicted (Figure S5).

List S2
Methods assessed to derive the confidence intervals of penetrance.
We consider two groups of methods to assess the confidence interval of the penetrance. In the first group (G1), the confidence interval for the ratio of the random variables | and (Eq. S1) is obtained similarly to the derivation of the confidence interval for relative risk 5 . Most of the methods are readily available in the R package ratesci 71 unless stated otherwise. The upper and lower limit of the (1 − )% confidence interval is then multiplied by the estimated value of the prevalence .
The second group (G2) of methods consider the prevalence as a random variable and the confidence interval is derived assuming the independence of all the quantities involved. The second group of methods rely on the Delta method applied on the logtransformed random variable | or directly on | (Eq. S1) with/without improved mean approximation.
To address the problem of degeneracy, i.e., the confidence interval's width is 0 when the probability of success is either 0 or 1, we added the constant = 0.5 and = 0.5 to and , respectively, in the binomial random variables | , and 41,42 or add a continuity correction 3 to the confidence interval. To avoid "overshooting" 2 , i.e., the confidence interval of penetrance could be outside the interval [0,1] and the results are truncated in the interval [0,1]. The CI( | ) is obtained as follows: genetic studies and lifted over to GrCh38. 57,787 were female and 67,961 were male. Ancestry is provided for global super-populations: i.e., African/African American (AFR), American Admixed/Latino (AMR), East Asian (EAS), Non-Finnish European (NFE), and South Asian (SAS), and some sub populations such as Northwestern Europeans (NWE). Individuals known to be affected by severe paediatric disease have been removed, as well as their first-degree relatives, however, some individuals with severe disease may still be included in the data sets, albeit likely at a frequency equivalent to or lower than that seen in the general population. The data released by gnomAD are available free of restrictions under the Creative Commons Zero Public Domain Dedication. The aggregation and release of summary data from the exomes collected by the Genome Aggregation Database has been approved by the Partners IRB (protocol 2013P001339, "Large-scale aggregation of human genomic data"). The gnomAD dataset was incorporated into the analysis through the Ensembl Variant Effect Predictor 49 plugin.

Cardiomyopathy case cohort summary information
Datasets created in closely collaborating centres (described below -RBHT, NHCS, AHCE) of which access has been granted for sequencing BAM files, are denoted "internal datasets". Datasets summarised and aggregated by external sequencing centres (described below -OMGL, LMM, BRGL, GDx) of which only summary counts were provided, are denoted "external datasets".

Internal datasets
Royal Brompton and Harefield NHS Foundation Trust, London, UK (RBHT) provided panel sequencing on HCM and DCM diagnosed patients, as previously published [50][51][52] . The patients were identified by consecutive referrals to the imaging unit from the dedicated cardiomyopathy service and a network of 30 regional hospitals, forming the National Institute for Health Research Biobank. Patients were referred for diagnostic evaluation, family screening, or assessment of CM severity. All patients were prospectively enrolled for research purposes and underwent cardiac phenotyping with either cardiovascular magnetic resonance (CMR) or trans-thoracic echocardiography, with CM diagnosed according to standard criteria 50 . Further information regarding the inclusion criteria of the patients, targeted sequencing protocol, and data quality control, can be found in previously published articles 50 . All participants gave written informed consent, and the study was approved by the relevant regional research ethics committees. Samples were sequenced on the NextSeq 500, the MiSeq and the HiSeq Illumina platforms using the TruSight Cardio Sequencing Kit from Illumina (which includes 174 genes associated with inherited cardiac conditions (ICCs)). Additional samples were sequenced on the 5500xl SOLiD platform (SLD) from Life Technologies using a custom Agilent SureSelect panel of genes associated with ICCs.
National Heart Centre Singapore (NHCS), Singapore, provided panel sequencing on HCM and DCM patients via the NHCS Biobank, as previously published 50,51,53 . Patients were sequenced using the Illumina TruSight Cardio targeted panel. All patients were prospectively enrolled for research purposes and underwent cardiac phenotyping with either cardiovascular magnetic resonance (CMR) or transthoracic echocardiography, with cardiomyopathy diagnosed according to standard criteria.
Aswan Heart Centre, Egypt (AHCE) provided panel sequencing on HCM and DCM patients 54,55 . A series of Egyptian patients with CM were assessed at Aswan Heart Centre (AHC) by echocardiography and/or magnetic resonance imaging. Patients were sequenced using the Illumina TruSight Cardio targeted panel on the Illumina MiSeq or NextSeq platforms.
All samples included in the internal datasets were consolidated and joint-genotyped using GATK v4.1.9 GenomicsDBImport and GenotypeGVCFs. Variant calls were hard filtered using GATK Best Practises guidelines for germline short variant discovery. Particularly, variants with quality-by-depth (QD)<3 and read depth <10x were not included in our counts due to the high likelihood of being false positives. All variants were converted to biallelic using bcftools v1.10.2 (htslib 1.10.2) and variants with AC=0 and star (*) alternative alleles were discarded.

External datasets
Laboratory of Molecular Medicine, Partners HealthCare, Massachusetts, US (LMM) provided aggregated summary sequencing information on patients with reported cardiomyopathy and consecutive diagnostic referrals for clinical genetic testing, i.e., HCM and DCM (no phenotypic confirmation), as previously published 50,51,[56][57][58] . The LMM HCM cohort comprised unrelated probands referred for HCM clinical genetic testing 59 . Any individuals with an unclear clinical diagnosis of HCM, or with left ventricular hypertrophy due to an identified syndrome such as Fabry or Danon disease, or unaffected individuals with a family history of HCM were excluded. The LMM DCM cohort comprised individual probands referred for DCM clinical genetic testing. According to the published report, all patients had DCM or clinical features consistent with DCM based on the medical and family history information provided by ordering providers. Additionally, any cases with confirmed diagnoses of other cardiomyopathies, structural heart disease, congenital heart disease or syndromic or environmental causes were not included in the study. Only rare variants were included in the aggregated data. Briefly, various sequencing technologies were used across time (Sanger; targeted next-generation sequencing) but with complete coverage (Sanger used to fill gaps in NGS). The LMM2 dataset is a small subset of the LMM cohort that contains ancestry information for the reported variants.
Oxford Molecular Genetics Laboratory, Oxford University Hospitals NHS Foundation Trust, Oxford, UK (OMGL), provided aggregated summary sequencing data on HCM and DCM apparently unrelated patients that were referred from Clinical Genetics centers across the UK for clinical genetic testing with initial clinical diagnosis of HCM or DCM made by a consultant cardiologist. The data included in this analysis is previously published 51,58 . All samples received for diagnostic genetic testing of HCM or DCM genes were eligible and analysis was undertaken in a routine clinical setting using clinical consent.
Belfast Regional Genetics Laboratory, Belfast, UK (BRGL), provided aggregated summary sequencing data on HCM diagnosed patients that had been referred for a Sanger screen. They provided information on four genes, including TNNI3 of which only information on exons 7 and 8.
GeneDx, Maryland, US (GDx), provided aggregated summary sequencing data on HCM diagnosed patients using panel data between 2016-2017. The data included information on referrals for full panel sequencing. To our knowledge, GDx do not perform further analysis to rule out unrecognised relatedness.
See summary information for the number of participants analysed for each gene of interest (Table S4). Actual numbers of samples included in the case cohort varies by gene. The number reported represents the maximum number of samples sequenced across for any gene. Institutional review board-approved protocols were used in this study and all included patients provided written, informed consent for their data to be included in research.

Ancestry, age at scan and sex
For the internal cohorts of RBHT, NHCS, AHCE and the LMM2 cohort, ancestry was determined via self-report at sample recruitment (Table S5). Local ancestry codes were assigned to one of the eight population codes used in gnomAD to allow ancestry matching across all cohorts. Age at scan was recorded and used in all age-based analyses for the case cohorts. Sex was self-reported at recruitment for all internal datasets, except NHCS. GDx provided age at scan and sex information only for the variants that were reported.

Data merging Technical differences and curation of aggregated datasets
The datasets included in this study have intra sequencing technology differences, e.g., Illumina and SOLiD technology have separate filtering, inter sequencing technology differences, e.g., NextSeq has higher resolution and depth than HiSeq and MiSeq, and intra panel differences, e.g., WES or target panel which vary in depth (i.e., WES has lower depth) and coverage. The NHCS provided data that was pre-filtered on bam level to a conservative quality of reads which reduced the number of reads.
The external data was shared in multiple different formats (e.g., excel, text, tab-or comma-separated values) with different variant identifiers (HGVSc or genomic position). All variants were confirmed and harmonised to variant call format (vcf) genomic coordinates using VEP v104, and bcftools v1.10.2 (htslib 1.10.2) was used to normalize variants (left align and parsimonious). Quality control or pre-filtering to the reported variants of the external datasets prior to this was subjective to the genetic centres.

Variant curation
All data (case cohort aggregated data, gnomAD, and UKB) was analysed in GrCh38. The aggregated data of the case cohorts was lifted over from GrCh37 using Picard Tools (version 2.23.1). The resulting vcf file was annotated using Ensembl Variant Effect Predictor (VEP; version 105) 49 with plugins and additional data for ClinVar (version 20220115) 60 , gnomAD (version r2.1) 48 , SpliceAI (1.3.1) 61 , REVEL 62 , and LOFTEE 48 . The VEP output was analysed using R (version 4.1.2) and Rstudio. The UKBB WES data was incorporated into the analysis using the --frq and --frq counts file formats from PLINK (version 1.9) 63 . Variants identified in the gnomAD data as AC0 (AC=0) were set as missing in the analyses and therefore could only be assessed using the UKBB WES data. The aggregate frequency and count data from gnomAD and UKB were summarised in an additive manner.
Variants identified in the case cohorts were analysed. MANE, protein altering variants of genes of interest that had a MAF of < 0.1% in gnomAD and UKBB were identified. Protein altering variants were included if specified as high or moderate impact by Sequence Ontology 64 and ENSEMBL 65 , with the addition of splice region variants for further curation. The genes of interest represent a list of 8 sarcomere-encoding genes with definitive evidence of an association with HCM (MYBPC3, MYH7, MYL2, MYL3, TNNI3, TNNT2, TPM1, ACTC1) 66 and 11 genes with definitive or strong evidence of an association with DCM (BAG3, DES, LMNA, MYH7, PLN, RBM20, SCN5A, TNNC1, TNNT2, TTN, DSP) 67 . FLNC was not included in this study as it was not present on the clinical panels analysed in the case cohort. Analysis was restricted to robustly disease-associated variant classes for each gene: all PAVs of MYBPC3; non-truncating variants (non-tvs; inframe indels, missense variants, start/stop lost variants, and nonsense-mediated decay incompetent premature termination codons (NMDi-PTCs)) for the other 7 HCM-associated genes 20 (MYH7, MYL2, MYL3, TNNI3, TNNT2, TPM1, ACTC1); all PAVs for BAG3, LMNA, PLN, RBM20, SCN5A, and DSP; TTNtvs (cardiac PSI >90% 52 ); non-tvs in DES, MYH7, TNNC1, and TNNT2.
Splice region variants (in the region of the canonical splice donor and acceptor sites, within 1-3 bases of the exon or 3-8 bases of the intron) with a non-protein altering flag (i.e., synonymous and intron variants) that would otherwise be excluded were assessed in a number of ways; via ClinVar report: those found pathogenic or likely pathogenic with at least 2 star evidence for HCM and DCM in ClinVar and reported functional evidence for splicing were termed "splice confirmed" or if the functional evidence was unclear for splicing were termed "splice likely"; via prediction threshold: the remaining variants were included in the analysis met a recommended SpliceAI threshold for "high precision" of > 0.8. For TTN, splice region, missense variants were analysed by Splice AI to identify those variants predicted to cause splicing that would otherwise be excluded.
LOFTEE was incorporated in the analysis to exclude loss of function (LoF) variants that were flagged as "low confidence" (LC) such as "NAGNAG site" requiring reannotation to non-LoF variant status and removal of 5'UTR and 3'UTR splice variants. Essential splice variant LoF occurs in the UTR of the transcript. Additional positional annotation included nonsense-mediated decay (NMD), to identify variants that introduce protein-truncating variants (PTCs) that are insensitive to NMD: i) < 50 coding bases 68 from a final splice boundary (final coding exon or 3'UTR exon), (ii) in the final exon, or (iii) in the first 100 coding bases of the transcript. For single coding exon PLN, all LoF variants were denoted as NMD escaping. Furthermore, variants flagged "coding sequence variant" or "protein altering variant" were manually curated, as were "stop_lost" and "start_lost" which were examined via ENSEMBL sequence and UCSC Genome Browser 69 to identify in-frame rescues nearby. Where there was no obvious rescue to assess, the variant was denoted as "inframe insertion".
Variants were classified as pathogenic/likely pathogenic (P/LP) if reported as P/LP for the correct CM multiple times in ClinVar and confirmed by manual review, or if annotated as P/LP according to ACMG criteria, using the semi-automated CardioClassifier decision support tool 70 (similar curation previously published 20 ). The primary ACMG classification was derived from ClinVar via VEP. All P/LP annotations and variants flagged as "conflicting interpretations of pathogenicity" were manually assessed via the ClinVar website to confirm curation for the specific cardiomyopathy and assess the date of reports, the evidence in comments, and the number of reports agreeing reports. CardioClassifier was used as a support tool for determining curations for variants not reported in ClinVar (i.e., UK Biobank variants). We note the duplication of definitive evidence for MYH7 and TNNT2 for both HCM and DCM, variants in these genes were treated as having a role in either HCM or DCM.
We did not manually adjudicate all variant classifications for this analysis. Of 2,005 variants observed in cases with HCM or DCM, 1,578 had a ClinVar accession, and 427 did not. Variants with no ClinVar accession were annotated using the CardioClassifier decision support software, following the ACMG framework. 168 loss-of-function variants in genes where LoF is a mechanism of disease for the presenting phenotype that were also rare were annotated as LP for the purposes of this analysis (PVS1 + PM2). Two further variants were prioritised as potentially P/LP by CardioClassifier (both missense variants in MYH7). These were manually adjudicated, and both were confirmed as fulfilling ACMG criteria for LP. The remaining variants without ClinVar accessions, did not have sufficient available evidence for us to formally recurate, and were grouped with the VUS for this analysis. An equivalent approach was applied to UKB. Of 6,321 variants, 3,603 had a ClinVar accession, and 2,717 did not. 306 were rare LoF variants where PVS1 & PM2 would be applicable, and they would be reported as LP if observed in a patient with disease. While we would not formally label these as P/LP, since this requires them to be observed at least once in an individual with disease, for the purpose of this analysis they were grouped with the LP variants.
Additional allele frequency filtering was used to adjust for potential pre-filtering undertaken for the external datasets: the HCM cohorts (case and population) were filtered to include variants that have a MAF less than the maximum population AF (gnomAD and UKBB) of the external datasets (of which GDx and OMGL had the most filtering, and lowest maximum population allele frequency, for HCM and DCM, respectively). This was a MAF <0.00036598 in gnomAD and MAF <0.0007344 in UKBB for HCM (via GDx) and a MAF <0.000552987 in gnomAD and MAF <0.0006031 in UKBB for DCM (via OMGL). This dataset made up the total variants depicted in this study ( Table S8, Table S9). To estimate penetrance, only variants that were observed more than once in both the case cohort and population reference dataset were included in the analysis (Table S10, Table  S11).
For aggregate penetrance estimates of all rare cardiomyopathy variants by subgroup, the UKBB WES data underwent the same variant curation pipeline and filtering thresholds.