Identification of PCSK9-like human gene knockouts using metabolomics, proteomics, and whole-genome sequencing in a consanguineous population

Summary Natural human knockouts of genes associated with desirable outcomes, such as PCSK9 with low levels of LDL-cholesterol, can lead to the discovery of new drug targets and treatments. Rare loss-of-function variants are more likely to be found in the homozygous state in consanguineous populations, and deep molecular phenotyping of blood samples from homozygous carriers can help to discriminate between silent and functional variants. Here, we combined whole-genome sequencing with proteomics and metabolomics for 2,935 individuals from the Qatar Biobank (QBB) to evaluate the power of this approach for finding genes of clinical and pharmaceutical interest. As proof-of-concept, we identified a homozygous carrier of a very rare PCSK9 variant with extremely low circulating PCSK9 levels and low LDL. Our study demonstrates that the chances of finding such variants are about 168 times higher in QBB compared with GnomAD and emphasizes the potential of consanguineous populations for drug discovery.


In brief
Belkadi et al. combined whole-genome sequencing with proteomics and metabolomics in a highly consanguineous Arab population to discover rare homozygous proteinchanging variants coinciding with extreme protein and metabolite levels. We found that the chance of identifying such variants is 168 times higher than in a non-consanguineous population.

INTRODUCTION
Cholesterol-lowering drugs that target PCSK9 are a well-documented example of how drug target selection based on genetic evidence from human knockouts can contribute to technical and regulatory success, providing a strong rationale for further investment in the field. 1 PCSK9 inhibitors, such as alirocumab and evolocumab, 2,3 were developed following the identification of healthy individuals with low levels of low-density lipoprotein cholesterol (LDL-C) carrying PCSK9 protein-changing variants (PCVs). 4 The success of this approach led to the widespread implementation of attempts to identify drug targets from healthy homozygous PCV carriers with extreme protein or metabolite levels and/or extreme biochemical findings in well-phenotyped cohorts. 5 Other successes achieved with similar strategies include LPA for lowering plasma lipoprotein levels, which was identified with biochemical assays, 6 APOC3 for lowering plasma triglyceride concentration, which was identified from proteomics and clinical biochemistry data, 7 and more recently HAO1 as a therapeutic target for primary hyperoxaluria type 1, identified through metabolomics. 8 The efficacy of studies of large genotyped or sequenced population cohorts for identifying PCVs for drug discovery was recently highlighted by a study of 141,456 whole-exome sequences from the GnomAD project 9 and 200,000 whole-exome sequences from the UK Biobank (UKB). 10 The availability of electronic health records and the continually decreasing costs of DNA sequencing have played an important role in this success. However, despite rigorous automatic filtering and manual curation to remove common model errors, 11 many false positives remain for rare PCVs, and it is difficult to distinguish true PCVs from processing artifacts in these sets of data from non-phenotyped subjects. 12 Rare PCVs differ considerably between populations with different structures. 13 In consanguineous populations with high rates of homozygosity, such as that of Qatar, the expected frequency of homozygotes for PCVs for the median gene is estimated at five per million, whereas this frequency is estimated at six per billion in non-consanguineous populations. 9 The sequencing of 7 billion non-consanguineous subjects would not be sufficient to find a natural knockout for every human gene, whereas predictions suggest that this target could be achieved with only a few million consanguineous individuals. 9 Consanguineous populations therefore provide a unique opportunity to identify rare homozygous PCVs, potentially leading to new PCKS9-like drug target discoveries.
Here we used deep (30x) whole-genome sequencing data from 2,935 subjects included in the Qatar Biobank (QBB) 14,15 to identify PCVs coinciding with extreme blood circulating levels of proteins and metabolites. We identified 98 in-cis protein associations, for which the PCV was in the gene encoding the protein measured in blood, and 105 metabolite associations for which the affected gene was biochemically related to the blood metabolite, which could also be viewed as a biochemical in-cis association. We manually curated the identified associations and selected 12 cases in which the mutated genes affected the corresponding protein and metabolite levels for further in-depth investigation as potential drug targets (PCSK9, BHMT, ACY1, PLG, ACSM2A, ABCG5, ABCC2, PAOX, AFMID, UPB1, AOX1, and ALOX15). The remaining PCV-extreme phenotype associations are presented in the supplementary tables.
Interestingly, we identified two homozygous PCSK9 variants associated with low PCSK9 protein and LDL-C levels: the first was the original rs11591147 variant that led to the development of PCSK9 inhibitors, 16 whereas the second (rs746442570) is reported here for the first time in the homozygous state. We also discovered two ACY1 variants affecting both protein (ACY1) and related metabolite (acetylated amino acids) levels and one PLG variant possibly associated with excessive coagulation of the blood, which was identified in an individual already on warfarin treatment according to the responses given on medical questionnaires. The PCV-extreme phenotype associations reported here, thus, shed new light on the functions of the proteins affected and may support further drug target development.

RESULTS
The population of Qatar displays enrichment in rare PCVs We consider homozygous gene variants that alter the encoded protein and are rare in the population to be candidate human knockouts. After quality control filtering, we identified 32,868 exonic and 5 0 UTR variants (PCVs) of 12,466 different genes that (1) had a moderate or high impact on the encoded protein, (2) were present in the homozygous state in at least one, but no more than five individuals from the 2,935 subjects included in QBB, and (3) had a minor allele frequency in the GnomAD populations <0.05 ( Figure 1A). These PCVs belong to 15 classes, with missense variants the largest group, followed by frameshift variants, and then in-frame deletions ( Figure 1B). Almost half the 12,466 genes (5,072, 46%) carried a single PCV ( Figure 1C). The density of PCVs per gene was correlated with gene length, such that the largest genes carried the largest numbers of PCV. For example, two large human genes, TTN and MUC4, carried 111 and 493 variants, respectively.
Almost half (15,600, 47%) of the PCVs identified in QBB were not found in the homozygous state in 125,748 subjects from the different populations of the GnomAD project ( Figure 1D). In total, 9,505 (61%) of these PCVs were present exclusively in the heterozygous state in GnomAD, and the rest (6,095, 39%) were completely undetected. We then split the identified PCVs into two groups: (1) 2,440 high-impact PCVs, and (2) 31,292 moderate-impact PCVs. We observed a high enrichment for the highimpact PCVs in those that were absent from the GnomAD project ( Figure S1). Most of the high-impact PCVs (92%) were not detected in the GnomAD project from which only 19% were detected in the heterozygote state in the GnomAD project. Only 43% of the moderate-impact PCVs were absent from the GnomAD project.
We estimated the excess homozygosity in QBB by comparing the inbreeding coefficient F for QBB cohort with that for the Europeans of the 1000 Genomes Project (Figure 2A). The F coefficient describes the probability that two alleles at a locus are identical by descent and can be used to estimate the excess homozygosity in a consanguineous population relative to a nonconsanguineous ancestor. The individuals included in QBB had an F value six times higher, on average, than that for nonconsanguineous populations (0.033 versus 0.0052, p < 2.2e-16).
Long runs of homozygosity (ROH) generated by recent consanguinity enable rare deleterious variants to exist in the homozygous form. 17, 18 We identified long ROH exceeding 10 kb in QBB and determined the fraction of PCVs lying in these long ROH. We found that 37,093 (65%) of the 57,063 homozygous PCVs were located in long ROH. The mean proportion of genomes in long ROH (PGROH) in QBB was 3.52% ( Figure 2B). For comparison, the mean PGROH in QBB was 40% higher than that for the Europeans of the 1000 Genomes Project (2.5%, Figure 2B). In total, 593 (20%) QBB participants have a PGROH above 5.8%, the PGROH observed in self-reported second cousins or closer parents. 19 PGROH reaches this level in fewer than 1% of Europeans ( Figure 2B). PGROH was also correlated with the number of homozygous PCVs carried by individuals included in QBB (Pearson's coefficient = 0.66, p < 2.2e-16, Figure 2C).
The expected homozygote frequency (EHF) in QBB was estimated at seven per million for the median gene, a value 168 times higher than that for the non-consanguineous populations of GnomAD ( Figure 2D). Based on this estimate of EHF, the targeted QBB sample size of 60,000 participants will make it possible to identify at least one homozygote for 5,709 genes, a result comparable to that obtained by sequencing five million non-consanguineous individuals ( Figure 2D). Thus, the high level of homozygosity in QBB, due to recent common ancestors, provides a much higher relative power for identifying rare PCVs than can be attained with non-consanguineous populations.

Extreme metabolomic and proteomic phenotypes identify functional variants
We characterized the 32,868 PCVs that were homozygous in one to five individuals in QBB further, by determining the levels of 1,305 blood-circulating proteins using the SomaLogic (Boulder, CO) affinity proteomics platform, and of 1,159 metabolites in plasma using the Metabolon (Durham, NC) non-targeted metabolomics platform. We also included data from 71 clinical biochemistry assays (Table S1) and the self-reported medication taken by these QBB participants. We assessed the association of homozygous variants with extreme protein and metabolite levels ( Figure 3). We retained all variants for which all homozygous individuals had protein or metabolite levels ranking in the 20 highest or 20 lowest values for the 2,935 QBB participants.
In other words, all individuals homozygous for these rare PCVs had extreme phenotypic traits. We identified 378,572 variantextreme protein-level associations (grouped in 377,249 variantprotein pairs, Table S2) and 312,899 variant-extreme metabolite-level associations (grouped in 311,637 variant-metabolite pairs, Table S3).
We estimated the false discovery rate (FDR) of our approach by sampling, by repeating the analysis 100 times with randomized sample ids. The mean number of trait-variant pairs found by chance was 370,384 for proteins and 296,643 for metabolites, suggesting that more than 95% of such associations (for either proteins or metabolites) would be expected to be found on average by chance ( Figures S2A and S2B).
As a means of reducing the FDR and focusing on the varianttrait pairs most likely to be of clinical relevance, we therefore limited our analysis to variants of genes functionally linked to the trait. For proteins, we retained only in-cis protein associations (pPCVs), that is, variants affecting the protein determined. We identified 95 pPCVs for 63 proteins associated with 72 variants (grouped in 73 variant-protein pairs, Table S4). The FDR for these pPCVs, determined by sampling, was 15% ( Figure S2C).
For metabolites, we limited the analysis to metabolites biochemically linked to the affected gene (mPCVs), by retaining only genemetabolite pairs reported as metabolite quantitative trait loci (mQTLs) in genome-wide association studies (GWASs). We assumed that mQTLs indicate a functional link between the gene and the metabolite. 20 We identified 103 mPCVs for 65 metabolites associated with 59 variants (grouped in 88 variant-metabolite pairs, Table S5). The FDR for mPCVs was 25% ( Figure S2D). We assessed sensitivity, by varying the thresholds from the 20 most extreme value to the 10 most extreme, the five most extreme, and the mean ±3 standard deviations for both proteins and metabolites. The fractions of pPCVs (in-cis protein associations) and mPCVs (metabolite association reported in the mGWAS) reported in each dataset were similar, regardless of the threshold used ( Figure S3). The similarity in the fractions of pPCV and mPCV suggests that more conservative thresholds do not exclude more false positive associations.
Clinical outcome for 12 PCVs for which multiple data sources are available We further dissected the potential effect of the association of PCVs with extreme protein and metabolite levels, by manually curating the pPCVs and mPCVs. We focus on 12 cases, 10 of which are the following: PCSK9, BHMT, ACY1, PLG, ACSM2A, ABCG5, ABCC2, PAOX, AFMID, and UPB1. We also included two other associations (AOX1 and ALOX15) that were reported in the Human Metabolite database (HMDB, Table 1). These PCVs are of particular interest, as they have (1) multiple sources of evidence, based on proteomics, metabolomics, and laboratory data, and/or (2) more than one homozygote in QBB displaying the extreme phenotype and therefore allow comparison of multiple cases for consistency. As per comparison, we ran two types of rare variant association with metabolites and laboratory data analyses: (1) a burden test using the combined multivariate and collapsing method (CMC), 21 and (2) the sequence kernel association test (SKAT). 22 The burden test aggregates PCVs that impact the metabolite and the lab data levels in the same direction, whereas SKAT considers PCVs in opposite directions. We identified eight significant gene-laboratory phenotype associations and 59 significant gene-metabolite associations with the burden test. All samples were collected exclusively from Qatari participants of the QBB study. Variants detected from whole-genome sequencing data were annotated with the Variant Effect Predictor (VEP), and variants annotated as high-impact, moderate-impact, and as creating new upstream open reading frames (ORFs) were retained as functional variants. Common variants in the GnomAD project (MAF > 5%) were excluded. Only rare homozygous variants present in at least one but no more than five participants were included. Associations between homozygous variants and extreme (in the top 20 or the bottom 20 values) protein and metabolite levels were identified. For each variant, all homozygotes had to be ranked in the top 20 or the bottom 20 values to be retained for the analysis. We retained only incis protein associations: 95 protein-changing variants affecting protein levels (pPCVs) and 103 gene-metabolite pairs reported in genome-wide association studies on metabolites (mPCVs).
Cell Genomics 3, 100218, January 11, 2023 5 Article ll OPEN ACCESS Using SKAT, we detected 14 gene-laboratory phenotype associations and 121 gene-metabolite associations (Table S6). Five of the 10 cases involving extreme metabolite levels (ACY1, ABCG5, ABCC2, PAOX, and UPB1) could also be identified in a burden test approach while the other five (genes associated with extreme betaine levels, ACSM2A, AFMID, ALOX15, and AOX1) were only found using the variant-extreme phenotype approach (Table S6). The two cases involving extreme laboratory data, PCSK9 and PLG, were also only found using the variant-extreme phenotype approach.
For two of the 12 cases-PCSK9 and PLG-extreme associations for proteomics and laboratory data were identified. For the ACY1 case, both protein (ACY1) and metabolite (acetylated amino acids) levels were extreme. For the betaine case, five homozygotes for PCVs of various genes (BHMT, SLC6A12, CBS, and SLC6A5) were associated with high betaine levels and different levels of dimethylglycine: high in individuals homozygous for the BHMT and SLC6A12 variants and low in individuals homozygous for the CBS and SLC6A5 variants. The eight remaining cases involved various homozygotes for a single gene that were found to be associated with one or several metabolites, such as ACSM2A with indolepropionic acid and phenylpropanoic acid; ABCC2 with the glycolic acid sulfate; PAOX with polyamine and polyamine metabolites, acetylspermidine, acisoga, and acetylisoputreanine; AFMID with formylanthranilic acid; UPB1 with beta-ureidopropionic acid and beta-aminoisobutyric acid; AOX1 with pyridoxic acid, methylnicotinamide, and N1methyl-2-pyridone-5-carboxamide; and ALOX15 with linoleic acid and arachidonic acid.
We describe below three cases that are well-established in the literature (PCSK9, ACY1, and Plasmin). We discuss the nine other cases in detail in the Supplementary Text.
We identified one individual homozygous for the extensively studied PCSK9 missense variant rs11591147 in QBB. This variant has been reported to lower the levels of LDL in the blood and to reduce the risk of coronary heart disease. Confirming this finding, QBB homozygote for rs11591147 had a low level of LDL-C ( Figure 4A). Our proteomics data also show that the homozygote of rs11591147 had lower levels of PCSK9 protein in the blood too ( Figure 4B). The rs11591147 missense variant Article ll OPEN ACCESS appears to have an additive effect on both PCSK9 ( Figure 4C) and LDL-C levels ( Figure 4D). We identified a second homozygous PCSK9 missense variant, rs746442570, carried by a single QBB participant. Unlike rs11591147, only two heterozygotes in $400,000 individuals were identified for rs746442570 in UKB. We identified two ACY1 missense variants, rs121912698 and rs2229152, each carried by a single homozygote in QBB. Both proteomics and metabolomics data provided evidence for an effect of these two missense variants of ACY1 on blood circulating protein and metabolite levels: the levels of ACY1 protein in the rs121912698 and rs2229152 homozygotes were the lowest of the 2,935 QBB participants (Figure 5A). ACY1 is required to remove the acetyl group from acetylated amino acids. Therefore, in metabolomics analyses, the levels of various acetylated amino acids, including acetylmethionine, acetylalanine, acetylisoleucine, acetylleucine, acetylvaline, acetylglutamate, acetylhistidine, acetylserine, and acetylthreonine, were extremely high and ranked in the 20 highest levels in both rs121912698 and rs2229152 homozygotes ( Figure 5C and Table S7). ACY1 deficiency is a neurological disorder caused by mutations in ACY1 and characterized by high levels of acetylated amino acids. rs121912698 is an established causal variant of ACY1 deficiency 23-26 where the replacement of Arg353 by a Cys residue could create a perturbation in the vicinity of the ACY1 active site. 27 Our data suggest that carriers of the rs2229152 variant are likely to present with the same pathophysiology. High levels of ACY1 expression are associated with a risk of type 2 diabetes (T2D). 28-30 Free amino acid levels generated by overexpression of ACY1 play a role in insulin secretion and glucose homeostasis and could eventually lead to T2D with impaired ß-cell function and insulin resistance. 30 Further investigations are, therefore, required to assess the potential of lowering free amino acid levels by ACY1 blockade as a potential treatment for T2D.
We identified one homozygote for a PLG missense variant rs4252129 in QBB. The rs4252129 homozygote has low levels of plasminogen (plasmin zymogen) and angiostatin (the plasmin cleavage product) but normal level of active plasmin ( Figure 6B). Individuals heterozygous for rs4252129 have lower levels of plasminogen and angiostatin but normal levels of active plasmin compared with the wild-type individuals, confirming the potential effect of rs4252129 on plasminogen and angiostatin rather than active plasmin ( Figure 6C). Plasmin plays a role in fibrin clot degradation, and mutations in PLG were associated with severe thrombosis. 31 The medical questionnaire completed by QBB participants indicated that the individual homozygous for rs4252129 was on warfarin treatment possibly for a high blood-clotting problem. This treatment inhibits the vitamin K-dependent synthesis of biologically active forms of the CFs F2, F7, F9, and F10. Our proteomics data determined the levels of six coagulation factors (CFs), four of which-F2, F7, F9, and F10-were extremely low in the rs4252129 homozygote ( Figure 6E). Furthermore, (C) Overall LDL-C level for the PCSK9 heterozygotes was lower than that in wild-type individuals. (D) Overall PCSK9 level for the heterozygotes was lower than that in wild-type individuals. Het, heterozygous; wt, wild-type homozygous. All the protein and clinical biochemistry data are presented on plots with a normalized scale (Z score, mean = 0, SD = 1).
Cell Genomics 3, 100218, January 11, 2023 7 Article ll OPEN ACCESS our laboratory data showed that the rs4252129 homozygote has a prolonged activated partial thromboplastin time (APTT) and prothrombin time (PT) and a high international normalized ratio (INR) ( Figure 6D). The low levels of four CFs, the prolonged APTT and PT, and high INR are in the rs4252129 homozygote probably a result of the warfarin treatment. The variant rs4252129 seems to extremely decrease both plasminogen and angiostatin levels in the blood, possibly resulting in a high blood-clotting problem. Further investigations are therefore required to determine whether extremely low levels of plasminogen and angiostatin combined with normal levels of active plasmin favor excessive blood clotting.
A detailed description of the three cases discussed here (PCSK9, ACY1, and PLG) and of the other nine cases is provided in the form of vignettes in the supplementary text.

DISCUSSION
We identified homozygous PCVs causing extreme protein and metabolite levels in QBB. Our data show that the population included in QBB is highly consanguineous. 32 Almost half the PCVs identified in QBB have no homozygote in the largest available DNA sequencing catalog. Furthermore, most of the PCVs identified were detected in large ROH enriched in deleterious variants. 18,33 In addition, previous studies to identify PCVs in cohorts well-characterized phenotypically used exome sequencing 7,34 or imputed SNP arrays, 35 but genome sequencing has been shown to outperform exome sequencing for identifying exonic variants 36 and it is difficult to tag rare variants with the available imputation methods. 37  Table S7.
(D) Example of free amino acid levels for methionine. Only the individual heterozygous for an ACY1 PCV had high methionine levels. The data for the other amino acids are presented in Table S7.
(E) ACY1 levels were correlated with the ratio of methionine to acetylmethionine levels. All the protein and metabolite data are presented on plots with a normalized scale (Z score, mean = 0, SD = 1). In our extreme metabolite associations, we excluded individuals with undetected metabolite levels from the analysis. However, a PCV can disrupt the gene function related to the production of a specific metabolite resulting in undetected metabolite levels. 38 Metabolites not detected in participants might therefore bias the mPCVs identified, making it impossible to determine whether the lack of detection is a random effect or due to an extreme level of the metabolite concerned. For undetected metabolite levels, we performed the Fisher test for randomness (STAR Methods) to detect whether the missing metabolite values from certain genotype groups were missing by chance. We also repeated the analysis using a value of zero for undetected metabolites, but no additional detections were achieved.
We identified PCVs causing extreme levels of proteins and metabolites, and extreme biochemical data, reflecting a disruption of gene function. The PCVs identified were PCSK9, BHMT, ACY1, PLG, ACSM2A, ABCG5, ABCC2, PAOX, AFMID, UPB1, AOX1, and ALOX15. PCSK9 provides proof-of-concept for our strategy for identifying true associations. 39,40 Pharmacological inhibitors of PCSK9 lowering LDL-C levels and, thus, reducing the risk of coronary heart disease, were developed after the identification of healthy PCSK9 carriers, a process that could have been initiated by our analysis.
In summary, we report here evidence for both new and previously reported associations between genetic variants and extreme protein and metabolite levels in a consanguineous population. The protein (Table S2) and metabolite (Table S3) associations identified are presented in the supplementary tables. We believe that our findings will be of great utility for the screening of diseases and traits associated with the identified genes. 41 Furthermore, this research should provide new hypotheses concerning potential targets for drug target investigations.

Limitations of the study
Our study also has some limitations. First, genetic variation affecting the protein sequence may lead to changes in the higher order structure of the protein and, thus, its aptamer binding affinity. 42 This may lead in some cases to extreme protein-level readouts where in reality only the aptamer binding affinity is affected by the variant. Additional pPCV findings concerning functional disruption in the metabolite and/or biochemical data can help to address this epitope effect. Second, many pPCVs and mPCVs, particularly those identified in a single participant, are false positives. 7,34,43 We addressed this issue by manually curating the pPCVs and mPCVs. A larger sample size, such as the target size of QBB (60,000 participants), would also help to resolve this issue. Finally, the available proteomics and metabolomics platforms target a limited number of proteins and metabolites, respectively. The expected improvements to these platforms in the future will increase the number of proteins and metabolites that can be targeted. 44

Variant annotation
The Variant Effect Predictor (VEP) was used to annotate the 100 million variants identified. 110 The UTRannotator plugin was also used to annotate additional variants to create new upstream open reading frames (ORFs). 117 In total, VEP annotated 610,493 autosomal variants as ORFs or as having a high or moderate impact. Common functional alleles are less likely to exert strong functional effects as they are less constrained by purifying selection. We, thus, retained 35,567 protein-changing variants (PCVs) for which one to five homozygotes were present in QBB for analysis. To identify rare PCV that were not present in the non-consanguineous population, we excluded 2,699 PCVs with a minor allele frequency (MAF) R 0.05 in the GnomAD populations. We finally retained 32,868 PCVs in the analysis and we annotated the identified PCVs with both the Combined Annotation Dependent Depletion (CADD) 111 and the rare exome variant ensemble learner (REVEL) 112 scores.

Estimation of the inbreeding coefficient F
The coefficient of inbreeding distributions of 2,935 QBB individuals were compared with those of 503 Europeans from the 1,000 genomes project. 104 We extracted a total of 869,201 polymorphic SNPs present on the Affymetrix 6.0 SNP array that passed the quality control check-up in QBB and the 1,000 Genomes project (1KG). PLINK (v2.0) was used to estimate the coefficient of inbreeding separately for each ethnic group. 109 The coefficient of inbreeding was estimated by dividing the observed degree of homozygosity by the expected homozygosity based on an estimated common ancestor. 118 Two-sided Wilcoxon tests were used to compare the F coefficients of QBB population and the non-consanguineous 1KG population.

Runs of homozygosity (ROH)
We estimated the mean proportion of the genome covered by ROH in QBB participants, as previously described for the East London Genes & Health (ELGH) study 34 : a hidden Markov model implemented in bcftools 113 was applied to the exonic regions targeted by the Illumina Trueseq exome regions (v1.2): bcftools roh -R exonic_regions_chr{CHROM).bed -G30 -a1e-8 -H1e-8 -V1e-10 -m genetic_ map_chr{CHROM}_combined_b37.txt. We targeted exonic regions only, as ELGH is a collection of exome-sequencing data. Furthermore, ROH length does not differ significantly between exome-and genome-sequencing data obtained with bcftools. 113 The proportion of the genome covered by ROH for each QBB participant was calculated with an in-house python script.

Expected homozygote frequency (EHF)
The EHF in the GnomAD consanguineous population was estimated as EHF = (1 -a)p 2 + ap, where p is the cumulative allele frequency (CAF) estimated for all GnomAD individuals and a is the mean proportion of the genome covered by ROH in individuals self-reporting having second-cousin or more closely related parents in the ELGH study 5419 . In GnomAD, the CAF was calculated as p = 1 -sqrt(q), where q is the fraction of GnomAD participants without loss of function. This calculation therefore requires the individual genotyping of GnomAD participants and is time-consuming. Here, we calculated the 'classic' CAF, as described by Minikel et al. 9 in Extended Data Figure 3. The 'classic' CAF for each gene is the sum of allele frequencies for all rare variants (MAF <5%) annotated as ''high-confidence loss-of-function'' by Loftee 12 in 125,748 GnomAD exomes.
We applied the method used on the GnomAD consanguineous population to estimate the EHF in QBB. The EHF for the GnomAD non-consanguineous population was estimated at four per hundred million for the median gene, a value different from the EHF reported by Minikel et al. 9 (six per billion). This difference is probably due to the 'classic' CAF used here to estimate the EHF in the GnomAD non-consanguineous population.
Cell Genomics 3, 100218, January 11, 2023 e2 Article ll OPEN ACCESS PCVs associated with extreme protein and metabolite levels The levels of 1,305 proteins were measured with the Somascan aptamer-based proteomics platform for 2,935 QBB participants. Six aptamers targeting human virus proteins were excluded. The Uniprot identifier for each protein was used in the BiomaRt package to identify the associated protein-coding genes. 119 Fifty-eight aptamers were assigned to multiple Uniprot identifiers and included different protein isoforms and complexes. We treated each Uniprot identifier as a unique record, to avoid missing any interesting findings. In total, 1,301 unique protein-coding genes were linked to the 1,299 proteins analyzed.
For the identification of variants associated with extreme protein levels in QBB, we first browsed the 32,868 PCVs and identified homozygotes. For each PCV, there were one to five homozygotes. We restricted analyses to proteins for which all homozygotes ranked in the top 20 or the bottom 20 for protein levels. In total, 378,572 associations of 20,909 PCVs from 10,025 protein-coding genes with 1,301 proteins were identified.
The levels of 1,159 metabolites were determined for 2,935 QBB participants with the non-targeted HD4 metabolomics platform from Metabolon. For the identification of PCVs associated with extreme metabolite levels, we excluded participants with missing metabolite determinations. As for proteins, analyses of metabolite associations were restricted to situations in which all homozygotes ranked in the top 20 or the bottom 20 for metabolite levels. Only metabolites with non-missing measures for more than 1,000 QBB participants were kept. In total, we identified 312,899 associations of 29,999 PCVs from 11,967 protein-coding genes with 989 metabolites.
Estimation of the false discovery rate (FDR) A knowledge of the distribution of a test statistic under the null hypothesis is important for FDR estimation. Permutation methods have become popular for estimating null distributions, due to their flexibility and generalizability. For both proteins and metabolites, we performed 100 permutations to estimate the null distribution. We randomly permuted the participant identifiers 100 times and determined the extreme associations for each permuted dataset, as for the unpermuted data. The FDR was estimated by dividing the mean number of associations in the permuted data by the number of associations identified in the unpermuted data.

Correction for multiple testing
We corrected for multiple testing and decreased the FDR for protein associations, by retaining associations for which the PCVs causing extreme protein level were variants of the gene encoding the protein concerned, that is, in cis associations (pPCVs).
For metabolites, we decreased the FDR by extracting pairs of genes and metabolites from metabolite associations already reported in the available mGWAS (mQTLs) (mPCVs). We extracted the mQTLs from the mGWAS reported in the GWAS catalog. 106 mQTLs were fine mapped and manually curated to identify causal genes. 120 mQTLs from 21 mGWAS in the GWAS catalog were fine mapped and manually curated to identify causal genes (Table S9). We also used the supplementary data from two recent mGWAS that were not included in the GWAS catalog. 120,121 The XML version of the HMDB 107 was downloaded from the HMDB website (https://hmdb.ca/) to identify metabolite associations reported in the HMDB. An in-house python script was used to extract metabolite-enzyme and metabolite-transporter associations. We linked the metabolites with a known HMDB identifier to enzymes, transporters and carriers as reported in the HMDB. The complete list of all associations is also provided in Table S9.
Fisher's exact test for the randomness of missing values Missing data can be classified in three patterns: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). The missingness is MCAR, if the probability of being missing is the same for all cases; MAR, if the probability of being missing is the same only within groups defined by the observed data; MNAR, if the probability of being missing depends on both observed and non-observed quantities. The missingness in most of metabolites is due to the limits of detection (LOD). 122 This missingness is assumed left-censoring, a variant of MNAR.
We used Fisher's exact test to determine whether the missing metabolite or laboratory data values for homozygotes were missing by chance, in analyses of the effects of PCVs on the metabolite level and laboratory data. The two inputs for Fisher's exact test were: (a) the number of missing values for each genotype group (wild-type, heterozygous and homozygous), and (b) the total number of samples in each genotype group. We used fisher.test method in R for this calculation. In case of missingness due to LOD, low p values indicate a low probability of these values being missing by chance.

Burden test for gene-metabolite associations
In gene-based association analyses, the effects of various variants on a gene unit are aggregated in burden and SKAT tests. We used the 'FamCMC' and 'FamSKAT' functions implemented in rvtests 114 on the 610,493 autosomal variants as ORFs or as having a high or moderate impact. All 2,935 QBB participants were included, as both 'FamCMC' and 'FamSKAT' collapses and combines rare variants in related individuals through a kinship matrix. Variants with MAF >1% were excluded from the analysis. Sex, age, the three first principal components on genetics data were used as covariates. Metabolites and laboratory data with more than 300 missing values were excluded. The exome-wide significance was set at p < 2.39e-09 after Bonferroni correction for the maximum number of genes tested (18,798 genes) and the total number of phenotypes (71 laboratory phenotypes +1,040 metabolites). Additional evidence for associations with extreme protein and metabolite levels We collected additional evidence for pPCVs and mPCVs, by identifying homozygous participants. The extreme ranks for homozygotes were obtained for all phenotypes: 71 for laboratory data, 1,159 for metabolites and 1,299 for proteins. An approximate p value was calculated by dividing the product of extreme ranks for homozygotes by the total number of QBB participants to the power of the number of homozygotes: Where k is the number of homozygotes, N is the number of non-missing data points divided by 2 (division by 2 because we looked at both ends of the extreme) and R i is the extreme rank of homozygote i. After correction for multi-testing, a significant p-value will be: P < 0:05 M Where M is the total number of phenotypes: M = 2,529 (71 laboratory data phenotypes +1,159 metabolites +1,299 proteins). A quantile-quantile plot ( Figure S4) confirms that this p value is not inflated.
A comparison of the observed association statistics against the expected distribution suggests no systematic over-dispersion of the association statistics ( Figure S4).
Note that the same extreme ranks for high and low levels give the same P. For example, if three homozygotes for one or many PCVs for the same gene have the levels of a specific phenotype ranked 2,935/2,935, 2,934/2.935 and 2,933/2,935, then the corresponding extreme ranks are 1/2,935, 2/2,935 and 3/2,935 and P = (1 * 2 * 3)/2,935 3 = 2.37e-10. A low p value for a phenotype indicates a statistically significant association.
Genome-wide association study hits, expression quantitative trait loci (QTL), protein QTL, metabolite QTL, and methylation QTL displaying strong linkage disequilibrium with pPCVs and mPCVs in the European population (r 2 > 0.8) were extracted with Phenoscanner. 115 We used the R package VarfromPDB 123 to identify diseases associated with pPCVs and mPCVs in Clinvar. 124 We downloaded the DrugBank database (https://go.drugbank.com) and used the lxmx toolkit implemented in Python (v.2.7) to identify genes carrying pPCVs and mPCVs associated with drugs. We included only gene-drug combinations with a known action of the drugs on the genes.

QUANTIFICATION AND STATISTICAL ANALYSIS
The quantitative and statistical analyses are described in the relevant sections of the method details or in the table and figure legends.