The 100,000 Genomes Pilot on Rare Disease Diagnosis in Healthcare − A Preliminary Report

Background The UK 100,000 Genomes Project is in the process of investigating the role of genome sequencing of patients with undiagnosed rare disease following usual care, and the alignment of research with healthcare implementation in the UK’s national health service. (Other parts of this Project focus on patients with cancer and infection.) Methods We enrolled participants, collected clinical features with human phenotype ontology terms, undertook genome sequencing and applied automated variant prioritization based on virtual gene panels (PanelApp) and phenotypes (Exomiser), alongside identification of novel pathogenic variants through research analysis. We report results on a pilot study of 4660 participants from 2183 families with 161 disorders covering a broad spectrum of rare disease. Results Diagnostic yields varied by family structure and were highest in trios and larger pedigrees. Likely monogenic disorders had much higher diagnostic yields (35%) with intellectual disability, hearing and vision disorders, achieving yields between 40 and 55%. Those with more complex etiologies had an overall 25% yield. Combining research and automated approaches was critical to 14% of diagnoses in which we found etiologic non-coding, structural and mitochondrial genome variants and coding variants poorly covered by exome sequencing. Cohort-wide burden testing across 57,000 genomes enabled discovery of 3 new disease genes and 19 novel associations. Of the genetic diagnoses that we made, 24% had immediate ramifications for the clinical decision-making for the patient or their relatives. Conclusion Our pilot study of genome sequencing in a national health care system demonstrates diagnostic uplift across a range of rare diseases. (Funded by National Institute for Health Research and others)

diagnostic testing. [3][4][5] To address this, the UK Government launched the 100,000 Genomes Project (100KGP) in 2013 to apply whole genome sequencing (WGS) to rare disease, cancer and infection in national healthcare. 6 To assess impact of this WGS approach on the genetic diagnosis of rare disease in the UK's National Health Service, we carried out a pilot study in which we enrolled families and undertook detailed clinical phenotyping of the proband. 4 We collected electronic health records from all participants in a multi-petabyte research environment. 5 When necessary, we carried out wet bench orthogonal tests and in-silico approaches.

Patients
Following ethical approval, consenting participants (identified by healthcare professionals and researchers) with a broad range of rare diseases without diagnoses after undergoing usual care in the NHS (which ranged from no available test through approved tests which did not include genome sequencing) were recruited by nine English hospitals and consented through the National Institute for Health Research (NIHR) BioResource for Rare Diseases. To test the broad applicability of genome sequencing, participants were eligible if they had a rare disease (as defined in the UK as a disorder affecting 1 in 2000 or less), were likely to have a single gene or oligogenic aetiology, and no genomic diagnosis. Data on prior proband testing was collected where possible including single-gene tests, karyotyping, single nucleotide polymorphism (SNP) arrays, next generation sequencing panels, and exomes. Probands and, where feasible, parents and/or other family members were enrolled by multiple clinical specialties in the NHS. Standardized baseline clinical data were recorded using the Human Phenotype Ontology (HPO) 7 against disease specific data models 8 and whole blood was drawn for DNA extraction. The participants are followed over their life course using electronic health records (all hospital episodes, registries and cause of death).

Genome Sequencing
Genome sequencing 9 was performed using the Illumina TruSeq DNA PCR-Free sample preparation kit by Illumina Laboratory Sciences, Cambridge UK on an HiSeq 2500 sequencer, generating a mean depth of 32× (range from 27× to 54×) and greater than 15× for at least 95% of the reference human genome. WGS reads were aligned to the Genome Reference Consortium human genome build 37 (GRCh37) using Isaac Genome Alignment Software. Family-based variant calling of single variant nucleotides and insertion deletions (indels) for chromosomes 1 to 22, X, and the mitochondrial genome (mean 2814x coverage, range 142-16581) was performed using the Platypus variant caller. 10 HPO terms (applied virtual panels). To address the issue of which genes have sufficient evidence to attribute causation and include in these virtual gene panels, we used our PanelApp software to enable expert, crowd-sourced review and curation of genes with diagnostic-grade evidence for each of our disease categories e.g. evidence in at least three, unrelated families. 11 Loss of function (LoF) or de novo, protein altering variants affecting genes in the applied virtual panels were classified as tier 1, other variant types such as missense variants affecting these genes were classified as tier 2, and all other filtered variants were classified as tier 3 ( Figure S1 in the Supplementary Appendix). To further reduce the possibility of missing, or inefficient prioritization of diagnoses, we ran Exomiser 12 , a phenotype-based approach to look across all genes in the genome for a diagnosis. Exomiser prioritizes rare, segregating, predicted pathogenic variants in genes where the patient phenotypes match previous reference knowledge from human disease or model organism databases. The ontology-driven phenotype matching can detect patients possessing atypical profile for a disease.
Decision support systems and clinical genetics teams provided by Congenica Ltd and Fabric Genomics 13,14 assisted us in variant prioritization and return of candidate variants to the 13 NHS Genomic Medicine Centres (GMC). These variants were reviewed by NHS clinical scientists and clinicians using the American College of Medical Genetics and Genomics guidelines and a diagnostic report was issued for each proband. 15 Final clinical outcomes included whether a genetic diagnosis was obtained, the variant(s) involved, whether they explained all, or some of the phenotypes and whether an intervention was deployed.
The pilot participants were recruited and sequenced throughout 2014-2016, while the infrastructure to collect, QC, process and return data was being established. Results were returned to the GMCs from May 2016 to April 2019. In our post-pilot phase with an established pipeline, we now return results to the GMCs within 6 weeks of sample collection.

Novel Pathogenic Variants
Researchers investigated coding and non-coding regions for novel diagnoses in genes matching the patients' phenotypes, including the presence of de novo variants in highly constrained coding regions 16 with 95% confidence. We used a novel methodology for mitochondrial DNA that accounts for heteroplasmy, 17 Genomiser, 18 and ExpansionHunter for simple tandem repeat expansions. 19 Finally we employed a novel random forest method to analyse Canvas 20 and Manta 21 calls and identify potentially pathogenic copy number and structural variants.
Gene-based burden testing to detect enrichment of rare, predicted pathogenic, segregating variants in novel genes in specific disease cohorts relative to controls was performed on the pilot genomes as well as additional genomes from the rest of the 100KGP to increase power (57,002 genomes; see Supplementary Methods).
Access to the pilot genomic and clinical data is freely accessible by becoming a member of a Genomics England Clinical Interpretation Partnership (GeCIP) domain (https:// www.genomicsengland.co.uk/about-gecip/).

Statistical Analysis
Testing was performed using the R (version 3.6.0) and Stata (version 16) statistical packages. Further detail on individual methods is given in the Supplementary Appendix.

Patients
We enrolled 4660 participants (2183 probands and 2477 family members) from 161 broad categories across rare disease (Table 1), with neurologic, ophthalmologic and tumor syndromes commonly represented. Participants were recruited with varying numbers of affected and unaffected family members. We aimed, with varying degrees of success, to recruit trios or larger family structures to facilitate more effective variant prioritization. Of the probands with multiple bowel polyps whom we recruited, 93% were singletons. In contrast, 12% of probands with intellectual disability were singletons. Adult probands were more commonly enrolled than pediatric probands (age at recruitment 18 years or younger) (74% vs. 26%), in line with the general population (79% vs. 21%; 2011 census of England and Wales). The preponderance of adults is unusual compared to previous sequencing projects and reflects an eligibility criterion: probands had already undergone usual care: in many cases, usual care involved standard genetic testing (mostly single-gene or panel-based). A lower percentage of female probands were recruited, especially for pediatric cases, where the difference was significant (232 female vs. 339 male; P< 0.001) based on the expected female proportion of 51% from 2011 census of England and Wales) across most disease categories. The increased susceptibility of males to recessive X-linked conditions may account for this sex bias: over 6% of total diagnoses involved variants on the X chromosome (which represents approximately 5% of the genome). The inferred ancestry of the probands (see Supplementary Appendix) was in line with that expected from the population (86% white, 7.5% Asian, 3.3% black, 2.2% mixed, 1% other: 2011 census of England and Wales). However, significantly more pediatric probands were of South Asian ancestry compared to adult probands (16% vs. 4%, P<0.001); our results indicated potential consanguinity in 43% of pediatric South Asian probands and 1% for the other pediatric probands (Table 1).

Clinical Data and Sequencing
We collected HPO terms for each participant (median of 4 present terms, range 1-61 and median of 4 absent terms (phenotypes not exhibited by the proband), range 0-144). We then carried out genome sequencing followed by quality assurance to check coverage, sequence quality, presence of repeat sample submissions or sample swaps, and consistency with reported family structures (see Supplementary Appendix).

The Diagnostic Yield
We obtained genetic diagnoses for 25% of probands and deposited the genotypes into the ClinVar repository (accession numbers XXXX to YYYY). Of these diagnoses, 60% were made on the basis of coding SNV/indels in the applied virtual panels, 26% from coding SNV/indels affecting well-established disease genes outside the virtual panels using Europe PMC Funders Author Manuscripts phenotype-based prioritization and/or expert review by the clinicians, Congenica Ltd, or Fabric Genomics, and 14% from genome-wide, phenotype-agnostic research analysis looking beyond SNV/indels, coding regions, and disease genes in the virtual panels ( Figure  1). Following international guidelines 15 a further 10% of probands were classified with variants of unknown significance in genes consistent with the phenotype by clinical review at the site, but with further functional validation required. Fewer candidate variants were returned after filtering in larger family structures (Table 3), making it easier to identify causative variants, in turn leading to higher diagnostic rates for trios, quads and more complex family structures (Figure 2a), even within a disorder e.g. for hereditary ataxia the diagnostic rate increased from 21% for singletons to 32% for trios (Table S4 in the Supplementary Materials).
Unsurprisingly, we obtained a higher diagnostic yield for diseases that were considered more likely to have a monogenic cause ( Table S4 in the Supplementary Appendix) than those we considered more likely to have complex etiology (35% vs 11%) (Figure 2a). Likely monogenic diseases equate to those with a presence in OMIM and where genetic testing is part of the standard diagnostic workup, based on the consensus blinded review of three clinical geneticists. Diagnostic yield was highly variable by disease ( Figure 2b, Table S3 in the Supplementary Appendix), varying from 40-55% for intellectual disability and various vision and hearing disorders to 6% for tumor syndromes.
We obtained data on the presence or absence of prior genetic testing for a subset (1177) of the participants. The number of tests per proband ranged from 0-16 with a median of 1 (IQR 0-2), and approximately half of the probands in this subset had been tested at least once. The overall diagnostic uplift from genome sequencing in this subset was 32% with only a slight difference depending on whether prior testing had been performed (33%), or not (31%). However, many of these prior tests were not recent. The diagnostic yield provided by genome sequencing varied between 28 to 45% depending on the type of prior testing ( Figure  2c, Table S5 in the Supplementary Appendix) which, for the most part, involved targeted single gene and panel testing (Table S6 in the Supplementary Appendix).

Diagnostic Pipeline
The aim of the automated, diagnostic pipeline is to identify a few, potentially causative candidate variants, from the millions in a whole genome, through removal of extremely unlikely candidates (filtering) and identification of the most likely in the remainder (prioritization). This allows the GMCs to efficiently perform manual, clinical interpretation and issue a diagnostic report.

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts the variants with only a median of 1 (IQR 0-2) candidate variant in panels returned to the clinicians at the GMCs per case (Table 3). Ongoing evolution of the virtual panels with new disease genes is expected to continue increasing the yield from this approach.
Phenotype-based prioritization using Exomiser detected 77%, 86%, and 88% of these diagnoses in the top, top 3 and top 5 ranked candidates respectively (Figure 2d). Exomiser and use of virtual panels were complementary, with 92% of these diagnoses re-called when used combined (last blue bar in Figure 2d). Precision phenotyping of our patients was essential both for Exomiser and for the selection of additional virtual panels, without which only 54% of these diagnoses would have been prioritized in the recruited disease virtual panel and presented to the GMCs as a likely candidate (first blue bar in Figure 2d).

Research-based Diagnoses
14% of the genetic diagnoses required research outside the diagnostic pipeline ( Figure  1). This research involved comparisons with the genome sequences and clinical data in our research environment, with validation using wet bench orthogonal tests and in-silico approaches ( Table S7 in the Supplementary Appendix). Additional diagnoses were made by screening for the presence of de novo variants in highly constrained coding regions 16 . These diagnoses included a de novo EBF3 missense variant in a patient with hereditary ataxia. Mitochondrial genome analysis, taking into account heteroplasmy, detected 4 new diagnoses as well as the 9 that had already been detected by the main pipeline). Twelve probands had intronic splicing variants prioritized by Exomiser due to the known pathogenic status of these variants in ClinVar. 23 Nine novel non-coding diagnoses involving previously undescribed variants required exploration of the whole genome and in vitro functional validation via reverse transcription polymerase chain reaction, mini-gene, or luciferase assays. 24,25,26 Here, unsolved probands were queried for non-coding variants affecting genes in the applied virtual panels, either alone, or in compound heterozygosity with lossof-function variants. These were identified using either Genomiser or, for retinal disorder probands, systematic analysis of the untranslated regions, promoter or introns. A further 43 probands were fully or partially explained by structural variants or simple tandem repeat expansions in the genes HTT or FXN in probands with hereditary spastic paraplegia.

Novel Disease Gene Associations
We performed burden testing to discover novel Mendelian disease gene associations and potential genetic diagnoses for unsolved probands; 828 significant disease-gene associations (q value < 0.1) were identified, including 249 known and 579 novel genes (novel with respect to their association with disease), with only 0.03 ± 0.2 (range 0-3) associations from 10,000 permutations where cases and controls were assigned randomly. Twenty two candidates represent the most likely new, fully penetrant, Mendelian disease genes ( Table S8 in the Supplementary Appendix and ClinVar accession numbers SCV001759972 -SCV001760540) with three recently independently confirmed diagnoses: UBAP1 in hereditary spastic paraplegia, 27 FOXJ1 in non-CF bronchiectasis, 28 and SORD in Charcot-Marie Tooth disease. 29 Diagnostic reports were issued for three probands with these genes (Figure 1) and we are investigating others in GeneMatcher and by functional validation studies in model organisms. this study which enabled predictive testing to be offered to the younger brother within one week of birth. The younger child, who received a positive result, received weekly hydroxocobalamin injections to prevent metabolic decompensation. A 10-year-old girl was admitted to intensive care with life-threatening chicken pox. She had endured a diagnostic odyssey over seven years at a total cost of £356,571 across 307 secondary care episodes (Table S11 in Supplementary Appendix). We were able to diagnose CTPS1 deficiency due to a homozygous, known pathogenic splice acceptor variant. A diagnosis enabled a curative bone marrow transplant (cost £70,000) and predictive testing of her siblings showed no further family members to be at risk. One proband had waited till his sixth decade for a genomic diagnosis of an INF2 mutation causing focal segmental glomerulosclerosis. His father, brother and uncle had all died of renal failue. He had received two kidney transplants, had transmitted the condition to his daughter and was concerned about whether his 15-yearold grand daughter, who was under surveillance, was at risk. After he received his genetic diagnosis, the grand-daughter was tested, found to be negative, and discharged from regular medical surveillance.

Discussion
Our findings demonstrate a substantial uplift in genomic diagnoses achieved for patients by genome sequencing across a broad spectrum of rare disease. The enhanced diagnostic benefit was observed regardless of whether participants had undergone prior genetic testing (31% in those who had received testing and 33% in those who had not). For 25% of those who received a genetic diagnosis, there was immediate clinical actionability. Standardizing procedures, from enrolment of patients to the return of NHS-validated results to clinicians, was critical to our success. For example, clinical data collection using diseasespecific data et al.

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts models and HPO terms enabled diagnoses confirming the value of standardization through ontologies and clinical annotation in precision medicine. 30 . These additional diagnoses, beyond the 264 (49% of total diagnoses) observed in the single disease virtual panel, came from Exomiser and additional, applied virtual panels. The diagnostic discoveries derived by combining research, decision support and clinical validation and assessment leveraged an additional 72 diagnoses.
Diagnostic yield was influenced by family structure, and for disorders with a likely Mendelian inheritance and a single gene etiology our yield increased to 35%: ophthalmological, metabolic and neurologic disorders yielded the greatest percentage of diagnoses. The scale of our dataset enabled cohort-wide burden testing which identified numerous novel disease-gene associations including three that have now been confirmed and 19 with compelling evidence that are likely to be confirmed in independent datasets.
Of the diseases we diagnosed through genome-sequencing, 13% were caused by mutations in non-coding sequence or mitochondrial genomes, tandem repeat expansions in Huntington disease, and a wide range of structural variants with nucleotide resolution of breakpoints using a novel random forest method. An additional 2% of diagnoses involved coding variants in regions of low coverage on exome sequencing. Our results provide new evidence of the value of genome sequencing and mirror previous studies where 53% of participants who received new diagnoses from genome sequencing had previously received testing by exome sequencing. 5 Previous studies have demonstrated how next-generation sequencing can reveal diagnoses with yields of between 25% and 29% from exome sequencing in persons who had received no prior genetic testing. [32][33][34] The Undiagnosed Disease Network reported a 26% yield from a mixture of exome and genome sequence analysis of 382 patients 5 and another genome sequencing study gave a 42% yield in 50 families with intellectual disability in whom prior testing had previously been carried out. 35 We obtained similar results with a broad range of disorders (160) with unmet diagnostic need. Our approach is limited to diagnoses that are readily made through short-read genome sequencing. Fully phased, long-read sequencing better detects structural variation and delivers sequence from parts of the genome that are poorly captured by short read sequencing. 31 This pilot has underpinned the case for genome-sequencing in the diagnosis of certain specific rare diseases in the new NHS National Genomic Test Directory 36 . For patients in the National Health Service for specific disorders, such as intellectual disability, genomesequencing will now be the first-line test (

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Figure 1.
Overview of the diagnostic and research pipeline and source of diagnoses. Results were returned to the Genomic Medicine Centres (GMCs) of the recruiting hospitals on an 2183 pilot probands. 25% received a positive diagnosis, 10% had variant(s) of unknown significance (VUS) in genes consistent with the phenotype according to clinical geneticists at the recruiting site, but with further functional validation required. The remaining 65% received a negative report at the time but will be reanalysed. Numbers and source of these positive diagnoses is shown at each stage of the automated diagnostic pipeline and additional research where a clear diagnosis was not immediately obvious.  panel-based and Exomiser prioritization for identifying the diagnoses. Virtual disease panel only: a single panel for the recruited disease category. Applied panels -all applied virtual panels used in the pipeline including the recruited disease associated panel as well as 0 or more additionally selected panels based on the patient phenotypes (HPO terms). Proportion of diagnoses detected are in blue (sensitivity) along with proportion of prioritized variants leading to a positive diagnosis in orange (positive predictive value). Proportions are also shown on bars. Here, diagnosed variant(s) are true positives and other returned candidate variants are false positives.   In virtual panels 1 (0-2) 1 (0-2) 1 (0-3) 1 (0-2) 0 (0-1)