Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data

Summary Background Human genome sequencing has transformed our understanding of genomic variation and its relevance to health and disease, and is now starting to enter clinical practice for the diagnosis of rare diseases. The question of whether and how some categories of genomic findings should be shared with individual research participants is currently a topic of international debate, and development of robust analytical workflows to identify and communicate clinically relevant variants is paramount. Methods The Deciphering Developmental Disorders (DDD) study has developed a UK-wide patient recruitment network involving over 180 clinicians across all 24 regional genetics services, and has performed genome-wide microarray and whole exome sequencing on children with undiagnosed developmental disorders and their parents. After data analysis, pertinent genomic variants were returned to individual research participants via their local clinical genetics team. Findings Around 80 000 genomic variants were identified from exome sequencing and microarray analysis in each individual, of which on average 400 were rare and predicted to be protein altering. By focusing only on de novo and segregating variants in known developmental disorder genes, we achieved a diagnostic yield of 27% among 1133 previously investigated yet undiagnosed children with developmental disorders, whilst minimising incidental findings. In families with developmentally normal parents, whole exome sequencing of the child and both parents resulted in a 10-fold reduction in the number of potential causal variants that needed clinical evaluation compared to sequencing only the child. Most diagnostic variants identified in known genes were novel and not present in current databases of known disease variation. Interpretation Implementation of a robust translational genomics workflow is achievable within a large-scale rare disease research study to allow feedback of potentially diagnostic findings to clinicians and research participants. Systematic recording of relevant clinical data, curation of a gene–phenotype knowledge base, and development of clinical decision support software are needed in addition to automated exclusion of almost all variants, which is crucial for scalable prioritisation and review of possible diagnostic variants. However, the resource requirements of development and maintenance of a clinical reporting system within a research setting are substantial. Funding Health Innovation Challenge Fund, a parallel funding partnership between the Wellcome Trust and the UK Department of Health.


Sample Collection
Saliva samples were taken from patients and their parents using barcoded Oragene®DNA collection kits (DNA Genotek), and sent to the Wellcome Trust Sanger Institute (WTSI), where genomic DNA was extracted using a QIAsymphony instrument; blood-extracted genomic DNA from the proband was also provided by regional molecular genetics laboratories. A bespoke Laboratory Information Management System (LIMS) was developed, which interacted with DECIPHER to facilitate sample tracking. Following gel electrophoresis, an automated volume check (BioMicroLab) and assessment of concentration via a pico green assay (Beckman FX, NX-96, Molecular devices DTX reader). Fifty-eight unique single nucleotide polymorphisms (SNPs) were genotyped by the WTSI core sample logistics facility using Sequenom® mass spectrometry to provide a 'molecular barcode' for sample/data tracking; any inconsistent family relationships were identified and resolved prior to genomic analysis either via tube swaps or individual sample failures.

Genomic Analysis
Copy number variation was initially assessed in the probands by the DDD microarray team using a custom Agilent 2x1M CGH array (Amadid No.s 031220/031221), with 5 probes per exon and a mean backbone spacing of ~4kb. Patient samples were processed in batches of 95 using an Agilent BRAVO robot, labelled with Cy5 with Agilent SureTag labelling reagents, and hybridised against a pool of 500 developmentally normal males labelled with Cy3. All laboratory procedures were supported by our bespoke LIMS. Arrays were scanned in an Agilent G2565CA scanner at a 3µm resolution, and processed using Feature Extraction v10.5.1.1. CNVs were then called using a novel in-house analysis package (CNsolidate, manuscript in preparation) that combines 12 weighted change point detection algorithms. Patient data were annotated with allele frequencies based on overlap with a set of controls analysed on the same platform (570 population controls from the UK Blood Service and 455 developmental normal individuals from the Scottish Family Health Study). The WTSI genotyping team also genotyped the first ~1000 families using the Illumina 700K OmniExpress SNP-array (SangerDDD_OmniExPlusv1_15019773_A) with an additional 100K custom content to fill any large gaps. The inheritance of rare CNVs identified in the proband using aCGH was subsequently assessed on the SNParray using a Bayesian framework.
Exome sequencing of family trios was performed by the WTSI core pull-down and sequencing teams, using a custom Agilent SureSelect Exome bait design baits) of regulatory regions were targeted using custom baits. The median (n=3,399) average sequencing depth (ASD = bases sequenced/bases targeted) was 90X across the whole targeted sequence or 93X across autosomal targets only. 95% of all samples had an average sequencing depth higher than 63X. At least 90% of all targeted regions have a median ASD higher than 15X. Only 16026 baits showed a median ASD smaller than 10X, comprising 800kb of protein coding sequence. More than 85% of all probes were consistently covered (ASD >10X) across the three samples of the trio in at least 90% of the 1133 trios. Alignment was performed using BWA 1 and variants were called from BAM files using GATK 2 , SAMtools 3 and Dindel 4 . Putative de novo variants were identified from trio BAM files using DeNovoGear 5 and CNVs were called using a novel in-house algorithm (CoNVex, manuscript in preparation). Variant Call Format (VCF) files from each variant calling pipeline were then merged and annotated with the most severe consequence predicted by Ensembl Variant Effect Predictor (VEP version 2.6) 6 , and minor allele frequencies from a combination of the 1000 Genomes project (www.1000genomes.org), UK10K (www.uk10k.org), the NHLBI Exome Sequencing Project (esp.gs.washington.edu), Scottish Family Health Study (www.generationscotland.org), UK Blood Service and unaffected DDD parents. Putative de novo SNVs and indels were validated in-house using whole genome amplified DNA, PCR and capillary sequencing.

Data Filtering and Sharing
Rare (minor allele frequency <0.01%) and protein-altering coding variants (defined by the sequence ontology terms: transcript ablation, splice donor variant, splice acceptor variant, stop-gained, frameshift variant, stop-lost, initiator codon variant, inframe insertion, in-frame deletion, missense variant, transcript amplification and coding sequence variant) in individual probands were compared against an in-house developmental disorder genotype-to-phenotype database (DDG2P) and the patient's clinical features. Flagged variants were manually reviewed and likely diagnostic variants fed-back to the referring clinical geneticist via the patient's record in DECIPHER, where they can be viewed in an interactive genome browser by all members of the regional genetics service to enable local clinical evaluation, diagnostic laboratory validation and discussion with the family as appropriate. These variants are made publicly accessible after a short holding period (to ensure families are informed). Full genomic datasets are also deposited in EGA (www.ebi.ac.uk/ega) for future research.

Overview
The DDG2P dataset integrates data on genes, variants and phenotypes relating to developmental disorders. It is constructed entirely from published literature, and is primarily an inclusion list to allow targeted filtering of genome-wide data for diagnostic purposes in the DDD study. The database was compiled with respect to published genes, and annotated with types of disease-causing gene variants. Each row of the database associates a gene with a disease phenotype via an evidence level, inheritance mechanism and mutation consequence. Some genes therefore appear in the database more than once, where different genetic mechanisms result in different phenotypes. DDG2P is produced and curated by UK consultant clinical geneticists. It is regularly updated and the dataset used here is from 7th November 2013 (available in Appendix 2).

Establishing and updating DDG2P
DDG2P is dependent on the RefSeq list of human genes downloaded from USCS Genome Informatics Table Browser interface (genome.ucsc.edu). The initial assignment of developmental disease genes, allele requirements and mutational consequence within DDG2P was achieved using three main sources: a systematic review of all articles in the previous 5 years of Nature Genetics and American Journal of Human Genetics by HVF and DRF; a review of all diagnostic tests currently offered by NHS DNA diagnostic laboratories in the UK; and text searching of the "Involvement in Disease" topic within the "General Annotation" field of UniProt (www.uniprot.org) using multiple different Boolean combinations of terms related to developmental disorders. The database interfaces with OMIM to inform the curation process. DDG2P is currently maintained as a Filemaker Pro relational database and is updated via regular review of the literature and from presentations at scientific meetings attended by DRF and HVF. All new gene-disease entities are added to the database by DRF and are made public via Twitter using @DDG2P, and regular releases are available via DECIPHER (https://decipher.sanger.ac.uk).

Definitions of terms and evidence base used in DDG2P
The definitions of the categorical terms used for assigning the DDG2P status of a gene, the allele requirements of each disease linked to a gene and the mutational consequence associated with each allele requirement are given in Appendix 2, Tables S1a, S1b and S1c respectively. The main evidence source used in DDG2P is the PMID of peer-reviewed articles that have been reviewed in order to categorise each gene-disease (one-to-many) relationship. The November 2013 version of DDG2P is provided in Table S1d of Appendix 2.