Noninvasive prenatal diagnosis of monogenic disorders based on direct haplotype phasing through targeted linked-read sequencing

Though massively parallel sequencing has been widely applied to noninvasive prenatal screen for common trisomy, the clinical use of massively parallel sequencing to noninvasive prenatal diagnose monogenic disorders is limited. This study was to develop a method for directly determining paternal haplotypes for noninvasive prenatal diagnosis of monogenic disorders without requiring proband’s samples. The study recruited 40 families at high risk for autosomal recessive diseases. The targeted linked-read sequencing was performed on high molecular weight (HMW) DNA of parents using customized probes designed to capture targeted genes and single-nucleotide polymorphisms (SNPs) distributed within 1Mb flanking region of targeted genes. Plasma DNA from pregnant mothers also underwent targeted sequencing using the same probes to determine fetal haplotypes according to parental haplotypes. The results were further confirmed by invasive prenatal diagnosis. Seventy-eight parental haplotypes of targeted gene were successfully determined by targeted linked-read sequencing. The predicted fetal inheritance of variant was correctly deduced in 38 families in which the variants had been confirmed by invasive prenatal diagnosis. Two families were determined to be no-call. Targeted linked-read sequencing method demonstrated to be an effective means to phase personal haplotype for noninvasive prenatal diagnosis of monogenic disorders.


Background
The discovery of cell-free fetal circulating DNA (cff-DNA) in maternal blood and the rapid advances of massively parallel sequencing (MPS) have provided an unprecedented opportunity to perform the prenatal genetic testing of common fetal aneuploidies and singlegene diseases. Though MPS has been widely applied to screen for fetal trisomy 21, 18 and 13 [1], the clinical use of MPS to diagnose monogenic disorders is limited [2]. Several studies have been conducted to develop noninvasive prenatal diagnosis (NIPD) for monogenic disease using various technologies such as real-time polymerase chain reaction (PCR), amplification at lower denaturation temperature-PCR, digital PCR, circulating single-molecule amplification and resequencing technology [3,4] and MPS. These studies were confined to exclude paternally inherited [5] and detect de novo variants [6] based on variant-specific assays due to the strong interference of maternal background signal. The relative haplotype dosage approach has been demonstrated to detect parental inherited variants at the same time. Our group has employed a proband-based method for resolving parental haplotypes and successfully applied this method to NIPD of Duchenne muscular dystrophy (DMD) [7], congenital adrenal hyperplasia (CAH) [8], maple syrup urine disease (MSUD) [9], hyperphenylalaninemia [10] and spinal muscular atrophy (SMA) [11]. This phasing information makes it possible to measure the haplotype dosage imbalance in maternal plasma DNA. The advantage of relative haplotype dosage approach is that analysis is independent of variant types. While, the method needs proband's samples to phase parental haplotypes, which hampers the application of NIPD to monogenic diseases in clinical practice. The haplotype phasing is a critical step for haplotype-based NIPD of monogenic disorders. Serval studies have reported specific haplotype building methods such as clone pool dilution sequencing [12], contiguity-preserving transposition sequencing [13], targeted locus amplification (TLA) [14], HaploSeq [15] and long fragment read (LFR) technology [16]. These approaches need complex experimental operations and are time consuming and associated with a low success rate. These limitations can be problematic for identifying single gene disorders. Population data-based personal haplotype phasing overcomes the above drawbacks. The population-based method is based on reference population with genotyping data of unrelated individuals and the accuracy of NIPD is only 80%, which is lower than the experimental methods [17]. In order to further improve the success rate and accuracy of haplotype phasing, microfluidics-based linked-read sequencing technology and TLA-based phasing were utilized to phase parental DNA directly [18,19]. The former approach combined the whole-genome sequencing (WGS) and linked-read sequencing method and succeeded in predicting fetal inherited variants in 12 of 13 pregnancies. The informative sequencing depth (40x) of WGS and the expensive experimental reagents restricted its clinical practice for NIPD [18]. Targeted TLA-based phasing approach is also subject to the complex acquisition of TLA template and customized target kit for NIPD which is inconvenient. A customized probe which covers dozens of common single gene disorders in China is used for haplotype-based NIPD. Therefore, we speculated that the linked-read sequencing combined with targeted sequencing using the above probes would expand the list of single gene disorders and reduce the cost compared with the wholegenome sequencing.
In our previous study, we demonstrated direct haplotyping of NIPD based on linked-read sequencing is accurate for the prediction of fetal pathogenic variants of DMD [20]. The objectives of this study are to investigate the feasibility and accuracy of targeted linked-read sequencing in six different types of autosomal recessive diseases. We analyzed 40 families at high risk for six kinds of autosomal recessive diseases and showed that direct haplotype phasing of parental high molecular weight (HMW) DNA is feasible using targeted linkedread sequencing of target genes. Targeted sequencing of maternal plasma DNA combined with the parental haplotype information were interpreted to determine the inherited variants in fetus. Our approach might be a costeffective and applicable method for NIPD of autosomal recessive monogenic disorders in clinical settings.

Sample collection
We recruited 40 families at high risk for autosomal recessive diseases, including 13 methylmalonic acidemia (MMA) families, 12 β-thalassemia families, 8 phenylketonuria (PKU) families, 5 α-thalassemia families, 1 autosomal recessive polycystic kidney disease (ARPKD) family and 1 autosomal recessive deafness-1A (DFNB1A) family caused by pathogenic variants of GJB2 gene. The variants have been identified in all families (Table 1). All participants provided written informed consent to join in the study. The ethics committee of the participating hospitals and the Institutional Review Board of BGI approved the conduct of this study (BGI-IRB No 17080-T1).

Target capture probe design
The targeted enrichment of DNA libraries was performed according to the custom-designed SeqCap EZ Choice Library (NimbleGen, Roche) protocol. The capture probes (NimbleGen, Roche) targeting the whole genes of HBB, HBA1, HBA2, and highly heterozygous SNPs within 1Mb flanking region of target genes were designed for NIPD of β-thalassemia and α-thalassemia. Another set of target capture probe was designed to cover the coding region and SNPs within 1Mb upstream and downstream regions of the interested genes, including MMACHC (MMA), PAH (PKU), PKHD1 (ARPKD) and GJB2 (DFNB1A).

Targeted linked-read sequencing
HMW genomic DNA (gDNA) was extracted from stored blood using the Mag Attract HMW Kit (Qiagen, Germany). The size of HMW gDNA should be more than 50kb according to the pulse electrophoresis results. Then gDNA was processed with Chromium ™ Genome v2 libraries (10x Genomics, USA). Long gDNA strands were partitioned in barcoded gel beads through a microfluidic device. Barcoded oligonucleotides in a gel bead bind randomly onto the long molecules and generate short fragments with the same barcode. The chance that two molecules were covering the same genomic locus on each gel bead is low, and the short fragments with the same barcode were considered to come from the same long molecule. Libraries of the barcoded fragments were prepared and captured using the customized probe. The prepared DNA library was then sequenced using an Illumina HiSeq2500 sequencer with a paired-end format of 101 bp or 150 bp.

Variant calling and direct haplotype phasing
The barcoded libraries read were then processed with the Long Ranger pipeline (v.2.2.2) provided by 10x Genomics [21]. Reads associated with valid barcodes were aligned against the human genome 19 (Hg19) by using the Burrows-Wheeler Aligner (BWA) software [22]. Output files annotated with barcode and phasing information were generated and served as the reference haplotypes of the family for downstream analysis. The maternal plasma DNA sequencing reads were aligned against the reference hg19 using BWA. After duplicated reads were marked by the Picard Mark Duplicates tool, the GATK tools were applied to perform local realignment and base quality score recalibration [23].
The free Long Ranger (v.2.2.2) software was utilized to determine the parental haplotype in the interested region. Barcode information provides the clue to associate short reads to the original long input molecules. Variant-linked haplotype referred to those reads whose barcodes were consistent with the ones with variant alleles. In contrast, wild-linked haplotype denoted the reads carrying same barcode with the ones with wild-type alleles. The different haplotype blocks were linked with identified SNPs using the overlapping region. SNPs associated with the same haplotypes carrying the wild-type and variant alleles were used for the maternal plasma DNA analysis.

The estimation of fetal fraction and NIPD of monogenic disorders
The evaluation of fetal fraction could be conducted according to the procedure reported before [8]. The haplotype related to variant and wild alleles was constructed based on targeted linked-read sequencing. The informative SNPs that were heterozygous in the mother but homozygous in the father were analyzed for maternal inheritance. On the contrary, the paternal inheritance analysis followed the opposite strategy with maternal inheritance analysis. We used hidden Markov model (HMM) to predict the most likely inherited haplotype using our previously reported algorithm [24]. The probabilities that the fetus inherited the pathogenic and non-pathogenic alleles were evaluated using the number of reads in maternal plasma and then considered as the HMM emission probabilities. The genetic map from the National Center for Biotechnology Information provided the genetic position of the SNPs in centimorgan (cM) and recombination rates between SNPs, these probabilities were regarded as HMM transition probabilities. Lastly, the Viterbi algorithm was utilized to predict the inherited haplotype in the fetus.

Validation of NIPD
The samples obtained through invasive procedures including chorionic villus sampling (CVS) and amniocentesis were used for prenatal genetic diagnosis. After DNA extraction, Sanger sequencing, gap-PCR and reverse dot blot PCR for target variations were performed in a blind manner to further validate the accuracy of NIPD.  Table 1.

Targeted linked-reads sequencing
Targeted sequencing on the interested gene region was performed in plasma DNA samples from 40 pregnant women at different gestational weeks. The fetal fraction varied from 5.9 to 27.7%, with a mean fetal fraction of 13.2%, showing significant differences between individuals ( Table 1). The targeted sequencing of gDNA samples showed the coverage was relatively consistent in the targeted genes, with a mean read depth of 402× (Additional file 1: Table S1). After data pre-processing and alignment, over 98% of the linked-reads were aligned to the hg19, an average of 50% of the bases were on-target (Additional file 1: Table S1). The summary statistics of alignment are presented in detail in Additional file 1: Table S1.

Direct haplotype phasing
The 10x genomics barcoding technology allowed us to obtain long-range information by linking the short sequencing reads produced. There were two haplotypes, the pathogenic haplotype (P) and normal haplotype (N). The former referred to the reads whose alleles or barcodes were in consistence with variant-supporting reads at heterozygous SNP positions. While the latter represented those reads whose alleles were opposite to the variant-supporting reads at heterozygous SNP positions. The two haplotypes of were directly determined by linking the haplotype blocks assembled by the barcoded reads for all parental gDNA. N50 phase-block length represents the contiguity achieved in the experimental  Table 2 and Additional file 1: Table S1. The number of SNPs in the phase blocks used for phasing ranged from 3 to 2418 SNPs, with a mean of 1006 (Table 2). All variants carried by family members were initially detected by the targeted linkedread sequencing and verified to be concordant with those from the MPS data. The paternal haplotypes phasing of F27 and F36 failed, because the haplotype block cannot cover the pathogenic variants. Therefore, the NIPD analysis is not required for failed phasing individuals (pF27 and pF36).

Noninvasive prenatal diagnosis
As shown in the NIPD flowchart (Figure 1), maternal and paternal haplotypes were first established using targetregion sequencing data and the HMM and Viterbi algorithm was then applied to predict fetal haplotypes. Our goal was to precisely infer the fetal genotypes at pathogenic sites, not to correctly infer the haplotypes of all SNP markers flanking the target gene. Therefore, the specific rules [25] were set to determine the fetal genotype at the pathogenic site after obtaining the optimal path of the fetal haplotype block via the Viterbi algorithm. If the path contains only one halotype block (pathogenic or normal) and the block spans the target gene, the fetal genotype at the pathogenic site is the state of the haplotype block that spans the target gene. If the path contains two haplotype blocks (pathogenic and normal) and only one haplotype block spans the target gene, the fetal genotype at the pathogenic site is the state of the haplotype block that spans the target gene (for example, mF04 and mF06). If two haplotype block (pathogenic and normal) exists inside the target gene, the fetal genotype at the pathogenic site is determined as no-call (for example, mF36). A confidence score (CS) [25] was introduced into our algorithm to quantify the probability of obtaining the correct results for NIPD. The CS was calculated using the fetal fraction, sequencing depth of maternal plasma and number of parental informative SNPs as inputs for computational simulation. The detailed method can be referred to the published literature [25]. The condition that the CS was less than 0.99 was defined as no-call. The NIPD results exhibited that 38 fetuses had both alleles detected; of these 38 fetuses, 11 were affected, 15 were carriers and 12 were normal. (Table 3, Additional file 2: Figure S1, Additional file 3: Figure S2 and Additional file 4: Figure S3). For F27, only one normal haplotype inherited from mother can be inferred by NIPD. For F36, we cannot predict fetal haplotypes inherited from parents.
The fetal genotypes inferred by NIPD were compared with direct sequencing results of fetal gDNA extracted from CVS or amniotic fluid cells to further validate the accuracy of NIPD. The results of NIPD were in concordant with invasive diagnosis and the standard genotype of captured sequencing (Table 3).

Discussion
In our study, we applied the targeted linked-read sequencing method to resolve the parental haplotypes across a range of disease loci and successfully determined the fetal genotypes in 38 families, at risk for various single  In recent years, several studies have utilized the direct haplotyping method to perform NIPD of single gene disorders [18,23]. Hui et al conducted whole genome haplotyping method and resolved the parental haplotypes with the use of linked-read sequencing technology. They correctly deduced the fetal variant profiles in 12 out of 13 families at risk for a number of autosomal and X-linked diseases. However, the cost of whole genome haplotyping method is relatively high, which might limit its wide use in clinical settings. Vermeulen et al established the targeted locus amplification approach and phased heterozygous variants in selected genes, the method reduced the cost of whole genome haplotyping method and predicted fetal variant status with a high accuracy. Michael Parks utilized targeted capture enrichment of SNPs across a 6 Mb genomic window on chromosome 5 containing the SMN1 gene and successfully deduced fetal variants by relative haplotype dosage with 100% accuracy [11]. However, customizing the targeted region might be a complex task, due to population frequency difference of SNPs across different ethnicities [26]. Our method is advantageous to the above-mentioned 2 direct phasing methods with respect to the cost-effectiveness and recombination prediction. The current NIPD practically requires maternal, paternal DNA and proband's DNA samples, therefore, the cost of the current proband-dependent method is approximately $830. The major advantage of our method is that it bypassed the availability of the proband's DNA which considerably reduced the cost to $700. Moreover, multiplexing of a barcoded library further reduces the cost of linked-read sequencing. The turnaround time of linked-read sequencing is 3 weeks, that is more time-consuming than that of the probandbased method but is still affordable for noninvasive prenatal diagnosis. One potential application of our method is NIPD of cystic fibrosis variants which are more relevant to other ethnicity. As demonstrated in this study, the capture probes should cover the whole CF transmembrane regulator (CFTR) gene and highly heterozygous SNPs within 1Mb flanking region of CFTR. With reduced cost, the targeted linked-read sequencing method is capable of NIPD of a wide range of monogenic disorders independently of proband sample.
Despite the advantages as mentioned above, our method still has certain limitations. First, the average percentage of bases on target is approximately 50%, the low on-target rate is a potential limitation of this linked-read target sequencing and may increase the sequencing cost. However, as compared to two other studies, in which the authors reported mean on-target rates of 30.7% and 32% [7,19], our linked-read target sequencing outperformed the previously published methods. Second, the design of target region and capture probe is critical to successfully conduct targeted linked-read sequencing. There is no existent recommended guideline on the design of capture probes. Additionally, it's essential to evaluate recombination hot spots surrounding the target region and include the results in the recombination adjustment [27]. Given the clinical applicability of linked-read sequencing hasn't fully characterized, more researches are required to validate the readiness and effectiveness of this technique in the future.

Conclusions
In summary, we have provided solid evidence that targeted linked-read sequencing method could be applied to the noninvasive assessment of a variety of fetal single gene diseases. The method is a cost-effective and could be widely adopted in clinical practice.