Noninvasive prenatal testing of α-thalassemia and β-thalassemia through population-based parental haplotyping

Noninvasive prenatal testing (NIPT) of recessive monogenic diseases depends heavily on knowing the correct parental haplotypes. However, the currently used family-based haplotyping method requires pedigrees, and molecular haplotyping is highly challenging due to its high cost, long turnaround time, and complexity. Here, we proposed a new two-step approach, population-based haplotyping-NIPT (PBH-NIPT), using α-thalassemia and β-thalassemia as prototypes. First, we deduced parental haplotypes with Beagle 4.0 with training on a large retrospective carrier screening dataset (4356 thalassemia carrier screening-positive cases). Second, we inferred fetal haplotypes using a parental haplotype-assisted hidden Markov model (HMM) and the Viterbi algorithm. With this approach, we enrolled 59 couples at risk of having a fetus with thalassemia and successfully inferred 94.1% (111/118) of fetal alleles. We confirmed these alleles by invasive prenatal diagnosis, with 99.1% (110/111) accuracy (95% CI, 95.1–100%). These results demonstrate that PBH-NIPT is a sensitive, fast, and inexpensive strategy for NIPT of thalassemia.

current approaches for NIPT of recessive diseases are typically classified into two categories [12,13]: relative mutation dosage (RMD) analysis [14] and relative haplotype dosage (RHDO) analysis [15]. The RMD approach focuses on quantitative comparisons between variant and wild-type alleles present in cfDNA and has relatively high sensitivity and specificity [14,16]. This approach is powerful for detecting single nucleotide variants (SNVs) and small insertions/deletions (InDels) but usually cannot detect large InDels and copy number variants (CNVs) [17][18][19]. Its performance is also affected by sequencing errors and amplification bias of low-abundance fetal variants in cfDNA [20]. Unlike the RMD approach, the RHDO approach determines the relative proportions of variant and normal haplotypes in maternal plasma [21] and can theoretically detect most types of variants, including large InDels and CNVs, in one test [16,22]. However, RHDO analysis requires parental haplotype information [23]. Although molecular phasing approaches to determine parental haplotypes, including linked-read sequencing [24,25] and targeted locus amplification (TLA) [26], have not been widely used in clinical settings due to their high cost and complex procedures [27][28][29][30][31][32], populationbased parental haplotyping provides an alternative approach due to its rapid turnaround and inexpensive and relatively simple procedures. However, the use of this method has been limited to a founder variant (GBA gene, c.1226A>G) [33].
In the present study, we proposed a novel population-based haplotyping-NIPT method (PBH-NIPT) for α-thalassemia and β-thalassemia in which nonfounder variants were detected when the sample size of the reference panel (population data used to infer parental haplotypes) was sufficiently large for accurate deduction of parental haplotypes. The PBH-NIPT model was trained on a large retrospective carrier screening dataset, and its accuracy was verified via invasive prenatal diagnosis. In addition, we assessed the effect of the reference panel sample size on the outcomes of PBH-NIPT.

Patients and samples
The ethics committees of Guangzhou Women and Children's Medical Center and BGI approved this study (approval numbers: 2017102408 and BGI-IRB 18043). Fiftynine couples at risk of having a fetus with thalassemia provided written informed consent. The clinical features of the participants are provided in the supplement (Additional file 2: Table S1). We collected 5 ml of blood from each parent. We promptly isolated maternal plasma using a two-step centrifugation method [37]. We used 10 ml of amniotic fluid (AF) or 5 mg of chorionic villus sample (CVS) for invasive prenatal diagnosis.

Sequencing library preparation
We extracted cfDNA from maternal plasma using a QIAamp Circulating Nucleic Acid Kit (Qiagen, Dusseldorf, Germany) and extracted parental gDNA from peripheral blood and fetal DNA from CVS or AF using a QIAamp DNA Mini Kit (Qiagen).
We used gDNA (500 ng) for library construction and fragmented it ultrasonically with a Bioruptor Pico (Diagenode, Liege, Belgium), yielding 300-700-bp fragments. We then performed end repair, phosphorylation, and Atailing reactions on the sheared DNA and ligated BGI-SEQ adaptors with specific barcodes to the A-tailed products. We performed 4-6 cycles of polymerase chain reaction (PCR) amplification to enrich the target regions and performed hybridization capture according to the NimbleGen protocols after pooling twenty barcoded gDNA libraries in equal amounts. Finally, we performed circularization of the post-capture library to generate circular single-stranded DNA (ssDNA). We prepared the maternal plasma DNA library using the same method except without fragmentation and pooled eight cfDNA libraries in equal amounts. After quantitation using Qubit 3.0 (Thermo Fisher, Waltham, USA), we used rolling circle replication to form DNA nanoballs (DNBs) from the ssDNA and loaded each DNB into 1 lane to be processed for 100-bp paired-end sequencing on the BGISEQ-500 and MGISEQ-2000 platforms (BGI, Shenzhen, China).

Reference panel construction
We generated the reference panel from 4356 thalassemia carrier screening-positive cases. Of the total 4356 cases, 3867 were obtained from our previously published paper [35], and 489 were obtained from unpublished in-house data.
We first used our previously published algorithm [35] to call SNPs from 4356 positive carriers and then filtered SNPs with a sequencing depth of less than 20-fold in more than 2% of the population or with an allele ratio between 5 and 40% in more than 70% of heterozygous individuals in the population. We used the publicly available software Beagle (version 4.0) to construct haplotypes for 4356 individuals and used these data as the reference panel for the next step. Since SNPs and InDels are the acceptable input for Beagle, we treated CNVs as SNPs in the phasing procedure. CNVs are represented as the VCF format of SNPs in the Beagle input file (VCF format), where the genomic position is the start position of the CNV, and the genotypes "0/1" and "1/1" represent heterozygous and homozygous CNVs, respectively.

Construction of parental haplotypes by PBH
We aligned the sequence reads from parental gDNA and maternal plasma DNA to the reference human genome (hg19) using BWA version 0.7.12. We marked duplicate reads with Picard version 1.87 and performed variant calling as previously described [35]. We also treated CNVs as SNPs in the phasing procedure. We used the haplotypes of the reference panel and the genotypes of the parents as inputs to deduce parental haplotypes with Beagle 4.0. Finally, we used only heterozygous SNPs to represent parental haplotypes.

NIPT of thalassemia
We calculated the fetal fraction (FF) as described in Additional file 3 and inferred fetal haplotypes inherited from the father and mother separately. First, we determined paternal inheritance using paternal informative SNPs, which were heterozygous in the father but homozygous in the mother. Second, we determined maternal inheritance using maternal informative SNPs, which included two types of SNPs: (1) SNPs heterozygous in the mother but homozygous in the father and (2) SNPs heterozygous in the parents in the blocks where the first step inferred the fetal inherited haplotype from the father. Because informative SNPs linked to the inherited haplotype are overrepresented in maternal plasma, we applied the hidden Markov model (HMM) and Viterbi algorithm [38] to determine the fetal genotypes of pathogenic sites (Additional file 3: Supplementary Methods). For samples with CNVs, all SNPs in the CNV region were not selected as informative SNPs to perform Viterbi decoding.

Invasive prenatal diagnosis of thalassemia
We performed invasive prenatal diagnosis via chorionic villus sampling or amniocentesis in accordance with standard protocols. We determined fetal genotypes through gap-PCR and reverse dot blot PCR (RDB-PCR).
The effect of the reference panel sample size on the outcomes of PBH-NIPT To assess the effect of the reference panel sample size on the outcomes of PBH-NIPT, we randomly selected one-half, one-quarter, one-sixth, one-eighth, one-twelfth, and 50 of the samples from the total reference panel and performed three independent tests.

Results
As shown in Fig. 1, the PBH-NIPT workflow involves the following steps. First, we generated the reference panel from 4356 thalassemia carrier screening-positive cases. Of the total 4356 cases, 3867 were obtained from our previously published paper [35], and 489 were obtained from unpublished in-house data. Second, we enrolled 59 couples in whom both partners carried at least one of the 10 aforementioned variants and were at risk of having a fetus with thalassemia major or intermedia [39] (Additional file 2: Table S1). The average gestational age at the time of collection was 12.6 +3 weeks (range 10 +1 -22 weeks), and the average FF was 15.4% (range 6.0-26.1%) (Additional file 4: Table S2). We subjected genomic DNA (gDNA) of the couples and fetuses as well as maternal cfDNA to hybridization-based capture and sequencing using a strategy previously described for thalassemia carrier screening [35]. We obtained an average target region coverage of 177-fold (range 56-678) in maternal plasma and 203-fold (range 85-360) in parental gDNA. Third, we inferred parental haplotypes by PBH (see the "Methods" section and Fig. 1). To evaluate the reliability of PBH, we also constructed parental haplotypes by family-based haplotyping (FBH) and calculated the percentage of concordant single-nucleotide polymorphisms (SNPs) phased by these two methods (Additional file 3: Supplementary Methods). The average concordance rates of phased SNPs in the maternal and paternal haplotypes were 98.7% (range 87.5-100%) and 95.7% (range 59.2-100%), respectively (Additional file 5: Fig. S2; Additional file 6: Table S3).
To correctly infer fetal genotypes of pathogenic sites (rather than all SNPs), we developed a hidden Markov model (HMM) and used the Viterbi algorithm. We calculated a confidence score (CS), defined as the probability of obtaining the correct NIPT result, to evaluate the reliability of each prediction. A "no-call" condition was defined when (1) the CS was less than 0.99 or (2) the inferred haplotype contained two haplotype blocks (pathogenic and normal), and neither block spanned the target gene (HBB or HBA) (Additional file 3: Supplementary  Methods). Accordingly, NIPT successfully inferred 111/ 118 (94.1%) alleles, and invasive prenatal diagnosis confirmed these alleles, with 99.1% (110/111 alleles) accuracy (95% CI, 95.1-100%) ( Table 1, Fig. 2, and Additional file 7: Fig. S3). Among these 59 fetuses, 52 had both alleles detected; of these 52 fetuses, 15 were normal, 25 were carriers, and 12 were affected. Seven fetuses had only one allele successfully detected, and the other allele failed, with a CS of less than 0.99 (Table 1 and Fig. 2). Among the 7 fetuses with only one allele inferred by NIPT, 6 inherited the pathogenic allele. Obviously, invasive prenatal diagnosis was needed, which we used to clarify that 4 fetuses were affected and 2 were carriers.
To evaluate the relationship between the accuracy of NIPT and the reference panel sample size, we randomly selected one-half, one-quarter, one-sixth, oneeighth, one-twelfth, and 50 of the samples from the total reference panel and performed three independent tests. As expected, in the 52 fetuses in whom NIPT inferred both alleles, the NIPT outcome improved as the reference panel sample size increased (Fig. 3). Reduction of the sample size to one-half of the total reference panel yielded accuracies of NIPT of approximately 89.3% for β-thalassemia and 95.1% for α-thalassemia relative to the invasive prenatal diagnosis results.

Discussion
This study demonstrated the feasibility of PBH-NIPT for thalassemia. PBH-NIPT can be used after carrier screening for thalassemia. For high-risk couples reluctant to undergo an invasive procedure, PBH-NIPT is a more attractive option, requiring only a simple blood draw from the pregnant woman. For most conditions, including the deduction of carrier and normal individuals using NIPT, no further confirmation is needed. For conditions where NIPT detects an affected fetus (12 cases) or detects only one pathogenic allele (6 cases), invasive prenatal diagnosis is recommended. In our study, PBH-NIPT dramatically reduced the number of invasive prenatal diagnosis required by approximately 69.5% (from 59 to 18 fetuses).
Here, NIPT successfully inferred 94.1% of the fetal alleles (111/118) from the 59 fetuses. Focusing on the 7 no-call cases clearly shows that the number of informative SNPs in all 7 cases was fewer than 3. This problem can be resolved by increasing the number of informative SNPs flanking the target gene through expansion of the target region [33]. Moreover, this study demonstrated that the reference panel size could affect the performance of NIPT. However, the reference panel size is not a limiting factor since large-scale expanded carrier screening for recessive monogenic disorders is common in clinical practice [40].
This study aimed to evaluate and provide a simple, fast, and inexpensive NIPT method for thalassemia. Compared with the current linked-read sequencingbased NIPT method, which requires 15-20 days (10 days    [25], PBH-NIPT requires only 5-7 days (4-5 days of wet lab work and 1-2 days of data analysis). Training PBH on a large reference panel requires only a few minutes [41]. The PBH-NIPT method costs approximately 80 dollars, as estimated in the supplement (Additional file 8: Table  S4). Considering the cost of invasive prenatal diagnosis (~1000 dollars/sample [42]) testing in 6 cases, the actual cost of PBH-NIPT per sample is 174 dollars, which is significantly less than those of molecular haplotyping (1 500 dollars/sample [25]) and invasive prenatal diagnosis (~1000 dollars/sample [42]).
This study has two limitations. First, since all 59 families and training reference data were from southern China, the test cannot detect individuals with ethnic backgrounds differing from those in the training population. Currently, we can only consult ethnic information based on self-reports before testing. A potentially good solution would be to add a quantifiable QC parameter to provide guidance for the reliability of the test. Therefore, we will consider including SNPs that are able to distinguish ethnic information when designing the next version of the probe [43]. Second, the population frequency of these 10 variants was 0.15~2.66% in our dataset [35], , confidence score for fetal inheritance from maternal haplotype; CS pat , confidence score for fetal inheritance from paternal haplotype; M p , maternal pathogenic haplotype, P p , paternal pathogenic haplotype; M n , maternal normal haplotype; P n , paternal normal haplotype; SNPs for M p /P p , the number of informative SNPs that supported fetal inheritance from parental pathogenic haplotypes; SNPs for M n /P n ; the number of informative SNPs that supported fetal inheritance from parental normal haplotypes *No-call: confidence score less than 0.99. **The NIPT result of maternal inheritance for F55 was inconsistent with the invasive prenatal diagnosis result and more data are needed to validate whether PBH-NIPT is able to detect variants with lower frequencies.

Conclusions
In summary, we developed and verified PBH-NIPT, a novel method for prenatal testing of α-thalassemia and βthalassemia. Compared with invasive prenatal diagnosis, this method achieved 99.1% accuracy (95% CI, 95.1-100%). Therefore, we propose that this strategy might be extended to detect variants in addition to single-haplotype founder variants in other recessive monogenic disorders.
Additional studies with larger sample sizes are required to confirm the application and performance of PBH-NIPT for other populations and variants with lower frequencies. Fig. 2 Outcomes of PBH-NIPT. We performed PBH-NIPT on 59 couples. Fifty-nine fetuses had 111 total alleles confirmed by invasive prenatal diagnosis, with 99.1% accuracy (95% CI, 95.1-100%). Fifty-two fetuses had both alleles detected for a total of 104 alleles; 7 fetuses had only one allele detected Fig. 3 Effect of the reference panel sample size on the outcomes of PBH-NIPT. We randomly extracted one-half, one-quarter, one-sixth, oneeighth, one-twelfth, and 50 of the samples from the total reference panel and performed three independent tests for a β-thalassemia and b αthalassemia. In the 52 fetuses in whom NIPT successfully inferred both alleles, the outcome of NIPT improved as the sample size of the reference panel increased