A new sequencing-based women’s health assay combining self-sampling, HPV detection and genotyping, STI detection, and vaginal microbiome analysis

The composition of the vaginal microbiome, including both the presence of pathogens involved in sexually transmitted infections (STI) as well as commensal microbiota, has been shown to have important associations for a woman’s reproductive and general health. Currently, healthcare providers cannot offer comprehensive vaginal microbiome screening, but are limited to the detection of individual pathogens, such as high-risk human papillomavirus (hrHPV), the predominant cause of cervical cancer. There is no single test on the market that combines HPV, STI, and microbiome screening. Here, we describe a novel inclusive women’s health assay that combines self-sampling with sequencing-based HPV detection and genotyping, vaginal microbiome analysis, and STI-associated pathogen detection. The assay includes genotyping and detection of 14 hrHPV types, 5 low-risk HPV types (lrHPV), as well as the relative abundance of 32 bacterial taxa of clinical importance, including Lactobacillus, Sneathia, Gardnerella, and 4 pathogens involved in STI, with high sensitivity, specificity, and reproducibility. For each of these taxa, healthy ranges were determined in a group of 50 self-reported healthy women. The hrHPV portion of the test was evaluated against the Digene High-Risk HPV HC2 DNA test with vaginal samples obtained from 185 women. Results were concordant for 181/185 of the samples (overall agreement of 97.83%, Cohen’s kappa = 0.93), with sensitivity and specificity values of 94.74% and 98.64%, respectively. Two discrepancies were caused by the Digene assay’s known cross-reactivity with low-risk HPV types, while two additional samples were found to contain hrHPV not detected by Digene. This novel assay could be used to complement conventional cervical cancer screening, because its self-sampling format can expand access among women who would otherwise not participate, and because of its additional information about the composition of the vaginal microbiome and the presence of pathogens.

denatured with sodium hydroxide, denatured viral DNA is hybridized with specific RNA 115 probes, and RNA:DNA hybrids are subsequently detected with antibodies (Lörincz 116 1996). The HC2 test detects 13 hrHPV types, but does not report which specific type is 117 present. Other HPV detection assays involve the amplification of viral DNA by 118 Polymerase Chain Reaction (PCR). The most widely used primer pairs for HPV PCR 119 detection are the GP5+/6+ primers (de Roda Husman et al. 1995) and the degenerate 120 MY09/11 primers (Manos et al. 1989)  In this study, we tested the feasibility of a novel assay, that combines the 162 detection and identification of HPV DNA, STI-associated pathogens, and microbiome 163 analysis on samples obtained through self-sampling. We validated the performance of 164 marker gene amplification and sequencing to detect the presence and relative 165 abundance of 32 clinically important bacterial targets with high precision and accuracy. 166 In addition to detecting Lactobacillus, Sneathia, and Gardnerella spp., this test detects 167 STI-associated pathogens including Chlamydia trachomatis, Mycoplasma genitalium, 168 Neisseria gonorrhoeae, and Treponema pallidum, which cause chlamydia, genital tract 169 infections, gonorrhea, and syphilis, respectively. The performance of a novel 170 amplification and sequencing-based strategy for HPV detection and type-specific 171 identification was compared to that of the most widely used test for HPV detection in 172 cervicovaginal specimens, the Digene HC2 test. This assay is intended to complement, 173 rather than replace, current healthcare guidelines for in-clinic cervical cancer screening.  Table 1 in the Supplementary Materials for more detailed   241 information about e.g. the HPV genotypes included and a list of references.

242
10 Then, assuming amplification with up to two mismatches with the primers used, 243 we identified for each taxa the sequences that would produce an amplicon, and 244 evaluated whether that amplicon is unique to the taxon of interest (ti) or also shared by 245 sequences from different taxa (dt). The number of true positives (TP), true negatives 246 (TN), false positives (FP) and false negatives (FN) was computed for different tolerance 247 ratios for the quotient dt/ti, and subsequently in silico performance metrics were 248 assessed. Of the 73 bacterial targets initially selected, the 32 targets selected for the 249 assay had all four in silico performance metrics above 90% (Supplementary Figure 2).  In order to evaluate the performance metrics for identification of the HPV targets, 265 sequences of the L1 segment of HPV genomes from the NCBI database were used. 266 The search was filtered to sequences with length in the range 1,500-10,000 bp and with 267 correct assignment of the type of the HPV (4177 sequences). These sequences were 268 amplified in silico using the primers described below. Following these steps, we 269 generated 161,398 amplicons. These sequences were mapped using Vsearch (Rognes 270 2016) at 95% of identity against an HPV amplicon reference database consisting of the 271 amplicons produced by the reference genomes in PaVe for the 19 HPV types included 272 in our assay. The performance metrics were calculated in a manner analogous to how 273 11 they were calculated for the 16S rRNA gene targets. Briefly, the correct assignment of 274 an NCBI amplicon against the reference was counted as a true positive and an incorrect 275 assignment was considered as a false negative. Also, we considered as a false 276 negatives the genomes from NCBI that our primers could not amplify. According to this, 277 the 19 HPV types obtained values for sensitivity, specificity, positive predictive value 278 (PPV) and negative predictive values (NPV) above 90% (Supplementary Table 2 To create these mixes, each bacterial sDNA was randomly assigned to one of two 307 pools, A and B, that each contained sDNAs in equimolar amount. Each pool was serially 308 diluted in PCR grade water. Pool A dilutions were mixed 1:1 with undiluted Pool B and 309 vice versa. All pool A/B combinations were used in triplicate for DNA extraction, 310 amplification, and sequencing as described below. For each target, the LOD was 311 defined as the lowest concentration of sDNA where at least two of the three replicates 312 contained at least 2 reads for that target in a sample with 10,000 reads or more. Using 313 this LOD, we calculated a lower threshold for detection for each taxa at its LOD as the 314 LOB (48.27) plus the standard deviation of the taxa at LOD * 1.65. This threshold is 315 used to correctly assign a taxa as identified in a sample at or above its LOD. 316 For targets that had both a species and a genus level sDNA present in the mixed 317 pools A and B, a bioinformatic correction was applied. The total reads for a genus-level 318 target for which a species within that genus was also present in the mixed pools, was 319 defined as the total measured reads for the genus and subtracting all those reads 320 corresponding to species-level targets belonging to that genus in the same pool mix, 321 i.e., only reads that match to a genus and not to a species level were finally assigned to 322 the genus. 323 324

In vitro validation of HPV targets 325
To test the ability of our assay to detect and genotype HPV targets, fragments of included in this study, and DNA was extracted from each spiked vaginal pool (see 334 below). Subsequently, the spiked HPV targets were detected by amplification using the 335 13 PCR targeting the L1 gene and bioinformatics pipeline described below. Each spike-in 336 experiment was performed in triplicate. Each HPV target was detected above the LOD 337 (see below) in each of the triplicate spiked-in amplification reactions performed on the 338 extracted DNA from the vaginal pool (not shown). Each target had a ratio > 0.1 for the 339 number of HPV-assigned reads divided by the total number of normalized reads 340 assigned to an internal spike-in control (see below). 341 To determine the LOD of HPV targets, 10-fold serial dilutions of the sDNAs 342 representing HPV targets were made in nuclease-free water, ranging from 10 5 to 10 2 343 molecules per µl. Dilutions of one target were inversely combined with dilutions of 344 another target, forming different pairs of HPV sDNAs. Each dilution pair was used 345 directly as template for PCR in triplicate as described below. After sequencing, demultiplexing of reads according to sample-specific barcodes 374 was performed using Illumina's BCL2FASTQ algorithm. Reads were filtered using an 375 average Q-score > 30. Forward and reverse 16S rRNA gene reads were appended 376 together after removal of primers and any leading bases, and clustered using version 377 2.1.5 of the Swarm algorithm (Mahe 2014) using a distance of one nucleotide and the 378 "fastidious" and "usearch-abundance" flags. The most abundant sequence per cluster 379 was considered the real biological sequence and was assigned the count of all reads in 380 the cluster. The representative reads from all clusters were subjected to chimera 381 removal using the VSEARCH algorithm (Rognes 2016). Reads passing all above filters 382 (filtered reads) were aligned using 100% identity over 100% of the length against the 32 383 target 16S rRNA gene sequences described above (Supplementary Table 4). The 384 relative abundance of each taxon was determined by dividing the count linked to that 385 taxa by the total number of filtered reads. 386 387

Sequence analysis and taxonomic annotation for HPV targets 388
Raw sequencing reads were demultiplexed using BCL2FASTQ. Primers were 389 removed using cutadapt (Martin 2011). Trimmomatic (Bolger 2014) was used to remove 390 reads with a length less than 125 bp, and a mean quality score below 30. After that, 391 forward and reverse paired reads were joined using custom in-house scripts and 392 converted to a fasta file. Identical sequences were merged and written to a file in fasta 393 format and sorted by decreasing abundance using --derep_fulllength option in 394 VSEARCH (Rognes 2016). Target sequences in the fasta files were compared to the 395 fasta-formatted query database sequences (19 HPV target sequences) using the global 396 pairwise alignment option with VSEARCH, using 95 percent sequence identity, to obtain 397 15 the counts for each HPV type within a different sample. 398 The HPV portion of the assay was considered positive if the number of sequence 399 reads assigned to the specific HPV types was above the threshold at the limit of 400 detection, and greater than a previously defined cutoff. To set this cutoff, two 401 normalization steps were employed. First, according to in silico PCR amplification, a 402 different number of combinations of primers amplify different HPV targets (e.g. HPV16 403 is amplified using 66 different combinations, while HPV43 is amplified with just 10 404 combinations), reflecting the sequence variability within the primer binding site among 405 HPVs. This also means that the spiked-in internal control and the target HPV have 406 different amplification efficiencies. To avoid this bias, the internal control (which has the 407 primer sites for HPV16) is normalized for the amplification factor (number of primer 408 combinations that generate an amplicon) of each HPV type. The number of HPV-409 assigned reads was divided by the total number of normalized reads assigned to the 410 spike, and a sample was considered HPV-positive if that ratio was above 0.1, which 411 was obtained from the concordance study, and corresponds to approximately 500 target 412 molecules. 413

Relative abundance of 32 bacterial targets in healthy vaginal samples 536
To determine healthy reference ranges for the bacterial targets in the assay, we 537 selected a set of 50 samples from our database. These represent self-reported healthy 538 individuals from the uBiome microbiome research study. In addition to health status, 539 additional selection criteria included no usage of antibiotics six months prior, and no 540 current urinary tract or vaginal infections, including the presence of STDs. The 50 541 samples were processed with the target database, and the relative abundance ranges 542 for each bacterial target in the cohort are shown ( Figure 5). 543 544 545 546 for HPV in both tests) ( Table 1). Three sample pairs that were positive with the Digene 592 brush were HPV negative when the corresponding test was performed on extracted 593 DNA. These three samples had an average Digene RLU ratio of 1.94, suggesting that 594 these contained low levels of HPV (Figure 7). 595 596 Table 1

Performance of the HPV sequencing test on clinical samples 614
Using 185 vaginal specimens, the performance of the assay to detect hrHPV was 615 compared to that of the Digene HC2 hrHPV assay. The Digene test was considered 616 positive if the measured RLU was equal to or greater than the assay's cutoff (RLU ratio 617 of 1 or higher), as per the manufacturer's instructions. The assay was considered to be 618 positive for hrHPV if the normalized number of reads assigned to hrHPV types divided 619 by the number of reads assigned to a spiked-in control was greater than 0.1. Of the 185 620 samples, 145 were negative in both tests, while 36 were positive in both tests (Table 2) Excellent correlation was found between the number of normalized hrHPV 639 sequencing reads and the Digene HC2 hrHPV RLU ratios, confirming that the PCR and 640 sequencing hrHPV assay described here can not only detect hrHPV types but also 641 assess their relative abundance (Figure 8). Here, we describe a novel women's health assay combining vaginal microbiome 658 analysis, STI-associated pathogen detection, and HPV detection and identification in a 659 self-sampling format. Although each of these components have been described before, 660 to our knowledge, this assay is the first to combine all of these parts, thus offering 661 women a unique opportunity to gain a broad perspective into their vaginal and 662 reproductive health. 663 The detection of hrHPV in combination with self-sampling has been proposed as 664 an effective method for cervical cancer risk screening. Although the sensitivity of signal-665 based hrHPV detection, such as the Digene assay, in self-obtained vaginal swabs has 666 been found to be slightly lower than that in clinician-obtained cervical specimens, hrHPV 667 detection based on PCR was shown to be equally sensitive in self-sampled specimens 668 (Arbyn et al. 2014). While this assay is not intended to replace regular cervical cancer 669 screening programs, offering women the opportunity to self-collect vaginal specimens 670 poses fewer barriers for women to be screened, and thus could lead to increased 671 participation rates. Therefore, encouraging women to self-collect vaginal samples for 672 hrHPV screening, as already implemented in numerous countries, and recommending 673 them to seek further physician examination in case of a positive result, while 674 encouraging them to participate in regular screening programs may have a positive 675 impact on rates of detection of cervical cancer, and potentially save lives (Wong et al. 676 2016). 677 Unlike most currently available clinical assays, this assay not only detects 678 whether hrHPV is present in a sample, but also identifies the presence of specific 679 type(s) by using sequencing analysis. This is of particular importance because the 680 prevalence of hrHPV types might shift in the setting of recently introduced HPV 681 vaccines. In the last decade, many countries, including the US, have implemented HPV 682 vaccination programs (Harper and DeMars 2017)  positive sample will test negative at retesting, and the majority of those samples will 717 29 have a low RLU ratio between 1.00 and 3.00 in the positive test. Reproducibility testing 718 using the Digene assay therefore is expected to be lowest in samples with an RLU ratio 719 near the cutoff value, and a cutoff ratio of 2 or 3 instead of 1 has been proposed to 720 serve as a better indicator for reproducible positive results (Carozzi et Moss et al. 2015). In our study, all specimens with 722 RLU ratios of 2 or higher in the direct Digene test on STM tubes were also correctly 723 identified as positive when the test was performed on extracted DNA, suggesting that 724 the Digene assay can be applied to extracted DNA as well. 725 Using extracted DNA from 185 vaginal specimens as the template, the 726 performance of the hrHPV sequencing assay was compared to that of the Digene HC2 727 hrHPV assay. The hrHPV sequencing assay had excellent correlation with the Digene 728 assay, with a sensitivity and specificity of 94.74% and 98.64%, respectively, and a 729 Cohen's kappa of 0.93. In addition, the two tests were in good general agreement about 730 the relative amount of HPV molecules detected. Of the two samples that were reported 731 positive by the Digene assay but that did not contain hrHPV sequences as determined 732 by our assay, both of them were found to contain lrHPV types. Cross-reactivity of the 733 Digene HC2 hrHPV probe mix with lrHPV sequences such as 30  In conclusion, we here present a women's health assay that for the first time 765 combines the detection of the most important bacterial and viral indicators of vaginal 766 health and disease. We envision that this test will greatly help women to learn about 767 their vaginal microbiome, encourage them to participate in existing cervical screening 768 programs because it allows for self-sampling, and assist their doctors to more 769 accurately diagnose and treat diseases of the genital tract. 770 771