PCR diagnostics: In silico validation by an automated tool using freely available software programs

Highlights • In silico validation of PCR tests using exponentially expanding databases.• The need of regular in silico validation of PCR tests by expanding databases.• Fulfilling quality standards of in silico validation of molecular diagnostics.


Introduction
Pathogens exhibit genetic variation as a result of genetic drift, adaptation and evolution, but also by random variation. Since the late nineties of the 20 th century, due to the improved sequencing techniques and high throughput sequencing machines, the number of sequences submitted to databases like GenBank ® has increased exponentially. This results in an enormous increase of identified variants and quasi-species as well as sequences of newly discovered pathogens from all over the world. A few examples are the discovery of coronaviruses causing Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), Nipah and Hendra viruses, atypical pestiviruses, atypical and new serotypes of bluetongue virus, Schmallenberg virus and new variants of avian influenza viruses (Chua et al., 2000;Demmler and Ligon, 2003;Drosten et al., 2003;Hoffmann et al., 2012;Hofmann et al., 2008;Maan et al., 2011;Marcacci et al., 2018;Schirrmeier et al., 2004;van Boheemen et al., 2012;Wang, 2011;Zientara et al., 2014).
Currently, in many countries, the first line of pathogen detection is real-time PCR diagnostics. Favourably, PCR tests can be highly sensitive and specific, and are often designed to detect all variants of a defined family, genus or species, while not detecting closely related pathogens. In addition, PCRv can also be used to validate in silico PCR assays that differentiate between lineages, serotypes or variants. Therefore, PCR targets must be unique, and highly conserved. Nonetheless, false negative results can arise by genetic drifting or by emergence of new variants, while false positive results can be caused by new variants of closely related pathogens. It is therefore important to frequently reevaluate and, if necessary, redesign PCR tests taking sequences of newly discovered pathogen variants into account.
Validation of PCR diagnostics should be organized in three stages, in silico, in vitro and in vivo validation. In silico validation covers the study on inventory of matching and non-matching sequences of the PCR target sequence in a nucleotide database. Matching sequences enable in silico sensitivity (detection of all variants), while non-matching sequences support in silico specificity (selective detection of variants of the respective group of pathogens). In vitro and in vivo validation include testing of cultured pathogens, and field samples of defined positive and negative status. In vitro and in vivo validation for all virus variants is practically impossible and extremely costly. Even more, not every pathogen variant has been cultured or isolated, and transport and handling of pathogens could imply safety issues. In contrast, sequences are rapidly becoming available by high throughput and deep sequencing, even without culturing of pathogens. Therefore, in silico re-evaluation of validated PCR diagnostics is and will be an attractive alternative to obtain detailed insight in detection of circulating and (re-) emerging virus variants, and should be frequently executed. It will however become an increasing task due to the rapid increase of available sequences and full genome sequences of numerous species.
We developed a software tool named PCRv to facilitate in silico validation of PCR tests entirely based on freely available software programs. PCRv links freely available software programs to automate the whole process, reduces labour, and generates a validation report that includes a brief summary as well as a list of detailed results.

Methods
The software tool PCRv is written in the Python programming language. PCRv consists of a user friendly graphical user interface and coordinates the use of software programs ClustalW2.1 (Larkin et al., 2007;Thompson et al., 2002) and SSEARCH (Brenner et al., 1998;Pearson, 1991;Pearson et al., 2017; to perform in silico validation. PCRv is suitable to determine the in silico sensitivity (conservation of sequences) and in silico specificity (selectivity) of different PCR formats. To monitor the performance of PCRv, a set of flagged internal control sequences (FICS) are randomly added to the sequence database. PCRv processes data and analyses results, and generates a validation report that includes a summarizing table as well as a list of detailed results for an easy check of potential false positives and false negatives. An overview of all actions executed by PCRv is shown in Fig. 2.

In silico sensitivity
The sequences of a target organism are downloaded from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/nuccore/) by using the respective taxonomy ID number as search query. This guarantees that all available sequences of the defined taxon in the database are downloaded. To generate a multiple sequence alignment (MSA) of these sequences, a full genome sequence was selected as a reference sequence. Genome segments of pathogens with a segmented genome were concatenated to serve as an artificial full length genome. If a full genome sequence was not available, a representative large sequence of the taxon was selected as a reference sequence. A prerequisite is that this partial sequence contained the full target of the PCR test being validated. In order to drastically reduce computing time, pairwise alignments were calculated for each downloaded sequence to the reference sequence by using software program ClustalW 2.1 (Larkin et al., 2007). To correct for orientation errors in the database sequences, alignment in the reverse complement orientation was also attempted. A score was calculated using a scoring scheme as follows: match (+1), mismatch (-2), point deletion or gap (-3), every next adjacent point deletion (-2). The aligned orientation with the highest score was selected. To enable efficient alignment of large sequences, these large sequences were segmented in fragments of 10,000 nucleotides in length and individually aligned to the reference sequence and subsequently combined into one pairwise alignment. PCRv combined all individual pairwise alignments into one multiple sequence alignment (MSA), including the pairwise alignments of primers and probes. The calculation of the MSA was performed by a computer with an Intel ® Xeon(R) CPU E5-1650 v2 @ 3.50 GHz processor and 16 Gb of internal computer memory. The regions corresponding to primers and probes were selected from the MSA to construct a conservation plot sorted in decreasing total number of mismatches. The in silico sensitivity was expressed as the percentage of hits with a cut-off value of a maximum of one mismatch per primer or probe.

In silico specificity
The entire nucleotide sequence database (compressed gzip file: nt.gz) was downloaded from the NCBI FTP-website (ftp:// ftp.ncbi.nlm.nih.gov/blast/db/) using PCRv. The integrity of the download was confirmed by calculation of the MD5 checksum and subsequent comparison with the checksum published on the FTP-website (file nt.gz.md5). PCRv processed the data stream during download by several optimizations to improve the analysis. Nucleotide code 'N' was replaced by the meaningless code 'Z', which prevents infinite number of hits by the alignment search. The data stream was unpacked and subdivided into multiple fasta formatted text files. Fasta files with a maximum size of 500 MB were sequentially numbered and stored because the NCBI nucleotide database is too large to be analysed all at once. To increase the accuracy of the alignment search (see Discussion), large sequences were fragmented in sequences of maximal 3000 nucleotides with an overlap of 50 nucleotides to prevent the loss of hits of primer or probe sequences spanning the split site. Fragmented sequences were tagged with a unique code allowing reconstruction of the original sequence. Any nucleotide database in fasta format is compatible and could be added. Flagged internal control sequences (FICS) were added to enable validation of the alignment search. FICS consisted of randomly generated sequences of 3000 nucleotides in length containing primer and probe sequences of the PCR test being validated. Primer and probe sequences were inserted in all possible combinations and orientations potentially initiating amplification ( Fig. 1). Multiple copies of each combination were inserted with an increasing number of randomly introduced mismatches from 0-10 in each primer and probe sequence ( Fig. 1). In total, ten copies of each control sequence per number of mismatches were linearly spread in each 500 MB fasta file. An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998;Pearson, 1991;Pearson et al., 2017;. PCRv produced a list of hits of the alignment search of all possible primer/probe combinations potentially leading to detectable amplicons. Hits of FICS were stored separately. The percentage of returned hits of control sequences with an increasing number of mismatches was indicative for the sensitivity and accuracy of the alignment search per 500 MB fasta file. The maximum number of returned mismatches in the control sequences was determined by use of the Spearman-Kärber method and demonstrated the validity of the computing process (Wulff et al., 2012). An aborted search caused by an unknown error was visible by the incompleteness of returned FICS. If the accuracy of the alignment search was not acceptable, the alignment search was repeated with a higher expectancy threshold value, which usually resulted in a longer analysis time. The specificity check was limited to a maximum of 5000 nucleotides in amplicon length and up to four mismatches per primer or probe. This limitation was however not applied to the FICS in order to

Fig. 1. Flagged internal control sequences (FICS). A)
FICS consist of randomly generated sequences of 3000 nucleotides in length containing the primer and probe sequence of the PCR test being validated. Multiple copies were inserted with an increasing number of randomly introduced mismatches from 0-10 in each primer and probe sequence. Ten copies of each FICS per number of mismatches were linearly spread in each 500 MB fasta file. B) Overview of all eight possible combinations of positional orientations of forward primer (FWD), reverse (REV) primer and probe used as FICS which are all capable of initiating an (nonspecific) amplification reaction in combination with a detectable probe signal. Combinations of primers and probes according to other PCR formats (e.g. nested PCR, PCR using hybridisation probes or hydrolysis probe) are also supported by PCRv but are not shown. fully ascertain the validity of the executed alignment search. Hits were interpreted as specific or nonspecific according to the taxonomy classified sequences as used to generate the MSA. The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test.

Results
To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010;Johnson et al., 2001). These WNV PCR tests represented three different formats; a real-time PCR test, a conventional PCR test and a nested PCR test (Table 1).

In silico sensitivity
Available West Nile virus nucleotide sequences were downloaded from the NCBI website using taxonomy ID 11,082 (search query NCBI:txid11082 on January 15 th , 2019). In total, the download contained 20,964 WNV sequences. A MSA was calculated using the full genome sequence with accession number NC_009942 as a reference sequence (Borisevich et al., 2006). Primer and probe sequences were included in the alignment. The calculation of the MSA with PCRv was completed in about 4.5 h. A limited number of 10-15% of the aligned sequences encompassed the locations of primers or probes of the selected OIE-recommended WNV PCR tests. The regions corresponding to primers and probes were taken from the alignment in order to construct a conservation plot. Detailed results were sorted according to the number of mismatches to easily select individual sequences with > 1 mismatch in order to check their origin (Supplemented data A). Note, sequences incorrectly classified as WNV as well as synthetically derived sequences should be discarded as these are irrelevant. Results of the conservation plot were summarized according to the number of mismatches to a maximum of four mismatches per primer or probe ( Table 2).The overall in silico sensitivity of each PCR test was calculated and expressed as the percentage of sequences with a maximum of one mismatch per primer or probe. The real time PCR test for WNV showed the highest in silico sensitivity of 98.8% (83.3%+15.47%). The conventional and nested PCR tests showed an in silico sensitivity of 87.1% and 86.5%, respectively.

In silico specificity
The entire nucleotide sequence database from the NCBI FTP-website was downloaded as a compressed gzip file (nt.gz) of 502 GB on January 7 th , 2019. The download was valid according to the calculated MD5  (Johnson et al., 2001), and the real time PCR test have been described (Eiden et al., 2010).  sequences. An alignment search with primer and probe sequences was performed with a cut-off expectation value E of 5000. The search per PCR test was completed in less than two hours. About 3.7-6.9 million individual primer and probe alignment hits were found and processed by PCRv as described (Fig. 2). FICS were found homogeneously in all 371 database files indicating that the alignment search was completed properly. FICS for each PCR test were returned with a mean of 3.7-4.3 mismatches per primer or probe demonstrating completeness and acceptable accuracy of the alignment search (Table 2). Potential amplicons were interpreted as specific or non-specific according to the presence of its NCBI accession number in the list of sequences as used for the in silico sensitivity check (Table 2). We noticed that the number of specific hits differed from the numbers as scored by the in silico sensitivity check (Table 2). However, several reasons for this apparent inconsistency can be considered, see Discussion. In summary, using WNV PCR tests as an example, PCRv easily determined the in silico sensitivity and specificity of these PCR tests of different formats in a highly automated manner. All results are included in the validation report generated by PCRv, such as a summarizing table of results, conservation plot and a list of nonspecific hits. The summarizing table clearly demonstrates the differences of the in silico sensitivity and specificity between these PCR tests (Table 2). In addition, the detailed conservation plot (Supplemented data A) and detailed list of nonspecific hits up to 4 mismatches per primer or probe (Supplemented data B) support manual check of individual sequences on correctness, background, submission details, and other information.

Conclusion and discussion
Validation of diagnostics by testing all variants of a target pathogen in cultured or field samples, named in vitro and in vivo validation, respectively, is hardly feasible. Because of the availability of sequences of pathogens in databases, checking conservation and uniqueness of primer and probe sequences, so-called in silico validation, has become an attractive and reliable alternative to (re-) evaluate specificity and sensitivity of molecular diagnostics. Exponential expansion of available sequences, genetic drift of pathogens, and discovery of new pathogens drive the need to frequently validate established PCR tests. This, however, will also become an increasing significant effort. We automated the in silico validation process by integrating freely available software programs into a single tool named PCRv. Public databases, such as NCBI as well as other available databases and sequences formatted in single sequence fasta files are compatible with PCRv.
PCRv generates a multiple sequence alignment (MSA) using a selected reference sequence, which is preferably a full length genome but at least a partially large sequence encompassing the PCR target. Software program ClustalW2.1 (Larkin et al., 2007) is used to calculate pair-wise alignments of each sequence to the reference sequence, and subsequently a MSA is generated using these pair-wise alignments. This strategy exponentially reduces calculation time, in particular for large numbers of sequences. Additionally, more than one reference sequence could be used to improve the generation of a MSA in case of extreme variability among a group of pathogens. The MSA is used to determine the in silico sensitivity, since this is less prone to mismatches in primers or probes (not shown). For example, sequences with numerous mismatches in one of the primers or probes will not be found by an alignment search using these primer or probe sequences as search queries. However, such sequences will be present in the MSA, see conservation plots of WNV PCR tests. Supplemented data A shows the summarised -without accession numbers -conservation plots of the three WNV PCR tests. PCRv generates a conservation plot listing all hits according to decreasing number of mismatches. Hits with the most mismatches needs attention as these could lead to false negative PCR results. We calculated and defined the in silico sensitivity as the percentage of hits with a maximum of one mismatch per primer or probe as these are assumed to be detected with the respective PCR test.
The software program SSEARCH that is available in the FASTA sequence analysis package from the University of Virginia (Pearson, 1991) uses a calculated expectation value E in combination with a supplied threshold value to determine whether a hit is returned. The expectation value E depends on the number and length of sequences in the database. Consequently, the E value of a search hit depends on the location of the found sequence in the database. Large sequences are therefore segmented into fragments of maximal 3000 nucleotides in length. This reduces the variability in sequence length leading to a more homogenous sensitivity of SSEARCH across the database and improves the overall sensitivity of SSEARCH.
The sensitivity of the well-known and commonly used BlastN alignment search program was compared to that of SSEARCH (Fig. 3). Clearly, SSEARCH returns 100% of the primers up to six mismatches. In contrast, the percentage of returns with BlastN is slightly less than 100% for three mismatches and rapidly declines by an increasing number of mismatches. We conclude that SSEARCH is much more accurate, and thus more suitable than BlastN to determine the in silico  Fig. 3. Comparison of the accuracy of an alignment search performed by the BlastN and the SSEARCH software programs. A test database of randomly generated nucleotide sequences was generated containing 10,000 sequences of 3000 nucleotides in length. 875 sequences contained a primer sequence of 24 nucleotides in length. Each primer contained randomly 0-10 mismatches. The cut-off expectation value E used in both programs was 1000. The inserted primer with up to 2 mismatches completely returned with BlastN, whereas SSEARCH completely returned the primer with up to 6 mismatches.
specificity. We also noticed that BlastN tends to find partial/fractional nucleotide alignment hits which is not desirable for primers and probes. In addition, PCRv using SSEARCH is suitable for use in a laboratory quality control system, since the search process is monitored per 500 MB fasta file for completeness and accuracy/sensitivity by returned hits of flagged internal control sequences (FICS). An overview of this monitoring is added to the validation report. Examples of incomplete, inaccurate or alignment searches with a low sensitivity are presented (Supplemented data C). In case alignment search results are not sufficient, the threshold value can be changed to increase the sensitivity but the calculation time will also increase.
Here, we showed in silico validation results of WNV PCR tests of different formats as an example. PCRv was also used to validate real time PCR tests at WBVR (Fig. 4). SSEARCH quantifies hits for any combination of primers and probes potentially leading to detectable amplicons, see Fig. 2. This can result in more hits for the in silico specificity check by SSEARCH than for the in silico sensitivity check by ClustalW 2.1. For example, sequences partially overlapping with the PCR target sequence will not be found by the in silico specificity check, since this check only finds complete amplicons. Further, NCBI only stores unique nucleotide sequences in its downloadable database export file "nt.gz". Identical sequences are combined as one sequence with the sequence name as a concatenation of all individual sequence names separated by the ASCII code 1. PCRv does not recognize merged names as multiple sequences, resulting in less hits by SSEARCH.
Detailed analysis of in silico validation results enables a focus on specific test problems, as shown for the PCR test for peste-des-petits ruminants virus (PPRV) of WBVR that presumably does not detect PPRV strain Ghana2010 because of three mismatches in the probe sequence. Indeed, the PCR target of this PPRV strain was amplified but was not detected by the Taqman probe (van Rijn et al., 2018a). We used PCRv to analyse OIE-recommended and published PCR tests for other pathogens in order to select the best option for implementation in laboratory diagnostics. Upon preparedness on incursions, frequent in silico (re-)validation could also show the need for adaptation of operational PCR tests to emerging epidemics caused by new variants in other parts of the world.
PCRv depends on compatible and reliable nucleotide databases. For example, in silico validation by PCRv depends on submission of accurately determined sequences which are coded with the correct taxonomy ID number. For example, classical swine fever virus (CSFV) sequences that are taxonomy classified as bovine viral diarrhoea virus type 2 (BVDV II) were consequently interpreted as false positives in the CSFV PCR test and as false negatives in the BVDV PCR test. Further, in our example of WNV PCR tests, five nonspecific hits appeared to be sequences without taxonomy ID. Still, these sequences are definitely WNV sequences, although 2 out of 5 nonspecific hits have been synthetically derived (Supplemented data B). On the other hand, a more specific taxonomy classification or labelling of sequences in databases could be used for the development of PCR tests specific for subspecies, serotypes or lineages.
Considering the expected rapid expansion of available sequences, PCRv will be further improved by allowing incremental analyses in which only newly submitted sequences with respect to the previously analysed sequences are processed. This will keep the required analysis time manageable for in silico re-validation of PCR tests. The number of hits for the in silico sensitivity and specificity are not representative for the field situation but represents that of the sequences in the database. In other words, the percentages could be skewed by a small number of sequences in the database, or by a large number of very closely related sequences caused by a huge effort during one epidemic.
Submitted sequences are sometimes not trimmed for synthetic adaptors like PCR primers causing misleading positive analysis results. Synthetic or optimized genes of pathogens can lead to misleading negative PCRv results. Synthetic and genetically modified sequences should be labelled as 'nonnatural' in databases to prevent misleading results of in silico validation efforts. Finally, negative PCRv results can be created on purpose by development of DIVA (Differentiating Infected from Vaccinated) vaccine viruses with a deleted or mutated DIVA target, like gE deletion mutants of bovine herpes virus type 1 and pseudorabies virus (Kaashoek et al., 1994;van Oirschot et al., 1990), NS3 deletion mutants of bluetongue virus and African horse sickness virus (Feenstra et al., 2014;van Rijn et al., 2018bvan Rijn et al., , 2013, and liveattenuated lumpy skin disease (LSD) vaccine (Agianniotaki et al., 2017).
Viral pathogens belonging to the same taxon showing an extreme variation in their sequence cannot be aggregated in one MSA using one reference sequence. Further, large scale genomic rearrangements, such as duplication, deletion, insertion, inversion, and translocation, are very common in genomes of bacterial pathogens, and will undoubtedly challenge the calculation of a MSA, if this is even possible. Currently, we are investigating alignment-free analysis methods to address these challenges. Even more, we foresee the development of a next generation in silico tool, partially based on PCRv, to find highly conserved targets for new or confirmatory PCR tests. Fig. 4. Overview of the in silico sensitivity and specificity of several real time PCR tests at WBVR as determined by PCRv. The in silico sensitivity of PCR tests is expressed as the percentage of hits with a maximum of one mismatch per primer or probe (squares, line). The in silico specificity is expressed as the percentage of specific hits with 0 mismatches (black) and 1 mismatch per primer of probe (grey). Real time PCR tests are indicated: WNV; West-Nile virus (Eiden et al., 2010;Johnson et al., 2001), BTV; bluetongue virus (van Rijn et al., 2012;2013), PPRV; peste des petits ruminants virus (van Rijn et al., 2018a), ASHV_S4; African horse sickness virus segment 4 (van Rijn et al., 2018b), ASHV_S5; African horse sickness virus segment 5 (van Rijn et al., 2018b); in-house developed assays: RVFV; Rift Valley fever virus, SGPV; sheepand-goat pox virus, EHDV-a; epizootic haemorrhagic disease virus test a, EHDV-b; epizootic haemorrhagic disease virus test b, EAV; equine arteritis virus, EBLV-1; European bat lyssa virus type 1, CSFV; classical swine fever virus, ASFV; African swine fever virus, PRV-gB; pseudorabies virus glycoprotein gene gB, PRV-gE; pseudorabies virus glycoprotein gene gE. Results of PCRv could demonstrate the need to optimize or redesign a PCR test, like for EHDV-a and AHSV_S4. Note: hits of non-natural sequences were not discarded. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).