Creation of Universal Primers Targeting Nonconserved, Horizontally Mobile Genes: Lessons and Considerations

Increasing use of molecular detection methods, specifically PCR and quantitative PCR (qPCR), requires utmost confidence in the results while minimizing false positives and negatives due to poor primer designs. Frequently, these detection methods are focused on conserved core genes, which limits their applications.

IMPORTANCE Increasing use of molecular detection methods, specifically PCR and quantitative PCR (qPCR), requires utmost confidence in the results while minimizing false positives and false negatives due to poor primer designs. Frequently, these detection methods are focused on conserved core genes, which limits their applications. These screening methods are being used in various industries for specific genetic targets or key organisms, such as viral or infectious strains, or characteristic genes indicating the presence of key metabolic processes. The significance of this work is to improve primer design approaches to broaden the scope of detectable genes. The use of the techniques explored here will improve detection of nonconserved genes through unique primer design approaches. Additionally, the approaches here highlight additional, important information which can be gleaned during the in silico phase of primer design and will improve our gene annotations based on percent identities. KEYWORDS PCR, genetics, mobile genetic elements, molecular methods, multidrug resistance, primer design, qPCR P CR-based diagnostic approaches are being widely used for rapid screening of microbes and pathogens in environmental and clinical settings (1)(2)(3)(4). Correct primer design is a critical factor in assessing the accuracy of the diagnostic approach to avoid false positives and false negatives. Many clinical studies focus on a single infectious strain or species (5-7) and thus the primers can be designed to focus on specific, characteristic genes present in the pathogenic strains and not in the benign strains, simplifying the primer design. However, this restricts the scope of use for the primers.
Some of the most successful attempts at designing "universal" primers are the 16S rRNA sets, of which there are many subsets, each targeting different variable regions of the 16S rRNA gene; they are reviewed elsewhere (8)(9)(10). These universal primer sets are mixed batches of primers with differing degrees of base variability at each position within the primer as denoted by the degenerate bases, where each possible combination is present and intended to target specific sequences, each representing a different species or groups of species. This approach attempts to cover the depth of diversity within the target location and provide unbiased detection of each species present before the first PCR cycle. Over the subsequent years, it has been shown that different primer sets have various detection levels for the species, both within different bacterial clades and between Bacteria and Archaea (8,11,12). Thus, the idea of "universal" is difficult even for an expectedly highly conserved core gene.
Alternatives to PCR-based primer detection are primer-probe assays such as fluorescence in situ hybridization (FISH), Southern blots, or microarrays, which all use primer binding to detect genes of interest but through different methodology approaches. While only PCR-based approaches are discussed here due to the ease of in silico analysis, it should be kept in mind that the approach to how primers are developed can be applied in these other techniques to answer other questions or for interpretation and conclusions from primer "hit" specificities.
Here, we describe an approach to design universal primer sets targeting multidrug resistance efflux pumps (MDREPs). MDREPs are nonconserved genes that encode a vast range of different proteins, from specific metabolite efflux transporters to those targeting specific groups of antiseptics and/or antibiotics (13)(14)(15). They are a group of integral membrane transport protein systems subdivided into six superfamilies: small multidrug resistance (SMR), major facilitator superfamily (MFS), multidrug and toxic (compound) extrusion (MATE), ATP-binding cassette (ABC), resistance-nodulation-cell division (RND), and proteobacterial antimicrobial compound efflux (PACE). A review of the superfamilies is available elsewhere (13). Due to the narrow range of annotated PACE genes (namely, aceI) and the absence of any annotated PACE genes present in the genomes of the model community members, the PACE genes were omitted from this study.
As their name suggests, MDREPs were originally classified based on their ability to confer resistance to antibiotics, although a single MDREP can have a diverse substrate range, sharing few structural, size, or ionic properties (16)(17)(18). To date, it is unclear how the specificity of the proteins is determined; however, recent research suggests it may be from different entrance channels allowing transport of chemicals with similar physicochemical properties or through the use of weaker hydrophobic interactions between substrate and efflux pump compared to the specific hydrogen bonds used by morespecific transporters (19,20).
MDREPs are often found on mobile genetic elements, including plasmids, transposons, integrons, integrative conjugative elements, and genomic islands (15,(21)(22)(23)(24). Thus, they are of particular and increasing importance due to their horizontal mobility and their contribution to the growing problem of global antibiotic resistance (24)(25)(26). The phenomenon of antibiotic resistance is well studied in medical environments but has only recently begun to be investigated in other environments, such as water treatment plants and activated sludges (27)(28)(29). Several research groups have begun using metagenomics to track and monitor the migration and abundance of different MDREP genes in wastewater following various treatment methods, with mixed results (30,31). Our interest is to follow specific MDREPs in the context of biocide resistance in a model community of six members designed to resemble a microbiologically influenced corrosion environment.
Many detailed reviews of the different efflux pump superfamilies have been published (32-37); they have been used for the selection of targets in this study and are shown in Table 1. Here, gene targets have been specifically selected for their published substrate compounds, focusing on antiseptics/biocides. While there are some conserved regions and motifs within the MDREP superfamilies, these traits are typically limited to very short regions, as in the conserved N terminus region of AcrB (38), specific residues, as in the proton relay components of transmembrane helices (39), or short motifs, as in the case of conserved residues in the motif C of members of the MFS superfamily (40). There are no known residues or motifs conserved across all superfamilies and those within a superfamily are less conserved than one would assume (41).
We explore the possibility of designing universal primers for nonconserved, mobile gene targets in the fashion of the multiple universal 16S rRNA primer sets (2,8). We will discuss difficulties and challenges encountered and describe a novel approach which can be applied to any desired gene target for improved detection. This novel primer design approach was tested on environmental water samples treated with various biocides.

RESULTS AND DISCUSSION
The goal of this study was to go through a primer design workflow, then test the efficacy of the designed primer sets in silico to evaluate the success or failure of the design. The approach is to evaluate the primers before using them experimentally, consuming resources and time in optimization. The output of our in silico primer annealing experiment is represented over four figures that illustrate all the binding locations for each of the primers within the model community genomes, separated into MDREP superfamilies ( F1-4 Fig. 1 to Fig. 4). The results illustrate the direct hits toward the intended MDREP gene but also the unintended binding locations where primers would anneal under different annealing temperatures and with different sequence homologies (see T2/AQ:C  encoding a "conserved protein of unknown function," "hypothetical protein," "conserved hypothetical protein," "conserved protein of unknown function," or "conserved exported protein of unknown function." This work illustrates the difficulty of making a universal primer set for accessory/ character genes compared to core genes. The challenge is further highlighted working with genes that frequent mobile genetic elements, thus providing the opportunity for increased divergence. To illustrate the issues we observed and our findings, we discuss a few key examples highlighting certain trends and difficulties which were discovered as a result of this primer design work. Off-target unintended primer binding. To begin, we discuss an example showing the potential for a primer designed for one target to identify other MDREPs in another species. The upstream norM primer targeting Thauera aromatica has a relatively high annealing temperature on its target sequence (65.6°C), owing to its high GC content of 75%. This primer has five unintended binding targets, all with a percent identity of 90.0 to 99.9% (Fig. 1). Two of these sites are located within T. aromatica and target dctM (Tmz1t_0544) and yfdV (Tmz1t_0790). dctM encodes a tripartite ATP-independent periplasmic transporter which falls under the C 4 -dicarboxylate transport system classification according to KEGG, while yfdV encodes an auxin efflux carrier that is only a hypothetical protein with general function predicted and thus could be a more general transporter. The other unintended targets include genes encoding a conserved membrane protein of unknown function in Pseudomonas putida (PP_2935) that is provisionally in the MFS superfamily, an ethanolamine ammonia lyase large subunit (DvMF_1253), and an NAD synthase (GSUB_13065). The annealing temperatures of these locations are at or just below the intended annealing temperature. This illustrates a clear example of a primer targeting a sequence of nucleotides in what may be a conserved region of certain transporter genes. It is important to note that although these primers are detecting these unintended targets in silico, in each case only a single primer is binding and therefore no double-stranded PCR product would be formed in vitro. In practice, these unintended single primer interactions will only affect PCR amplification by reducing the availability of the primers (i.e., primer efficacy) for finding the intended sequence. Significant amounts of off-target binding interactions will reduce the availability of the primers to bind to the intended target sequence, which decreases the amount of PCR amplicon production. This reduction in primer availability has additional implications for interpreting qPCR data, as it may result in an underestimation of gene copies. This issue may also be addressed in wet lab work through an iterative screening process to determine the ideal primer concentration range for each primer to account for unintended binding. However, using the approach here should cut down on such labor. A consideration is that this ratio may vary depending on the genomic template being used, e.g., pure culture DNA compared to complex environments.
A more complicated example of a primer with unintended binding interactions is the downstream qacA primer for Desulfovibrio vulgaris (Fig. 2). In this example, the downstream qacA primer has an annealing temperature of 55.4°C and three unintended binding sites, none of which are in the D. vulgaris genome. The primer binds at the same or a lower annealing temperature to pcrA (QU35_03810), encoding an ATPdependent DNA helicase, and a gene for the succinyl-CoA synthetase alpha subunit (QU35_08905), both in Bacillus subtilis, as well as to a cupin gene (GSUB_03070) in Geoalkalibacter subterraneus. This example illustrates undesired targeting that does not provide any sort of additional information, such as potentially conserved sequences (domains) of efflux pumps or identifying unannotated/misannotated genes of similar function. It is unclear which trait(s) of these genes contributes to being targeted by this primer. To illustrate less-desirable unintended primer binding, we discuss the primers targeting genes of the SMR superfamily (Fig. 3). This superfamily is represented by genes from P. putida, G. subterraneus, Acetobacterium woodii, and B. subtilis for a total of eight distinct target genes. Out of these eight targets, the primers for B. subtilis qacE1 (QU35_18255) and ebrA (QU35_09520) have no unintended binding locations. The upstream primers for B. subtilis emrE (QU35_06845) and P. putida emrE (PP_4930) each have a single unintended binding location with 90.0 to 99.9% identity, targeting a noncoding region in G. subterraneus (GSUB:834814) and a betA gene in P. putida (PP_3383), respectively. Six of the remaining primers have unintended targets with 80.0 to 89.9% identity, targeting 24 different genes encoding proteins ranging from hypothetical proteins (QU35_15730, Tmz1t_0375, and Tmz1t_0977) to transporter permeases (QU35_06380, QU35_21320, and QU35_18115), a putative symporter (PP_3247), polymerases (Awo_c25450 and Tmz1t_3813), isomerases (DvMF_3022 and QU35_19985), and regulators and stress proteins (QU35_03275, DvMF_0255, PP_1269, and PP_3526). The lower percent identity matches (80.0 to 89.9%) do not target genes of a predictable or specific function but, rather, the targets share little similarity despite some membrane-associated proteins being detected. The unintended efflux genes detected are not of the same superfamily as the intended targets but rather are of the ABC superfamily. Interestingly, primers designed as part of this work to target the ABC superfamily, for which there is only a single target (lmrA from B. subtilis; QU35_01610), have zero unintended binding locations.
Finally, we discuss the unintended targeting of three different primers, two of which were constructed with degenerate bases and intended to target multiple sequences in different species. The two degenerate primers are the upstream acrB/mexD targeting both genes in P. putida and the downstream primer targeting acrB in T. aromatica and mexD in P. putida. The single target primer is the downstream acrB for P. putida, which  Applied and Environmental Microbiology complements the P. putida acrB/mexD primer. For the intended targets, all primers anneal with 100% identity. Due to the degenerate nature of two of these primers, there is a significant increase in the unintended binding targets compared to other primer sets (Fig. 4). Of note is the complexity of this figure, which has been intentionally retained to highlight the issues of the unintended primer binding locations as the gene targets become more complicated. Interestingly, the unintended binding locations of the upstream acrB/mexD primer are exclusively between 90.0 and 99.9% sequence identity, while the P. putida downstream acrB primer has a mix of high percent identity and low percent identity binding sites. Some of the recurring unintended targets are the other mex genes, specifically mexB and mexF, both of which are detected by the acrB/mexD upstream and downstream primers with identities of 90.0 to 99.9% (with the exception of mexB, which unintentionally has 100% identity with the T. aromatica/P. putida mexD downstream primer). The T. aromatica/P. putida mexD downstream primer also has 100% identity matches with mexF (PP_3426) and the gene encoding a putative RND transporter (PP_0906) (Fig. 4). Many of the unintended targets of this degenerate downstream primer are genes coding for hypothetical proteins and all have annealing temperatures of 50.0 to 59.9°C (a minimum of 5°C below the melting temperature of the intended target). Of the unintended targets of the upstream P. putida acrB/mexD primer, two are hydrophobe/amphiphile efflux-1 (HAE1) genes (DvMF_0036 and Tmz1t_0505), one encodes an uncharacterized RND transporter in G. subterraneus (GSUB_04250), and the final target is the ttgB gene (PP_1385) in P. putida, which codes for a probable membrane efflux pump transporter protein.
This primer set illustrates the ability of the primers designed from a multiplesequence alignment (MSA) of multiple genes (11 annotations in total) to target and locate other, similar genes both within the intended target species and in other, unrelated species owing partially to the degenerate bases present. Most hits are located within the two intended species, with only two hits occurring outside, genes encoding HAE1 in D. vulgaris and an RND transporter in G. subterraneus. It is important to note that RND primers are the only primers to have both upstream and downstream primers binding simultaneously to the same target, potentially resulting in unintended PCR amplicons. This occurs in four different genes: mexF (PP_3426), ttgB (PP_1385), HAE1 (Tmz1t_0505), and mexB (PP_3456) (Fig. 4). A fifth gene (encoding a putative RND transporter, PP_0906) is targeted by multiple primers; however, this gene is only targeted by downstream primers which bind to the identical location and thus cannot produce a PCR amplicon. Perhaps unsurprisingly, the primers targeting mexD in P. putida also bind to and would produce a PCR amplicon on mexB (PP_3456) and mexF (PP_3426), suggesting that these genes have a homologous domain which influenced the MSA of the mexD genes. The other two unintended targets with multiple primer attachments target a ttgB (PP_1385) gene in P. putida, which encodes a probable efflux pump, and an HAE1 gene in T. aromatica (Tmz1t_0505), encoding a protein that belongs to the acriflavine resistance protein B family (efflux pump) according to KEGG. It becomes clear from this example that the degenerate bases allow for an increase in unintended target locations and, unlike the other examples, these primers will produce actual PCR amplicons, disrupting any potential qPCR or downstream analyses. Although these unintended amplicons can be accounted for and removed using bioinformatics during sequencing applications, in qPCR applications these products, especially if they are of similar size to the intended product (or more specifically produce amplicons with similar melting points), can produce false positives without being able to distinguish between the intended and unintended products. Though the focus here was to design primers for PCR to determine presence and abundance of target genes  in a community, the primers could be used in a parallel workflow for sequencing.
To address an issue resulting from our primer design A approach (i.e., designing with degenerate bases), we developed an alternative method (primer design B) to create primer sets which preferentially target different locations (i.e., potentially unconserved locations) but prioritize maintaining identical amplicon sizes across all intended targets. As illustrated with the RND primers, design A can lead to an increase in unintended binding locations due to the exponential increase in primer sequences. Each degenerate base increases the number of unique sequences present in the primer mix and may create a primer which has no intended target sequence, meaning the chance of unintended binding increases. To illustrate, take the example outlined in Table 3 for the upstream primer targeting the acrB and mexD genes in P. putida.
From this simple example using two degenerate bases coding for only two nucleotides each, the resulting degenerate primer consists of a mixture of four primers, two of which do not code for any intended sequence. Rather, a more precise approach is to treat each gene as its own template, design primers to create the same amplicon size, and have similar melting temperatures and subsequently combine all the upstream primers into a single mixture in equal proportions. In this way, the number of primers is kept to a minimum and every primer has a desired target. An alternative to degenerate bases is the use of inosine; however, this will not always improve primer function    (42). Based on this previous study, the use of degenerate bases improves the broad range detection of a target gene, but the primers suffer from nonspecific amplification, dimerization, and primer slippage. The use of inosine in place of degenerate bases on occasion improved primer detection, but this was not universally the case across all their primer sets (42). Correct use of degenerate bases or inosine must be assessed on a case-by-case basis. The unintended specificity of these primers is of note when considering that all the primers are 18 to 23 nucleotides in length and therefore target a sequence of six or seven amino acids. It must be understood that not all MDREP genes would be detected using this approach, but it could provide a means of improving our understanding of conserved regions and, as this work illustrates, the conserved amino acid region may be as short as six or seven amino acids and still provide relatively high accuracy for gene identification. It is clear from this that even a six-amino-acid sequence can have evolutionary pressure to convey similarity in overall protein structure and/or function.
Using approach B (i.e., mixing each unique primer together without degenerate bases) allows the "universal" primer mix to always be in flux and be improved as new primers get added, until such point as the number of primers present negatively affects a specific primer's ability to bind the correct sequence. A consideration with approach B is that because the primers do not always target conserved regions, the primers may have unintended targets elsewhere in the genome(s), entirely unrelated to the target sequence. As a result, and as is good practice, primer sequences developed using either approach should be tested against the target genome(s) and all unintended binding locations should be identified and accounted for during PCR protocol development.
MDREP primers compared to universal 16S rRNA primers. In contrast to the successful universal primer designs targeting the 16S rRNA gene, the issues discussed here highlight the difference between targeting core genes (essential for replication), character genes (those defining the type of metabolism), and accessory genes (genes which may provide improved fitness under specific conditions). As we improve techniques and apply genetic screening to more and more health (e.g., infection, disease, etc.) and economically significant (e.g., agriculture, bioremediation, etc.) issues, the more relevant genetic targets we will discover. Logically, the more specific the target, the more meaningful the presence/absence and quantity become but the further away from the core genes (moving toward characteristic and accessory genes) we must move. Accessory genes (and to a lesser extent characteristic genes) have less evolutionary pressure on them, which allows for higher rates of mutation and variability on the nucleotide level. Additionally, as many accessory genes are or can be located on mobile genetic elements, they become subject to the nucleotide biases and codon usages present in their current host. While conserved genes may have variable regions flanked by conserved regions (e.g., 16S rRNA genes) which can be targeted to facilitate primer design, nonconserved genes may not reliably have conserved regions, forcing primers to target less-conserved regions more susceptible to mutations and variations. Furthermore, the mechanism of the movement may affect the gene's availability or expression levels. The fitness of an accessory gene is dependent on environmental pressures and is subject to pulses of challenge, such as short periods of exposure to biocide, as is typical for pipeline antimicrobial treatments or antibiotic courses.
For comparisons, the percent identities of the target genes were calculated using the multiple-sequence alignment (MSA) for each respective gene. The scores were calculated using the equation Each MSA was calculated using the conditions described in Materials and Methods. Only a single representative gene for each superfamily was selected as an indication of the variability within that superfamily. The average scores of the sequence identities are listed in Table 4, and a complete list of each gene's scores is provided in Tables S2 to S6 in the supplemental material. The ABC superfamily has been omitted due to low gene counts across the six representative species. From the percent identity scores, it is obvious the 16S rRNA gene is more conserved than any of the efflux pump genes. The highest score belongs to the norM gene, which among the annotated copies had very high percent identity (67.63%), while acrB and emrE had low scores (43.99 and 49.99%, respectively). The acrB score is skewed from two annotated copies being significantly shorter than the MSA, producing percent identity scores of 5.26% and 4.24%. Removing these two copies produced a score of 52.71% 6 5.71%, which brings the score for acrB more in line with qacA/emrB and emrE scores.
These scores reflect the relative simplicity of designing primers for highly conserved core genes, while nonessential, accessory genes are significantly more difficult due their more plastic nature. While they are less homologous, the identity scores suggest it is possible to design universal primers for these peripheral genes. However, these genes do require a different approach for primer design. Using approach A, which employs multiple sequence alignments, more conserved regions can be identified to target, whereas using approach B, which employs unique primer sequences (avoiding the use of degenerate bases), results in a primer mixture with higher accuracy and specificity. Where approach A may have a higher universality to the genes detected, the use of degenerate bases requires a more in-depth investigation into unintended hits. In contrast, approach B will provide more specific, targeted results with higher confidence, but the creation of universal mixes through the mixture of targeted primers detects less-diverse targets.
The plasticity of nucleotide sequences of the accessory genes results from the improved fitness benefits occurring only under specific conditions that are not always present. This allows for mutations to occur that would otherwise be impossible in core or character genes. Due to the nonconserved nature of MDREP genes and their mobility through and across genomes, the abundance and variance on the genetic level are erratic and unpredictable. The presence of the same gene in two different species does not allow for the determination of the direction of flow or origin of the gene. The in silico approach described here has the potential to be exploited to investigate evolutionary branching in these genes and in the case of MDREP genes, potentially shedding light on how these genes are selected for and allow for the identification of other clinically relevant efflux pumps yet to be discovered.
As with the 16S rRNA example, successful primer sets may be used for sequencing the intergenic sequences. The degree to which this sequencing will allow for more accurate gene annotation or phylogenetic assignment will be dependent on the depth of sequencing of a given target gene. The primers then also become useful for other, related wet-lab techniques, such as reverse transcriptase PCR (RT-PCR) or FISH.
Control group binding. The primers used as a control to assess the primer design methods target the gene encoding 3-isopropylmalate dehydratase subunit LeuC (ECK0074), which has been characterized as nonessential according to the Profiling of E. coli Genome (PEC) database (http://www.shigen.nig.ac.jp/ecoli/pec/). This gene represents an example of a characteristic gene, which is one susceptible to mutations but not to the same degree as the MDREP genes, nor is it expected to be as mobile.
Upon searching the six genomes for the terms "LeuC" and "3-isopropylmalate," seven total annotated genes were collected and used to create an MSA (other genes were found but attributed to the small subunit, leuD). Details of the primers used in this example are shown in Table 5. Based on the MAFFT default settings of Benchling, the MSA was split into two sequences to improve overall alignment scores. MSA1 consisted of both A. woodii sequences and those of D. vulgaris and G. subterraneus, while MSA2 comprised sequences of B. subtilis, P. putida, and T. aromatica. A summary of the number of gene hits using these primers is in Table 6, and a full list of gene hits can be found in Table S7. Table 6 shows that the primers containing degenerate bases have more unintended primer binding locations, especially the MSA1 upstream primer. A significant amount of the unintended hits for all leuC primers in the 90.0 to 99.9% identity range are attributable to leuC genes from the other strains (17/27), indicating that both MSAs were successful in identifying highly conserved regions within the leuC gene. At the 80.0 to 89.9% identity range, the primer hits become far less specific, detecting noncoding regions (5/41) or unannotated/hypothetical/unknown functional genes (4/41) as frequently as leuC (5/41). The degenerate primers detected fewer hypothetical proteins than the MDREP primers, which we attribute to the higher rate of correct leuC annotation, owing to its more conserved status compared to the MDREP primers.
Applications in metagenomic data sets. A major consideration of primer design is application to metagenomic data sets. Due to the diversity and size of these data sets, they are typically poorly annotated, with thousands of predicted genes with no known function or hypothetical proteins; thus, searching for a specific gene using a name or annotation has the potential to return false-negative results, or even false positives  due to misannotation. Thus, it is suggested that primers be designed based not on any annotations from a metagenomic data set but rather by using a type strain of a species expected to be present in the metagenomic environment of interest. This will improve the primer accuracy but has the limitation of targeting a narrower subset of the desired target gene. For highly conserved genes, this is less of an issue, but for highly mobile, nonconserved genes such as the MDREPs used here, this may further reduce the number of primer-detected genes. The more "universal" a primer set is, i.e., the more sequences used to compile the MSA used in primer design, the more potent the detection rate would become.
Considerations and limitations. While this approach endeavors to reduce the amount of wet-lab troubleshooting by anticipating pitfalls such as loss of primer efficiency through off-target binding, there will always be some degree of benchtop troubleshooting required. This work is not meant to replace or eliminate wet-lab work but rather to improve primer design and reduce the well-known issue of dependence on annotation databases (43). It is important to keep in mind the specific end goal of the primers (e.g., qPCR, amplicon sequencing, key pathogen or metabolism potential identification) and how that will affect parameters of the primer design. Even when targeting the same gene, a different end goal may require separate primers.
Here, initial in vitro validation of some of the more complicated primers employing degenerate bases has been done against P. putida and T. aromatica pure-culture DNA and DNA collected from a selection of field samples of fresh surface water samples with various biocide treatments (untreated, bronopol, glutaraldehyde, DBNPA, and quaternary ammonium compounds). The PCR products were separated using 1.5% agarose gels run at 100 V for 45 min (Fig. S.1 to Fig. S.4 in the supplemental material). The primer sets chosen were 16S rRNA for field template validation (amplicon size: 292) (Fig. S.1), qacA from P. putida (P.puti qacA1/3_U and P.puti qacA1_D; amplicon size: 198) (Fig. S.2), mexD from P. putida (P.puti acrB/mexD_U and T.aro acrB2/P.puti mexD_D; amplicon size: 187), acrB2 from T. aromatica (T.aro acrB2_U and T.aro acrB2/P. puti mexD_D; amplicon size: 187) (Fig. S.3), and acrB from P. putida (P.puti acrB/ mexD_U and P.puti acrB_D; amplicon size: 187) (Fig. S.4). These gels show that field DNA, regardless of biocide treatment, is suitable for PCR as the 16S rRNA primers were all successful (Fig. S.1) but that qacA was not detectable in these samples (Fig. S.2). The qacA1 primers have two unintended products against T. aromatica (;400 and ;750 bp) but the single intended product against P. putida. The primer sets targeting mexD and acrB2 (Fig. S.3) have identical amplicon products when used against P. putida and T. aromatica, suggesting that these primers, which employ degenerate bases and share a downstream primer (T.aro acrB2/P.puti mexD_D), are unable to distinguish between the two intended targets at the given annealing temperature (58°C). The primer set targeting acrB from P. putida illustrates how the PCR cycling conditions must still be properly tuned, as there is a single product when used with an annealing temperature of 63°C against P. putida and T. aromatica (Fig. S.4, lane 5) but many unintended products against P. putida when used with an annealing temperature of 58°C ( Fig. S.4, lane 13). All these gels indicate that none of the samples have P. putida or T. aromatica in them, or any target sequences of high enough similarity to be detected by these primer sets. From the pure-culture DNA templates, it is clear the primers work as intended (when used with appropriate annealing temperatures) but were unable to detect these targets in the field samples. To further validate these primers, the PCR assays were performed on samples with the field DNA samples spiked with 1% total DNA concentration of both P. putida and T. aromatica genomic DNA (Fig. S.5 and S.6). When the samples were spiked with the genomic DNA of the pure cultures, the expected bands were produced for every sample (with the exception of the NTC), showing that these primers are still able to function in these field samples, detecting sequences added at 1% of the total DNA concentration.
Lessons learned. An overlying issue with primer design attempts such as this one is the often-poor annotation quality of our genomic data libraries, particularly for environmental (or more generally any nonclinical) species. The abundance of putative, predicted, or hypothetical proteins severely limits the ability to accurately find related genes to design primers for specific genes. To alleviate this issue, primers should be designed using a well-annotated genome which is likely to be present in the environment where the primers will ultimately be used. In this way, the primers can be designed with higher confidence and, as shown here, can be used to probe genomes with lower annotation quality and aid in identifying the desired targets there.
As with all scientific endeavors, it behooves the scientist to keep in mind the ultimate goal of the primers. Here, we attempted to design primers for the same genes across six different genomes to eventually combine them into a "universal" primer mix for the desired target. In these situations, unintended binding becomes a larger issue because the amount of a specific primer is already reduced with respect to the final primer concentration, and any off-site binding could result in false negatives during PCR amplification. If the objective is to probe a mixed sample for the presence of a specific gene (e.g., for clinical or environmental screening), the use of degenerate primers becomes risky as it may lead to false positives or negatives (should the mixture of primers be too complex and the competitive binding of the primers prevent correct primer binding). Alternatively, in exploratory science such as the attempt explained here, the degenerate primers can increase the ability of the primers to detect additional genes not identified in the annotations. There is the potential that many of the primer-selected genes coding for hypothetical proteins actually represent efflux pumps of some nature, and thus we can add additional evidence toward these predicted proteins. This would then require a more targeted investigation of the genes identified in this manner, such as comparing the amino acid sequences of the predicted genes to known proteins and determining whether they truly would encode efflux pumps.
Unlike for the 16S rRNA gene family, designing universal primers for mobile, accessory genes is particularly difficult. Of note from this approach are the utility and the potential of in silico analysis of primers designed for less-conserved genes and their potential to aid or facilitate improved annotation in less-studied organisms.
The choice concerning which of the two primer design approaches should be employed becomes a decision based on the percent identity or conservation of the nucleotide sequences of the target genes. To facilitate the decision, percent identity scores should be calculated for all annotated copies of the desired target gene. From the 16S rRNA and MDREP gene examples shown here, we suggest a cutoff value of 75%, where identity scores above 75% should use primer design A (employing MSA and degenerate bases) while identity scores below 75% would be more effective if primers were designed using design B (unique sequences with variable binding locations). This should reduce the amount of unintended binding locations and overall improve the efficacy of primer binding in PCR and qPCR applications.
In conclusion, this work illustrates the benefits and shortcomings of two different primer design approaches. First, the use of multiple-sequence alignments (MSAs) to locate conserved regions of the nucleic acid lends itself toward creating primers with degenerate bases and ensures uniform PCR amplicon size. With the degenerate bases, these primers are more likely to have unintended binding and lower primer efficiency during thermocycling. However, these primers are more likely to reveal the presence of the desired gene(s) in poorly annotated genomes when using an in silico method.
The second primer design approach creates primers from individual genes without targeting conserved regions of the MSA while controlling for melting temperature and amplicon size to ensure all primers designed in this way are compatible. This approach allows for more-stringent thermocycling conditions and reduces the amount of unintended primer binding locations (thus improving detection rates in vitro), but this approach is less likely to reveal similar or identical genes in mixed environments.
Overall, this work highlights how extremely important it is to appreciate the false discovery rates resulting from the chosen primer design approach and the subsequent ramifications in interpretations when it comes to defining one's goal.
The full genome sequences were imported from JGI to a Web-hosted sequence alignment tool (Benchling) for gene sequence alignments. Multidrug resistance efflux pump genes were identified through searching directly for their gene names and abbreviations or key words, including (but not exclusively) multidrug, efflux, transporter, outer membrane, inner membrane, and resistance. A recognized challenge was that not all genomes were equally annotated. For example, the genome of P. putida has a more complete annotation than a more environmentally relevant species such as D. vulgaris. Once identified, all copies of each gene were clustered and a nucleic acid multiple-sequence alignment (MSA) was performed with no template sequence and using the MAFFT alignment algorithm with default conditions (maximum iterations: 0; tree rebuilding number: 2; gap open penalty: 1.53; gap extension penalty: 0.0; and no adjust direction). MSAs were used to identify regions of high nucleotide percent identities across all annotated genes of the same name. These regions were preferentially used for primer design as described below. No specific percent identity was used to identify regions to target for primer design; rather, regions were selected to minimize the number of mismatches while still maintaining amplicon sizes, as described below.
Primer design approaches. (i) Primer design approach A (traditional). Using the MSA, homologous regions of the nucleotide sequences were identified and used as targets for primer binding. All primer sequences and details are reported in Table 2. Using the Benchling platform, primers were all designed to be 18 to 23 bp in length and create PCR amplicons of 180 to 240 bp in length to facilitate quantitative PCR (qPCR) analysis as a downstream application. The GC content was targeted to be 50% but exceptions were made, allowing the GC content to reach a maximum of 80% (Table 2) to accommodate the target regions. Although efforts were made to maintain the same melting temperature (65°C) between the upstream (forward) and downstream (reverse) primers for all primer pairs of a specific gene, priority was given to optimizing the melting temperatures of a specific pairing (albeit this rule had to be stretched on occasion as well). The position along the MSA which matched all these conditions was used to create primers and, when required, primers were designed with degenerate bases to ensure they targeted the desired locations of the MSA.
(ii) Primer design approach B (novel). As a result of the limitations of primer design approach A, a unique approach was developed to more efficiently target genes annotated the same but with a lower percent identity. Again, using the Benchling platform, primers were designed to be 18 to 23 nucleotides in length, produce an amplicon of 180 to 240 bp, and have the same melting temperatures (6 5°C) between the upstream and downstream primer pairs. To avoid the use of degenerate bases, the primer positioning against the MSA was more flexible, allowing the binding locations for the primer pairs to drift while maintaining identical amplicon size. Using this approach, the amplicon size was a higher priority than annealing location, thereby preventing issues with downstream analysis across different primer pairs targeting the same genes in different genomes.
Primer binding testing. Intended and unintended primer binding locations were identified using the following in silico conditions on the Benchling platform. The SantaLucia 1999 algorithm was selected by default, primer binding parameters were a minimum of 18 matched bases, a maximum of three mismatches total with no consecutive mismatches was allowed, and annealing temperatures were between 30 and 100°C. The three allowed mismatches result in a minimum percent identity of .80%. Due to the binding algorithm of Benchling's primer binding, primers with mismatched 59 ends had to be manually removed from the pool (no mismatches were allowed on the 39 end by default). A full list of the binding locations for all MDREP primers is available in Table S1. To identify primer binding locations, the locus tag of the gene was collected from Benchling and the NCBI GenBank files were queried to identify the product name. In cases where no product name was provided for a specific locus tag, the protein ID was investigated and a gene identity was assigned from the region name.
Control targets. To further validate these approaches, primers were designed against a control group distinct from the MDREP class of genes used here. The control gene chosen was 3-isopropylmalate dehydratase large subunit (leuC), of which there are two annotated copies in A. woodii and one copy each in B. subtilis, D. vulgaris, G. subterraneus, P. putida, and T. aromatica. Primers were designed using the MSAs of the seven annotated copies, which were subsequently split into two distinct MSAs based on overall alignments. From these MSAs, primers with degenerate bases were designed, followed by unique primers for identical locations but with no degenerate bases so each individual sequence has a unique primer set. Identical primer-binding conditions were used as before with the exception that binding was allowed to be 17 bases while maintaining the maximum number of three mismatches, which still produces primers with a minimum of 80% identity on primers of 20 bases in length.
16S comparison. A primer set targeting 16S rRNA has been selected as a tool for comparison against well-established universal primer sets (2). This primer set employs degenerate bases and consists of the upstream primer (59-GTGCCAGCMGCCGCGGTAA) and the downstream primer (59-GGACTACHVGGGTWTCTAAT), with annealing temperatures of 62.6°C and 48.9°C, respectively. These primers are designed to target the conserved nucleotide regions flanking the V4 region of the 16S rRNA gene and produce an amplicon of 292 nucleotides (2).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.6 MB.