Sequence alignments and validation of PCR primers used to detect phylogenetically diverse nrfA genes associated with dissimilatory nitrate reduction to ammonium (DNRA)

PCR primer sets were designed to target nrfA, the gene encoding the pentaheme nitrite reductase NrfA that catalyzes the nitrite ammonification step in the process of dissimilatory nitrate reduction to ammonium (DNRA). Details of the nucleotide alignments of the primer target regions of 271 nrfA sequences from reference genomes representing 18 distinct clades of NrfA are shown here along with validation of application to PCR-based methodology including the use of amplified fragment length polymorphism (AFLP) profiling and Illumina platform amplicon-based sequencing of environmental samples and selected reference strains. Summary data tables illustrate the specificity of forward primers nrfAF2awMOD and nrfAF2awMODgeo when paired with the new reverse primer nrfAR1MOD in relation to consensus target reference sequences associated with members of 18 NrfA clades. Specificity of the new primers to nrfA sequences in environmental samples is shown in AFLP analysis and amino acid-translated amplicon sequences obtained with the new primer sets. We also provide sequence alignment files of the full length nrfA genes, PCR reference amplicon alignment, NrfA amino-acid alignment and NrfA translated PCR amplicon-amino acid alignment. The full nucleotide and protein alignments contain 271 reference genomes that represent the 18 identified NrfA clades as a tool to further aid practitioners in examining new sequences corresponding to the primer target regions and allow further primer design modifications if deemed pertinent to specific studies. A more comprehensive analysis of this data may be obtained from (“Optimization of PCR primers to detect phylogenetically diverse nrfA genes associated with nitrite ammonification” Cannon et al., 2019).

reference genomes that represent the 18 identified NrfA clades as a tool to further aid practitioners in examining new sequences corresponding to the primer target regions and allow further primer design modifications if deemed pertinent to specific studies. A more comprehensive analysis of this data may be obtained from ("Optimization of PCR primers to detect phylogenetically diverse nrfA genes associated with nitrite ammonification" Cannon et al., 2019). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Data
Summary percent coverage to each NrfA clade of the forward primers nrfAF2awMOD and nrfA-F2awMODgeo along with reverse primer nrfAR1MOD are shown in Table 1. WebLogos depicting the consensus of the primers aligned to reference genomes in the target regions for forward primers Specifications  Value of the data Provides extensive nucleotide alignments of reference nrfA sequences in PCR-targeted regions corresponding to highly conserved motifs that are diagnostic for the pentaheme nitrite reductase protein NrfA, the key enzyme in the Ncycling process dissimilatory nitrate reduction to ammonium (DNRA). Provides a detailed view of the sequence coverage by our primers for a majority of the known nrfA diversity, and the methodological approach we took to validating the primer design to demonstrate efficient amplification of nrfA from different environments. This type of methodology is still an indispensable tool for study of microbial communities and relies on updated primer sets for genes like nrfA that are harbored by highly diverse taxa. The availability of an extensive sequence alignment of nrfA will provide users with a highly useful bioinformatics tool and a starting scaffold for assessing newly obtained sequences and evaluating the effectiveness of the primers presented in this study. DNRA includes the key step of nitrite ammonification and is a microbial process that is more prevalent than previously thought in wide ranging terrestrial and aquatic environments. Having updated molecular tools is paramount for researchers to assess the potential for this process in mixed community systems.
nrfAF2awMOD and nrfAF2awMODgeo, and reverse primer nrfAR1MOD are shown in Fig. 1. Alignments of the primer target regions from reference genomes to primers nrfAF2awMOD and nrfAR1MOD grouped accordingly with members of each NrfA clade are shown in Fig. 2. Alignments of forward primer nrfAF2awMODgeo targeting its corresponding region in specific reference genomes that are not covered by nrfAF2awMOD are shown in Fig. 3. The primer sequence alignments to target regions of the nrfA genes are derived from full length and partial nrfA sequence alignments made available in FASTA format (Supplemental nrfA and nrfA-amplicon FASTA files: "NrfA-gene-complete-Nucleotidealignment.fasta "; "NrfA-Gene-Amplicon-alignment.fasta"; and "NrfA-protein-alignment.fasta"). Graphic representations of AFLP data demonstrating the utility of using the individual primer pairs or multiplexed together are shown in Figs. 4 and 5 for reference genomic DNA and soil DNA, respectively. Translated amplicon sequences obtained from different soil and groundwater samples using the Fluidigm amplicon array followed by high throughput sequencing yielded sequences of the expected size range (230e300 bp) from multiple clades (Fig. 6). Selected sequences were translated and aligned to reveal common amino acids conserved among both reference amplicons and environmental DNAderived amplicons ( Fig. 7 and Supplemental file "NrfA-Environ-amplicon-Translation-AAalignment.fasta"). The data demonstrate the utility of the new primer sets for use in the detection of nrfA genes and the sequence alignment data available here will provide a reference tool and starting point for data analysis by researchers conducting DNRA studies.

NrfA sequence selection
A previous phylogenetic analysis of 272 full-length NrfA protein sequences, based on Bayesian inference, distinguished 18 clades possessing conserved features diagnostic of pentaheme NrfA proteins [2]. The resulting final sets of new primers were ultimately tested in silico against a library of 271 aligned nrfA sequences assembled here (Figs. 1 and 2). NrfA sequences from three metagenomeassembled genomes (European Nucleotide Archive # PRJEB20068) belonging to Clades K and N and derived from the Illinois agricultural soils used in this study (described below) were included for this analysis [3].

Sequence alignment and primer design
All sequence alignments, mismatch identification, and analyses of temperature characteristics were made in silico using tools in MacVector software (v. 16.0.8, MacVector, Inc.). The resulting primer sequences were further analyzed for consensus alignment in silico against reference sequences grouped by clade membership.

Validation of primers
DNA extracts from reference strains from different NrfA clades and originating from a variety of environments were used to test new primer pair candidates. The subset of accessible reference DNA included Serratia fonticola strain HAc5 (Clade A) (Genbank #JX293824.1), Shewanella oneidensis MR-1 (Clade C), Geobacter bemidjiensis Bem (Clade I), Anaeromyxobacter dehalogenans st. 2CP-1 (Clades J and K). Full nrfA sequences were obtained from the Functional Gene Pipeline and Repository (FUNGENE) (http://fungene.cme.msu.edu/) database, version 9.5 (February 2018). S. fonticola strain HAc-5 was previously isolated from agricultural soils and a draft genome was previously obtained (Chee-Sanford, unpublished). DNA was extracted from reference cultures and soil using a phenol: chloroform extraction method [4]. Soil extracts were modified by the addition of glycogen (20 mg/mL) to enhance the recovery of DNA during precipitation. Soil DNA samples consisted of equal volumes of DNA pooled accordingly from extracts of soil taken in April 2012 and November 2012 from depths of 0e5 cm, 5e20 cm, and 20e30 cm at agricultural sites near Havana, Illinois (HW) and Urbana, Illinois (UM). DNA from additional soil and groundwater samples used specifically for amplicon sequencing were extracted    using an abbreviated phenol:chloroform protocol [5] and then followed by glycogen-enhanced recovery as described above. Final DNA concentrations (~8e10 ng/mL) were measured using Qubit 2. 0 fluorometry (Invitrogen) and DNA band intensities estimated against quantitative DNA ladders following gel electrophoresis.

Optimization of PCR
All primers were HPLC-purified and obtained from IDT (Integrated DNA Technologies, Skokie, IL, USA). Stock concentrations (100 mM) of each primer were made by adding Invitrogen™ UltraPure™ for 10 min. DNA from four nrfA containing organisms served as positive controls to test the efficacy of the primers (Fig. 4). PCR products were resolved by gel electrophoresis using 2.

Amplified fragment length polymorphism (AFLP) analysis
Amplified fragment length polymorphism (AFLP) analysis was used to assess the amplification efficiencies from a pool of different reference nrfA and to further corroborate the specificity of the forward primers nrfAF2awMOD and nrfAF2awMODgeo when paired with the reverse primer nrfAR1MOD. AFLP analysis was performed on amplicons generated from a mixed DNA pool (1 ng each) of reference DNA from S. fonticola HAc-5, S. oneidensis MR-1, A. dehalogenans 2CP-1, and G. bemidjiensis Bem (Fig. 4). A combined pool of both forward primers with the reverse primer was also tested against the same reference DNA to assess any inhibition that could result from competing reactions. The primer pair combinations used were 5'-(6-FAM)-nrfAF2awMOD/nrfAR1Mod, 5'-(6-FAM)-  Table 2 in [1]) due to migration characteristics during column separation. *Cross-specificity of the primer set to another heme-binding sequence homolog that is not nrfA from G. bemidjiensis yields an additional product. nrfAF2awMODgeo/nrfAR1Mod, and combined forward primers 5'-(6-FAM)-nrfAF2awMODþ5'-(6-FAM)-nrfAF2awMODgeo/nrfAR1Mod. All PCR products were diluted 50-fold with ultrapure water before submitting for fragment size analysis (Roy J. Carver Biotechnology Center, University of Illinois, Urbana, IL). Fragments were sized following calibration against a MapMarker 1000 size standard and expected product sizes were accounted for in the resulting profiles. To test the application of AFLP to an

Fluidigm array and amplicon-based sequencing
To verify that the designed primers yielded actual nrfA gene fragments, we included the redesigned primer pair in amplicon sequencing analysis of soil, groundwater, and reference genomic DNA pools. The reference pool consisted of equal masses of genomic DNA from the known DNRA taxa Desulfovibrio vulgaris st. Hildenborough, Anaeromyxobacter dehalogenans st. 2CP-1, Shewanella oneidensis st. MR-1, Geobacter bemidjiensis st. Bem, Serratia fonticola st. HAc5, and Bacillus sp. UAAc-7. Using the Fluidigm Access Array at the University of Illinois Carver Biotechnology Center, DNA from different samples (up to 48) were amplified using up to 48 primer pairs, one of which included primers nrfAF2awMOD and nrfAR1MOD. The nrfA primers were one set of 14 gene-specific primer sets evaluated in the Fluidigm array, allowing both an assessment of their application in multiplex PCR technology and to address the efficacy of the primer set to detect nrfA genes in different environmental samples. The other data from the other 13 gene amplicon sequences collected from this array was not relevant to this paper. A standard annealing temperature of 55 C was used during PCR amplification to generate a pool of amplicons. These Fluidigm generated pooled amplicons from all PCR reactions were purified using a Qiagen™ QIAquick Gel Extraction Kit (Qiagen™, Valencia, CA, United States) according to the manufacturer's instructions. The DNA from the entire Fluidigm array was quantified and sequenced on one MiSeq flowcell for 301 cycles from each end of the fragments using a MiSeq 600-cycle sequencing kit version 3. Fastq files were generated and demultiplexed with the bcl2fastq v2.20 Conversion Software (Illumina). PhiX DNA was used as a spike-in control and removed in the data processing. Read lengths were 300 nucleotides. The raw data was sorted by the PCR-specific primers and paired end reads were obtained and demultiplexed by sample index.
Sequence data was selectively processed only for the nrfA gene amplicons in the reference genomic sample (no amplicons were obtained for G. bemidjiensis), one soil DNA sample and one groundwater DNA sample. Briefly, paired end reads were stitched together and filtered to the expected amplicon length using mothur [6]. The resulting fasta files for each sample were shortened to a maximum of 1000 sequences and aligned using MacVector software. Any sequences outside the forward and reverse Fig. 7. Aligned amino acids from reference NrfA and representative translated nrfA gene amplicons from soil (Soil-) and groundwater (GWNrfA-) DNA using the new primer pair design. Highlighted regions in sequence indicate expected conserved residues associated specifically with NrfA as identified by [2]. Clade designations are shown in parentheses.
from soil and groundwater used in this study are shown in Fig. 7. Summary alignment of the amino acid sequences represented in this Figure is included in supplemental material as a FASTA alignment file "NrfA-Environ-amplicon-Translation-AA-alignment.fasta". primer target regions were trimmed manually. Representative OTUs of clearly different taxa were selectively translated using MacVector starting with the 5' end of the forward primer which is known to be in-frame. The resulting amino-acid sequences were separately aligned using MacVector to evaluate the predicted protein fragment for diagnostic residues expected in NrfA between the third and fourth heme-binding domains [2].