Genomic Resources for the North American Water Vole (Microtus richardsoni) and the Montane Vole (Microtus montanus)

Voles of the genus Microtus are important research organisms, yet genomic resources are lacking. Such resources would benefit future studies of immunology, phylogeography, cryptic diversity, and more. We sequenced and assembled nuclear genomes from two subspecies of water vole (Microtus richardsoni) and from the montane vole (Microtus montanus). The water vole genomes were sequenced with Illumina and 10× Chromium plus Illumina sequencing, resulting in assemblies with ∼1600,000 and ∼30,000 scaffolds, respectively. The montane vole was also assembled into ∼13,000 scaffolds using Illumina sequencing. Mitochondrial genome assemblies were also performed for both species. Structural and functional annotation for the best water vole nuclear genome resulted in ∼24,500 annotated genes, with 83% of these having functional annotations. Assembly quality statistics for our nuclear assemblies fall within the range of genomes previously published in the genus Microtus, making the water vole and montane vole genomes useful additions to currently available genomic resources.


Sequencing and nuclear genome assembly
Frozen tissue from a single M. r. arvicoloides individual collected from the southern Cascades Mountain range (JMS_292; 44.016667N, −121.750000E; [22]) was sent to Hudson Alpha (Huntsville, AL, USA) for high molecular weight DNA extraction and 10× Chromium library preparation [25]. In the 10× method, each extracted DNA fragment receives a different barcode before the fragment is sheared for library preparation. After sequencing, these barcodes are used to connect sequencing reads for a more contiguous assembly. After sequencing with a single run on an Illumina HiSeqX, the resulting 150-bp paired-end reads were input into Supernova v. 2.2.1 for de novo genome assembly with -maxreads=all [26].
To provide the most contiguous assembly for M. richardsoni, a hybrid assembly was performed using the ARCS+LINKS pipeline [32,33]. The ARCS+LINKS pipeline uses barcoding information from the 10× Chromium reads to scaffold the contigs from a separate genome assembly. Barcoded reads from M. r. arvicoloides were mapped to the M. r. macropus Discovar assembly with bwa mem v. 0.7.17 [34] before converting the mapped reads to BAM format and sorting with SAMTools v. 1.8 (SAMTOOLS, RRID:SCR_002105) [35]. ARCS v. 1.0.5 and LINKS v. 1.8.6 were then run with settings -s 98 -c 5 -l 0 -z 500 -d 0 -r 0.05 -m 50-10000 -e 30000 and -d 4000 -k 20 -l 5 -t 2 -a 0.3 -o 0 -a 0.3 -z 500, respectively.
As part of a separate project, a single M. montanus individual from Utah (UMNH:Mamm:30891; 38.19381N, −111.5824E) was misidentified as M. richardsoni. DNA was extracted from the sample using a Qiagen DNeasy Blood and Tissue Kit before being sent to the University of California Davis Genome Center for library preparation and sequencing. Then, 150-bp paired-end sequences were collected with a single shared run on an Illumina NovaSeq. Species identity was confirmed using the Barcode of Life Database (BOLD [36]). Reads were checked and trimmed for quality with fastQC and Trimmomatic as above, before mapping reads to the mitochondrial cytochrome oxidase I (COI) sequence of M. r. macropus [37] using bwa mem. The resulting mapped reads were converted to BAM format, sorted, and indexed with SAMTools. PCR duplicates were identified and removed with Picard v. 2.3.0 (Picard, RRID:SCR_006525) [38]. Resulting reads were then piled with SAMTools mpileup using base and mapping quality scores of 30, consensus sequences were generated with bcftools v. 1.3.1 (SAMtools/BCFtools, RRID:SCR_005227) [39], and consensus sequences were converted to fastq format using vcfutils with a minimum depth filter of 5 and maximum depth filter of 10,000 [35]. The resulting sequence was input into BOLD. Owing to low sequencing coverage, de novo genome assembly was not appropriate for M. montanus. To provide a preliminary genome sequence, a reference-guided genome assembly was performed with RaGOO v. 1.1 [40]. Raw reads were input into Discovar to generate an initial genome assembly. Misassembly correction was performed with RaGOO, using reads trimmed with the same settings as the M. r. macropus reads, and RaGOO was then used to scaffold the Discovar contigs onto the M. r. arvicoloides assembly; this is more closely related to M. montanus than the other available Microtus genome assemblies [3]. Since M. montanus has less than half the chromosomes of M. richardsoni (2n = 22-24 in M. montanus versus 56 in M. richardsoni [41]), the possibility of structural errors in the M. montanus assembly was examined by calculating the percentage of reads that mapped back to the assembly using bwa mem and bamtools v. 2.2.2 [42].
The final assemblies were submitted to GenBank [43], where screening was performed to identify any contamination, and contaminated scaffolds were removed. All assemblies were evaluated with QUAST v. 5.0.1 (QUAST, RRID:SCR_001228) [44], bbmap v. 38.35 (BBmap, RRID:SCR_016965) [45], custom Python scripts [46], and BUSCO v. 5.0.0 (BUSCO, RRID:SCR_015008) using the Euarchontoglires reference set [47]. After comparing assembly statistics from the different assemblies of M. r. macropus, the Discovar assembly was selected as the best one because it had less fragmentation, higher N50 and lower L50 values, and a higher BUSCO score than the SOAPdenovo assemblies (Table 1). Genome sequencing of M. r. arvicoloides produced over 800 million reads and 47× genome sequencing coverage. The final genome assembly consisted of ∼32,000 scaffolds, with an N50 of 2.3 Mb (megabase pairs), 1.3% missing data (N), and a BUSCO score of 85.8%. Supernova estimated the length of the genome assembled to be ∼2.4 Gb (gigabase pairs) and the total genome size to be ∼2.6 Gb. M. r. macropus sequencing produced over 600 million reads and 35× coverage. Genome assembly with Discovar resulted in ∼1.6 million scaffolds, with an N50 of 16 Kb (kilobase pairs), 0.06% N, and a BUSCO score of 54.5%. Given the many programs available to perform de novo genome assembly from short reads, another program might have produced a more contiguous M. r. macropus assembly, but previous studies have shown that Discovar performs well compared to other programs [48,49]. The hybrid assembly produced with the ARCS+LINKS pipeline had ∼1.6 million scaffolds, an N50 of 38 Kb, 0.09% N, and a BUSCO score of 59.8%. Because the hybrid assembly was of poor quality, it was not used for further analyses, and the M. richardsoni subspecies assemblies were kept separate. High fragmentation of the hybrid assembly is probably associated with fragmentation of the Discovar input assembly. Published results with this hybrid pipeline often include a much higher sequencing coverage of the input contigs to produce a better starting point for the pipeline. Therefore, in the future, additional Illumina sequencing with M. r. macropus might substantially improve the hybrid assembly. To produce the preliminary M. montanus genome, 108 million reads (13× coverage) were used, resulting in ∼13,000 scaffolds, an N50 of ∼3.1 Mb, 8.8% N, and a BUSCO score of 82.6%. Additionally, 89.3% of reads mapped back to the M. montanus assembly.

Mitochondrial genomes
The complete mitochondrial genomes of M. r. arvicoloides and M. montanus were assembled using the genomic sequencing reads. Mitochondrial genomes were assembled by To compare mitochondrial assemblies between methods, the assemblies were aligned RepeatMasker v. 4.0.9 (RepeatMasker, RRID:SCR_012954) was then used to further identify repeats using a combined repeat library that included the repeats identified from RepeatModeler and those from the RepeatMasker Rodentia database [60]. The percentage of the genome comprising each type of repeat element was extracted from the RepeatMasker log file for each genome assembly.
All genome assemblies used some form of Illumina sequencing (  Figure 1). Repeat content did not appear to be associated with phylogenetic relatedness, since repeats between the two subspecies of M. richardsoni were not more similar to each other than to other Microtus species. However, it is possible that the repeat content is affected by the continuity of the genome assemblies. Further research is needed to confirm this relationship.

Genome annotation
The M. r. arvicoloides genome assembly was annotated with the MAKER v. 2.31.10 (MAKER, RRID:SCR_005309) pipeline [61], loosely following the tutorial by Daren Card [62]. Briefly,  Assemblies with a * were produced by the present study. Note: in-depth methods for M. agrestis are not available, and it is possible that the assembly includes additional sequencing and/or methods.
the pipeline comprises masking repeats followed by multiple rounds of annotation, with both evidence-based and ab initio gene models. Repeats were identified as described above.
Complex repeats were then extracted from RepeatMasker results using grep with keywords "Satellite" and "rich". Within Maker, the model_org argument was set to "simple" so Maker Augustus was used to train the HMM using the -long option in BUSCO and the Euarchontoglires reference set. MAKER was then run again with the previously annotated gene models, and the HMM models from SNAP and Augustus. After the initial MAKER run, two cycles of ab initio gene prediction and annotation with MAKER were performed. To prevent overfitting, results were compared after each round of MAKER. Because the increase in AED score was minimal between the first and second rounds of ab initio gene prediction, further analysis was conducted on the results after the first round only. This round annotated ∼24,500 genes, with a mean gene length of 7445 bp (

REUSE POTENTIAL
The current study details the assembly and annotation of three nuclear and two

ETHICAL APPROVAL
Not applicable.

CONSENT FOR PUBLICATION
Not applicable.

COMPETING INTERESTS
SP is the director of Iridian Genomes, Inc. The other authors declare that they have no competing interests.

AUTHORS' CONTRIBUTIONS
DD, JS, and BC conceived the study. JS, SP, and BC provided funding for sequencing. DD performed DNA extractions, assembled genomes, and annotated genomes with input from SP. DD and BC wrote the manuscript with input from JS and SP. DD and SP submitted the resources to GenBank. All authors approved the final version.