The Genome of C57BL/6J “Eve”, the Mother of the Laboratory Mouse Genome Reference Strain

Isogenic laboratory mouse strains enhance reproducibility because individual animals are genetically identical. For the most widely used isogenic strain, C57BL/6, there exists a wealth of genetic, phenotypic, and genomic data, including a high-quality reference genome (GRCm38.p6). Now 20 years after the first release of the mouse reference genome, C57BL/6J mice are at least 26 inbreeding generations removed from GRCm38 and the strain is now maintained with periodic reintroduction of cryorecovered mice derived from a single breeder pair, aptly named Adam and Eve. To provide an update to the mouse reference genome that more accurately represents the genome of today’s C57BL/6J mice, we took advantage of long read, short read, and optical mapping technologies to generate a de novo assembly of the C57BL/6J Eve genome (B6Eve). Using these data, we have addressed recurring variants observed in previous mouse genomic studies. We have also identified structural variations, closed gaps in the mouse reference assembly, and revealed previously unannotated coding sequences. This B6Eve assembly explains discrepant observations that have been associated with GRCm38-based analyses, and will inform a reference genome that is more representative of the C57BL/6J mice that are in use today.

Despite being one of the best-assembled and curated mammalian reference genomes, GRCm38 still contains 523 gaps within chromosome sequences, and there are nearly 300 unresolved issues that have been reported to the GRC (https://genomereference.org). In addition to gaps, these issues include reports of localized sequence mis-assembly, missing genic and non-genic sequences, sequencing errors and suspect variation. These types of assembly issues inflate false positive rates in reference-based variant calling. For example, we reported an analysis of systemic exome variants called across a wide variety of mouse strains (including C57BL/6J) and showed that a significant fraction of these overlap with regions with annotated reference assembly issues and/or gaps (Fairfield et al. 2015). While a small fraction of these are expected due to private variation in the reference genome, the vast majority are likely technical artifacts resulting from unreported issues in the reference genome assembly, or regions where paralogous gene copies are not fully represented.
In an effort to close gaps and resolve other issues in the current mouse reference genome, to minimize variant calls associated with GRCm38private variation (i.e., to bring the mouse reference genome sequence closer to the C57BL/6J mice that are currently in use), to provide a de novo assembly representing a single individual, and to identify additional data to support unannotated genes, we used high coverage, long-read sequencing, optical mapping and short-read data to generate a de novo genome assembly from C57BL/6J Eve.

Sample preparation and sequencing -PacBio
Genomic DNA samples were extracted using both kidney and brain samples from the C57BL/6J Eve female (MouseID 03-03685 (The Jackson Laboratory), Strain ID 00664, born 8/27/2003, generation F223) by phenol-chloroform extraction of a nuclei-enriched pellet. DNA samples were resuspended in TE buffer to a final concentration of 300-400 ng/ul, 260/280 1.8-1.9 (Taylor and Rowe 1984). The PacBio data were generated from three libraries prepared using the Pacific Biosciences SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, Menlo Park, CA, USA) using the "20-kb Template Preparation Using BluePippin Size-Selection System (15-kb Size cutoff)" protocol obtained from PacBio SampleNet. The BluePippin (Sage Science, Inc, Beverly, MA, USA) was set to collect from 7-50 kb. After sequence length QC, the resulting sized libraries were repaired using the "Procedure & Checklist-10 kb Template Preparation and Sequencing" protocol. All libraries were sequenced using 294 SMRT cells on Pacific Biosciences RS II platform (P6C4 chemistry). One of these libraries was generated and sequenced (10X) by Pacific Biosciences using the same protocols and chemistries. The other two libraries and the remaining coverage were generated and sequenced at The Jackson Laboratory.
For IsoSeq cDNA sequencing, RNA was extracted from archived whole brain samples from C57BL/6J Eve. 1 microgram of input RNA was used to generate cDNA (Clontech SMARTER cDNA synthesis kit), cDNA was size selected (3-6 kb) by BluePippin, and 400 ng of SMRT-Bell library was prepared as above. The library was sequenced on the Pacific Biosciences RSII platform (P6v2 chemistry), 532,941 reads were generated with a mean insert length of 3,004 bp. Quiver (Chin et al. 2013) was used to predict consensus isoforms and for polishing. There were 31,076 high-quality isoforms, and 11,661 lowquality isoforms with average consensus read length of 3,042 bp.
Sample preparation and sequencing -Illumina short read Genomic DNA was fragmented and Illumina whole genome libraries were constructed using the methods described in Hodges et al. (2009).
Steps 1-28 were followed to produce a whole genome library that was then sequenced rather than used in the enrichment portion of the protocol. The library was quantified by QPCR and sequenced on six lanes of an Illumina HiSeq GAIIX (Illumina, San Diego, CA, USA) using a 100 base paired end sequencing protocol.
Sample preparation and sequencing -Bionano optical mapping DNA Isolation: High-molecular weight DNA was extracted from mouse spleen. 70 mg of mouse frozen spleen tissue was place on a Petri dish over ice and chopped with a razor blade into approximately 2 mm chunks. The tissue was then transferred into a 15 mL conical. 1 mL of fixing solution (2% (v/v) formaldehyde, 10 mM Tris, 10 mM EDTA, 100 mM NaCl, pH 7.5) and left on ice for 30 min. Fixing solution was pipetted out and discarded. Tissue was washed 3 times by adding 2 MB Buffer (10 mM Tris, 10 mM EDTA, 100 mM NaCl, pH 9.4), swirling tube, and pipetting off MB Buffer. 2 mL of MB Buffer was added after the third wash. The tissue was blended using a fixed rotor-stator homogenizer (TissueRuptor, Qiagen #9001271) on high speed for 10 sec. The homogenate was transferred to a 2 mL microfuge tube and spun down at 2000 rcf for 5 min. at 4°. Supernatant was removed, pellet was resuspended in 1.5 mL of MB Buffer, and spin was repeated. Supernatant was removed and final pellet was resuspended in MB Buffer.
Resuspended cells were embedded into low-melting point agarose gel plugs, using the CHEF Mammalian Genomic DNA Plug Kit (BioRad #170-3591), following the manufacturer's instructions. Plugs were made with 10 mg, 15, mg, and 20 mg equivalents of the original starting material (mass equivalents calculated based on volume of final resuspended pellet). The plugs were incubated with Lysis Buffer (Bionano #20270) and Puregene Proteinase K (Qiagen #1588920) overnight at 50°, then again the following morning for 2 hr. (using new buffer and Proteinase K). The plug was washed, melted, and solubilized with GELase (Epicentre #G09200). The purified DNA was subjected to 4 hr. of drop dialysis (Millipore, #VCWP04700) and quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen/Molecular Probes #P11496). The plug made with 10 mg equivalents of starting material had a concentration of 286 ng/mL and was clear and viscous, so it was selected for further processing.
DNA Fluorescent Labeling: DNA was labeled according to commercial protocols using the NLRS kit (Bionano Genomics, #80001). Briefly, 300 ng of purified genomic DNA was nicked with 7 U nicking endonuclease Nt.BspQI (New England BioLabs (NEB), #R0644) at 37°for two hrs. in NEBuffer3. The nicked DNA was labeled with a fluorescent-dUTP nucleotide analog using Taq DNA Polymerase (NEB, #M0267) for one hr. at 72°. After labeling, the nicks were ligated with Taq DNA Ligase (NEB, #M0208) in the presence of dNTPs. The backbone of fluorescently labeled DNA was counterstained using the DNA Stain from the NLRS DNA Labeling Kit.
Data Collection: The labeled DNA was loaded onto Irys chips (Bionano, #20247) and inserted into the Irys instrument. The instrument automated the electrophoresis of the DNA into nanochannels, thereby linearizing them with uniform stretch throughout the molecule. The stationary molecules were then imaged, and the automated process of electrophoresis followed by imaging was repeated for multiple cycles until the desired amount of data were collected. The stained DNA molecule backbones and locations of fluorescent labels along each molecule were automatically detected using the in-house software package, IrysView. A total of 244 Gbp ($80X coverage depth) of data were generated, using 3 Irys chips.

Processing raw data -PacBio
Quality trimming of sequenced reads: Raw data from 294 SMRT cells were imported into the SMRT portal (https://bit.ly/2S63Tav) and subreads were extracted from the raw h5 files within PacBio SMRT portal. Subreads with polymerase read were further filtered according to the following criterion: quality , 75, read length , 50, and polymerase read length , 50. Finally, after filtering, 32,210,376 subreads (mean subread length of 5,753) were extracted from 20,081,751 raw reads, providing theoretical coverage of 66X for de novo assembly.
Error correction: Error correction of reads was accomplished with the MinHash Alignment Process (MHAP) (Berlin et al. 2015) within PBcR. Continuous benchmarking of correction parameters (k-mer size, hash size, min-mer size, error rate) was done to obtain the best possible set of corrected subreads. Our analysis indicated that usage of more sensitive parameters (MhapSensitivity = highOvlErrorRate = 0.05) significantly increased run time but overall improved the quality of corrected reads.

Sequence assembly -PacBio
Corrected reads from MHAP were assembled using the Celera assembler (CA 8.3) (default parameters) (Myers et al. 2000), which requires $60-70 gigabases of corrected sequence and consists of overlapper, unitigger, scaffolder and consensus steps to reconstruct genomes from corrected long reads.

Hybrid scaffolding
Single molecule high-resolution maps of the B6Eve genome were obtained using the Bionano Irys System (Das et al. 2010). Label positions captured in images and molecule map lengths were stored in CMAP format files (consensus map). The hybrid scaffold tool from Bionano genomics was used to further extend the scaffold size by combining the PacBio de novo assembly and genome map data of B6Eve. The hybrid scaffold pipeline created an alignment between the datasets and constructed super scaffolds excluding the conflicting alignments 33 .
Polishing and assembly evaluation The hybrid scaffolded assembly was polished using Quiver (Chin et al. 2013) to improve consensus accuracies in the range of Q60 and to reduce the high indel errors that are expected in the PacBio sequencing data (Zuo et al. 2012) . pbalign (https://github.com/PacificBiosciences/ pbalign) was used to create a bam file from all h5 files of SMRT cells and Quiver trained on P6-C4 chemistries were used to obtain the consensus corrected assembly. This assembly was further improved by using Pilon (Walker et al. 2014) with default parameters, that corrects bases, fixes mis-assemblies and fills gaps provided a draft assembly and paired-end Illumina sequencing data. Nearly 32X of Illumina data from B6Eve was used as an input to Pilon to fix the bases in B6Eve assembly. The GATK variant calling pipeline (following best practices) was used to call variant using Illumina data on the B6Eve assembly to judge the improvement in overall quality at each step of polishing. Finally, the QUAST tool (Gurevich et al. 2013) was used (-split-scaffolds) to compare the final assembly with GRCm38. The split-scaffolds option breaks the assembly and performs reconstruction of "contigs" which were used to build the scaffolds to compare the effectiveness of scaffolding.

RefSeq transcript alignments
Murine "known" RefSeq transcripts (those with NM and NR prefixes) were queried from NCBI Entrez on September 11, 2017 and aligned to the Pilon-corrected B6Eve assembly and GRCm38 full assembly (GCF_ 000001635.20). From these analyses, the counts of transcripts with low quality alignments, split alignments or no alignment to the GRCm38 primary assembly unit and B6Eve were determined, as were the counts of transcripts dropped for co-location, (Taylor and Rowe 1984). From the same set of RefSeq transcripts, we additionally identified alignments to GRCm38 and to the Pilon-corrected B6Eve containing frameshifting and non-frameshifting indels in CDS (Schneider et al. 2017). The frameshift analysis of the pre-polished/corrected assembly used a set of known RefSeq transcripts queried on February 28, 2016.

LiftOver construction
LiftOver was performed between the Eve assembly and GRCm38 reference genome using the same species lift over construction procedure 9 outlined by University of California Santa Cruz Genome Bioinformatics Group. Same species lift over construction contained two steps a) BLAT alignments and b) chaining and netting to obtain the lift over file. Genome loci of the Eve assembly were further converted into GRCm38 coordinates using the LiftOver 8 tool from UCSC utilities.

Repeat content assessment
The repeat elements in the GRCm38.p6 (excluding ChrY and alternate loci scaffolds) and B6Eve assembly were determined by RepeatMasker (Smit et al. 2013(Smit et al. -2015 trained on the mouse model by excluding RNA elements (-norna). A chi-square test was performed to identify the repeat classes that are enriched in one genome over the other.

Repeat analysis of unaligned B6Eve sequences
There were unaligned B6Eve sequences from three different alignment methods a) Cactus based alignment using UCSC Comparative Annotation Toolkit which was used for the B6Eve annotation, b) NCBI BLAST-based alignment of B6Eve to GRCm38, and c) QUAST (minimap2 aligner). A consensus set (common among all three set), was constructed using BEDTools (Matera et al. 2008). RepeatMasker was used to identify the repeat content in the common unaligned region. A chi-square test was used to test the differences in repeat content for each of the repeat classes of common unaligned sequence against GRCm38.

Resolving recurring variants
Recurring variants present in $ 75% of strains, detected in previous whole exome sequencing efforts 10 were extracted. Fixed (homozygous) variants from the mouse Collaborative Cross genome 11 project, as well as fixed variants from 24 C57BL/6J pedigrees descendent from Eve 12 were also obtained (G. Ananda and G. Churchill, unpublished data). We used LiftOver to remap the genomic coordinates, and a recurring variant was said to be resolved if the ALT allele of a recurring variant in exome data matches the REF allele in the B6Eve assembly. The analysis was restricted to only homozygous variant calls.

B6Eve annotation
B6Eve was annotated using the Comparative Annotation Toolkit ) (https://github.com/ComparativeGenomicsToolkit/ Comparative-Annotation-Toolkit commit c7852b4). As input, CAT was given a progressive Cactus alignment generated with rat rn6 (GCA_000001895.4) and human GRCh38 as outgroups as well as the GENCODE VM11 annotation on mouse GRCm38. CAT was provided with extrinsic transcript information from RNA-seq as well as IsoSeq. For mouse GRCm38, the same RNA-seq used in the CAT publication was used. RNA-seq data were generated from whole brain of a female C57BL/6J Eve descendent and aligned to the B6Eve assembly. To guide AugustusPB in detecting novel isoforms, a total of 26,188 IsoSeq full length cDNA reads were aligned to the B6Eve assembly.

Novel isoform detection
To detect novel isoforms, homGeneMapping (Konig et al. 2016) was used to map GENCODE VM11 annotation coordinates onto B6Eve, and these splice junction coordinates compared to AugustusPB and AugustusCGP transcript predictions filtered for IsoSeq support. Transcripts with annotation support were filtered out. The remaining candidate novel isoforms then were checked to see if they overlapped a comparatively annotated locus and if they contained either a fully novel exon or a splice site shift based on bedtools (Quinlan et al. 2010) intersections.

SV detection & gap filling
PacBio read alignment: Raw PacBio reads were aligned to GRCm38 using the long-read aligner NGMLR version 0.2.6. CoNvex Gap-cost alignments for Long Reads (NGMLR) (Beal et al. 2012a) is a long-read aligner designed to align PacBio reads with the focus on identifying structural variations. Stringent alignment requirements were used for identifying SVs: -i 0.85 argument to disregard alignments with identity with less than 85% and -R 0.5 option to ignore alignments containing less than 50% of the read length.
Structural variant calling: Sniffles (Beal et al. 2012a) (default parameters) was used to call SVs from the alignments produced by NGMLR. GATK was used to process Illumina WGS data from B6Eve. Best practices were used to generate a BAM file and SVS were called using Delly v0.7.7. SURVIVOR-1.0.3 (Buac et al. 2008) package was used to perform the integration of PacBio and Illumina calls.
Gap filling: We extracted the coordinates of the gaps from the GRCm38 chromosomes and further extended this to include the 50 Kb flanking both sides of the gap. We aligned B6Eve scaffolds to these padded regions using minimap2. We filtered the candidate alignments according to the following criterion: a) Must be the reciprocal best hit, b) Total alignment length .= 80KB, and c) Align to one unique location to the reference (extracted this information from assembly-assembly alignments). The retained alignments were further visualized in Integrated Genomic Viewer (IGV) to inspect insertion/deletion patterns around the gap region. To confirm and extract the gap spanning a B6Eve scaffold, we performed the reciprocal alignments, aligning the padded gap regions to B6Eve scaffolds, using minimap2. We filtered out candidate alignments not satisfying criteria mentioned above and visualized the retained alignments in IGV to inspect whether we observe the opposite of previously found insertion/deletion pattern. The sequence and locus of confirmed gap spanning B6Eve scaffolds were extracted and subjected to GRC internal curation.

Data availability
C57BL/6J mice and breeding generation information are available through the Jackson Laboratory. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession LXEJ00000000. The version described in this paper is version LXEJ02000000. The raw PacBio, Illumina and Bionano data used were deposited at NCBI BioProject under accession PRJNA318985. The B6Eve assembly along with annotation and an assembly hub are available at ftp link (ftp://ftp.jax.org/b6eve). Visualization of the assembly can be found at https://genome.ucsc.edu / MyData /Track Hubs / My Hubs with the following URL: ftp://ftp.jax.org/b6eve/assemblyhub/hub.txt. All supplementary figure,

RESULTS AND DISCUSSION
The Jackson Laboratory manages the rate of genetic drift through periodic replenishment of foundation breeding colonies from pedigreed, cryopreserved embryo stock (Wiles and Taft 2010) that are three generations removed from a single brother-sister breeder pair, "Adam" and "Eve" (Figure 1). This process introduces a controlled bottleneck that minimizes the accumulation of genetic change. These two individual mice capture an evolutionary snapshot in time at inbreeding generation F223 (Figure 1). Therefore, any C57BL/6J individual obtained from the production colonies at The Jackson Laboratory today is limited to a maximum of 24 inbreeding generations removed from the mice whose DNA was used to generate the C57BL/6J reference assembly, GRCm38 (Figure 1). Under the highly selective breeding paradigms employed for inbred laboratory strains, this genetic distance is sufficient for rapid fixation of 98.7% of variants, such that today's C57BL/6J mice are by definition a sub-strain of the animals from which GRCm38 is derived (Green 1981). Therefore, we chose DNA isolated from one of these individuals as the material for our de novo assembly with the goal of providing a genome sequence from a single individual that is not more than eight generations removed from C57BL/6J mice sourced from The Jackson Laboratory today, and that might also be used to improve the current C57BL/6J reference assembly (GRCm38). We chose C57BL/6J Eve (B6Eve) to get balanced representation of the X chromosome and the autosomes. Future efforts are focused on a de novo assembly of Adam, where de novo assembly the Y chromosome will require more specialize approaches.

Sequence assembly and evaluation
To generate data for our de novo assembly of the B6Eve genome, we used a range of technologies, including Pacific BioSciences (PacBio) long read technology at 66X whole genome coverage (Table S1), Illumina short read at 32X whole genome coverage, and Bionano Genomics (BNG) optical maps. The overall assembly procedure involved 1) correction of PacBio reads, 2) creation of contigs from PacBio, 3) extension of contigs to scaffolds using optical maps, 4) polishing of the assembly, and 5) further correction of the assembly using Illumina data ( Figure 2). To assess base-pair level improvements in assembly quality afforded by each of these steps, we mapped the Illumina reads of B6Eve and called variants at each step using the GATK HaplotypeCaller (Depristo et al. 2011) (Table S2). The scaffolded assembly yielded 1,664,599 variants, the majority of which were insertions ($70%), followed by SNPs and small deletions. This pattern is reminiscent of the error profile generated from PacBio technology (Zuo et al. 2012). To improve base pair accuracy, we used Quiver software (Chin et al. 2013) to polish the assembly and reduced the total number of variants to 505,782. We found that compared to the unpolished assembly, Quiver reduced the number of insertions to just 22%. Finally, we used B6Eve Illumina data itself to correct the polished assembly with Pilon (Walker et al. 2014), a tool that improves the quality of the draft assemblies using read alignment analysis. After Pilon correction, the number of variants was narrowed down to 310,205; of which 227,523 were considered high quality (PASS) by GATK HaplotypeCaller (File S3).
Since these sequencing data sets were generated from the same individual female, we surmised that most of these high-quality variant calls were due to technical artifacts arising from errors in the B6Eve genome assembly, alignment errors, or errors in the Illumina data. Although we couldn't exclude the possibility of true variant calls resulting from somatic variation since different tissues were used to source DNA. To explore these variants in more detail we plotted the distributions of alternate allele frequency, genotype called, and total depth relationship for these (File S4). We found that the vast majority of these variant calls were low frequency calls, which are more commonly due to technical artifacts or somatic mosaicism. To place these variants onto the mouse reference genome, we lifted them over to GRCm38 where roughly 95% (210,571 [SNP:152,486 and Indels:58,080]) could be placed onto a chromosome (File S5). We found that 62% percent of these variants mapped repeat elements making it likely that the variants are technical artifacts due to alignment issues. Many of these variants also mapped to a prominent cluster on Chr12 $17.5 MB containing 10,464 variants within a 1.5 Mb region including exonic, intronic, and intergenic sequence (File S6). The non-random distribution of these variant calls also supports the conclusion that they were technical artifacts due to misalignment of reads in a poorly assembled region in B6Eve. A closer look at the GRCm38 assembly revealed the presence of segmental duplications in this region, which may have caused a collapse in the B6 Eve assembly.
We also called variants using Eve Illumina data aligned to GRCm38 and found significantly fewer variants (69,089 [SNP: 50,143 and Indel: 18,946]), about 10% of which were shared with the variants resulted from the mapping of Eve Illumina data to the Eve assembly (File S7). Except for a few minor clusters, including one on Chr1, these shared variants were uniformly distributed. The shared Chr1 cluster had 560 variants which were present in both variants call sets (Eve Illumina-Eve PacBio Assembly and Eve Illumina to GRCm38). Of these variants, there were two prominent clusters on Chr1: 85062206-85279614, Chr1: 88212297-88310615 with 100 and 110 variants, respectively. The first region includes the Csprs gene and second includes a known GRCm38 assembly issue which was also unresolved in the B6Eve assembly. Similar to the Chr12 region above, these regions on Chr1 are flanked by segmental duplications in GRCm38.
We used an automatic assembly quality evaluation tool, QUAST (Gurevich et al. 2013), to assess the overall quality of the Pilon corrected assembly. We found that during assembly-assembly alignment, 96.8% of the B6Eve assembly aligned to 97% of the GRCm38 (excluding ChrY and alternate sequences) reference genome chromosomal sequences and 54% of unplaced & unlocalized sequences ( Figure S1). Only 166 and 2,456 B6Eve components comprising of 3.7 Mb and 7.2 Mb of sequences remained wholly or partially unaligned to the reference genome, respectively. We also found that the K-mer based completeness of B6Eve was very high at 97.4%, suggesting high coverage and per-base quality. The detailed QUAST report for complete and broken-down (breaks the assembly by continuous fragments of N's of length $ 10) versions of the assembly is found in File S8. Taken together, our resulting PacBio-only de novo B6Eve genome assembly was 2.53 Gb consisting of 14,551 contigs (longest contig = 4,574,471 and 2.3% of total contigs exceeding 1Mb) with an N50 size of 401,294 bp. Our complete PacBio-Bionano hybrid assembly yielded an N50 of 1,290,032 bp with a total assembly size of 2.79 Gb, which was a 3X improvement (in N50) over the PacBio-only assembly (Table 1).

Gene Content Analysis
We also evaluated the gene content of the B6Eve assembly as another measure of assembly quality. Akin to previous analyses (Taylor and Rowe 1984;Schneider et al. 2017) we aligned 36,009 RefSeq transcripts to the GRCm38 primary assembly (also excluding unplaced sequences and alternate loci scaffolds from other mouse strains that are a part of the full assembly) and to the "Piloned" B6Eve assembly versions (Table 2 and  Table S3). We observed that B6Eve provides a comparable representation for total gene content, as compared to GRCm38, with only 19 nonchromosome and ChrY associated sequences having no alignment. Consistent with the more fragmented nature of the B6Eve assembly, however, a greater number of aligned RefSeq transcripts exhibited partial alignments or alignments split over multiple scaffolds than in GRCm38. We examined the co-placement of transcripts representing different genes as a proxy for measuring the collapse of segmental duplications. Although B6Eve showed a greater number of co-placed transcripts than GRCm38, these numbers were consistent with those seen in other high-quality long-read derived WGS assemblies, demonstrating the utility of this mouse assembly. To gauge the impact of the Illumina-read correction step on the quality of protein representation in the B6Eve assembly, we looked at the incidence of frameshifting indels in aligned RefSeq transcripts prior to and after this step (Florea et al. 2011) (Table S3). Although the Pilon corrected assembly still exhibited more frameshifts than GRCm38, we found that this step resulted in a substantial improvement in functional representation (protein coding sequence).

Reference Assembly Gap Filling
The GRCm38 chromosome assemblies contain 440 gaps (excluding centromere, short arm, and telomere gaps). We assessed whether sequences in the B6Eve assembly could resolve these gaps. Based on Figure 1 Origin of the inbred strain C57BL/6J. Inbred laboratory mouse strains are maintained by brother x sister mating. Filial (F) generations from which mice contributing to the reference assembly clone libraries and from which the B6Eve mouse were derived are shown. Cryopreserved embryo stock is represented by blue snowflakes at F226, 3 generations from Adam and Eve at F223. Generations subsequent to the cryopreservation event are F226p###, e.g., F226p230, which means embryos cryopreserved at F226 were recovered and there were an additional 4 generations of subsequent inbreeding. our gap-filling methodology (see Methods), the B6Eve assembly spanned 23 gaps in the GRCm38 chromosomes (Table S4 and Figure S2). In several instances, we observed discrepancies between the gap length reported in GRCm38 and the amount of sequence provided by the B6Eve assembly. For example, the B6Eve assembly spanned a 1,760 bp intra-scaffold gap located at Chr2:172,624,657-172,626,416 bp in GRCm38, with 1,620 bp (a 140 bp relative deletion) ( Figure S2a). In other cases, B6Eve spanning sequences were longer than assembly gaps. For example, a B6Eve assembly scaffold spanned the 100 bp intra-scaffold gap at Chr1:183,334,907-183,335,006 bp in GRCm38, with 595 bp sequence ( Figure S2b). These discrepancies were not unexpected, however, as the methods used to estimate reference assembly gap sizes do not always offer base-pair level resolution, and also because the GRC assigns default gap lengths when no sizing estimates are available (https://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/ TPF_Specification_v1.8_20131106.docx). Consistent with prior reports that remaining reference assembly gaps are in complex genomic regions (Church et al. 2011), we observed that our gap spanning sequences had a repeat content of 45.9% (vs. 42.5% of GRCm38 total sequence) repeats, with simple repeats accounting for 18.1% (vs 2.6% in GRCm38 total sequence).

Variant analysis
Previously, we reported whole-exome sequences from of a collection of nearly 200 unique strains of spontaneous mutant mice maintained at The Jackson Laboratory (Fairfield et al. 2015). In our analysis of these exomes, we found that there were 855 coding variants (SNPs) common across 75% or more of the samples, which we attributed to errors with the reference genome itself due to their significant inclusion within the component-mapped boundaries of GRC incident features. We investigated the subset of these exome variants (n = 126) that were homozygous across all strains (100% allele frequency). We found that 10 (7.9%) matched the B6Eve assembly allele rather than the GRCm38 allele, supporting the assertion that high frequency alleles are putative indicators of reference assembly error (Figure 3, Table S5).
To extend this analysis to the whole genome, we performed a similar analysis using variant calls from sixty-nine multi-parent, recombinant n  (Figure 3, Table S5) matched the B6Eve assembly allele rather than GRCm38. Finally, we analyzed variant calls from whole genome short read sequencing of 24 recent descendants of B6Eve, representing multiple inbred lineages. We found 3,203 homozygous variants (SNPs) common across these samples; of these, 2,194 (68.5%) met minimum alignment criteria for remapping to B6Eve. Of these 2,194, we found 393 cases (12.6% of the total variation and 17.9% of net variation) (Figure 3, Table  S5) where the reported alternate alleles matched the B6Eve assembly.
Taken together, our analyses identified 503 single nucleotide positions in GRCm38 (excluding variants common in three datasets) that are not representative of today's C57BL/6J mice. Two of these are nonsynonymous SNPs in Akap9 and Sfi1. Akap9 (A kinase anchoring protein 9) is a protein that is responsible for cytoskeletal organization and is required for formation and maintenance of the blood-testis barrier, and male fertility (Schimenti et al. 2013;Venkatesh et al. 2016). Using allele specific PCR, we confirmed the presence of the alternate Akap9 allele in B6Eve and in randomly selected descendants of Eve, and not in an ancestor of B6Eve. Therefore, Akap9 represents a variant that arose in and is now stably maintained in C57BL/6J mice in the years since the sequencing GRCm38. Sfi1 is predicted to be a spindle assembly associated protein on the basis of homology with a yeast cytoskeletal protein of known function, however no phenotypic alleles have been reported in mice (Salisbury 2004). When we attempted to validate this variant, however, we discovered that the alternate allele is indeed represented in the mouse reference genome, though it aligned to an unplaced scaffold (JH584304.1: 14,038-14,339). Flanking sequence variation between this scaffold and Chr11 allowed us to design allele specific primers, with which we confirmed that both alleles are present in DNA samples from B6Eve and from randomly selected descendants of B6Eve, supporting the idea that the unplaced scaffold indeed represents C57BL/6J sequence. Previously published SV data for C57BL/6J showed that the mouse genome potentially harbors 20-30 copies of this gene (Quinlan et al. 2010). Therefore, the recurrent "variation" observed in this gene is likely not allelic, but due to mis-mapping of reads from paralogous gene copies to the Sfi locus that is currently represented on GRCm38 Chromosome 11. Paralogous gene variation may be a previously underappreciated source of variation, since we observed a relative enrichment of variants within certain genes (e.g., Tulp4, Table S5). Previous studies have shown that GRCm38 is missing paralogous copies of many genes (Church et al. 2009;Church et al. 2011), some of which may be represented on unplaced scaffolds as we found for Sfi1.

Structural Variation
We aligned raw B6Eve PacBio reads to GRCm38 using NGMLR (Beal et al. 2012a) and called structural variants (SVs) with Sniffles (Beal et al. 2012a). We also aligned Illumina WGS data from B6Eve and called SVs with Delly (Beal et al. 2012b) (Table 3). The median size of detected duplication, deletion and inversion events from Delly were 901, 2,610 and 12,362 bp, respectively. Similarly, from PacBio data, the median size of duplication, deletion, inversion and insertion events were 432, 77, 1352, and 92 bp, respectively.
We found 12 deletion, 43 duplication and 4 inversion calls that were common in both Illumina and PacBio data (Table S6). Of these common SVs, 8 deletions, 30 duplications, and 4 inversions overlapped genes (Table S6), though mostly within noncoding intronic regions. We used DGVa to further investigate the SVs overlapping genes. Each of these were associated with multiple (21-124) DGVa entries representing germ line SV across genetically diverse inbred strains from multiple strain surveys of SV (Cutler et al. 2007;Graubert et al. 2007;Keane et al. 2011). Some of these regions contain genes that have been previously been shown to be subject to positive selection of copy number variants in inbred laboratory mouse strains. Our data showed that even within a strain, we detected SV in these regions, which suggests that these regions are by their very nature susceptible to rearrangements, i.e., through suppressed recombination. Alternatively, recurrent SV calls could reflect either private SVs in the reference assembly, or mis-assembly of these regions.

Repeat analysis
Repetitive sequences present challenges to assembly, as highly identical repeat sequences from different genomic regions are often incorrectly assembled together. This is a particular concern for data generated from short-read technologies, which are too short to span longer repeats. One advantage of long read sequencing reads are their ability to span a greater range of genomic repeats into unique sequence, enabling resolution of repetitive regions that cannot be resolved in unlinked short read n  (Table S7).
We also used RepeatMasker analysis to assess repeat content in scaffolds from our B6Eve assembly that failed to align to GRCm38, as we surmised these scaffolds might contain repetitive sequences that could not be resolved with genomic clones. To do this we focused on sequences which are unaligned by all of three of the following methods a) Cactus based alignment using UCSC Comparative Annotation Toolkit b) NCBI assembly-assembly alignments and c) QUAST evaluation. The common unaligned sequences (total 6.12 MB) (File S9) between CACTUS, NCBI, and QUAST had significant enrichment for repeats relative to aligned sequences. The repeat content accounted to 77.6% (4,754,330 out of 6,128,602 bp) with the microsatellite repeat class showed significantly enrichment when compared with GRCm38 (59.9% vs. 0.1%, chi-square test: X 2 = 80,332,000, p-value , 2.2e-1 (Table 4). While more work is needed to determine the underlying cause of failed alignment, the enrichment of microsatellite repeats in Figure 3 Ideogram of GRCm38 assembly annotated to highlight resolved gaps (vs. current reference), structural variants, and fixed variation using B6Eve data.
n these scaffolds is compelling. Microsatellite repeats are prone to slippage during DNA replication, and as a result their copy number is highly polymorphic in eukaryotic genomes; a phenomenon known as microsatellite instability (MSI). In inbred laboratory mouse strains and in the human population, mutations that change copy number occur at rates that are up to 10,000 times higher than single nucleotide mutation rates (1-3 · 10 24 (Beal et al. 2012a;Beal et al. 2012b;Zuo et al. 2012) per repeat per generation for microsatellite sequences vs. 2-4 · 10 29 (Milholland et al. 2017) per nucleotide per generation for SNV in C57BL/6J). Similarly, in the human population, CNV are estimated to occur at rates that are 100-10,000 times higher than the point mutation rate (Hu et al. 2017). Taken together, CNV are a major source of intrastrain variation and divergence from isogenicity (Watkins-Chow and Pavan 2008; Locke et al. 2015). Therefore, failed alignment of these microsatellite containing scaffolds could be due to repeat polymorphisms that have arisen over the intervening years in C57BL/6J. Alternatively, failed alignment could be due to assembly issues in either genome.

Gene prediction
Long read sequencing of cDNAs (IsoSeq) provides full-length transcript sequences and highly accurate representations of splice junctions and isoforms. To determine if long read sequencing data of B6Eve cDNAs could support more accurate gene prediction for the mouse reference genome, we generated IsoSeq data from RNA extracted from archived B6Eve brain. We used the Comparative Annotation Toolkit (CAT) (Fiddes et al. 2018) to identify 107,192 transcripts (82,187 protein-coding) representing 41,669 gene loci (20,182 protein-coding). 2,426 transcript predictions had splice junctions that were novel relative to GENCODE VM11, and we found additional support for these junctions in RNA-Seq data generated from the brain of a female C57BL/6J descendent of Eve. Analysis of the transcript predictions produced by AugustusPB and AugustusCGP revealed 206 exons with splice site shifts relative to GRCm38, nine putatively novel exons and ten putatively novel loci (Table S8). Three of the novel exons detected in the IsoSeq data reveal deletions in GRCm38: (1) 640 bp in Mia3 (Figure 4), (2) Traf5, and Slc26a6 ( Figure S3 and Figure S4). In support of these data, there are GRC incident reports describing deletions at each of these loci in GRCm38. Contiguity analysis in the B6Eve assembly showed that 616 genes mapped across two or more scaffolds and four genes had projections split on the same scaffold. A total of 258 protein-coding genes exhibited signs of gene family collapse, with 156 pairs of genes being resolved to the same locus.

Conclusions
The value of isogenic mouse strain backgrounds in biomedical research was recognized by geneticists in the early 20 th century leading to the creation and description of the over 450 unique inbred mouse strains to date (Silver 1995;Beck et al. 2000). Twenty generations of sibling intercrosses are required for the generation of a new inbred strain; a breeding method that creates genomes in which more than 98% of loci are homozygous. Therefore, individuals within a generation, within the same vivarium are essentially, genetically identical. The remarkable genetic architecture of inbred laboratory mouse strains is shaped by the frequent bottlenecks required for the on-going maintenance of these strains. This accelerates genetic drift and is a major source of the often unexpected, genetic variation that can be observed across generations and/or between vivaria. For example, a reference-based alignment of the inbred laboratory strain C57BL/6J yields approximately 900 raw variant calls (SNPs/Indels), despite it being the same inbred strain as the mouse reference genome (Fairfield et al. 2015). While a subset of these variant calls are the expected result of genetic drift, we previously found that a significant percentage of these variants are located in regions of the reference genome where there are reported assembly issues (Fairfield et al. 2015) and regions that contain missing paralogs, which are a known source of false positive variation due to mis-mapping of reads (Church et al. 2011;Hu et al. 2017). A major goal of this de novo assembly was to generate long read sequencing data that could potentially be used resolve these regions and to provide an updated representation of the variation that is present in the most recent inbreeding generations of C57BL/6J. The generation of the B6Eve assembly also provides important insights into the relative merits and limitations of different sequencing and assembly approaches. As we demonstrate, long-read WGS assemblies can resolve regions in which there is no coverage in clone-based assemblies. As some genomic regions are recalcitrant to cloning in vectors, alternate technologies like long-read WGS are critical to completing a gapless assembly. However, even long-read WGS assemblies of mammalian genomes are prone to collapse of highly repetitive or segmentally duplicated regions where these lengths are greater than the average read length. The impact of these collapses manifest as false positive variant calls in regions where paralogous variants were miscalled. As sequencing technologies improve and read lengths get even longer, we may reach a point at which the need for genomic clones in assembly is obviated. However, to generate very high-quality reference assemblies, it will also be important to further reduce the error rates associated with long reads, even beyond the corrections achieved with n short reads. Currently, generation of the highest quality assemblies requires a mix of techniques. Using variant data from our B6Eve assembly, as well as data from several other large sequencing efforts (Srivastava et al. 2017), we provide a "truth" call for over 500 high quality, recurring variants (SNP/Indels) that can be used to update the mouse reference genome (GRCm38.p6). We also found evidence for over 40 structural variants (inversions, deletions, and duplications) involving protein-coding genes in our B6Eve assembly compared to the reference genome. The majority of these SV calls were found in DGVa across a variety of strains, suggesting that they are likely recurrent SV calls that, similar to recurrent variants, are due to mis-assembly of paralogous sequences or reference specific SVs. Further, our data fill 23 gaps of varying length in the mouse reference genome which will be used to inform the upcoming release of GRCm39.
Our IsoSeq data provided improved/more accurate gene models with previously unrecognized splice junctions for over 2,000 genes. This is likely an underrepresentation since our analysis is limited to only those genes expressed in brain. We also found evidence for novel exons, as well as evidence for novel loci (expressed regions that lack gene annotation). This demonstrates that even in a well-curated reference genome assembly, gene annotation remains subject to change as new technologies provide improved representation of transcribed sequences and access to more highly specialized cell types.
Overall, our de novo assembly of Eve is not as polished as GRCm38, a clone-based assembly that benefits from more than 20 years of on-going curation and annotation, but it does provide key enhancements and a full picture of the types and sources of technical error in re-sequencing. Whole genome sequencing data are now available for hundreds of standardized laboratory inbred mouse strains (Keane et al. 2011;Srivastava et al. 2017;Lilue et al. 2018). These data reveal the remarkable architecture of inbred genomes, and provide a stark reminder that isogenic mouse strains are subject to genetic drift; a feature that directly conflicts with the idea of a 'reagent-grade' laboratory mouse. Careful breeding practices, cryoarchiving, and routine sequencing are key steps toward maximizing reproducibility of studies that rely on these living reagents. Ultimately, de novo assembly captures the full spectrum of genetic variation resident in inbred strains, some of which harbor significantly more variation than distantly related human populations. Recently, genome graphs have been used to represent "population reference genomes" as a means to improve read mapping and to minimize false positive variant calls (Rosen et al. 2017;Garrison et al. 2018). As applied to mouse genomes, this approach would ideally provide a framework for future representation of the laboratory mouse reference genome as a graph of many inbred strains upon which emergent variation can be more accurately discovered and used to guide experimental research involving laboratory mouse strains.

ACKNOWLEDGMENTS
LGR, VKS, AS, NR, and OZ were supported in part by NIH R24 OD02135 awarded to LGR and The Jackson Laboratory. The work of VAS and F. T-N was supported by the intramural research program of the National Library of Medicine, National Institutes of Health. We are grateful to the services provided by The Jackson Laboratory Genome Technologies and Computational Sciences Core, which are supported by a grant to The Jackson Laboratory Cancer Center, NCI P30 CA034196. Figure 4 The Mia3 locus from the perspective of both the B6Eve assembly (top) and the GRCm38 mouse reference (bottom). CAT annotation of B6Eve identified three isoforms with an IsoSeq supported exon not found in the reference. The cactus alignments (blue bars) show that there are 43 bp of reference sequence that does not align to B6Eve, and that there are 638 bp of B6Eve not seen in the reference. These 638 bp contain the extra exon. This result is confirmed in the B6Eve IsoSeq GRCm38 alignment, which shows an insertion (white blocks between gray exon alignments).