New de novo assembly of the Atlantic bottlenose dolphin (Tursiops truncatus) improves genome completeness and provides haplotype phasing

Abstract High-quality genomes are essential to resolve challenges in breeding, comparative biology, medicine, and conservation planning. New library preparation techniques along with better assembly algorithms result in continued improvements in assemblies for non-model organisms, moving them toward reference-quality genomes. We report on the latest genome assembly of the Atlantic bottlenose dolphin, leveraging Illumina sequencing data coupled with a combination of several library preparation techniques. These include Linked-Reads (Chromium, 10x Genomics), mate pairs (MP), long insert paired ends, and standard paired end. Data were assembled with the commercial DeNovoMAGIC assembly software, resulting in two assemblies, a traditional “haploid” assembly (Tur_tru_Illumina_hap_v1) that is a mosaic of the two parental haplotypes and a phased assembly (Tur_tru_Illumina_phased_v1) where each scaffold has sequence from a single homologous chromosome. We show that Tur_tru_Illumina_hap_v1 is more complete and more accurate compared to the current best reference based on the amount and composition of sequence, the consistency of the MP alignments to the assembled scaffolds, and on the analysis of conserved single-copy mammalian orthologs. The phased de novo assembly Tur_tru_Illumina_phased_v1 is the first publicly available for this species and provides the community with novel and accurate ways to explore the heterozygous nature of the dolphin genome.


Introduction
Technical advances in the past decade have reduced sequencing costs and improved access to sequencing data. Subsequent improvements in DNA extraction, preparation, and assembly algorithms facilitate low-cost accurate de novo genome assemblies. Such assemblies are essential for constructing haplotype diversity databases for breeding, comparative biology, medicine, and conservation planning. Even highly complex genomes now benefit from higher contiguity and improved protein coding coverage [1][2][3][4]. Consortium efforts to catalogue biodiversity of pivotal species of comparative evolutionary significance will continue to drive novel low-cost approaches toward reference-quality assemblies with chromosome-level resolution [5][6][7]. Here, we use a combination of methods to drive improvements in assembly structure for the Atlantic bottlenose dolphin (Tursiops truncatus; National Center for Biotechnology Information [NCBI]: txid9739). This genome assembly, like those of the Hawaiian Monk seal and African wild dog, is being published with the goal to facilitate research on comparative genomics, provide structure for cataloging biodiversity, and ultimately support decisions around species conservation and management [8,9].
The bottlenose dolphin is one of the most widely studied marine mammals; however, the taxonomy of the Tursiops genus remains unresolved. Numerous species designations have been suggested but not adopted due to a lack of resolution afforded by available data [10]. Even with new molecular genetic markers, we have reached a limitation on resolution from genetic data available to delineate species, subspecies, and populations [11]. To usher this species into the era of genomics, a high-quality reference genome is essential. It provides structure to catalogue diversity within and between species at the whole-genome level. In addition, the parallel molecular trajectory between dolphin and other mammalian species [12] makes the bottlenose dolphin a useful model to understand aspects of human health such as metabolic processes/diabetes [13][14][15], proteomics [16,17], and aging [18].
With the collection of data from multiple sources including Linked-Reads (Chromium, 10x Genomics; [21]), mate pairs (MP), long insert paired ends, and standard paired ends, and using DeNovoMAGIC assembly tool (NRGene), we provide an improved haploid reference-quality dolphin genome assembly as well as the first haplotype phased diploid assembly. We refer to our unphased assembly as Tur tru Illumina hap v1 and to the phased assembly as Tur tru Illumina phased v1. Using Tur tru v1 for comparison, our assembly shows increased contiguity and completeness with high consistency to the MP data and orthologous mammalian protein alignments. Additionally, by aligning Tur tru Illumina hap v1 to the human reference genome, we illustrate the synteny of the dolphin scaffolds to human chromosome 1 [22,23].

Coverage
We generated sequence data for a total coverage of approximately 450X, the majority from PCR Free and Chromium 10X Genomics Linked-Read libraries (Table 1). Coverage was computed using a 2.4 Gbp estimated genome size. Genome assembly was conducted using DeNovoMAGIC software (NRGene). More detail about the library preparation and the assembly process are found in the Methods section.

Haploid and diploid assemblies
Here, we report on two assemblies, one traditional haploid consensus assembly Tur tru Illumina hap v1 that represents a mosaic of the maternal and paternal haplotypes and the other haplotype-phased (i.e., diploid) assembly where each scaffold represents sequence corresponding to a single haplotype, Tur tru Illumina phased v1. The quantitative statistics for both assemblies are listed in Table 2. The phased or diploid genome assembly, which was made possible using Illumina sequencing data by leveraging the combination of library prep methods including Linked-Reads, is a significant advance and will provide the community with a powerful genomic tool for the downstream analysis in the context of the true heterozygous dolphin genome.

Genome assembly comparison
Both assemblies were compared to the best available assembly Tur tru v1 (NCBI accession GCA 001922835.1; [16]). We did not use the Ttru 1.4 assembly (NCBI accession GCA 000151865.3) because the contiguity statistics of the Ttru 1.4 are vastly inferior to the Tur tru v1 with a contig N50 3 times smaller than Tur tru v1 and scaffold N50 over 200 times smaller.
The statistics for the Tur tru Illumina hap v1 assembly show bigger scaffolds but slightly smaller contigs, with about 13% more sequence in the scaffolds compared to Tur tru v1 (Table 2). More sequence does not necessarily make for a better assembly considering that the extra sequence may be duplicated haplotypes or contaminants that do not belong to the original organism. To characterize the extra sequence, we first aligned the Tur tru Illumina hap v1 to the Tur tru v1 assembly using the Nucmer aligner, which is part of MUM-mer4 package [24]. We used default settings for generating the alignments. We then analyzed the alignments using the dnadiff package included with MUMmer4. We found that 87.5% of Tur   The total sequence listed excludes Ns (ambiguous nucleotides). Ns were also squeezed out from the scaffolds for N50 computations. We used genome size of 2,383,130,043 bp, equal to the total amount of sequence in the scaffolds of the bigger haploid assembly, for comparison of the N50 contig and scaffold sizes between the two assemblies. The Tur tru Illumina hap v1 and Tur tru v1 assemblies have comparable scaffold N50 sizes, and Tur tru v1 has bigger contigs. The Tur tru Illumina hap v1 assembly has more sequence and our Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis (  The table shows that the Tur tru Illumina hap v1 assembly is more complete, with 110 fewer missing single-copy orthologs compared to the Tru tru v1 assembly. The Tur tru Illumina hap v1 assembly has 43 extra duplicated orthologs, which possibly points to incomplete filtering of redundant haplotypes. While the Tur tru v1 assembly has bigger contigs, the Tur tru Illumina hap v1 assembly has many fewer fragmented BUSCOs. The haplotype-resolved Tur tru Illumina phased v1 assembly is less contiguous and less complete. As expected, more than half of the complete BUSCOs are duplicated, corresponding to the two resolved haplotypes. were 105 BUSCOS missing from both assemblies. We examined the locations of the 165 BUSCOs that are only found in the Tur tru Illumina hap v1, and all of them fully or partially aligned to locations in the sequences that were missing in Tur tru v1 assembly. Figure 1 shows the Venn diagram of BUSCOs aligned to both assemblies, showing that there are 165 BUSCOS that are only present in Tur tru Illumina hap v1 and 55 that are only present in Tur tru v1, with 3,779 present in both assemblies. The haplotype-resolved assembly is more fragmented, and it is missing 266 BUSCOs. As expected, most of the complete BUSCOs that were found (3537) are duplicated 2227, since they are found in different haplotypes.

Assembly validation through MP consistency
Since both Tur tru v1 and Tur tru Illumina hap v1 reference the same species, we expect few rearrangements between the assemblies. To examine this, we compared the absolute and relative correctness of the scaffolds of Tur tru Illumina hap v1 assembly by aligning the Illumina data from the 5-7 Kbp MP library to the scaffolds of Tur tru Illumina hap v1, Tur tru Illumina phased v1, and Tur tru v1 assemblies using the Bowtie2 (Bowtie 2, RRID:SCR 016368) tool [25]. We chose this library because it contained the largest number of valid 5-7 Kbp MPs. We then used only high-quality uniquely aligning mated reads (both mates had to align uniquely with quality score 42 in the SAM file) and classified the alignments of the MPs into the following categories (Table 4): 1. Same scaffold happy-number of MPs where both mates aligned to the same scaffold in the correct orientation with mate separation within 3 standard deviations of the library mean. 2. Same scaffold short-number of MPs where both mates aligned to the same scaffold in the opposite orientation with mate separation of less than 1,000 bp; these MPs are not in-  The alignments were done with Bowtie2. Only the reads that mapped uniquely were used for this computation, thus the number of MPs uniquely mapping to haplotype resolved assembly is much smaller. Same scaffold means that both mates mapped to the same scaffold; happy mates aligned in the correct orientation with mate distance within 5 standard deviations from the mean; misoriented mates aligned in the wrong orientation; long mates aligned with the distance between the mates exceeding 5 standard deviations; short mates aligned with the distance of less than 1,000 bp. Same scaffold mate ALL is the total number of all MPs where both mates aligned to the same scaffold.
dicative of scaffolding misassemblies, they are simply a byproduct of the MP library preparation process as they are MPs that are missing the circularization junction site between the mates. 3. Same scaffold long-number of MPs where both mates aligned to the same scaffold in the correct orientation, but the mate separation exceeded 3 standard deviations of the library mean. 4. Same scaffold misoriented-number of MPs where both mates aligned to the same scaffold in the opposite orientation with mate separation of more than 1,000 bp. 5. Mates aligned to different scaffolds-number of MPs where the two mates aligned to different scaffolds. 6. Only one mate in the pair aligned-number of MPs where only one read aligned to the assembly.
The "Same scaffold mate ALL" category in Table 4 is the sum of all mates in categories 1 to 4 (happy, short, long, and misoriented); it is listed for completeness.
Comparing Tur tru Illumina hap v1 with Tur tru v1, the total number of reads uniquely aligning to both "haploid" assemblies is similar; 295.2 M reads aligned to Tur tru Illumina hap v1 vs 293.6 M reads aligned to Tur tru v1. The total number of MPs aligning to the same scaffold is larger for Tur tru Illumina hap v1. Of the MPs aligning to the same scaffold, the number of MPs in the "Same scaffold happy" cate-gory is similar between the two assemblies. The differences that stand out are the much larger (7.3 times more) number of mates that aligned to the same scaffold in the wrong orientation and the much larger (3.3 times more) number of the same scaffold long pairs in Tur tru v1 compared to Tur tru Illumina hap v1 ( Table 4). Of course, some level of discrepancy is expected because the two assemblies represent two different individuals with an unknown level of structural variation between them. However, in concert, the two different categories may also suggest a possibility of a relatively higher number of locally mis-ordered or mis-oriented contigs in the scaffolds of the Tur tru v1 assembly. This may be due to the scaffolding process used to create the Tur tru v1 assembly. The assembly was created with the HiRise assembler [26] using proximity ligation Hi-C data for scaffolding. The proximity ligation data provide MPs of all possible sizes; however, the MP distances and mate orientations are unknown. Since there are more shorter pairs than longer pairs due to the 3D structure of the DNA, it is much more likely to ligate parts of DNA that are closer to each other than the ones that are far apart. This property enables one to use these data for scaffolding. By mapping the pairs to the assembled scaffolds, one can measure how the distance between the mates in a pair varies with the number of pairs whose ends map to the same location in the assembly. However, the dependence is weak on the short end, meaning that the number of pairs of about 10 Kbp in length is not much different from the number of links of 12-13 Kbp in length. This frequently results in mis-orientations and shuffling of scaffold positions for contigs or scaffolds that are smaller than 10-20 Kbp in the scaffolding process.
The haplotype phased assembly is much more fragmented compared to both haploid assemblies, resulting in higher relative numbers of MPs mapping to different scaffolds. However, when looking at the "internal" MPs, i.e., where both mates map at least 10 Kb away from the scaffold ends, we see remarkable consistency, with less than 0.5% of the mates mapped to the wrong scaffold (see next section). Since for this analysis we only used mates mapping uniquely to the assembly, and there are two copies of the genome in the assembly, the total number of mapped mates is much lower.

Haplotype resolution
To Illustrate the resolution of the haplotypes in Tur tru Illumina phased v1, we aligned it to Tur tru Illumina hap v1 using the Nucmer tool. In Fig. 2 we show the mummerplot of alignments of the phased assembly to the haploid one. The circles represent contig ends, with lines joining them representing aligned sequence. The color indicates direction of the alignments. We display the alignments to an arbitrarily chosen scaffold314 of the haploid assembly. Only alignments longer than 5 Kb are shown. Figure 2 shows that most of the "haploid" assembly aligns to two phased scaffolds, i.e., for each location on the x-axis there are two corresponding alignments on the y-axis. The regions that are covered by a single haplotype (rather than 2) are most probably homozygous regions of this genome. In cases where the homozygous region is long, it is more difficult to phase its heterozygous ends. Thus, in some cases, the homozygous regions are represented only once in the phased assembly (instead of twice). This is the cause for some of the single copy BUSCOs in the phased assembly. In our experience, this issue is more pronounced in mammalian phased assemblies due to the relatively lower heterozygosity level and the way it is distributed along the genome.
In haplotype phasing, it is easy to phase small regions. For example, a single isolated single-nucleotide polymorphism (SNP) with no haplotype differences within 100 bp in both directions can be trivially phased into two 201 bp (or longer) contigs different by one base in the middle. It gets more difficult for larger contigs/scaffolds, where one must make sure that the contig/scaffold represents single haplotype and not a "mosaic" of haplotypes and that the SNPs and other bigger haplotype differences are correctly "phased." To do that, we mapped the MPs from the 5-7 Kb MP library to all phased scaffolds using Bowtie2 [25] and then examined the "internal" MPs where both reads in each pair mapped to the assembly and one read mapped within 10 Kb away from the ends of the scaffold. This would imply that if haplotype phasing is done properly, the other mate must map to the same scaffold and not its haplotype. If it does not, then it indicates an apparent mis-assembly or failure to phase haplotypes. By measuring the number of "properly" aligned internal mates, where both mates aligned to the same scaffold vs "improper" internal mates where the mates aligned to different scaffolds, one can measure the efficacy of the haplotype phasing. There were 35,697,369 pairs where both mates mapped properly to the same scaffold, while only 169,244 mapped improperly, i.e., to two different scaffolds. The percentage of improperly mapping MPs is only 0.5%, indicating that haplotype resolution was of high quality.

Synteny between human and dolphin
Dolphin is a mammal, and currently the best mammalian reference genome is the human genome. To understand similarities between dolphin and human on the DNA level, we aligned the Tur tru Illumina hap v1 assembly to the primary chromosomes of the current haploid human reference genome GRCh38 [27]. Since human and dolphin are fairly distant species, we did not expect to find long DNA sequence alignments; instead, we were looking for synteny where relatively short DNA fragments of scaffolds align in the same order and orientation between the two assemblies. We used the MUMmer4 package for producing the alignments using the default settings. The alignment mummerplot (Fig. 3) shows a striking synteny between the dolphin assembled scaffolds and human chromosome 1, visible even on the large-scale chromosome plot (Fig. 3a). No largescale synteny to the other human chromosomes can be readily observed. The synteny observation is possible due to large scaffold sizes in Tur tru Illumina hap v1. In Fig. 3b, we show 22 scaffolds that have 50% or more of their sequence in syntenic alignments. The syntenic alignments of these 22 scaffolds span nearly the entire human chromosome 1 sequence. The synteny is not a new finding, it was first identified by Bielec et al [22] and was later extended to many other placental mammals [23]. The Tur tru Illumina hap v1 assembly clearly illustrates and confirms the expected synteny.

Sample collection and DNA extraction
The sample for this study came from a female Atlantic bottlenose dolphin (sample ID 04329), captive born at SeaWorld of Orlando, Florida, from wild male and female Atlantic bottlenose dolphins. The animal was 36 years old at blood collection with a healthy medical history. Blood was collected using PAXgene Blood DNA Tubes (Qiagen). High-molecular-weight genomic DNA was isolated using the MasterPure DNA Purification Kit (Illumina) and subsequently quantified and qualified using Quant-iT dsDNA Kit and E-Gel EX Agarose Gel (ThermoFisher).

Paired end libraries
We generated the 450 bp and 800 bp paired-end (PE) libraries using the TruSeq PCR-free DNA Sample Prep kit (Illumina). The protocol was slightly modified at fragmentation and double-size selection steps by adjusting the DNA shearing protocols (Covaris) and by empirically titrating the ratios of SPRI magnetic beads over DNA to obtain insert sizes around 450 bp and 800 bp. We then evaluated the libraries for insert size and yield using Bioanalyzer (Agilent) and real-time qPCR assay, using Illumina DNA Standards and primer master mix qPCR kit (KAPA Biosystems, Roche), then normalized to 2 nM prior to clustering and sequencing. Both the 450 bp and 800 bp libraries were then denatured and diluted to 8 pM and 12 pM, respectively. The 800 bp PE library was clustered and sequenced on the HiSeq 2000, using the HiSeq Cluster and SBS v4 kits for PE 2 × 160 bp reads (Illumina). The 450 bp PE library was clustered and sequenced on the HiSeq 2500 v2 Rapid Run mode using the HiSeq Rapid Cluster and SBS v2 kits for PE 2 × 250 bp reads.

Mate pair libraries
To maximize sequence diversity and genome coverage, three separate MP libraries were constructed corresponding to 2-5 Kb, Figure 2: An example mummerplot of the alignments of the phased assembly to the "haploid" one, spanning about 15 Mbp of sequence of scaffold314. The circles represent contig ends, with lines joining them representing aligned sequence. The color indicates the direction of the alignment, with red and blue forward and reverse, respectively. We show that for most locations on the x-axis (haploid assembly coordinates) there are two alignments on the y-axis corresponding to the two phased haplotypes. The small number of regions with a single contig aligning represent long homozygous regions of the genome that we were unable to phase. 5-7 Kb, and 7-10 Kb insert sizes using the Nextera MP Library Preparation Kit according to the manufacturer's instructions (Illumina). All three libraries were generated from a single input of 4ug of genomic DNA size-selected on a 0.8% E-gel (Invitrogen). Proper sizing of gel-extracted products was confirmed using the Bioanalyzer High Sensitivity chip (Agilent), and 600 ng was subsequently used as input for circularization. Following library preparation, the Bioanalyzer was used to confirm library quality. Each of the three libraries were quantified by qPCR (KAPA Biosystems Library Quantification Kit, Roche), denatured, and diluted to 200 pM after size-adjustment, according to Bioanalyzer results, and clustered on the cBot according to the manufacturer's instructions (Illumina). Then, 2 × 150 bp of Illumina PE sequencing was performed on the HiSeq 4000 using the HiSeq 3000/4000 Cluster and SBS kits.

i. Chromium library
Genomic DNA quality was assessed by pulsed-field gel electrophoresis to determine suitability for 10x Chromium library preparation (10x Genomics). A total of 1.125 ng of input was used for library preparation according to the manufacturer's instructions without size-selection. Final library concentration was de-termined by qPCR (KAPA Library Quantification Kit, Roche) and size-adjusted according to Bioanalyzer DNA 100 chip (Agilent) results. Next, 2 × 150 bp of Illumina PE sequencing with an 8base index read was performed on the HiSeq 4000 using the HiSeq 3000/4000 Cluster and SBS kits.

Genome assembly
Genome assembly was completed using the DeNovoMAGIC software platform (NRGene). This is a proprietary DeBruijn-graphbased assembler that was used to produce assemblies of several challenging plant genomes such as corn [1] and ancestral wheat Aegilops tauschii [3]. The following outlines design of the assembler and steps of the assembly process.

Reads pre-processing
In the pre-processing step, we first removed PCR duplicate reads and trimmed Illumina adaptor AGATCGGAAGAGC and Nextera linker (for MP library) sequences. We then merged the PE 450 bp 2 × 250 bp overlapping reads with minimal required overlap of 10 bp to create stitched reads using the approach similar to the one implemented in the Flash software [28].

Error correction
We scanned through all merged reads to detect and filter out reads with apparent sequencing errors by examining k-mers (k = 24) in the reads and looking for low abundance k-mers. We have high coverage data (∼450x), with each read yielding 127 (150-24+1) to 227 (250-24+1) k-mers. Thus, average 24-mer coverage is at least 300x. The 24-mers that only appear fewer than 10 times in the set of reads likely contain errors. We did not use the reads that contain these low abundance k-mers for building initial contigs.

Contig assembly
The first step of the assembly consists of building a DeBruijn graph (k-mer = 127 bp) of contigs from all filtered reads. Next, PE and MP reads are used to find reliable paths in the graph between contigs for repeat resolving and contig extension.The 10x barcoded reads were mapped to contigs to ensure that adjacent contigs were connected only when there is evidence that those contigs originate from a single stretch of genomic sequence (reads from the same two or more barcodes were mapped to the same contigs).

Split phased/un-phased assembly processes
Two parallel assemblies take place to complete the phased and un-phased assembly result. The phased assembly process utilizes the complete set of contigs. In the un-phased assembly process, the homologous contigs are identified and one of the homologs is filtered out, leaving a subset of the homozygous and one of the homologous contigs in heterozygous regions. The linking information of both homologous contigs is kept through the assembly process of the un-phased assembly, usually enabling longer un-phased scaffolds.

Scaffolding
All of the following steps are done in parallel for both the phased and un-phased assemblies. Contigs were linked into scaffolds with PE and MP information, estimating gaps between the con-tigs according to the distance of PE and MP links. In addition, for the phased assembly, 10x data were used to validate and support correct phasing during scaffolding.

Gap filling
A final gap fill step used PE and MP links and DeBruijn graph information to locally construct a unique path through the graph connecting the gap edges. The path was used to close the gap if it was unique and its length was consistent with the gap size estimate.

Scaffold split/merge
We used 10x barcoded reads to refine and merge scaffolds. All barcoded 10x reads were mapped to the assembled scaffolds. Clusters of reads with the same barcode mapped to adjacent contigs in the scaffolds were identified to be part of a single long molecule. Next, each scaffold was scanned with a 20 kb length window to ensure that the number of distinct clusters that cover the entire window (indicating a support for this 20 kb connection by several long molecules) is statistically significant with respect to the number of clusters that span the left and the right edge of the window. If there was a statistically significant disagreement in the coverage by the clusters over the window, we broke the scaffold at the two edges of the window. Finally, the barcodes that were mapped to the scaffold edges (first and last 20 kb sequences) were compared to generate a graph of scaffolds. The scaffolds are nodes and the edges are links connecting nodes with more than two common barcodes on the ends. We broke the links to the nodes that had more than two links and output the resulting linear paths in the scaffold graph as final scaffolds.

Summary
We show that Tur tru Illumina hap v1 is more complete and more accurate compared to the current best reference Tur tru v1, based on the amount and composition of sequence, the con-sistency of the MP alignments to the assembled scaffolds, and on the analysis of conserved single-copy mammalian orthologs. The additional 12.5% of sequence data identified and assembled here was found to contain 165 additional BUSCO alignments as compared to the latest published assembly Tur tru v1. The large scaffolds represented by Tur tru Illumina hap v1 enabled and confirmed expected synteny to human chromosome 1. The phased de novo assembly Tur tru Illumina phased v1 is of the first publicly available, and it provides the community with novel ways to explore the heterozygous nature of the dolphin genome. These findings illustrate the impact of improved sample preparation and improved de novo assembly methods on progress toward more complete and accurate reference quality genomes. Better-quality assemblies will improve our understanding of gene structure, function, and evolution in mammalian species.

Availability of supporting data
The dolphin assembly Tur tru Illumina hap v1 has been deposited at NCBI under BioProject PRJNA476133, accession QMGA00000000. The dolphin assembly Tur tru Illumina phased v1 has been deposited at NCBI under BioProject PRJNA478376, accession QUXD00000000. All data are also available from the GigaScience GigaDB repository [29].