Leveraging evolutionary relationships to improve Anopheles genome assemblies

While new sequencing technologies have lowered financial barriers to whole genome sequencing, resulting assemblies are often fragmented and far from ‘finished’. Subsequent improvements towards chromosomal-level status can be achieved by both experimental and computational approaches. Requiring only annotated assemblies and gene orthology data, comparative genomics approaches that aim to capture evolutionary signals to predict scaffold neighbours (adjacencies) offer potentially substantive improvements without the costs associated with experimental scaffolding or re-sequencing. We leverage the combined detection power of three such gene synteny-based methods applied to 21 Anopheles mosquito assemblies with variable contiguity levels to produce consensus sets of scaffold adjacency predictions. Three complementary validations were performed on subsets of assemblies with additional supporting data: six with physical mapping data; 13 with paired-end RNA sequencing (RNAseq) data; and three with new assemblies based on re-scaffolding or incorporating Pacific Biosciences (PacBio) sequencing data. Improved assemblies were built by integrating the consensus adjacency predictions with supporting experimental data, resulting in 20 new reference assemblies with improved contiguities. Combined with physical mapping data for six anophelines, chromosomal positioning of scaffolds improved assembly anchoring by 47% for A. funestus and 38% A. stephensi. Reconciling an A. funestus PacBio assembly with synteny-based and RNAseq-based adjacencies and physical mapping data resulted in a new 81.5% chromosomally mapped reference assembly and cytogenetic photomap. While complementary experimental data are clearly key to achieving high-quality chromosomal-level assemblies, our assessments and validations of gene synteny-based computational methods highlight the utility of applying comparative genomics approaches to improve community genomic resources.


Introduction
Reduced costs of new sequencing technologies have facilitated the rapid growth of draft genome assemblies from all kingdoms of life. Nevertheless, the often painstaking process of progressing from draft status to that of a 'finished' reference genome-a near-complete and near-contiguous chromosomal-level assembly-remains the exclusive accomplishment of relatively few species.
Chromosomal ordering and orienting of contigs or scaffolds may be achieved by experimental approaches including fluorescence in situ hybridization (FISH) (Bauman et al. 1980), genetic linkage mapping (Hahn et al. 2014;Fierst 2015), optical (restriction site) mapping (Levy-Sakin and Ebenstein 2013), or analysis of chromatin interaction frequency data (Kaplan and Dekker 2013;Burton et al. 2013).
While many research applications do not strictly require such high-quality assemblies, improvements in completeness, contiguity, and chromosomal anchoring can substantially add to the power and breadth of biological and evolutionary inferences from comparative genomics or population genetics analyses. For example, extensive contiguity and chromosomal-level anchoring are clearly important when addressing questions concerning karyotype evolution or smaller-scale inversions and translocations, re-sequencing analyses of population-level samples, reconstructing rearrangement-based phylogenies, identifying and characterising genes that localise within quantitative trait loci (QTLs), examining genomic sexual conflicts, or tracing drivers of speciation. In many such studies, assembly improvements were critical to enable more robust analyses, e.g. QTL analysis with rape mustard flowering-time phenotypes (Markelz et al. 2017); contrasting genomic patterns of diversity between barley cultivars (Mascher et al. 2017); defining rearrangements of the typical avian karyotype (Damas et al. 2017); detecting chromosome fusion events during butterfly evolution (Davey et al. 2016); characterising the ancestral lepidopteran karyotype (Ahola et al. 2014); identifying the chromosomal position and structure of the male determining locus in Ae. aegypti (Matthews et al. 2017); and characterising a melon fly genetic sexing strain as well as localising the sexing trait (Sim and Geib 2017).
Available genome assemblies for anopheline mosquitoes vary considerably in contiguity and levels of chromosomal anchoring. Sequencing the first mosquito genome produced an assembly for the A. gambiae PEST strain with 8'987 scaffolds spanning 278 megabasepairs (Mbp), where 303 scaffolds spanned 91% of the assembly and physical mapping assigned 84% of the genome to chromosomal arms (Holt et al. 2002). Additional FISH mapping and orienting of 28 scaffolds and bioinformatics analyses later facilitated . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint an assembly update by removing haplotype scaffolds and bacterial sequences and anchoring a third of previously unmapped scaffolds to chromosomes (Sharakhova et al. 2007). Since then, more than 20 new assemblies have been built for the anophelines, several with mapping efforts that enabled at least partial chromosomal anchoring. Sequencing of the A. gambiae Pimperena S form and A. coluzzii (formerly A. gambiae M form) produced assemblies with 13'050 and 10'525 scaffolds, respectively, with 89% of each of these assemblies alignable to the closely related PEST assembly (Lawniczak et al. 2010). The much smaller 174 Mbp assembly of the more distantly related neotropical vector, A. darlingi, comprised 8'233 scaffolds, but they remained unanchored (Marinotti et al. 2013). Physical mapping assigned 62% of the 221 Mbp A. stephensi Indian strain assembly (23'371 scaffolds) (Jiang et al. 2014) and 36% of the A. sinensis Chinese strain assembly (9'597 scaffolds) (Zhou et al. 2014;Wei et al. 2017) to polytene chromosomes.
For a genus such as Anopheles with already more than 20 genome assemblies available (Ruzzante et al. 2018), contiguity can be improved by leveraging information from cross-species comparisons to exploit patterns of conservation and identify potential scaffold adjacencies. While genome rearrangements can and do occur, multiple homologous genomic regions with conserved orders and orientations, i.e. regions with maintained synteny, offer an evolutionarily guided approach for assembly improvement. Specifically, employing orthologous genes as conserved markers allows for the delineation of maintained syntenic blocks that provide support for putative scaffold adjacencies. Here we present results from applying three computational approaches, ADSEQ (Anselmetti et al. 2018), GOS-ASM (Aganezov and Alekseyev 2016), and ORTHOSTITCH (this study), to assess the performance of evolutionarily guided assembly improvements of multiple anopheline genomes. Consensus predictions offer well-supported sets of scaffold adjacencies that lead to the improved contiguity of draft assemblies without the associated costs or time-investments required for experimental support. Validations of these predictions exploiting experimental data for subsets of the anophelines supported many adjacencies and highlighted the complementarity of experimental and computational approaches. Thus, whether employed as supporting data for experimentally based assembly improvement approaches, as complementary data to further enhance such improvements, or as stand-alone evidence as part of an assembly building pipeline, these evolutionarily guided methods offer a handy new set of utensils in any genome assembly toolbox. These comparative genomics approaches will help to propel the draft assemblies from similar species-clusters along the journey towards becoming 'finished' reference genomes.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint

Synteny-improved contiguities of Anopheles genome assemblies
The paucity of supporting experimental data for many species means that improving the contiguity of current draft assemblies must often rely solely on comparative genomics approaches. The anophelines, as a genus where numerous assemblies are available and a few are already highly contiguous, present an ideal opportunity to apply and assess the performance of such approaches. Using orthologues delineated across 21 anopheline gene sets (Supplemental Table S1) and combining the results from three synteny-  Tables S2, S3), two-way consensus sets of well-supported predicted scaffold adjacencies resulted in substantial improvements for several assemblies (Fig. 1). The two-way consensus adjacencies were required to be predicted by at least two of the approaches with no third-method conflicts (see Methods). Improvements were quantified in terms of the absolute (Fig. 1A) and relative (Fig. 1B) increases in scaffold N50 values (a median-like metric where half the genome is assembled into scaffolds of length N50 or longer) and decreases in scaffold counts, considering only scaffolds with annotated orthologous genes used as input data for the scaffold adjacency predictions.
The greatest absolute increases in scaffold N50 values were achieved for A. dirus and A. minimus, while the greatest absolute reductions in scaffold counts were achieved for A. christyi, A. culicifacies, A. maculatus, and A. melas (Fig. 1A), reflecting the different levels of contiguity of their input assemblies. Reductions in the numbers of scaffolds that comprise each assembly varied from 1'890 fewer for the rather fragmented A. melas assembly to just one fewer for the already relatively contiguous A. albimanus assembly. Even without large reductions in the numbers of scaffolds, when a few adjacencies bring together relatively long scaffolds then they can lead to marked improvements in N50 values. For example, A. dirus and A. minimus improved with N50 increases of 5.1 Mbp and 4.8 Mbp and only 36 and 12 fewer scaffolds, respectively.
The general trend indicates that reducing the number of scaffolds by about a third leads to a doubling of the N50 value (Fig. 1B). Exemplifying this trend, A. epiroticus showed the greatest relative reduction in the number of scaffolds (40%) and achieved a 2.1-fold N50 increase. Notable exceptions include A.
farauti, which showed a 1.4-fold N50 increase with a 30% reduction in the number of scaffolds, while A. dirus and A. stephensi (Indian) achieved 1.66-fold and 2.08-fold N50 increases with only 14% and 19% reductions in the number of scaffolds, respectively. Using only three-way consensus adjacencies led to . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint more conservative improvements, while employing a liberal union of all non-conflicting adjacencies resulted in a trend of a ~30% scaffold reduction to double the N50 value (Supplemental Figs. S2, S3).
The enhanced contiguities of these anopheline assemblies based on predicted scaffold adjacencies demonstrate that while the results clearly depend on the quality of the input assemblies, applying syntenybased approaches can achieve substantial improvements.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint Figure 1. Improved genome assemblies for 20 anophelines from synteny-based scaffold adjacency predictions. Results from ADSEQ, GOS-ASM, and ORTHOSTITCH predictions were compared to define two-way consensus adjacencies predicted by at least two of the three approaches, where the third approach did not conflict. These adjacencies were used to build new assemblies with improved contiguities, quantified by comparing before and after scaffold counts and N50 values (half the total assembly length comprises scaffolds of length N50 or longer).
The counts, values, and ratios represent only scaffolds with annotated orthologous genes used as the input dataset for the scaffold adjacency predictions. (A) Scaffold counts (blues, bottom axis) and N50 values (red/orange, top axis) are shown before (dots) and after (arrowheads) synteny-based improvements were applied. The 20 anopheline assemblies are ordered from the greatest N50 improvement at the top for Anopheles dirus to the smallest at the bottom for Anopheles albimanus. Note axis scale changes for improved visibility after N50 of 5 Mbp and scaffold count of 6'000. (B) Plotting before to after ratios of scaffold counts versus N50 values (counts or N50 after / counts or N50 before superscaffolding of the adjacencies) reveals a general trend of a ~33% reduction in scaffold numbers resulting in a ~2-fold increase of N50 values. The line shows the linear regression with a 95% confidence interval in grey. Results for two strains are shown for Anopheles sinensis, SINENSIS and Chinese (C), and Anopheles stephensi, SDA-500 and Indian (I).

Consensus adjacencies from complementary synteny-based methods
Although each of the computational methods aims to predict scaffold adjacencies based on gene collinearity, they differ in some of their underlying assumptions and in their implementations that identify, score, and infer the most likely scaffold neighbours. ADSEQ uses reconciled gene trees to reconstruct ancestral genomes and to delineate extant gene adjacencies in a duplication-aware parsimonious evolutionary scenario of adjacency gains and breaks that also identifies extant adjacencies between genes at scaffold extremities (Anselmetti et al. 2018). GOS-ASM employs the concept of the breakpoint graph and utilizes the topology of the species phylogeny to perform evolutionary rearrangement analysis of orthologous genes from multiple genomes from which scaffold adjacencies can then be inferred (Aganezov and Alekseyev 2016). ORTHOSTITCH, a new approach developed as part of this study, interrogates gene orthology data from cross-species comparisons to evaluate synteny evidence that supports putative scaffold adjacencies (see Methods). Similar to traditional meta-assembly-like methods that leverage such differences to identify well-supported consensus predictions, we compared the results of all scaffold adjacencies predicted by each method using the Comparative Analysis and Merging of Scaffold Assemblies (CAMSA) tool (Aganezov and Alekseyev 2017)  For the full set of 20 assemblies, GOS-ASM and ORTHOSTITCH predicted about 10'000 oriented adjacencies each, with just over twice as many predictions from ADSEQ. Comparing all predictions identified almost 30'000 distinct scaffold adjacencies, 36% of which were supported by at least two methods; this fraction comprised 10% that were in three-way agreement and a further 20% that were in two-way agreement with no conflicts with the third method ( Fig. 2; Supplemental Fig. S4). The larger total number of predictions from ADSEQ resulted in much higher proportions of unique adjacencies ( Fig. 2). Adjacencies in three-way agreement constituted 30% of GOS-ASM and 27% of ORTHOSTITCH predictions, and just 13% of the much more numerous ADSEQ predictions. In pairwise comparisons, ADSEQ supported almost two-thirds of each of the other prediction sets, while about a third of ADSEQ and GOS-ASM adjacencies agreed with ORTHOSTITCH, and slightly less than a third of ADSEQ and ORTHOSTITCH predictions were supported by GOS-ASM.
From the liberal union sets of all non-conflicting adjacencies for all assemblies, the adjacencies in threeway agreement made up 17% of the total, 46% of GOS-ASM, 39% of ORTHOSTITCH, and 19% of ADSEQ predictions (Fig. 2B). Considering only the supported predictions that were used to build the two-way consensus sets of adjacencies for the synteny-based assembly improvements presented in Fig. 1, i.e. excluding adjacencies predicted by only one method, the three-way consensus adjacencies made up 33% of the total, 54% of GOS-ASM, 44% of ORTHOSTITCH, and 33% of ADSEQ predictions (Fig. 2B). A third of these two-way supported consensus adjacencies that were employed to build the new superscaffolded assemblies were predicted by all three methods, with 98% supported by ADSEQ, 74% by ORTHOSTITCH, and 61% by GOS-ASM. Thus, comparing the results from the three methods and employing a two-way agreement with no third-method conflict filter resulted in an improved level of three-way adjacency agreements from a tenth to a third.
For the individual assemblies, more than half of the distinct scaffold adjacencies were in agreement for A. epiroticus, A. merus, and both the A. stephensi assemblies, with A. funestus achieving the highest consistency at 58% ( Fig. 2C; Supplemental Fig. S5). Some of the most fragmented input assemblies produced some of the largest sets of distinct adjacency predictions but the agreement amongst these predictions was generally lower than the other assemblies. For example, A. maculatus was the least contiguous input assembly and produced more than 8'000 distinct predictions, of which only 18% showed at least two-way agreement with no conflicts ( Fig. 2C; Supplemental Fig. S5).
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint four with more than 50% agreement (top row), and four with lower levels of agreement (bottom row). Colours for each fraction are the same as in panel A, y-axes vary for each assembly with maxima of 120 for Anopheles coluzzii to 5'000 for Anopheles maculatus. Results for Anopheles stephensi are for the SDA-500 strain.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint Validated adjacencies with physical mapping and RNA sequencing data Physical mapping data generated from a subset of the anophelines considered here allowed for independent, quantitative validations of the synteny-based predictions and their consensus sets. Building cytogenetic photomaps and conducting extensive FISH experiments mapped 31 A. albimanus scaffolds (Artemov et al. 2017), 46 A. atroparvus scaffolds Neafsey et al. 2015;Artemov et al. 2018), 204 A. funestus scaffolds (Sharakhov et al. 2002(Sharakhov et al. , 2004Xia et al. 2010;Neafsey et al. 2015 The scaffold adjacencies identified from these physical mapping data, i.e. pairs of neighbouring mapped scaffolds, were compared with adjacencies predicted by each of the three methods and the CAMSA-generated two-way consensus sets, as well as the conservative three-way consensus sets and the liberal union sets of all non-conflicting adjacencies (Supplemental Table S6). With 85 physically mapped scaffold adjacencies, A. funestus validations confirmed 12-17% of the different sets of synteny-based adjacencies and highlighted conflicts with just 4-8% (Fig. 3A). Five of the 15 two-way consensus synteny-based predictions were confirmed by physical mapping of A. atroparvus scaffolds and only one conflict was identified (Fig. 3A). Examining the identified conflicts in detail revealed that most were resolvable. As not all scaffolds were targeted for physical mapping, neighbouring scaffolds on the physical maps could have shorter unmapped scaffolds between them that were identified by the synteny-based approaches. For A. funestus, five conflicts were resolved because the synteny-based neighbour was short and not used for physical mapping and an additional four conflicts were resolved by switching the orientation of physically mapped scaffolds, which were anchored by only a single FISH probe and therefore their orientations had not been confidently determined.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint Transcriptome data from RNAseq experiments can provide additional information about putative scaffold adjacencies when individual transcripts (or paired-end reads) reliably map to scaffold extremities.
The Annotated Genome Optimization Using Transcriptome Information (AGOUTI) tool (Zhang et al. 2016) employs RNAseq data to identify such adjacencies as well as correcting any fragmented gene models at the ends of scaffolds. Using available paired-end RNAseq mapping data from VECTORBASE . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint Supplemental Table S7). These AGOUTI-based scaffold adjacencies were compared with the adjacencies predicted by each of the three methods and the CAMSA-generated two-way consensus sets, as well as the conservative three-way consensus sets and the liberal union sets of all non-conflicting adjacencies ( Fig. 3B; Supplemental Table S8). Across all 13 assemblies, 18% of AGOUTI-based scaffold adjacencies supported the two-way consensus synteny-based adjacencies, 75% were unique to the AGOUTI sets, and only 7% were in conflict. Nearly 200 AGOUTI-based scaffold adjacencies for A. stephensi (Indian) confirmed only eight and conflicted with 14 of the two-way consensus set adjacencies (Fig. 3B).
In contrast, about half as many AGOUTI-based scaffold adjacencies each for A. stephensi (SDA-500) and A. funestus confirmed four to five times as many two-way consensus set adjacencies and conflicted with only five and six, respectively. Notably, 68% of the AGOUTI-based scaffold adjacencies that produced conflicts with the two-way consensus set adjacencies comprised scaffolds with no annotated orthologues.
These cases can be resolved by noting that only scaffolds with orthologous genes were used for syntenybased predictions: therefore, the inferred neighbouring scaffolds could have shorter un-annotated scaffolds between them that were identified by AGOUTI. Such un-annotated scaffolds were also numerous amongst the adjacencies that were unique to AGOUTI where for 66% either one or both scaffolds had no annotated orthologues.

Validated adjacencies with new genome assemblies
A new A. funestus assembly, designated AfunIP, was generated as part of this study by merging approximately 70X of PacBio sequencing data with the reference assembly (AfunF1) Table S9). This AfunIP genome assembly for A.
funestus enabled the validation of the scaffold adjacency predictions for the AfunF1 assembly by examining collinearity between the two assemblies. AfunF1 scaffolds were ordered and oriented based on their alignments to AfunIP scaffolds and the resulting 321 alignment-based scaffold adjacencies were then compared with the synteny-based and AGOUTI predictions as well as with the physical mapping adjacencies to identify supported, unique, and conflicting adjacencies ( Fig. 4; Supplemental Fig. S8; Supplemental Table S10). Each of the three synteny method prediction sets, as well as the two-way consensus and liberal union sets, had 14-17.5% in common with the alignment-based scaffold adjacencies, fewer than a quarter in conflict, and almost two thirds that were neither supported nor in conflict (Supplemental Table S10). The physical mapping adjacencies had generally more support, but also more conflicts as about half disagreed with the alignment-based adjacencies. Several disagreements were easily resolved by comparing these conflicts with those identified from the synteny-based . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint adjacencies (those shown in Fig. 3A) and confirming that switching the orientation of physically mapped scaffolds corrected the relative placements of these scaffolds, e.g. Fig. 4

inset (i).
Similarly to the validations with the physical mapping and RNAseq data presented above, apparent conflicts with the alignment-based adjacencies can also arise because using genome alignment data considered all alignable scaffolds while physical mapping targeted only large scaffolds and synteny methods did not consider scaffolds with no annotated orthologues (i.e. short scaffolds). This is exemplified in Fig. 4

inset (ii)
where the alignment data placed a short scaffold between two scaffolds predicted to be neighbours by ADSEQ, ORTHOSTITCH, and physical mapping data. Skipping such short scaffolds (<5 Kbp) to define a smaller set of alignment-based adjacencies considering only the longer scaffolds resulted in increased support for the synteny-based sets of 19-23%, and most notably up to 39% for the physical mapping adjacencies, while only marginally increasing support for AGOUTI predictions from 15% to 17% (Supplemental Table S10). Re-scaffolding of the initial A. farauti (AfarF1) and A. merus (AmerM1) assemblies employed large-insert 'Fosill' sequencing libraries and reduced the numbers of scaffolds from 550 to 310 and 2'753 to 2'027 and increased N50 values from 1'197 Kbp to 12'895 Kbp and 342 Kbp to 1'490 Kbp, respectively (Neafsey et al. 2015). The availability of these re-scaffolded assemblies enabled the validation of the synteny-based and AGOUTI-based scaffold adjacency predictions for the AfarF1 and AmerM1 assemblies by examining corresponding scaffolds from the AfarF2 and AmerM2 assemblies (see Methods; Supplementary Online Material; Supplemental Fig. S9). The comparisons identified full support for the majority (87% and 82%) of the two-way synteny consensus set adjacencies and unresolvable conflicts for just 5% and 10%, while the AGOUTI-based adjacencies achieved similarly high levels of full support (81% and 67%), but with slightly greater proportions of conflicts (Supplemental Table S11).

New Anopheles funestus cytogenetic photomap and physical genome map
The collated data for A. funestus allowed for a comprehensive update of the previously published chromosomal photomap from ovarian nurse cells (Sharakhov et al. 2002). The existing images of polytene chromosomes of the five arms common to all anophelines (X, 2R, 2L, 3R, and 3L) were further straightened to facilitate linear placements of the genomic scaffolds on the photomap. Major structural updates for the cytogenetic map included reversal of the order of divisions and subdivisions within the 3La inversion to follow the standard 3L+ a arrangement, and merging of two small subdivisions with larger neighbouring subdivisions: 5D to 6 and 34D to 34C (Fig. 5). The extensive additional physical mapping performed for A. funestus, together with the new AfunIP assembly and sequence alignmentbased comparisons with the AfunF1 assembly, enabled a physical genome map to be built (Fig. 5). The 126 previously FISH-mapped (Sharakhov et al. 2002(Sharakhov et al. , 2004Xia et al. 2010) and 66 newly FISH-mapped DNA markers (Supplemental Fig. S6) were located with BLAST searches to 139 AfunF1 scaffolds and then compared with AfunIP scaffolds using whole genome pairwise alignments (see Methods; Supplementary Online Material). The placement of scaffolds along the photomap took advantage of comparisons with the synteny-based scaffold adjacency predictions and with the AfunF1-AfunIP whole genome pairwise alignments. Synteny-or alignment-based scaffold neighbours were added to the genome map when they were short and thus had not been used for physical mapping. Additionally, scaffolds which were anchored with only a single FISH probe (i.e. with undetermined orientations) were reoriented when synteny-or alignment-based scaffold adjacencies provided supporting evidence to correct their relative placements on the map. The resulting physical genome map for A. funestus includes 204 AfunF1 scaffolds (Supplemental Table S5), with a further 99 neighbouring scaffolds after incorporating the synteny-based and AGOUTI-based adjacencies.

New reference genome assemblies for 20 anophelines
The consensus sets of synteny-based adjacencies for all species, validated with and complemented by physical mapping and/or RNAseq data for subsets of the anophelines, enabled improved genome assemblies to be generated by building superscaffolds from all scaffold neighbours (Fig. 6). This involved the design and implementation of a reconciliation workflow to integrate the different sets of scaffold adjacencies from synteny, physical mapping, AGOUTI, or alignment data for each assembly (see Fig. S10). The total span of scaffolds that now form part of superscaffolds is more than 150 Mbp for seven assemblies and between 80 and 150

Methods; Supplementary Online Material; Supplemental
Mbps for another seven assemblies, although their actual contiguity levels remain variable (e.g. A. atroparvus in 34 superscaffolds and A. funestus in 114 superscaffolds). The greatest reductions in the total numbers of scaffolds were achieved for A. christyi, A, culicifacies, A. maculatus, and A. melas, with the largest improvements in scaffold N50 values observed for A. atroparvus, A, dirus, and A. minimus (Table 1). Given the heterogeneity of the input assemblies the relative changes highlight some of the most dramatic improvements, e.g. the A. funestus and A. stephensi (SDA-500) scaffold counts both dropped by almost 22% and N50s increased 4.0-fold for A. atroparvus, 3.1-fold for A. funestus, and 2.4-fold for A. stephensi (Indian) ( Table 1). The largest relative improvements seen for An. farauti and An. merus are mainly from the additional "Fosill"-based scaffolded version 2 assemblies.
For the six anophelines with physical mapping data, the contributions of the synteny-based and/or AGOUTI-based adjacencies to the numbers and genomic spans of anchored scaffolds are largest for A.
stephensi (SDA-500) and A. funestus, but negligible or low for the recently updated A. albimanus (Artemov et al. 2017), A. atroparvus (Artemov et al. 2018), and A. sinensis (Chinese) (Wei et al. 2017) assemblies (Table 2). For A. albimanus, reconciling the AalbS2 assembly (Artemov et al. 2017) with the AalbS1 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint adjacencies showed that the single neighbouring pair from the two-way consensus (and two of the three pairs unique to ORTHOSTITICH) agreed with the physical mapping data and were thus already incorporated into the new assembly, so only two short scaffolds from AGOUTI were added. Reconciling the A. atroparvus AatrE3 assembly (Artemov et al. 2018) with the AatrE1 adjacencies showed that three AGOUTI-based adjacencies were already incorporated, as were five two-way consensus synteny-based adjacencies, as well as two, one, and one, unique to the ADSEQ, GOS-ASM, and ORTHOSTITCH sets, respectively. These, and the 41 other adjacencies that did not involve any of the 46 physically mapped scaffolds, together led to a reduction of the initial AatrE1 scaffold number by 5.3% and a 4.0-fold increase in scaffold N50 for the new superscaffolded assembly. For the other anophelines with physical mapping data, the two A. stephensi assemblies achieved total assembly anchoring of 79% (improvements of 17% and 38%) and A. funestus more than doubled to reach 82% ( Table 2). These chromosomally-anchored scaffolds and superscaffolds, together with all of the new improved genome assemblies have been submitted to VECTORBASE for processing and incorporation as new reference assemblies for the benefit of the entire research community.

Discussion
Applying synteny-based scaffold adjacency predictors across 21 Anopheles genome assemblies resulted in almost 43'000 predicted adjacencies. The identification of consensus predictions produced supported subsets that were used to build improved assemblies for which the general trend showed that a reduction in the total number of orthologue-bearing scaffolds of about a third could double the scaffold N50 (Fig.   1). Notably, when the scaffolds involved were long, even a handful of adjacencies could greatly increase the N50 value, however, the numerous adjacencies for the rather fragmented input assemblies improved their contiguity but led to only minor improvements in N50 values. For the six assemblies with starting N50 values of between 340 Kbp and 840 Kbp (considering all scaffolds, not only those with orthologues), the average improvement was just less than 400 Kbp, demonstrating what can be achieved using only synteny-based approaches. By way of comparison, the honeybee genome assembly upgrade relied on millions of reads from ~20X SOLiD and ~5X Roche 454 sequencing to improve the scaffold N50 by 638 Kbp from 359 Kbp to 997 Kbp (Elsik et al. 2014). Thus, while the Anopheles results varied considerably depending on the input assemblies, using only synteny-based adjacencies from a combined analysis of the results from three methods achieved substantial contiguity improvements for many assemblies.
ADSEQ predicted about twice as many adjacencies as GOS-ASM and ORTHOSTITCH, providing support for about two thirds of each, while ORTHOSTITCH supported about a third and GOS-ASM supported slightly less than a third of each of the other prediction sets (Fig. 2). Thus, the more conservative GOS-ASM and ORTHOSTITCH results do not show a substantially greater overlap than either of them do with ADSEQ. Only 10% of all distinct scaffold adjacencies were predicted by all three methods, but building the two-way consensus adjacency sets increased this agreement to 33%, where almost all were supported by ADSEQ, nearly three-quarters by ORTHOSTITCH, and three-fifths by GOS-ASM. Consensus building can therefore achieve the goal of identifying a subset of well-supported adjacencies. Considering each of the assemblies individually, these two-way consensus adjacencies made up at least half of the distinct predictions for eight assemblies, with a maximum of 58% for A. funestus. While levels of agreement were generally lower for fragmented assemblies with many predicted adjacencies, some others with many fewer predicted adjacencies also showed below-average agreements. Thus the number of predicted adjacencies is not necessarily a reliable indicator of the level of agreement that can be achieved.
These results highlight the challenge of inferring accurate adjacencies as well as the importance of employing multiple approaches. Synteny block delineation, which then allows for scaffold adjacencies to be predicted, is itself a complex task where results from different anchor-based approaches can vary . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint considerably (Liu et al. 2018). Several key differences distinguish the three methods applied to the Anopheles assemblies, for example, GOS-ASM employs only single-copy orthologues as anchors so any gene duplications are excluded from the ancestral genome reconstructions, whereas the other two methods do take paralogues into account. Furthermore, both GOS-ASM and ADSEQ are 'phylogenyaware' algorithms as they use the species tree topology, and ADSEQ additionally employs individual gene trees for each orthologous group. In contrast, ORTHOSTITCH does not take the species tree or gene phylogenies into account and instead relies on enumerating levels of support across the dataset to score putative adjacencies. These differences affect the sensitivity and specificity of each method, reflected by the more numerous predictions from ADSEQ that can explore complex gene evolutionary histories within the species tree topology (e.g. identifying adjacencies supported only within sub-clades of the phylogeny), versus the smaller sets of adjacencies from GOS-ASM, which excludes complexities introduced by gene duplications, and ORTHOSTITCH that simplifies the search by not imposing any evolutionary model. So while the consensus approach applied to the predictions across all the anophelines results in reduced sensitivities, it takes advantage of the different underlying assumptions and algorithmic implementations of each method to identify well-supported sets of scaffold adjacencies.
Another factor that may influence the number of predicted adjacencies, the level of agreement amongst different methods, and the resulting contiguity improvements, is the number of scaffolds with annotated orthologues, i.e. scaffolds used as input rather than the total number of scaffolds in an assembly (e.g. for A. stephensi (Indian) there were 23'371 scaffolds but only 660 with orthologues). The quality of the gene annotations themselves also likely plays a role, as fragmented gene models make orthology delineation difficult and hence reduce the ability to identify synteny blocks. Nevertheless, substantial improvements can be achieved even for relatively fragmented assemblies so long as gene models are nonetheless mostly complete. The evolutionary divergence of the set of species, as well as the total number of species, to which these methods are applied would also impact their ability to recover reliable adjacencies, because the complexity of the task of inferring synteny blocks is greatly reduced if the input orthology dataset consists mainly of near-universal single-copy orthologues. As gene duplications and losses accumulate over time the proportion of near-universal single-copy orthologues will shrink, and even amongst those that are maintained translocations and genomic shuffling events will add to the steady erosion of the evolutionary signals on which these methods rely. Shuffling rates may also vary in different lineagese.g. lepidopteran genomes appear to have reduced levels of gene rearrangements (Kanost et al. 2016)so seemingly equally divergent (in terms of time to last common ancestor) sets of species may be differentially amenable to synteny delineation.
The availability of alternative datasets with scaffold adjacency information for subsets of the anophelines provided the opportunity to validate the predictions based solely on synteny inferences. These included . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint physical mapping data, RNAseq data, a re-assembly incorporating PacBio sequencing data, and two rescaffolded assemblies using extra 'Fosill' sequencing libraries. Although generally few scaffold adjacencies were obtained from the physical mapping data (because only larger scaffolds were selected for mapping they may not be direct neighbours) the comparisons were able to identify support for many syntenybased adjacencies (Fig. 3A). Several conflicts were also identified; however, most of these were due to the fact that the synteny-based neighbour was a short scaffold that had not been targeted for physical mapping and could be positioned between the two much larger physically mapped scaffolds, thus, they are not truly conflicts. Importantly, other conflicts involved only the relative orientation of neighbouring scaffolds and occurred with scaffolds that were anchored with only a single FISH probe and whose orientations had thus not been confidently determined. In these cases the synteny-based adjacencies therefore provided key complementary information and helped to correct the orientations of the physically mapped scaffolds.
Validations with RNAseq-based AGOUTI-predicted adjacencies also provided support for many syntenybased predictions (Fig. 3B). Two-thirds of the adjacencies unique to AGOUTI were between scaffolds where one or both scaffolds had no annotated orthologues. As AGOUTI is not restricted to large scaffolds preferred for physical mapping or scaffolds with annotated orthologues required for synteny-based approaches, it can provide complementary predictions that capture shorter unannotated scaffolds that would otherwise not be recovered. While this would not substantially improve N50 values it is nonetheless important for improving gene annotations as correcting such assembly breaks could allow for more complete gene models to be correctly identified.
The A. funestus PacBio-based AfunIP assembly scaffolds allowed for the alignment-based ordering and orientation of AfunF1 scaffolds for comparisons with the adjacency predictions and physical mapping data (Fig. 4, Fig. 5). These supported up to almost a quarter of A. funestus two-way consensus synteny adjacencies and about 40% of the physical mapping adjacencies. Importantly, most were neither supported nor in conflict, and conflicts generally occurred when the alignment-based adjacencies included short scaffolds that were not considered by the synteny-based or physical mapping approaches, and thus could be resolved. For A. farauti and A. merus, the genome-alignment-based comparisons of their initial assemblies with the re-scaffolded AfarF2 and AmerM2 assemblies provided much higher levels of support for the two-way consensus synteny adjacencies, with very few conflicts. This reflects the radically different approaches between re-scaffolding, where the additional 'Fosill' library data served to build longer scaffolds from the initial scaffolds, versus the re-assembly of A. funestus where PacBio sequencing data were first merged with the initial assembly before re-scaffolding with the original Illumina data to produce the hybrid AfunIP assembly. These comparisons therefore validate many of the . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101 synteny-based adjacency predictions while conceding that short intervening scaffolds may be overlooked due to the limitations of having to rely on scaffolds with annotated orthologues. As modern long-read sequencing technologies are capable of producing more contiguous assemblies than those based mainly on assembling short-reads, it is conceivable that many fragmented draft genomes will be completely superseded by new independently built high-quality reference assemblies. For example, single-molecule sequencing technologies were recently employed to produce assemblies of 15 Drosophila species, 14 of which already had previously reported sequenced genomes (Miller et al. 2018).
Alternatively, re-sequencing to obtain proximity data to use in conjunction with contigs from draft assemblies can also achieve high-quality references to replace the fragmented initial versions, e.g. (Putnam et al. 2016;Dudchenko et al. 2017). Furthermore, although reference-assisted assembly approaches may mask true genomic rearrangements (Liu et al. 2018), high-quality chromosomal-level genomes of very close relatives can be used to improve draft assemblies, often employing alignment-based comparisons such as assisted assembly tools (Gnerre et al. 2009), reference-assisted chromosome assembly (Kim et al. 2013), CHROMOSOMER (Tamazian et al. 2016), or the RAGOUT 2 reference-assisted assembly tool (Kolmogorov et al. 2018). What role then is there for comparative genomics approaches that use evolutionary signals to predict scaffold adjacencies in draft assemblies?
Firstly, while recognising that downward trending costs of many new technologies are making sequencing-based approaches more accessible to even the smallest of research communities, the costs associated with experimental finishing or re-sequencing efforts remain non-trivial and acquired expertise is required for high-quality sample preparation and library building. Furthermore, the disappointing reality is that re-sequencing and re-scaffolding does not always lead to vastly improved assemblies, albeit an anecdotal reality because failures are not reported in the published literature. Secondly, hybrid assembly approaches benefit from the complementarity of the different types of input data that they employ, and our validations and comparisons show that synteny-based adjacencies can further complement the experimental data. In this regard, even if synteny-based results are not directly included in such hybrid approaches, they can nevertheless serve as a benchmark against which to quantify the effectiveness of different combinations of approaches (or different parameters used) and help guide re-assembly procedures towards producing the best possible improved assemblies. Thirdly, our results show how physical mapping datasets can be augmented or even corrected through comparisons with synteny-based scaffold adjacency predictions. Where subsets of scaffolds have already been mapped to chromosomes, adding neighbouring scaffolds from synteny-based predictions can add to the overall total proportion anchored without more labour-intensive experimental work. Additionally, as physical and genetic mapping of a scaffold requires the presence of at least one marker, and confirming the correct orientation of a localised scaffold requires at least two markers, draft assemblies with many short scaffolds require . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint much higher densities of markers in order to achieve near-complete chromosomal-level anchoring. Thus synteny-based predictions of scaffold adjacencies which reduce the total numbers of scaffolds to be mapped will allow for greater proportions of fragmented genome assemblies to be anchored using fewer markers. Our three-method synteny-based scaffold adjacency prediction workflow is relatively easily implemented and may flexibly include results from additional adjacency predictors or, as evidenced with our various types of validation datasets, alternative sources of adjacency information (Supplemental Fig. S10).
Rather than prescribing a panacea to cure all assembly ailments, we conclude that the components of this workflow may be adapted, substituted, extended, or simplified according to the needs and resources of any given draft genome assembly improvement project. Assessing the performance of three comparative genomics approaches and comparing their results with available experimental data demonstrates their utility as part of assembly improvement initiatives, as well as highlighting their complementarity to experimental approaches. The consensus predicted scaffold adjacencies can lead to substantial improvements of draft assemblies ( Fig. 1; Fig. 6; Table 1) without requiring additional sequencing-based support, and they can add to and improve physical mapping efforts ( Table 2). These evolutionarily guided methods therefore augment the capabilities of any genome assembly toolbox with approaches to assembly improvements or validations that will help draft assemblies mature into high-quality reference genomes.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint

Synteny-based scaffold adjacency predictions
The synteny-based prediction tools require as input both delineated orthology and genomic location data for the annotated genes from each assembly. All gene annotations were retrieved from VECTORBASE (Giraldo-Calderón et al. 2015) and orthology data was retrieved from ORTHODB V9 (Zdobnov et al. 2017): versions of the genome assemblies and their annotated gene sets are detailed in Supplemental   Table S1, along with counts of scaffolds, genes, and orthologues. The complete 'frozen' input datasets of orthology relationships and genomic locations of the annotated genes for each of the 21 assemblies are presented in Supplementary Dataset SD1. ADSEQ analysis first builds reconciled gene trees for each orthologous group (gene family), then for pairs of gene families for which extant genomic adjacencies are observed, or suggested by sequencing data, a duplication-aware parsimonious evolutionary scenario is computed, via Dynamic Programming (DP), that also predicts extant adjacencies between genes at the extremities of contigs or scaffolds. This DP algorithm also accounts for scaffolding scores obtained from paired-end reads mapped onto contigs and provides a probabilistic score for each predicted extant adjacency, based on sampling optimal solutions (Anselmetti et al. 2018). ADSEQ was applied across the full anopheline input dataset to predict scaffold adjacencies (Supplemental Table   S2). GOS-ASM employs an evolutionary rearrangement analysis strategy on multiple genomes utilizing the topology of the species phylogenetic tree and the concept of the breakpoint graph (Aganezov and Alekseyev 2016). Fragmented genomes with missing assembly 'links' between assembled regions are modelled as resulting from artificial 'fissions' caused by technological fragmentation that breaks longer contiguous genomic regions (chromosomes) into scaffolds. Assembling these scaffolds is therefore reduced to a search for technological 'fusions' that revert non-evolutionary 'fissions' and glue scaffolds back into chromosomes. GOS-ASM was applied to the full anopheline input dataset to predict such scaffold 'fusions' (Supplemental Table S2). The ORTHOSTITCH approach was first prototyped as part of the investigation of greater synteny conservation in lepidopteran genomes (Kanost et al. 2016), and subsequently further developed as part of this study to include a scoring system and additional consistency checks. Searches are performed to identify orthologues (both single-copy and multi-copy orthologues are considered) at contig or scaffold extremities in a given assembly that form neighbouring pairs in the other compared assemblies, thereby supporting the hypothesis that these scaffolds should themselves be neighbours. ORTHOSTITCH was applied to the full anopheline input dataset to predict scaffold adjacencies (Supplemental Table S2; Supplemental Fig. S1). The CAMSA tool (Aganezov and Alekseyev 2017) was used to compare and merge scaffold assemblies produced by the three methods by identifying adjacencies in three-way and two-way agreement (with no third method conflict) (Supplemental Table S3). CAMSA was also used to build merged assemblies using only conservative . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint three-way consensus adjacencies, and using liberal unions of all non-conflicting adjacencies. Quantifications of assembly improvements considered only scaffolds with annotated orthologous genes (because the synteny-based methods rely on orthology data) to count the numbers of scaffolds and compute scaffold N50 values before and after merging ( Fig. 1; Supplemental Figs. S2, S3). The results of the CAMSA merging procedure were used to quantify all agreements and conflicts amongst the different sets of predicted adjacencies ( Fig. 2; Supplemental Table S3; Supplemental Figs. S4, S5). A DOCKER container is provided that packages ADSEQ, GOS-ASM, ORTHOSTITCH, and CAMSA, as well as their dependencies, in a virtual environment that can run on a Linux server. See Supplementary Online Material for additional details for all synteny-based predictions and their comparisons, and the DOCKER container.

Validations with physical mapping and RNA sequencing data
Methods for chromosomal mapping of scaffolds are detailed for A. albimanus (Artemov et al. 2017), A.
atroparvus Neafsey et al. 2015;Artemov et al. 2018), A. stephensi (SDA-500) (Neafsey et al. 2015), A. stephensi (Indian) (Jiang et al. 2014), and A. sinensis (Chinese) (Wei et al. 2017). A. funestus mapping built on previous results (Sharakhov et al. 2002(Sharakhov et al. , 2004Xia et al. 2010) with additional FISH mapping (Supplemental Fig. S6) to further develop the physical map by considering several different types of mapping results. The complete 'frozen' input datasets of the physically mapped scaffolds for each of the six assemblies are presented in Supplementary Dataset SD2, with usable scaffold pair adjacencies in Supplemental Table S4, and the final mapped A. funestus scaffolds in Supplemental   Table S5. These adjacencies were compared with the CAMSA-generated two-way consensus assemblies, as well as the predictions from each method and the conservative and liberal consensus assemblies ( Fig.   3A; Supplemental Table S6). The RNAseq validations used genome-mapped paired-end sequencing data for 13 of the anophelines available from VECTORBASE (Giraldo-Calderón et al. 2015) (Release VB-2017-02), including those from the Anopheles 16 Genomes Project (Neafsey et al. 2015) and an A. stephensi (Indian) male/female study (Jiang et al. 2015). AGOUTI (Zhang et al. 2016) analyses were performed to identify transcript-supported scaffold adjacencies for these 13 anophelines (Supplemental Table S7).
These adjacencies were compared with the CAMSA-generated two-way consensus assemblies, as well as the predictions from each method and the conservative and liberal consensus assemblies ( Fig. 3B; Supplemental Table S8). See Supplementary Online Material for additional details for physical mapping and AGOUTI adjacencies and their comparisons.

Building the new assemblies
The new A. funestus assembly generated as part of this study was based on approximately 70X of PacBio sequencing data polished with QUIVER (from PacBio's SMRT Analysis software suite). This was . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint combined with the reference assembly (AfunF1) using METASSEMBLER (Wences and Schatz 2015) to generate a merged assembly, and this merged assembly was then scaffolded with SSPACE (Boetzer et al. 2011) using the original Illumina sequencing data, and designated the A. funestus AfunIP assembly. The AfunIP assembly improves on the reference AfunF1 assembly at contig level but not at scaffold level (Supplemental Fig. S7; Supplemental Table S9). Where AfunIP scaffolds span the ends of AfunF1 scaffolds they provide support for AfunF1 scaffold adjacencies. Thus, whole genome alignments of the two assemblies were performed using LASTZ (Harris 2007) and used to identify corresponding genomic regions that enabled the alignment-based ordering and orientation of AfunF1 scaffolds, which were then compared with the synteny-based, physical mapping based, and AGOUTI-based, adjacencies (Fig. 4,   Supplemental Fig. S8; Supplemental Table S10). Using the AfunF1 assembly as the basis, and incorporating evidence from the AfunIP assembly through scaffold correspondences established from the whole genome alignments, the physical mapping data and the synteny-based and AGOUTI-based adjacency predictions were integrated to build the new reference assembly for A. funestus. The comprehensive update to the photomap employed BLAST searches to identify positions of the physically mapped DNA markers within the AfunF1 and AfunIP assemblies, and whole genome pairwise alignments to reconcile these two assemblies with the new photomap. Whole genome alignments of versions 1 and 2 assemblies for A. farauti, and A. merus were used to delineate corresponding scaffolds and identify supported, unsupported, and conflicting adjacencies (Supplemental Fig. S9; Supplemental Table S11). The new assemblies were built using the different datasets available for each of the anophelines (Supplemental Fig. S10): synteny data only for six, A. christyi, A. coluzzii, A. culicifacies, A. darlingi, A. maculatus, & A. melas; synteny and AGOUTI data for eight, A. arabiensis, A. dirus, A. epiroticus, A. farauti, A. merus, A. minimus, A. quadriannulatus, and A. sinsensis (SINENSIS); synteny and physical mapping data for A. sinensis (Chinese); synteny, AGOUTI, and physical mapping data for four, A. albimanus, A. atroparvus, A. stephensi (SDA-500), A. stephensi (Indian); and synteny, AGOUTI, physical mapping data, and the new PacBio-based assembly for A. funestus. See Supplementary Online Material for additional details on the PacBio assembly generation, the genome alignment based comparisons of the AfunF1 and AfunIP assemblies, and the workflow to integrate different adjacency predictions and build the new assemblies.

Data Access
The updated assemblies of 20 anophelines have been submitted to VECTORBASE (www.vectorbase.org) and will become available as the new reference assemblies. The input data for the synteny analyses of orthology relationships and genomic locations of the annotated genes are presented in Supplementary Dataset SD1. The complete input datasets of the physically mapped scaffolds for each of the six . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted November 6, 2018. . https://doi.org/10.1101/434670 doi: bioRxiv preprint assemblies are presented in Supplementary Dataset SD2. The final sets of scaffold adjacencies and superscaffolds for all assemblies are presented in Supplementary Dataset SD3.