Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies

Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from ‘finished’. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. Results We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.


Contents
Details of all steps are presented in the following sections, here we provide an overview of the production of different sets of scaffold adjacencies for each of the anophelines and the different workflows that were followed to reconcile all the data to build the new assemblies ( Figure S1). The simplest workflow (A, six assemblies) was used for A. christyi, A. coluzzii, A. culicifacies, A. darlingi, A. maculatus, and A. melas, for which only consensus synteny predictions were produced. Workflow B (eight assemblies) reconciled the synteny-based two-way consensus sets with the adjacency predictions from RNA sequencing (RNAseq) data using the AGOUTI ) and RASCAF (Song et al. 2016) tools to build new assemblies for A. arabiensis, A. dirus, A. epiroticus, A. farauti, A. merus, A. minimus, A. quadriannulatus, and A. sinsensis (SINENSIS). Workflow C (four assemblies) additionally incorporated reconciliations with the available physical mapping data for A. albimanus, A. atroparvus,, and A. stephensi (Indian).
Workflow D was applied to A. funestus to also incorporate reconciliations with the adjacencies produced from comparing the reference assembly (AfunF1) with the new Pacific Biosciences (PacBio) assembly (AfunF2-IP). And finally, workflow E was adopted for A. sinensis (Chinese) that employed just the synteny-based two-way consensus set and the physical mapping data. Finally, chromosome mapping data from A. arabiensis were combined with the workflow B results to produce the new chromosome-anchored assembly.
We employed gene orthology data delineated using ORTHODB (Zdobnov et al. 2017), but alternative methodologies may be used to define orthologous relations amongst the annotated gene sets of the species to be analysed. With gene orthology data and genomic location data from VECTORBASE (Giraldo-Calderón et al. 2015) prepared, we performed adjacency predictions with GOS-ASM  and ORTHOSTITCH (this study) directly, while ADSEQ (Anselmetti et al. 2015 first required building sequence alignments and reconciled trees before scaffold neighbours were predicted (see the following sections for details). We then employed the CAMSA tool (Aganezov and Alekseyev 2017) for comparative analyses of the results from our different scaffold adjacency predictions to automatically build the most confident merged-scaffold assembly, and we used CAMSA's interactive visualisation framework to inspect conflicts in the assembly graph. For the species with no validation datasets we employed a simple two-way consensus approach with no third-method conflicts to define the final adjacencies. For the other species, all conflicts identified between the two-way consensus adjacencies and the alternative sources of adjacency information were manually resolved, the most complex being for A. funestus with the reconciliation of synteny, RNAseq (AGOUTI & RASCAF), PacBio-AfunF2-IP-alignment, and physical mapping data, and the construction of a new cytogenetic photomap. Figure S1. Workflows applied to upgrade the 20 anopheline assemblies A: two-way synteny only. B: two-way synteny and AGOUTI. C: two-way synteny, AGOUTI, and physical mapping data. D: two-way synteny, AGOUTI, physical mapping data, and PacBio sequencing data. E: two-way synteny and physical mapping data. Asterisks (*) indicate additional reconciliation with version 2 assemblies (V2 reconc.) for a subset of species.
[2] Superscaffolding and chromosome arm assignments Robert M. Waterhouse, Livio Ruzzante, Maarten J.M.F. Reijnders, Romain Feron The integrated approach to reconciling the different sources of scaffold adjacencies with available experimental data outlined above and detailed in the sections below improved assembly contiguity through building well-supported superscaffolds ( Table 1, main text). For several assemblies, the superscaffolding also resulted in the recovery of additional 'complete' Benchmarking Universal Single-Copy Orthologues (BUSCOs) (Simão et al. 2015;Waterhouse et al. 2018Waterhouse et al. , 2019 (Table S1), indicating that superscaffolding helped to recover some genes that previously appeared to be fragmented or missing.
Large increases in the numbers of recoverable BUSCOs are not expected as superscaffolding does not add new genomic sequence to the assemblies, but at least some partial genes at scaffold extremities now appear to be recoverable as 'complete' gene models. The superscaffolded assemblies also allowed for enhancing the anchoring of ordered and oriented scaffolds to chromosome arms ( Table 2, main text), and the assignment of non-anchored scaffolds and superscaffolds to chromosome arms (Table S2; Additional File 2). The resulting superscaffolds had total spans ranging from more than 200 Mbps for A. arabiensis to fewer than 20 Mbps for A. maculatus, reflecting the contiguity of the input assemblies and the availability of complementary datasets to support superscaffolding ( Figure S2). For ten assemblies the total span of superscaffolds comprised more than half the total assembly size, and they made up more than a quarter of a further seven assemblies ( Figure   S2). The enhanced chromosome anchoring for a subset of the anophelines ( Table 2, main text) and the chromosomal-level assembly for A. gambiae PEST together allowed for the assignment of non-anchored scaffolds and superscaffolds to chromosome arms. Enumerating shared orthologues between nonanchored scaffolds and the eight species with chromosome-anchored scaffolds (see section [12] below for details) enabled assignments with support from multiple species (Table S2; Additional File 2).  Figure S2. Superscaffolding genomic spans of 20 anopheline genome assemblies Superscaffolds are shown as stacked bars of alternating dark and light colours with lines within each superscaffold indicating the sizes (y-axis, basepairs) of their constituent scaffolds, and with superscaffolds and scaffolds ordered from the largest (left) to the smallest (right). The stacked bars continue with scaffolds that are not part of superscaffolds in grey, again ordered from the largest to the smallest. The assemblies are grouped and coloured according to the types of data and approaches used to perform the superscaffolding as presented in the legend and in main text Table 1. Approaches: synteny-based (SYN), and/or RNAseq AGOUTI-based (AGO), and/or alignment-based (ALN), and/or physical mapping-based (PHY), and/or PacBio sequencing-based (PB). Results for two strains are shown for Anopheles sinensis, SINENSIS and Chinese (C), and Anopheles stephensi, SDA-500 and Indian (I).
The local impact of superscaffolding on improving the ability to identify syntenic orthologues between pairs of assemblies was assessed by enumerating pairs and trios of collinear orthologues before and after superscaffolding ( Figure S3). From the full set of orthologous groups delineated across the 21 Anopheles assemblies (detailed in section [3] below), a subset of 10'657 groups were selected with orthologues in more than half of the assemblies and with more than half of these being single-copy orthologues. Being widely present and most single-copy, these orthologous groups represent a relatively evolutionarily stable set of genes with which to assess local synteny. For each assembly, neighbouring pairs and trios of these genes with orthologues that were maintained as neighbours in the other assemblies were counted before and after superscaffolding. Comparing each superscaffolded assembly with its input assembly showed the greatest gains of almost 3'000 pairs and about 2'000 trios for A. culicifacies, A. christyi, and A. melas, all of which were built following workflow A (i.e. only synteny-based adjacencies). The global impact of superscaffolding is exemplified by comparing orthologue locations in the A. gambiae (PEST) genome and the new A. arabiensis assembly to reveal large-scale structural variants ( Figure S4) that confirm the rearrangements identified from the previous scaffold-level assembly for A. arabiensis that was used to explore patterns of introgression in the species complex (Fontaine et al. 2015) and known from previous polytene chromosome studies (Coluzzi et al. 2002).

Figure S3. Increases in pairs and trios of syntenic orthologues after superscaffolding
Heatmaps of counts of additional neighbouring pairs (below the diagonal, from blue=low to yellow=high) and trios (above the diagonal, from purple=low to red=high) of genes with orthologues maintained as neighbours in pairs of assemblies after superscaffolding. The outlined cells along the diagonal present gained pairs and trios for each superscaffolded assembly compared with its input assembly. See Table S3 for the species that corresponds to each assembly abbreviation. These comparisons confirm the structural variants identified from the previous scaffold-level assembly for A.
arabiensis that was used to explore patterns of introgression in the species complex (Fontaine et al. 2015).

A B
[3] Sources of input data for predicting adjacencies

Robert M. Waterhouse
The orthology data used as inputs for each of the three synteny-based methods were retrieved from ORTHODB V9.1 (www.orthodb.org) (Zdobnov et al. 2017). These orthologous groups included all the anophelines apart from A. sinensis SINENSIS strain and A. stephensi Indian strain, so proteins from the gene sets of these two anophelines were mapped to the ORTHODB anopheline orthologous groups using the complete species mapping approach of ORTHODB. The protein sequences used by ORTHODB, and the gene annotations required for the adjacency predictions, were retrieved from VECTORBASE (Giraldo-Calderón et al. 2015). The versions of the genome assemblies and their annotated gene sets are detailed in Table S3, along with counts of scaffolds, genes, and orthologues. ORTHOSTITCH share the overarching goal of identifying blocks of collinear orthologues across several species that can be used to infer scaffold adjacencies in species where this collinearity has been broken due to assembly fragmentation. They operate in a framework where multiple rearrangements over the course of evolution have gradually eroded the collinearity of extant genomes with the ancestral organisation into shorter synteny blocks. Within these synteny blocks, broken collinearity in one or more species delineates putative rearrangement breakpoints, which may range in age from events that occurred early in the species radiation to younger lineage-or species-specific rearrangement events.
Once these breakpoints have been identified, the methods then attempt to decide whether an observed breakpoint in an extant genome is the result of a true genomic rearrangement event or the result of assembly fragmentation, considering breakpoints at the extremities of contigs/scaffolds to be more likely due to assembly fragmentation than to true genomic rearrangement events.

Yoann Anselmetti, Sèverine Bérard, Eric Tannier, Cedric Chauve
Full descriptions of the algorithms implemented, underlying assumptions, and performance of ADSEQ are detailed in (Duchemin et al. 2017;Anselmetti et al. 2015Anselmetti et al. , 2018. ADSEQ implements extensions to a group of approaches that aim to reconstruct evolutionary histories of gene adjacencies, based on the DECO algorithm (Bérard et al. 2012). ADSEQ computes ancestral genome segments and extant scaffolding adjacencies, taking advantage of sequencing data (e.g. paired-end reads) if available, and enabling inferences of various evolutionary events including gene duplications/losses/translocations along each branch of the provided species phylogeny. Previous simulations, described in , of assembly fragmentation using a subset of Anopheles genomes have detailed performance of ADSEQ in terms of precision and recall statistics, including comparisons with the scaffolder BESST . Similarly, ART-DECO analyses, detailed in (Anselmetti et al. 2015), simulated fragmentation of tetrapod genomes to evaluate the ability to recover broken scaffold adjacencies.
Gene trees. Gene trees contain the information about how genes, and the traits they are related to, evolve along the history of the species. They give access to information about adaptations by substitutions, gene gains and losses, duplications, transfers. Gene trees can also be used to detect coevolutionary elements in genomes. Moreover, gene trees are useful to reconstruct ancestral genomes and provide better assemblies for extant species, as shown in (Duchemin et al. 2017;Anselmetti et al. 2015Anselmetti et al. , 2018. However, the quality of the results highly depends on the quality of the gene trees. For the Anopheles genomes, genes were clustered into ORTHODB orthologous groups (Table S3), and multiple alignments were computed for each group using MUSCLE v.3.8.425 (Edgar 2004). These were then used as input for RAXML (Stamatakis 2014) phylogenetic tree estimations, in a large-scale automatic effort. A substantial number of branches are probably incorrect, since multiple sequence alignments often do not contain enough signal to fully resolve the gene tree. We applied the gene tree correction program TREERECS (https://gitlab.inria.fr/Phylophile/Treerecs) to correct these gene trees and our preliminary analysis shows that the corrected trees are of better quality (in terms of ancestral gene content) than the original ones.
Scaffolding extant and ancestral genomes. In a second step we used these improved reconciled gene trees to reconstruct jointly ancestral and extant gene adjacencies. The general approach is described in our recent papers (Anselmetti et al. 2015: we consider pairs of gene families for which extant adjacencies (synteny) is observed and compute, from the reconciled gene trees, a duplication-aware parsimonious evolutionary scenario in terms of adjacency gain/breaks that can also create extant adjacencies between genes at the extremities of contigs/scaffolds. The method has been modified to include sequencing data for the inference of potential extant scaffolding adjacencies, thus it is based on a combination of evolutionary signal and sequence data. We used all sequencing data available for the 21 anophelines to associate a prior score to potential extant scaffolding adjacencies with the scaffolder BESST ). The new method, using both sentence evolution and sequencing data is called ADSEQ; it includes a probabilistic version of the algorithm that allows sampling of optimal solutions uniformly and to associate to potential scaffolding (both extant and ancestral) a posterior score defined as the frequency of observing the adjacencies in this sample. Finally, if adjacency conflicts are observed (e.g. the same contig extremity is deemed to be adjacent to more than one other contig extremity), we use a Maximum Weight Matching algorithm to resolve these conflicts, using the posterior score of the adjacencies as edge weights. Resulting counts of predicted scaffold adjacencies for each of the anopheline assemblies are presented in Table S4.
Data and code availability. Input data and results obtained with ADSEQ are available from the GitHub repository https://github.com/YoannAnselmetti/DeCoSTAR_pipeline in the directory named "21Anopheles_dataset". This contains a pipeline written in snakemake, a python workflow management system (Köster and Rahmann 2012), allowing users to generate input data required for ADSEQ and execute it from standard genomic format files. Input gene trees and adjacencies were produced from ORTHODB orthologous groups and gene locations available in Additional File 3.

Sergey Aganezov, Max A. Alekseyev
Full descriptions of the algorithms implemented, underlying assumptions, and performance of GOS-ASM are detailed in Aganezov and Alekseyev 2016;Avdeyev et al. 2016). Technological fragmentation is modelled by artificial "fissions" that break genomic chromosomes into scaffolds. Scaffold assembly can therefore be reduced to the search for "fusions" that revert technological "fissions" and glue scaffolds back into chromosomes. This observation inspired us to employ the genome rearrangement analysis techniques for scaffolding purposes. Rearrangement analysis of multiple genomes relies on the concept of the breakpoint graph and utilizes the topology of the organisms' phylogenetic tree. While traditionally the breakpoint graph is constructed for complete genomes, it can also be constructed for fragmented genomes, where we treat scaffolds as "chromosomes". We demonstrate that the breakpoint graph of multiple genomes possesses an important property that its connected components are robust with respect to the technological genome fragmentation. In other words, connected components of the breakpoint graph mostly retain information about the complete genomes, even when the breakpoint graph is constructed on their scaffolds. We thus use the topology of the species phylogenetic tree and the structure of the connected components in the corresponding breakpoint graph to reconstruct the "reverse evolution" of the input genomes along the branches of the phylogenetic tree, distinguishing between signatures of evolutionary and technological fissions. Identified technological fissions are then used as guidance for the gluing of input scaffolds back into complete chromosomes.
Resulting counts of predicted scaffold adjacencies from applying GOS-ASM to the full set of anopheline assemblies are presented in Table S4.

Robert M. Waterhouse
Using gene orthology data from cross-species comparisons, ORTHOSTITCH identifies genes located at scaffold extremities and evaluates the evidence from the locations of orthologous genes from other species to predict likely scaffold adjacencies. The analysis proceeds in a stepwise manner, first identifying the most likely neighbour for each scaffold end and then requiring best neighbours to be reciprocal in order to identify putative adjacencies. The evaluations are not limited to single-copy orthologues as analyses of all paralogues are performed such that all possible neighbour relationships are examined.
Putative neighbours at scaffold extremities are scored by how many of the species with orthologues show the same neighbour relationship ( Figure S5), requiring at least two species to do so. ORTHOSTITCH was developed as part of the synteny-focused analyses of the comparative analysis of the Manduca sexta genome (Kanost et al. 2016), it is described in detail below and the code is available from the GitLab project page: https://gitlab.com/rmwaterhouse/OrthoStitch ORTHOSTITCH requires as input an anchor groups file and an anchor locations file. The anchor groups file may be generated from any orthology delineation procedure, and consists of just three columns of data: the orthologous group identifier, the gene identifier, and the species identifier. The anchor locations file may be generated from general feature format (GFF) or general transfer format (GTF) files that indicate the genomic locations of annotated features (genes) for each assembly. Like GFF or GTF files, the anchor locations file consists of nine columns, with only the coding sequence (CDS) lines selected from GFF or GTF files, and with the 'source' column (2 nd column) containing the species identifier, and with the 'attribute' column (9 th column) containing only the gene identifier. The gene and species identifiers used in both the groups file and the locations file must match exactly, and gene identifiers must be unique across the complete dataset of all species. The anchor locations file may contain the locations of genes that are not present in the anchor groups file, i.e. some genes with known locations may not have been assigned to any orthologous group, however, the anchor groups file may not contain any genes that are not present in the anchor locations files, i.e. all genes in orthologous groups must have known locations.

Figure S5. Example of ORTHOSTITCH adjacency evidence
This putative adjacency is identified in A. funestus (AFUNE, blue) with both scaffolds in the forward orientation, where orthologous genes from 12 other anophelines support the neighbour relationship (green). In three other anophelines one or more intervening genes disrupt the neighbour relationship of these pairs of orthologues (orange). In the remaining five anophelines there are no orthologues or the orthologues have no neighbouring genes and thus they offer neither support nor evidence against the putative neighbour relationship (yellow), or there are orthologues with neighbours but they do not support the putative adjacency (purple). So this adjacency is supported by evidence from 12 species out of a possible 16 for scaffold KB668690 and out of a possible 18 for scaffold KB668920, giving a synteny score of 0.71 and a universality score of 0.85 with a final adjacency score of 0.60.
ORTHOSTITCH options allow for the genomic location of each anchor gene to be set as the start, middle, or end of the input coding sequence genomic coordinates, and the analyses can be run using only genes with orthologues or with all genes in the locations file. All predicted adjacencies are further classified into confident, and superconfident subsets. Confident adjacencies require more than a third of comparison species to have orthologues and more than a third of those that do have orthologues to support the predicted scaffold adjacency. Superconfident adjacencies additionally require the same of their upstream or downstream neighbours. The adjacency score for each pair of putatively neighbouring scaffolds is computed as the product of a synteny score (S) and a universality score (U), based on the numbers of species with orthologues that support the adjacency where Sup = the number of supporting species, Pos = the number of possible species, and Tot = the total number of species thus: Orthology data from ORTHODB v9 (Zdobnov et al. 2017), were used to produce the input anchor groups file and the anchor locations were produced from GFF files from VECTORBASE (Giraldo-Calderón et al. Table S3). The ORTHOSTITCH (v1.6) analysis was run using data from all 21 available anophelines with the options of anchor locations set to 'middle' and using all annotated genes, and the resulting adjacency counts are presented in Table S4.

2015) (see
The performance of ORTHOSTITCH in terms of the ability to recover true adjacencies versus false adjacencies was assessed using the same input dataset from the 21 anophelines with the introduction of artificial scaffold/chromosome breaks. Four different types of randomly positioned scaffold/chromosome-splitting breaks were introduced and analysed separately, (i) between any (ANY) neighbouring pair of orthologues; (ii) between neighbouring orthologue pairs both from orthologous groups containing at least a third (1/3) of the 21 species; (iii) between neighbouring orthologue pairs both from orthologous groups with more than half (1/2) of the 21 species, a gene-to-species ratio of no more than 1.5 (i.e. limiting the numbers of duplicated copies), and restricted to scaffolds/chromosomes with at least 25 orthologues in total (i.e. avoiding splitting shorter scaffolds); and (iv) the same as (iii) but also requiring the neighbouring pair to have been part of the supporting sets that defined the superconfident adjacencies (1/2+SYN) in Table S4 (i.e. known to provide synteny support). 100 random scaffold/chromosome breaks were introduced and then analysed to predict putative adjacencies and assess how many of the artificially introduced breaks were correctly recovered as predicted adjacencies and how many were incorrectly recovered, repeated 100 times for each of the four different types of neighbouring orthologues. True adjacencies are those that correctly predict the split pairs of orthologues as neighbours, false adjacencies are those that incorrectly predict a different neighbour for either or both of the split orthologues. These were assessed for the 'all' and 'confident' sets of adjacencies predicted by ORTHOSTITCH.
ORTHOSTITCH options were selected as for the complete analysis above, with anchor locations set to 'middle' and using all annotated genes. Median true recoveries for the sets of all adjacencies were 74%, 82%, 87%, and 96% for the four split types, ANY, 1/3, 1/2, 1/2+SYN, respectively, versus median false recoveries for the same sets of 2, 2, 2, and 1 ( Figure S6). True recoveries increased according to split type from ANY to 1/3 to 1/2 to 1/2+SYN, as more orthologues and more syntenic orthologues at breakpoints allow for better predictions. True recoveries decreased for the confident datasets as the more stringent prediction criteria filter out real adjacencies. False recoveries were very low across all analysed datasets, with a few more from the all versus the confident predictions. Thus for similar datasets ORTHOSTITCH is expected to be able to recover about three quarters of true adjacencies, when the genes at the scaffold extremities have orthologues in more than a third or more than half the species then recovery levels are expected to increase, and when these orthologues provide synteny support then the adjacencies are almost always recovered.

Figure S6. Performance of ORTHOSTITCH adjacency recovery
For each of four different types of neighbouring orthologues (ANY, 1/3, 1/2, 1/2+SYN, see text for details), a total of 100 random scaffold/chromosome breaks were introduced into the gene locations data. These were then analysed to predict putative adjacencies and assess how many introduced breaks were recovered as predicted adjacencies. This was repeated 100 times for each of the four different types of neighbouring orthologues.
Results were assessed for two levels of confidence estimated by OrthoStitch, namely all (blue) and confident (orange) adjacencies to enumerate true recovered adjacencies (left panel), i.e. those that correctly predict the split pairs of orthologues as neighbours, and false recovered adjacencies (right panel), i.e. those that incorrectly predict a different neighbour for either or both of the split orthologues. The CAMSA tool automates the process of comparing and merging scaffold assemblies produced by alternative methods as well as providing interactive visualisations that enable detailed manual inspections of the scaffold adjacency agreements and conflicts identified during the merging process (Aganezov and Alekseyev 2017). CAMSA allows working with both oriented and (partially) un-oriented scaffold assemblies under the same unifying framework, thus greatly simplifying the downstream analysis process when working with data produced by both computational and wet-lab based methods. CAMSA (version 1.1.0b14, https://github.com/compbiol/CAMSA) was applied to the predicted adjacencies from each of the three synteny-based methods to produce three consensus sets for each of the 20 anopheline assemblies: conservative three-way consensus adjacency sets, two-way consensus adjacency sets with no third-method conflicts, and liberal union sets of all non-conflicting adjacencies. Pre-filtering of the predicted adjacencies first removed any pairs of scaffolds where one or both remained un-oriented (i.e., semi-un-oriented assembly pairs were removed). Thus common adjacencies must agree both at the level of being predicted neighbours and their relative orientations. Conflicting adjacencies occur when one or both scaffolds in a pair predicted by one method are predicted to be paired with a different scaffold (or the same scaffold but the opposite orientation) by another method. The remaining unique and nonconflicting adjacencies from each method formed part of the liberal union sets.
Adjacencies in three-way and two-way agreement in the resulting CAMSA-produced consensus sets (Table S5) were used to build the synteny-improved assemblies and compute scaffold N50 values and counts before and after merging. As the synteny-based methods rely on orthologous anchors as their input data they cannot predict adjacencies for scaffolds with no annotated orthologous genes, thus N50 values and counts were computed based only on scaffolds with annotated orthologues (Fig. 2, main text; Figures S7 and S8). Linear regressions plotted with 95% confidence intervals computed with the geom_smooth() function from the R package ggplot2, specifying the 'lm' method.  Figure S7. Assembly improvements based on conservative set synteny predictions For details see Fig. 2, main text.

Figure S8. Assembly improvements based on liberal union set synteny predictions
For details see Fig. 2, main text.

Synteny-based method comparisons
Comparing the CAMSA-produced two-way consensus sets with the input adjacencies from each of the three methods quantified agreements (Table S5) as well as conflicting and unique adjacencies predicted by each method for each assembly (Fig. 3, main text; Figure S9). A total of 29'418 distinct scaffold adjacencies were identified from the combined results of all 42'923 predictions from the three methods.
These were classified according to whether they were in three-way agreement, in two-way agreement with no third-method conflict, in two-way agreement but with conflict(s), unique to an individual method with no conflict(s) with the other methods, or unique to an individual method but with conflict(s).
Comparing all 42'923 predictions identified 29'418 distinct scaffold adjacencies, 36% of which were supported by at least two methods. Overall, 10% of the distinct adjacencies were predicted by all three methods, and a further 26% were predicted by two methods but this was reduced to 20% when adjacencies that conflicted with the third method were removed. These 8'878 supported predictions were used to build the two-way consensus sets of scaffold adjacencies for synteny-based assembly improvements presented in Fig. 2. Main text Fig. 3B shows the overlaps amongst the three methods, plotted as an area-proportional Euler diagram with EULERAPE v3.0.0 (Micallef and Rodgers 2014).
Adjacencies in three-way agreement made up 30% of GOS-ASM and 27% of ORTHOSTITCH predictions, and 13% of ADSEQ predictions (as there were about double the number of ADSEQ predictions compared with the other two methods). The much larger total number of ADSEQ predictions resulted in a higher proportion of unique adjacencies (54%) compared with GOS-ASM (35%) and ORTHOSTITCH (31%).
Considering only the liberal union sets of all non-conflicting adjacencies, the adjacencies in three-way agreement made up 16.5% of the total, 45.6% of GOS-ASM, 39.1% of ORTHOSTITCH, and 18.6% of ADSEQ predictions (Fig. 3B, main text). From the two-way consensus adjacency sets with no thirdmethod conflicts, three-way consensus adjacencies made up 32.8% of the total, 53.8% of GOS-ASM, 44.4% of ORTHOSTITCH, and 33.4% of ADSEQ predictions (Fig. 3B, main text). These two-way consensus adjacencies that were employed to build the new superscaffolded assemblies were therefore supported by ADSEQ (98.1%), and/or ORTHOSTITCH (73.7%), and/or GOS-ASM (60.9%), with a third being supported by all three methods. Thus, comparing the results from the three methods and employing a two-way agreement with no third-method conflict filter improved the overall level of threeway agreement from a tenth to a third. shared amongst all three methods (green), or two methods without (blues) and with (purple) third method conflicts, or that are unique to a single method and do not conflict (yellow) or do conflict with predictions from one (orange) or both (red) of the other methods. Note variable maxima for y-axes.
Examining the results from each individual assembly (selected assemblies shown Fig. 3C, main text; all assemblies shown in Figures S9 and S10), showed generally good agreement for at least eight of the assemblies (more than 48% of distinct adjacencies were found to be in at least two-way agreement with no third-method conflict), with A. funestus achieving the highest consistency at 58%. Some of the most fragmented input assemblies produced the some of the largest sets of distinct adjacency predictions but the agreement amongst these predictions was generally lower than the other assemblies, e.g. A. maculatus with 8'179 distinct adjacencies of which only of which only 18% showed at least two-way agreement with no conflicts (Figure S10). A. albimanus showed a very low level of agreement (16.7%), but this is primarily because of the very few predicted adjacencies: just six distinct adjacencies with only one being shared between two of the methods.

Figure S10. Proportions of synteny-based adjacencies in agreement for each assembly
Comparisons of the number of distinct adjacencies and the proportion of which were common to at least two methods with no third method conflict. Two-way consensus adjacencies made up 48% or more of the distinct predictions for eight assemblies, while some of the most fragmented assemblies with the most predicted adjacencies showed lower levels of agreement. For AalbS1 (Anopheles albimanus), only one of the six distinct adjacency predictions was in the two-way consensus set. See Table S3 for the species that corresponds to each assembly identifier.  (Artemov et al. 2017), A. atroparvus Neafsey et al. 2015;Artemov et al. 2018a), A. sinensis Chinese strain (Wei et al. 2017), A. stephensi SDA-500 strain , and A. stephensi Indian strain (Jiang et al. 2014). A. stephensi mapping added to existing mapping data (Sharakhova et al. 2006, and A. funestus mapping built on previous results (Sharakhov et al. 2002(Sharakhov et al. , 2004Xia et al. 2010) to further develop the physical map as described in detail below. Counts of mapped scaffolds and the resulting scaffold adjacencies, i.e. pairs of neighbouring mapped scaffolds, are summarised in Table S6   Mosquitoes were raised in a growth chamber at 27°C, with a 12-hour cycle of light and darkness.
Approximately 20-21 hours post-blood feeding, ovaries of adult females were pulled out and fixed in Carnoy's solution (3 : 1 ethanol : glacial acetic acid by volume). Ovaries were preserved in fixative solution from 24 h up to 1 month at -20°C.

Chromosome preparation:
Isolated ovaries were bathed in a drop of 50% propionic acid for 5 minutes and squashed as previously described ). The quality of the preparation was assessed with an Olympus CX41 phase contrast microscope (Olympus America Inc., Melville, NY). High-quality chromosome preparations were then flash frozen in liquid nitrogen and immediately placed in cold 50% ethanol. After that, preparations were dehydrated in an ethanol series (50%, 70%, 90%, and 100%) and air-dried.
Unstained chromosomes were observed using an Olympus BX41 phase contrast microscope with attached CCD camera Qcolor5 (Olympus America Inc., Melville, NY).

Probe preparation and fluorescence in situ hybridization:
Gene-specific primers were designed to amplify unique exon sequences from the beginning and end of each scaffold using the primer-BLAST program (Ye et al. 2012 Figure S11). CIRCOLETTO was used to visualize sequence similarity between linked Illumina scaffolds with merged PacBio scaffolds, their order and orientation (Darzentas 2010). Illumina scaffolds were ordered and oriented within large PacBio contigs and merged PacBio scaffolds, and the resulted arrangements were anchored to chromosomes by FISH as described above.

Figure S11. Fluorescence in situ hybridization (FISH) mapping in Anopheles funestus.
Multicolour FISH of four DNA probes designed based on gene sequences. Polytene chromosomes are from ovarian nurse cells of A. funestus.

Chromosome mapping:
Illumina scaffolds and merged Illumina-PacBio arrangements were anchored to chromosomes by several different ways.
(1) Scaffolds without adjacency and orientation were placed on chromosomes with only one FISH probe.
(2) Oriented scaffolds without adjacency were placed on chromosomes with at least two FISH probes, but they did not have any neighbours.
(3) Scaffolds with adjacency but without orientation consisted of two or several neighbouring scaffolds mapped with one FISH probe each.
Alternatively, several Illumina scaffolds were predicted to be adjacent within a PacBio contig or PacBiomerged scaffolds by BLAST but the whole assembly was anchored to chromosome by only one FISH probe. (4) Ordered and oriented scaffolds were placed on chromosomes by multiple FISH probes ( Figure S11) or their adjacency and orientation were inferred from the alignment to a mapped and oriented PacBio contigs or PacBio-merged scaffolds. The resulting physical genome map for A. funestus includes 202 AfunF1 scaffolds (Table S7). As for the comparisons of the synteny-based results, CAMSA was used to compare the two-way consensus sets, as well as the conservative three-way consensus sets and the liberal union sets of all non-conflicting adjacencies, with the physical mapping adjacencies from each of the six assemblies and quantify agreements as well as conflicting and unique adjacencies (Table S8).
For A. albimanus, the two-way consensus synteny-based predictions produced only a single adjacency, and this was confirmed by the physical mapping data. Five of the 15 two-way consensus synteny-based predictions were confirmed by physical mapping of A. atroparvus scaffolds and only one conflict (resolved) was identified (Fig. 4A, main text). The mapped scaffolds for the A. stephensi assemblies resulted very few adjacencies, the three SDA-500 strain adjacencies were all in conflict with synteny-based predictions, and of the six Indian strain adjacencies three were shared and one was in conflict with the two-way consensus synteny-based predictions. These conflicts were resolved by correcting the orientations of the physically mapped scaffolds, as the probe designs meant that mapping misorientations were possible.
Comparing the 20 A. sinensis (Chinese) mapped scaffolds confirmed three of the synteny-based adjacencies, but none of these were in the consensus sets, and identified conflicts with just two of the 92 two-way consensus adjacencies, both of which were resolved as they involved scaffolds that had not been selected for physical mapping. And finally, A. funestus presented the most adjacencies from both physical mapping and the synteny-based predictions where 12-17% of the different sets of synteny-based adjacencies were confirmed and just 4-8% were in conflict (Fig. 4A, main text). Amongst the 14 physically mapped neighbouring pairs that conflicted with 13 synteny-based adjacencies from the twoway consensus set, five conflicts were resolved because the synteny-based neighbour was short and not used for physical mapping. An additional four conflicts were resolved by switching the orientation of physically mapped scaffolds, which were anchored by only a single FISH probe and therefore their orientations were not confidently determined. All but one of these adjacency conflicts were resolved either because the scaffolds involved had not been selected for physical mapping or because the orientation determined by physical mapping was not confident and was thus inverted.  (Nystedt et al. 2013), and transcript-based scaffolding of the Loblolly pine genome linked together 31'231 scaffolds into 9'170 larger scaffolds (Zimin et al. 2014). Although large introns could potentially result in scaffold skipping and introduce large gaps, the Anopheles genomes are all relatively small (as shown in Figure 1, main text), and long introns are rare: e.g. the best annotated An. gambiae has a mean intron length of 1577 bp and only ~1.5% are longer than 20kbp; average of mean lengths, 776 bp; with an average 101 introns per assembly longer than 20Kbp). The presence of highly similar paralogues could also lead to incorrect read mapping that can hinder the correct identification of scaffold-spanning transcripts, but confident adjacencies can be identified by using uniquely-mapping reads with good coverage.
The Annotated Genome Optimization Using Transcriptome Information (AGOUTI) tool  employs RNAseq data to identify such adjacencies as well as correcting any fragmented gene models at the ends of scaffolds. AGOUTI identifies pairs of reads that are mapped to different contigs/scaffolds (joining-pairs) and uses only those joining-pairs that are uniquely mapped with a default minimum coverage of five reads. Performance of AGOUTI was previously evaluated by randomly fragmenting the genome of Caenorhabditis elegans (N2 strain) with six different levels of fragmentation , and compared the results with another RNAseq-based scaffolder, RNAPATH (Mortazavi et al. 2010). including those from the Anopheles 16 Genomes Project  and an A. stephensi (Indian) male/female study . These data were downloaded from VECTORBASE in the form of pre-computed BAM files -RNAseq reads aligned to the assemblies using HISAT2 version 2.0.4 (Kim et al. 2015). All BAM files were sorted by read name (required by AGOUTI), and where more than one BAM file was available for a given assembly they were first merged, both sorting and merging was performed using SAMTOOLS version 0.1.19-44428cd (Li et al. 2009). AGOUTI was run in scaffold mode with default parameters, e.g. for A. dirus 'python2 agouti.py scaffold -assembly anopheles-dirus.fa -bam AdirW1.sorted.bam -gff anopheles-dirus.gff3 -outdir ADIRU'. The numbers of resulting predicted adjacencies ranged from just two for A. albimanus to more than 200 A. sinensis (SINENSIS) ( Table S9).
Validation of the AGOUTI-predicted adjacencies was performed using the alternative RNAseq-based approach of RASCAF (Song et al. 2016), GitHub version 10.07.2018, with minimum support for connecting two contigs of five and the coordinate-sorted alignment BAM files. RASCAF consistently predicted more adjacencies than AGOUTI and full support for the AGOUTI-predicted adjacencies ranged from 2/2 for An. albimanus to just 5/39 for An. atroparvus (Table S9). Adjacencies predicted by both methods were given priority during reconciliation with the scaffold adjacencies from synteny and physical mapping data. As for the comparisons of the physical mapping results with the synteny-based results, CAMSA was used to compare the two-way consensus sets, as well as the conservative three-way consensus sets and the liberal union sets of all non-conflicting adjacencies, with the AGOUTI-based adjacencies from each of the 13 assemblies and quantify agreements as well as conflicting and unique adjacencies ( Table S10). The AGOUTI-based scaffold adjacencies supported up to 17-20% of two-way consensus synteny-based adjacencies in some species, with generally few conflicts but up to 11% and 14% conflicting for A. stephensi (Indian) and A. sinensis (SINENSIS), respectively, which had the most AGOUTI-based scaffold adjacencies. Across all 13 assemblies, 18% of AGOUTI-based scaffold adjacencies supported the two-way consensus synteny-based adjacencies, with only 7% in conflict and 75% were unique to the AGOUTI sets.
At the contig level the new AfunF2-IP assembly is an improvement over the reference AfunF1, e.g. the number of contigs is reduced from 9'880 to 4'170 and the NG50 increases from 47 Kbp to 194 Kbp.
However, longer-range scaffolding of these contigs unfortunately failed to produce a better quality scaffold-level assembly. In terms of gene content, analysis with 2'799 dipteran Benchmarking Universal Single-Copy Orthologues (BUSCOs) (Simão et al. 2015;Waterhouse et al. 2018Waterhouse et al. , 2019 indicates that despite the better contigs fewer BUSCOs are found as complete genes in the AfunF2-IP assembly ( Table   S11). For comparison, the new chromosomal-level assembly for A. funestus (Ghurye et al. 2019a) (AfunF3) achieves slightly lower BUSCO completeness with 96.0% 'complete' ( Table S1).
The AfunF1 assembly has a very high level of N's, 15.63% compared with just 0.90% for the AfunF2-IP assembly, reflecting how scaffolding improves N50 measures but mainly by joining contigs with stretches of unknown nucleotides (N's). When the scaffolds are artificially de-scaffolded by splitting them at consecutive runs of 3, 300, and 1'000 Ns the new AfunF2-IP assembly is clearly much better ( Figure   S12). The stringent splitting at N>=3 also indicates the greater integrity of the sequence quality of the AfunF2-IP assembly as this does not result in high fragmentation levels as it does for AfunF1 (i.e. from 3'772 scaffolds to 4'186 contigs for AfunF2-IP but from 1'391 scaffolds to 9'878 contigs for AfunF1).  263,192,532 260,811,631 NG50 244,910 194,030 Maximum 7,451,746 3,313,857 Figure S12. Cumulative scaffold lengths for Anopheles funestus AfunF1 and AfunF2-IP assemblies Cumulative assembly length plots for the reference AfunF1 and the new AfunF2-IP Anopheles funestus scaffoldlevel assemblies. Lengths are summed and plotted from the longest to the shortest scaffold for each assembly.
These are replotted for each assembly after splitting scaffolds at consecutive runs of 3, 300, and 1'000 Ns, i.e.
effectively de-scaffolding them and slicing at ambiguous or low-quality regions.

Robert M. Waterhouse, Livio Ruzzante, Romain Feron
Despite the lack of longer-range scaffolding information from the AfunF2-IP assembly, the scaffolds are nonetheless useful for the purposes of identifying potential adjacencies of the AfunF1 scaffolds through whole genome alignment analyses. The first step towards delineating the order and orientation of A.
funestus AfunF1 scaffolds along those of the AfunF2-IP assembly was to mask each assembly with a library of anopheline repeats using REPEATMASKER (Smit et al. 2015) and then perform a pairwise LASTZ (Harris 2007) whole genome alignment with default parameters. The resulting alignment blocks were then interrogated with a custom Perl script to define alignment blocks of more than 10 basepairs (bps) from AfunF1 allowing for insertions or deletions of no more than 10 bps in either assembly and requiring AfunF1 genomic regions to be unique (basepairs falling in regions that appeared in more than one alignment block were ignored unless the second-best scoring block scored less than 75% of the bestscoring block, in which case only the best-scoring block was considered). This identified a total of 124'926 links connecting 1'098 AfunF1 scaffolds to 2'845 AfunF2-IP scaffolds with a mean length of 1'234 bps, median of 650 bps, and maximum of 31'044 bps.
Links were then bundled into larger link-regions allowing a maximum of 30 Kbps between links from the same pairs of scaffolds with the same orientations. The largest bundle (by genomic span of the bundled links) for each AfunF1 scaffold was used to define the corresponding AfunF2-IP scaffold and its mapping location was set at the midpoint of the bundle's genomic span on the AfunF2-IP scaffold, thereby ordering and orientating A. funestus AfunF1 scaffolds along their corresponding AfunF2-IP scaffolds and producing a final set of 321 alignment-based scaffold adjacencies. Each set of predicted adjacencies, the consensus adjacencies, the physical mapping adjacencies, and the AGOUTI adjacencies were compared with the set of alignment-based scaffold adjacencies ( Table S12). As the alignments consider scaffolds regardless of whether they were targeted for physical mapping or if they have any annotated orthologues, short un-annotated scaffolds may be ordered and oriented that then result in conflicts with the synteny-based or physical mapping based adjacencies that do not consider such scaffolds. Ignoring short scaffolds (<5 Kbps) or scaffolds with less than 30% aligned sequence reduces the total number of alignment-based scaffold adjacencies by about half to just 154, but this results in additional supported adjacencies being recovered for all the comparison sets, increased support for the synteny-based sets from 14-17.5% to 19-23% and for AGOUTI predictions from 15% to 17% (Table   S12). The ordered and oriented scaffolds were visualised using CIRCOS (Krzywinski et al. 2009) to display alignments greater than 100 bps, and bundled links greater than 3 Kbps and examine the concordance between the different adjacency predictions (Figure 5, main text; Figure S13). The recent availability of a new chromosomal-level assembly for A. funestus (Ghurye et al. 2019a) (AfunF3), which used long-reads and Hi-C data from the same A. funestus FUMOZ colony, enabled structural comparisons of the original AfunF1 assembly and the AfunF2 superscaffolded assembly with the AfunF3 as a high-quality reference genome. Comparisons were performed with the QUality ASsessment Tool for large genomes (QUAST-LG v5.0.2), which measures completeness and correctness of an assembly against a high-quality reference genome (Mikheenko et al. 2018): 'quast.py AfunF2.fa AfunF1.fa -r AfunF3.fa -o Afun_QUAST -e -t 6 --large --circos -u -m 1'. QUAST-LG aligns query assemblies to a reference assembly and reports differences as misassemblies including relocations (same chromosome), translocations (different chromosomes), and inversions (Table S13). QUAST-LG reported totals of 1'980 differences for AfunF1 and an additional 211 differences for AfunF2, with the same proportion of scaffold differences being relocations (both 94%), i.e. mostly putative local rearrangements. Regions that are neighbours on the y-axis but not on the x-axis indicate putative translocations in the AfunF2 scaffolds and superscaffolds with respect to the AfunF3 chromosomes. Regions that are neighbours on the y-axis but not on the x-axis indicate putative translocations in the AfunF2 scaffolds and superscaffolds with respect to the AfunF3 chromosomes. Regions that are neighbours on the y-axis but not on the x-axis indicate putative translocations in the AfunF2 scaffolds and superscaffolds with respect to the AfunF3 chromosomes. Regions that are neighbours on the y-axis but not on the x-axis indicate putative translocations in the AfunF2 scaffolds and superscaffolds with respect to the AfunF3 chromosomes.
[12] Reconciliation to build the new assemblies Robert M. Waterhouse, Jiyoung Lee, Livio Ruzzante, Maarten J.M.F. Reijnders, Romain Feron, Daniel Lawson, Gareth Maslen, Igor V. Sharakhov In order to build the new assemblies for A. albimanus, A. atroparvus, A. farauti, A. melas, and A. merus, results from the two-way consensus synteny predictions, and AGOUTI and physical mapping data (where available), had to be compared and reconciled with their version 2 reference assemblies. For the published A. albimanus AalbS2 assembly, new physical mapping data (also used in this study) was used to improve the assembly by correcting nine misassemblies and anchoring 98% to chromosomes (Artemov et al. 2017). This splitting of the misassembled scaffolds resulted in an increase from 204 AalbS1 scaffolds to 236 AalbS2 scaffolds. The single synteny-based prediction from the two-way consensus set was in agreement with the physical mapping data, as were two of the three adjacencies unique to ORTHOSTITCH, and were therefore already present in the upgraded AalbS2 chromosomal assembly. AGOUTI predicted only two adjacencies, both of which were between very short scaffolds (1'148 bp and 1'012 bp) with no gene annotations and much longer already anchored scaffolds ( Table 2, main text).
For the published A. atroparvus AatrE2 assembly, and later AatrE3, additional physical mapping data (also used in this study) was used to anchor 56 scaffolds (201 Mbps, 89.6% of the assembly) to chromosomes, leaving 1'315 scaffolds unmapped (Artemov et al. 2018a). The A. melas AmelC2 assembly was produced from the AmelC1 assembly following the removal of several duplicated scaffolds and regions of scaffolds thereby reducing the number of scaffolds by 52 to 20'229 scaffolds with an unchanged scaffold N50 of 18 Kbps. This affected only 112 scaffolds that were part of 121 adjacencies, and where removed regions made up less than 25% of the original scaffold and they were removed from scaffold ends not involved in any adjacencies then these adjacencies were retained. Thus 95% of AmelC1 adjacencies (97% of scaffolds) were reconciled with the AmelC2 assembly and were used to build the AmelC3 assembly.
The version 2 assemblies for A. farauti (AfarF2) and A. merus (AmerM2) were derived from re-scaffolding efforts that included the addition of a large-insert 'fosill' sequencing library constructed from high molecular weight DNA, which reduced the numbers of scaffolds from 550 to 310 and 2'753 to 2'027 and increased N50 values from 1'197 Kbps to 12'895 Kbps and 342 Kbps to 1'490 Kbps, respectively. The version 1 assemblies were aligned to the version 2 assemblies using BLAST+ (Camacho et al. 2009) and all scaffolds involved in the synteny-based or AGOUTI-based adjacency predictions were visualised with their corresponding version 2 scaffolds using CIRCOS (Krzywinski et al. 2009). In this way, the predicted adjacencies from version 1 assemblies were assessed to identify adjacencies fully supported by alignments to version 2 scaffolds, e.g. seven A. farauti synteny-based two-way consensus set adjacencies confirmed by the alignment with a single AfarF2 scaffold ( Figure S15). These assessments also identified adjacencies without support from the version 2 assemblies but which were nonetheless not in conflict (i.e. predicted neighbouring scaffolds that were not joined during the re-scaffolding process), supported neighbours but conflicting orientations, and adjacencies where the arrangements in corresponding version 2 scaffolds precluded the possibility of being neighbours ( Table S14). The comparisons identified full support for the majority (87% and 82%) of the two-way synteny consensus set adjacencies and unresolvable conflicts for just 5% and 10%, while the AGOUTI-based adjacencies achieved similarly high levels of full support (81% and 67%), but with slightly greater proportions of conflicts.  Figure S15. Collinearity between Anopheles farauti AfarF1 and AfarF2 scaffolds Anopheles farauti AfarF1 scaffold adjacencies supported by collinearity with the subsequent AfarF2 assembly.
Seven adjacencies from the A. farauti synteny-based two-way consensus set predicted the order and orientation of eight AfarF1 scaffolds that are fully supported by the alignment with a single AfarF2 scaffold. Scaffold lengths are shown in increments of 0.1 Mbps. AfarF2 KI915049 aligned with AfarF1 KI421600, KI421705, KI421694, KI421638, KI421757, KI421658, KI421727, KI421610.

New assembly FASTA files and annotation 'lift-over' details
The final lists of pairwise adjacencies and the superscaffolds (Additional File 6), with superscaffolds presented in a GRIMM-like format (http://grimm.ucsd.edu/GRIMM/grimm_instr.html) were combined with the VECTORBASE (Release VB-2019-06) assembly sequence data (FASTA format) and assembly annotation data (GFF3 and GTF formats) to produce the new updated assemblies and their corresponding annotations. The adjacencies defined the neighbouring scaffolds that were fused together with an insertion of a stretch of 100 N's to indicate a sequence gap, and with reversed scaffold orientations as required by the relative orientations of the pairwise adjacencies and superscaffolds.

Chromosome arm assignment using updated assemblies and annotations
Several whole-arm translocations in the anophelines  mean that the five chromosomal elements that make up the X chromosome and the two autosomes correspond to different named chromosome arms in different species (Table S15), and thus results are presented as assignments to elements one to five rather than named chromosome arms. Combining orthology data delineated for genes from all 21 assemblies (see section [3] above) and chromosome arm locations for genes from the eight assemblies with chromosomal anchoring data, orthologues of genes on each scaffold were enumerated for each element from each of the eight chromosome-anchored assemblies (Additional File 2). To be considered for assignment, the scaffold was required to have a minimum of ten genes with annotated orthologues. The scaffold was then assigned to an element when at least 75% of these orthologues were located on a single element. Confident assignments reported in Table S2 and main text Fig. 1 were required to be confirmed by data from at least two species, and conflicting assignments were excluded as they could represent translocation events (assignments with only single-species support or with conflicting species support are reported in Additional File 2 but flagged as not assigned).

DOCKER container
A DOCKER container is provided that packages ADSEQ, GOS-ASM, ORTHOSTITCH, and CAMSA, as well as their dependencies, in a virtual environment that can run on a Linux server, this is available from: https://hub.docker.com/r/mreijnders/synteny/