Chromosome-scale assemblies reveal the structural evolution of African cichlid genomes

Abstract Background African cichlid fishes are well known for their rapid radiations and are a model system for studying evolutionary processes. Here we compare multiple, high-quality, chromosome-scale genome assemblies to elucidate the genetic mechanisms underlying cichlid diversification and study how genome structure evolves in rapidly radiating lineages. Results We re-anchored our recent assembly of the Nile tilapia (Oreochromis niloticus) genome using a new high-density genetic map. We also developed a new de novo genome assembly of the Lake Malawi cichlid, Metriaclima zebra, using high-coverage Pacific Biosciences sequencing, and anchored contigs to linkage groups (LGs) using 4 different genetic maps. These new anchored assemblies allow the first chromosome-scale comparisons of African cichlid genomes. Large intra-chromosomal structural differences (∼2–28 megabase pairs) among species are common, while inter-chromosomal differences are rare (<10 megabase pairs total). Placement of the centromeres within the chromosome-scale assemblies identifies large structural differences that explain many of the karyotype differences among species. Structural differences are also associated with unique patterns of recombination on sex chromosomes. Structural differences on LG9, LG11, and LG20 are associated with reduced recombination, indicative of inversions between the rock- and sand-dwelling clades of Lake Malawi cichlids. M. zebra has a larger number of recent transposable element insertions compared with O. niloticus, suggesting that several transposable element families have a higher rate of insertion in the haplochromine cichlid lineage. Conclusion This study identifies novel structural variation among East African cichlid genomes and provides a new set of genomic resources to support research on the mechanisms driving cichlid adaptation and speciation.


Conclusion
This study identifies novel structural variation among East African cichlid genomes and provides a new set of genomic resources to support research on the mechanisms driving cichlid adaptation and speciation.

Keywords
Genome assembly; African cichlids; comparative genomics; genome rearrangements; chromosome evolution; karyotype; inversion; recombination; transposable elements; genetic maps 4 Background African cichlid fishes, due to their phenotypic diversity and rapid speciation over the last several million years, are a model system for studying the mechanisms of evolution [1]. Many studies of cichlid speciation have used short read data to perform genome scans of SNPs and small indels in order to identify genomic regions under selection [2][3][4]. However, there are numerous other ways that genomes can evolve, including the accumulation of larger indels, as well as intra-and interchromosomal rearrangements. Identification of these types of mutation requires high quality, nearly complete genome sequences.
Draft genomes of five African cichlid species were previously generated using Illumina short-read sequencing and used in an initial analysis exploring some of the forces at play in African cichlid speciation [5]. The draft genome assembly of the Lake Malawi cichlid, Metriaclima zebra, was one of the most continuous and accurate genomes assembled from short reads, as revealed in the Assemblathon 2 competition [6]. However, these five draft genome assemblies still contain a large number of gaps, and only the assembly of the Nile tilapia, Oreochromis niloticus, has been anchored to linkage groups (LGs), making it difficult to compare the structure of cichlid genomes at chromosomal scales.
To improve these cichlid genome resources, we have employed long-read Pacific Bioscience SMRT sequencing. Long-read DNA sequencing technology has made it much easier to create accurate and contiguous genome assemblies [7][8][9][10][11]. In particular, long-read technologies have allowed the assembly of repetitive sequences, and the identification of structural variants. We previously improved the genome assembly for the Lake Malawi cichlid, M. zebra, by sequencing 16.5X coverage of PacBio reads to fill in gaps and characterize repetitive sequences [12]. We also produced a new high-quality genome assembly of O. niloticus, using 44X coverage PacBio sequencing. We were able to anchor 86.9% of the assembly to linkage groups, which allowed us to characterize the structure of two sex determination regions in tilapias [13].
Cichlid karyotypes are similar to other perciform fish. The diploid chromosome number (2n) varies from 32-60, but more than 60% of species have a diploid number of 2n = 48 [14]. Most of the chromosomes are acrocentric, but between 0 and 9 metacentric pairs are present in each species [15,16]. These karyotypic changes may have played an important role in the evolution and speciation of African cichlids. Classical cytogenetic techniques are able to characterize differences in chromosome number and large fusion or translocation events, which are easily seen under the microscope. They are less suited to studying smaller genome rearrangements, including inversions smaller than several megabases. Comparisons of chromosome scale assemblies in other vertebrate groups have begun to identify extensive structural differences at both the cytogenetic and the sequence assembly level [17,18], but the role of chromosome rearrangements in recent adaptive radiations has not been well studied.
Chromosome-scale assemblies can be achieved either by physical mapping techniques [19], or by anchoring the contigs of the sequence assembly to genetic linkage maps. Genetic maps have the advantage of reflecting another important feature of genomes, namely variation in recombination rate, which has manifold impacts on the levels of genetic polymorphism [20] and the efficiency of genome scans [21].
Here we describe chromosome-scale assemblies of two cichlid genomes.
First, we re-anchor our previously published PacBio assembly of the O. niloticus genome [13] using a new high-density genetic map [22]. Second, we present a new assembly of M. zebra based on 65X coverage of long PacBio sequence reads.
Finally, we anchor the M. zebra assembly with several recombination maps produced from hybrid crosses among closely related species from Lake Malawi. The anchored genome assemblies of these two species allow for the first chromosome-scale comparison of African cichlid genomes. We focus our analyses on three aspects of genome evolution that are revealed by these new chromosome-scale assemblies: variation in recombination rate across the genome, structural variation among cichlid lineages, and the landscape of transposable elements.
First, we describe the pattern of recombination along each chromosome.
Spatial variation in recombination rate has implications for patterns of genetic variation [23,24], the evolution of sex chromosomes [25], and the analysis of genome-wide associations between phenotypes and genotypes [21]. Despite the importance of recombination in shaping genome architecture [26], it is only beginning to be studied in cichlids [27]. A great diversity of sex chromosomes have evolved in East African cichlids, likely the result of sexual genetic conflict [28].
Rapid changes in sex determination mechanism, which are frequently variable even within species, may play an important role in cichlid speciation [1]. The evolution of new sex chromosomes often involves chromosomal inversions, which change the pattern of recombination [29][30][31][32][33]. Studies of these changing patterns of recombination, and their effects on genetic variation, have been hampered by the incomplete nature of the previous draft genome assemblies.
Second, we characterize the patterns of chromosome rearrangement among species. It has been suggested that teleost karyotypes have remained largely stable since the fish-specific whole genome duplication more than 300 million years ago [34]. This is in contrast to recent reports of chromosomal fusions among closely related cichlid species [35][36][37], and a large number of putative inversions associated with the evolution of sex chromosomes in various species [13,31,38]. Chromosomescale assemblies of cichlids allow us to quantify the levels of synteny among teleost lineages, and the rate of intra-chromosomal rearrangement among cichlid lineages in East Africa. To further explore these distinct patterns of recombination and structural changes in cichlids, we also compared the cichlid genomes to the detailed genomic history of the medaka (Oryzias latipes). Previous studies in medaka have shown that, subsequent to the teleost-specific whole-genome duplication 320-350 million years ago, one subset of medaka chromosomes remained stable while another subset underwent more extensive fusion and translocation events [34,39]. Related comparisons using additional teleost species have shown that the diploid number of chromosomes is relatively stable (24-25 chromosomes pairs in 58% of teleosts) and that when the chromosome number is lower in a particular species or group it is due to chromosome fusion events [40].
Finally, we quantify the abundance and distribution of various transposable element (TEs) families in each genome. Several studies have documented the expansion of particular transposon families in East African cichlids (AFC TEs) [41,42]. Transposable elements may play an important role in shaping genome architecture, particularly the divergence of sex chromosomes. Transposable elements may also be an important source of regulatory mutations [43]. Insertion of an AFC-SINE into a gene promoter is associated with the evolution of a novel egg-spot coloration pattern in haplochromine cichlids [44]. Similar promoter element re-wiring events have been shown to control cichlid opsin visual sensitivity [45]. Since transposons may have been involved in the evolution of many other phenotypes, it is important that these sequences be well-represented in genome assemblies.
Unfortunately, transposable elements are not well-represented in genome assemblies that are based on short Illumina sequence reads. Our previous work has shown that long-read sequencing greatly improves both the length and quantity of TE repeats in cichlid genome assemblies [12,13]. A comparative analysis of transposable elements will improve our understanding of the patterns of transposon insertion and deletion during the radiation of East African cichlids.
In this paper, we integrate the whole genome alignments of these two African cichlid species with the location of centromeres, and the patterns of recombination and linkage disequilibrium (LD). This panoramic view allows us to see how the structure of African cichlid genomes has evolved over the last several million years. It also raises new questions about the evolution of cichlid genomes, particularly about the evolution of sex chromosomes and the adaptive radiation of East African cichlids.

Data Description
To begin this study of chromosome-scale comparisons of African cichlid genomes, we used a new high-density map of O. niloticus [22] to improve the anchoring of our recent genome assembly [13]. We also generated a high-quality M. zebra genome assembly from a single male caught on Mazinzi Reef in Lake Malawi.
Single-molecule PacBio sequencing was performed to 65X coverage and a de novo assembly of the reads was constructed. Two new, and two previously published, genetic maps were used to quality check the assembly, break misassembled contigs, and anchor the sequence contigs to chromosomes.

Anchoring the O. niloticus assembly to a high-density linkage map
The recently assembled O. niloticus genome [13] was re-anchored in this study using a new high-density map that includes 40,190 SNP markers, see Methods and [22]. This new map identified 22 additional misassemblies not identified by previous maps. Table 1 provides a comparison of the previous O_niloticus_UMD1 assembly with this newly anchored O_niloticus_UMD_NMBU assembly. niloticus [15], and is a sex chromosome in the closely related species, O. aureus [46].
The anchored assembly of LG3 is 54.7% repetitive, compared to repeat rate of 37% assembly. This is due to the fact that misassembled contigs that have been broken by the new map are now assigned to a different LG.

Diploid sequence assembly of Metriaclima zebra
We assembled 65X coverage PacBio reads using FALCON/FALCON-unzip [7] to generate the new diploid M. zebra assembly, "M_zebra_UMD2". FALCON first assembles the PacBio reads into primary contigs (p-contigs) and unphased associate contigs (a-contigs) that correspond to alternate alleles. During the FALCON-unzip step, reads are assigned to haplotypes by phasing of heterozygous SNPs and then a final set of p-contigs and phased haplotigs are produced. Table 2 provides the assembly summary statistics for each of these assembly parts. The length of the p-contigs (total size 957Mb), compared to the estimated cichlid genome size of 1Gbp [47], suggests the assembly is relatively complete. To measure the completeness of the haplotigs, the theoretical sizes of heterozygous regions under null expectations of recombination rates and effective population sizes were compared to the size distribution of the haplotigs. Additional File A shows the size distribution of the assembled haplotigs and how it relates to the theoretical recombination rate for several different effective population sizes (Ne). The shape of this haplotig size distribution is closest to the curves representing effective population sizes of 1,000-2,500, which closely matches a recent estimate of the effective population size in M.
zebra [48].Variance in recombination rate across the genome may bias this estimate.  Prior to the final anchoring, these four maps were also used to detect and confirm potential misassemblies in the FALCON contigs. Additional File B lists the FALCON p-contigs for which markers from two or more different LGs aligned, an indicator of potential inter-LG misassembly. Each of these potential misassemblies was further evaluated using alignments of a 40kb Illumina mate-pair library [5], RefSeq gene annotations [51], and repeat annotations (see Methods). In some cases, it was determined that the map marker sequences were repetitive, giving a false signal niloticus), than did the other three maps. This is likely due to the fact that the M. mbenjii x A. koningsi map had both more F2 individuals and more map markers than the other three Lake Malawi cichlid maps, giving it the highest resolution. Anchoring with the other three maps resulted in anchoring of more contigs on LG2, LG9, LG18, LG20 (see Table 3). However, the map that produced the longest anchored LG did not always appear to be the most accurate. To determine this accuracy, each M. zebra LG (anchored with each of the four maps) was aligned to the anchored O. niloticus LGs have slightly different overall sizes than when the assembly was anchored with just a single map (e.g. LG3 changed from 37,717,154bp to 37,309,556bp, Table 2). This is due to the fact that several small contigs are assigned to different LGs by the four different maps.
An anchoring analysis that sequentially chained together the anchored assemblies from all four Lake Malawi cichlid maps resulted in a slightly more anchored assembly (833Mbp total compared to 760Mbp for M_zebra_UMD2).
However, the ordering of contigs in this combined anchored assembly was far less accurate (when aligned to O. niloticus) and so it was not used. There was only a single contig longer than 1Mbp ("000254F") that was not anchored by at least one map.

Structural differences among Lake Malawi cichlid genomes
The process of anchoring the M_zebra_UMD2 assembly to the four maps also allowed us to look for large structural differences among the six species used to generate the maps. Specifically, we looked for p-contigs that were assigned to different LGs in any of the four maps. Table 4 provides the list of the 9 contigs that were assigned to different LGs by at least two maps and which represent putative inter-chromosomal rearrangements.

Localization of centromeric repeats
The location of centromeres is key to understanding structural rearrangements in the karyotype. Figure 1 shows the karyotype of O. niloticus and Metriaclima lombardoi (a species closely related to M. zebra). The O. niloticus SATA consensus repeat [52] is common to the centromeres of many East African cichlid [15], and closely matches the satellite repeats identified in a recent analysis of centromeres across many taxa [53]. modified with permission from [33].
Oreochromis and Metriaclima diverged 17-28 million years ago [54]. Their karyotypes each have 22 chromosome pairs, as do the majority of African cichlids, but O. niloticus has 1 to 3 meta-submetacentric and 19 to 21 subtelo-acrocentric chromosomes according to two previous karyotypes [15,55], whereas M. zebra has six meta-submetacentric and 16 subtelo-acrocentric chromosomes. The chromosomes in Figure 1 have been ordered by type and then by size but only a few have been previously assigned to LGs. BAC and additional marker sequences have been used for specific labeling of chromosomes [36,56].
In order to understand the structural basis for these differences in karyotype, where there is a ~23Mbp rearrangement between M. zebra and O. niloticus. A similar ~20Mbp rearrangement is present on LG20. There is an ~11Mbp rearrangement at one end of LG22 that may be associated with another change in centromere location, although the centromere was not localized on LG22 in either assembly.
Perhaps the most diverged chromosome in terms of size, structure and repeat content is LG3. The karyotype of O. niloticus LG3 is much larger and more repetitive than the corresponding LG3 in Lake Malawi cichlids ( Figure 1 and [15,55]

Variation in recombination rate among species
To compare the rates and patterns of recombination across the chromosomes, LG9, LG16, LG20, and LG22 show higher recombination in males than females.
LG3 and LG23 are both known sex determination chromosomes in tilapias [46,57], and each deviates from the normal recombination patterns. On LG3, the largest chromosome in O. niloticus (Figure 1), there is very low recombination for ~70Mbp.
On LG23 there is a ~28Mbp region of greatly reduced recombination.
Likewise, the markers in the four Lake Malawi genetic recombination maps were aligned to the final M_zebra_UMD2 assembly and their recombination map positions were plotted against physical distance. Figure 3 shows  LG11, LG20 and LG23. All LG maps are provided in Additional File G.

Major structural rearrangements of ancient cichlid chromosomes
We also aligned the O_niloticus_UMD_NMBU assembly to the recently published "HSOK" O. latipes medaka assembly [39]. O. niloticus has 22 chromosome pairs, while the medaka HSOK genome has 24 chromosome pairs. Table 5 is a comparison of cichlid chromosomes and medaka HSOK chromosomes. We identified several large chromosome rearrangements that occurred in a cichlid ancestor. Tilapia LG7, the second largest chromosome (Table 1), is comprised of medaka chromosomes 6 and 12 in their entirety ( Figure 4). This indicates a fusion of these ancestral chromosomes in cichlids relative to medaka, as had been previously suggested [37]. Tilapia LG23, the third largest chromosome (Table 1) LG3 is the largest tilapia chromosome (Table 1)  ( Figure 5).

Repeat landscape of the Metriaclima zebra assembly
The M_zebra_UMD2 assembly is 35% repetitive, similar to the O_niloticus_UMD1 assembly which is 37% repetitive [13]. Figure 6 shows the repeat landscape for the M. zebra and O. niloticus assemblies. While the O. niloticus genome assembly does have a slightly larger total quantity of repeats, the M. zebra genome assembly has a noticeably larger amount of recent TE insertions (sequence divergence < 2%). To further test that this difference was not an artifact of the two assembly processes, we assembled the M. zebra PacBio reads at the same coverage, with the same parameters, using the same software versions and on the same compute cluster as was performed for the O_niloticus_UMD1 assembly. RepeatMasker was subsequently run on this assembly and the pattern of more recent insertion in M.
zebra relative to O. niloticus was even more pronounced (Additional File H).
The M_zebra_UMD2 assembly was annotated using the NCBI RefSeq annotation pipeline for eukaryotic genomes [51]. Table 6 shows the improvement in gene annotation for the new M_zebra_UMD2 assembly relative to the previous version of the M. zebra assembly [5,12].

Anchoring to produce chromosome-scale assemblies
The genetic maps and whole genome alignment comparisons to the O. niloticus assembly were very useful for identifying large and mostly inter-chromosomal misassemblies in the new M. zebra assembly. A 40kb Illumina jumping library was also used in this process to determine if disagreements between the maps and the assembly were true misassemblies, errors in the maps, or structural differences between samples. It is likely that several misassemblies still remain in the final M_zebra_UMD2 anchoring. However, these potential misassemblies are probably only present on smaller contigs where there were not enough markers to detect misassembly events.
Only one contig longer than 1Mbp was not able to be anchored by two or more markers from one of the four Lake Malawi maps. Therefore, any possible remaining misassemblies are likely to involve smaller contigs. The M_zebra_UMD2 anchoring was primarily anchored with the M. mbenjii x A. koningsi map, but for several chromosomes (Table 3), use of other Lake Malawi maps were used to arrive at a more complete anchoring. Therefore, the M_zebra_UMD2 anchoring does not represent a single species. A high-density map of M. zebra would be a useful resource for future studies.

Patterns of continuity in genome assemblies
The longest contigs tend to be anchored in the middle of chromosomes and in regions where there is greater recombination. The ends of chromosomes, typically in regions of lower recombination, tend to have smaller contigs. Perhaps the clearest example of this is on LG13 (Additional File D and Additional File G). On LG7, smaller contigs appear in the middle of the chromosome where there is also a reduction in recombination uncharacteristic of most other chromosomes. Smaller contigs likely correspond to regions with a large fraction of repetitive sequence that lead to a more fragmented assembly. These regions have likely accumulated large TE arrays that are not spanned by even the longest of the reads in our datasets. It is known that TEs accumulate in regions of suppressed recombination [64]. These chromosomal regions with smaller contigs also tend to have more structural rearrangements relative to O. niloticus, which suggests an important role for transposable elements in formation of the rearrangements. This pattern could also be caused by ambiguities in the maps due to there being fewer recombination events and therefore less map resolution in these regions. There are also fewer markers used to anchor smaller contigs that may also contribute to this pattern. Orthogonal mapping technologies that do not rely on recombination, such as optical mapping, may be needed to resolve the structure of these regions in finer detail.

Patterns of recombination in O. niloticus
Several patterns are evident in the recombination maps for O. niloticus. First, though the pattern of recombination is generally similar in males and females, the level of recombination in females is generally higher than in males. The total female map length is 1,641 cM, while the male map is only 1,321 cM. The sex differences in recombination rate of O. niloticus are smaller than observed in salmonids [65][66][67][68], stickleback [69], Japanese flounder [70], and zebrafish [71]. tend to accumulate repetitive transposable elements [64]. These regions are also likely to experience episodes of genetic hitchhiking, which will alter the pattern of genetic differentiation across the genome, as shown in stickleback [69,72]. The extent of LD impacts the probability of fixation of adaptive variants and may affect the probability that a given chromosomal segment can evolve into a new sex chromosome [69]. Interestingly, extensive LD is present on LG3 in O.
niloticus, where a sex determination locus is located in a sister species, O. aureus. One evolutionary interpretation of this finding is that high LD at LG3 predated, and facilitated evolution of, the O. aureus sex chromosome [46]. Alternatively, recombination suppression may have evolved as a result of sex-chromosome-associated evolution at LG3; in this scenario, the lineage leading to O. niloticus may have had, and subsequently lost, the dominant LG3 sex determination allele, but the traces of sex chromosome evolution remains in the genome.

Patterns of recombination in Lake Malawi cichlids
The

Patterns of recombination on sex chromosomes
Sex chromosomes typically accumulate inversions that reduce recombination between the sex determining gene and linked sexually antagonistic alleles [74]. The strain of O. niloticus used to generate the genome assembly contigs [13] has an XY sex determination locus on LG1 [31,75]. The strain of O. niloticus used to generate the map [22] and anchor those contigs to chromosomes has an XY sex determination locus on [76]. We observed reduced recombination in males relative to females adjacent to the sex locus at 34.5Mbp on O. niloticus LG23 (Additional File F). We also observed significant differences in recombination between the sexes on LG7, LG11, LG14 and LG15. An XY sex locus has been identified on LG14 in O.
mossambicus (Gammerdinger, in review), and XY sex loci have been identified on LG7 [29,58] and LG11 (unpublished) in Lake Malawi cichlids. Notably, three of the four Lake Malawi crosses segregate the LG7 XY sex determination system (the M. mbenjii x A. baenschi cross is unknown) (50,76 and unpublishsed results). However, LG7 shows relatively low recombination suppression compared to some other chromosomes. Recombination is reduced in the middle of LG7, centered at ~32Mbp, but this is not associated with the centromere (located at 61Mbp).
While this region is near the LG7 XY sex determination interval, the overall shape of recombination on LG7 may also be the result of the chromosome fusion event that happened in the cichlid ancestor ( Figure 4 and discussed below). As discussed for Oreochromis above, it is unclear whether recombination suppression or sex determination evolved first at this locus. It should also be noted that there is a single marker in this region that appears out of order in the M. zebra x M. mbenjii map, perhaps indicating a structural difference (Additional File D and Figure   3). Further investigation will be needed to determine if other regions of the genome that display large differences in sex-specific recombination are associated with previously identified and/or novel sex determination loci.
The new anchoring provides the most complete assembly to date of LG3, the largest chromosome in the O. niloticus karyotype (Figure 1). This chromosome carries a ZW sex locus in several species of Oreochromis [13,46], but not in the O. niloticus line studied here. The first ~30Mbp of LG3 shows a standard rate and pattern of recombination. However, the remaining ~70Mbp exhibits almost no recombination in either males or females. This region of highly reduced recombination contains the ZW sex locus in other Oreochromis species, and a very high density of repetitive elements.

Conservation of ancient synteny
Synteny is remarkably conserved among even distantly related teleosts [40,78]. Medaka show few inter-chromosomal rearrangements since shortly after the fish-specific whole genome duplication more than 300 MY ago [34]. Our whole genome alignment of tilapia to medaka supports the previously reported findings that the syntenic organization of teleost genomes is largely stable. The ancestral teleost chromosome number was 24, and contraction of diploid chromosome numbers is usually the result of chromosome fusion and/or translocation events [40]. In cichlids, where the most common chromosome number is 22 [16], we find evidence for two large fusion events on LG7 and LG23 and additional translocations on LG15 and LG17.
Cichlid LG7 corresponds to a fusion of medaka chromosomes 6 and 12 (Figure 4), while cichlid LG23 is a fusion of medaka chromosomes 2 and 4 ( Figure 5). Clearly, the variation in diploid number observed in other cichlid species implies that there have been additional interchromosomal rearrangements, but we predict these will be simple fission/fusion events and not the result of scrambling of these ancient syntenic relationships.
The patterns of recombination across these particular chromosomes provide additional evidence of fusion and translocation events (Figure 4 and Figure 5). There are large deviations from the slope of the recombination curves located precisely where these fusion and translocation events have occurred. This also suggests that the pattern of recombination evolves slowly, as these oddly shaped recombination patterns have persisted for at least 15 million years since the divergence of the common ancestor of O. niloticus and the Lake Malawi species.
Interestingly, the odd pattern of recombination on LG3 does not seem to be the result of a chromosome fusion event. This lends support to the hypothesis that LG3 accumulated repetitive sequences after it became a sex chromosome, and that this sex chromosome signature and associated recombination suppression persists in O. niloticus even following loss of the LG3 sex determination system.
There are many examples of large-scale (>2Mbp) intra-chromosomal rearrangements between O. niloticus and Lake Malawi cichlids, as well as rearrangements evident among the Lake Malawi species. In some cases, the anchoring of the M. zebra assembly using each map showed the same large structural rearrangement relative to O. niloticus for each map (see LG2, LG19, LG20 in Additional File D). This suggests that these rearrangements happened prior to the Lake Malawi radiation, or are specific to O. niloticus. In other cases, there are large structural koningsi map (Table 3). Additional work is needed to better define the structure of these chromosomes in each lineage.

Evolution of centromere position and sequence
Long-read sequencing has made it possible to assemble centromere repeats [79][80][81]. A recent study of centromere evolution in medaka provides an example of the role of centromere evolution in speciation [39]. The study showed that the centromere position of many medaka chromosomes has remained unchanged among Oryzias species in both acrocentric and nonacrocentric chromosomes. In other chromosomes, the position of centromeres did change and sometimes these chromosomes underwent major structural rearrangements involving other chromosomes. Alignment of the O_niloticus_UMD_NMBU assembly to these new medaka assemblies showed that cichlids have a different set of conserved and variable chromosomes compared to medaka. Additionally, the medaka study showed that centromere sequence repeats were more conserved in the chromosomes that remained acrocentric than in chromosomes that switched between acro-and non-acrocentric or that were non-acrocentric. Assembly and placement of cichlid centromere repeats in multiple species will allow refinement of previous karyotype studies in the context of whole genome assembly comparisons and will also provide insight into centromere evolution at the sequence level. Are there differences in centromere sequence/rate of evolution between acrocentric and non-acrocentric chromosomes? Are these differences great enough to create meiotic incompatibilities in hybrids? Are the positions of centromeres conserved across many species? This study provides a starting point to answer these questions.

Evolutionary patterns of African cichlid karyotypes
The karyotypes of O. niloticus and M. zebra in Figure 1 show that there have been at least 5 or 6 changes from subtelo-acrocentric to meta-submetacentric chromosomes. The clearest example of this in the new genome assemblies is the 15Mbp rearrangement on LG23 (Figure 2).
Additionally, three similar centromere location changes have occurred, on LG3, LG4 and LG16 (Additional File D). We were able to identify centromere-containing repeats on both the M.
zebra and O. niloticus assemblies in just over half of the chromosomes (LGs 3, 4, 5,7,8,9,11,13,14,16,17,19,23). The ONSATA (Oreochromis niloticus satellite A repeat) and the TZSAT (Tilapia zillii satellite repeat) satellite sequences [82] have not been explicitly shown to be centromeric binding sequences, but rather are highly associated with the centromeres via in situ hybridization [15]. They are also the only two well-characterized centromere-associated repeat sequences in cichlids. It is possible that these ONSATA and TZSAT repeat sequences may be present in other portions of the chromosome, or that some of them have been assembled Two of the chromosomes where we have identified karyotype changes have also been shown to harbor sex-determining loci in African cichlids. The first was the previously mentioned XY sex determination region in O. niloticus on LG23 [76]. On LG3, a WZ sex determination region was previously identified [46] and characterized [13] in the congener species O. aureus, and now reanalyzed on the O_niloticus_UMD_NMBU assembly (Additional File E). There is a very wide region of sex-patterned differentiation from ~40Mbp to 85Mbp on LG3 in O. aureus (13). This region corresponds to the low recombination region in male and female O. niloticus.
The largest chromosome in the O. niloticus karyotype is LG3, and the largest M. zebra chromosome is LG7 (Figure 1). Each of these chromosomes is known to be a sex chromosome in their respective lineage. The assembled and anchored chromosomes support these karyotypes since the largest O. niloticus assembled chromosome is LG3 and the largest M. zebra chromosome is LG7 (Table 1 and Table 3). We suggest that LG3 expanded from the ancestral state in the O. niloticus lineage by the accumulation of a large amount of TEs and segmental duplications, likely while linked to sex determination in a basal Oreochromis [13]. It is not clear if this apparent runaway elongation of LG3 in Oreochromis is due to a sex-determination locus, or if recombination was suppressed first due to some other process. Additional genome assemblies of similar quality in related Oreochromis species should allow for further refinement of the evolutionary history of this large sex chromosome in the Oreochromini.
There is also a large (~28Mbp) region of greatly reduced recombination on LG23 in the O. niloticus map, as well as in each of the four Lake Malawi maps.
LG23 is also the second largest anchored chromosome in the M. zebra assembly and third largest chromosome in the O. niloticus assembly. It is possible that this arm of LG23 is accumulating TEs similar to LG3, but at an earlier stage. There is an XY sex determination locus on LG23 in O. niloticus [57,76], and in at least one species of Lake Victoria cichlid [83], which may be contributing to changes in the size and rate of recombination on this chromosome. Three scenarios may explain these observations: 1) LG23 is an ancient sex chromosome, and though lost in the Malawi lineage, associated recombination suppression remains in Lake Malawi cichlids; 2) The LG23 sex determination locus is indeed segregating in Lake Malawi cichlids but has yet to be identified and described; 3) The recombination pattern on LG23 is not due to sex-chromosome-associated evolution but has been maintained by unknown factors in both lineages.

Recent transposable element expansion in M. zebra
Recent evidence has shown that AFC-SINE indels in cis-regulatory regions of genes are highly associated with innovative cichlid phenotypes such as egg-spots [44]. A deletion that may be TE-mediated is responsible for controlling the expression of the SWS2A opsin [45]. It is likely that other AFC-specific and other TE-mediated mutations have also contributed to the diverse phenotypes of African cichlids. Therefore, it is important that these TE insertion events are well represented in the genome assemblies.
M. zebra has a higher number of recent TE insertions (sequence divergence < 2%) than O. niloticus ( Figure 6). Since the O. niloticus assembly is 43.4Mbp longer than the M. zebra assembly, it is possible that the difference in the number of recent TE insertions is even greater than we have quantified here. Each new version of the M. zebra genome assembly has increased the total assembled length of many TE super-families (Additional File I), including the AFCspecific families.
We present this finding with several caveats. It is possible that the two species have divergent patterns of insertion across the genome. We previously suggested O. niloticus contains larger clusters of repeat arrays that are experiencing recent insertions [13]. These very long arrays do not seem to be present at the same frequency in the M. zebra genome. It is possible that many of the recent TE insertions in O. niloticus were not assembled and remain hidden in these large arrays. Differences in effective population size (Ne) between the two species may also account for differences in rate of TE accumulation as larger populations will able to purge deleterious insertions more efficiently. Additionally, the DNA of the two species was extracted and sequenced at different times. The O. niloticus genome was assembled from 44X coverage of PacBio reads using the P6-C4 chemistry, while the M. zebra genome was assembled from 48.5X coverage using the P6-C4 chemistry and 16.5X coverage using a different PacBio chemistry (P5-C3). Other unknown technical factors may also have contributed to the difference that we have described. Future comparisons of additional samples and species assembled using the same sequencing coverage and assembly software/parameters will help to more accurately quantify the recent TE expansion in African Great Lake cichlids.

Diploid assembly
We present the new M. zebra assembly in both haploid and diploid representations. The majority of current genomics tools assume a haploid reference assembly and all subsequent analyses are based on this haploid representation. The use of multiple diploid assemblies will be required to capture population-level patterns of heterozygosity and complex structural variation.
The genome assemblies reported here should therefore be considered the beginning of a larger effort to properly represent cichlid genomes. A study of Arabidopsis thaliana and Vitis vinifera (Cabernet Sauvignon) showed that the phased diploid assemblies produced by FALCON-unzip improved identification of haplotype structure and heterozygous structural variation [7].
Sequencing and assembly of F1 in cattle has also been shown to recover these complex regions better and may be the way forward for assembly of diploid genomes [84]. Graph genome representations [85,86] have been shown to improve variant calling in complex regions such as the human leukocyte antigen (HLA) [87], major histocompatibility complex (MHC) [88] and centromeres [89]. Additional long-read diploid assemblies will be able to better represent genetic variation, particularly in regions of complex variation which current long read assemblies are beginning to span [80].

Potential implications
This study highlights the evolutionary insights that can be gained using a comparison of high-quality chromosome-scale genome assemblies, genetic recombination maps and cytogenetics across multiple related and, in this case, rapidly evolving species. It further illustrates the need for high-quality, chromosome-scale genome assemblies for answering many basic biological questions. This study illustrates the structural changes that can occur in the genomes of a rapidly evolving clade. It will be interesting to make comparisons to other radiations in the tree of life, both large and small. This study provides a wide-angle view of African cichlid genome history and demonstrates how these high-quality resources can be used for many different types of evolutionary genomic analyses. As additional high-quality cichlid genomes are generated, this study will provide the foundation for comparisons of structural variation, recombination, cytogenetics, and repetitive sequences across the cichlid phylogeny.
Many new questions have been generated here. How do the structural changes of African cichlid genomes compare to other groups? Is the pattern of few inter-chromosomal, but many intrachromosomal differences seen here found in additional Lake Malawi genera as well as other radiations in Lake Tanganyika and Lake Victoria? Are these patterns of recombination observed across the majority of cichlids? Are any deviations from these typical recombination patterns related to specific phenotypic traits or sex chromosome history? How have these chromosomes evolved structurally? We look forward to the new dawn in cichlid genomics.

O. niloticus SNP array map, misassembly detection and new anchoring
Offspring (n=689) and parents from 41 full-sib families belonging to the 20 th , 24 th and 25 th generations of the GST® strain were analyzed using a custom 57K SNP Axiom ® Nile Tilapia Genotyping Array [22]. SNPs classified as "PolyHighRes" or "No-MinorHom" by Axiom Analysis Suite (Affymetrix, Santa Clara, USA), and having a minor-allele frequency ≥ 0.05, and call rate ≥ 0.85 were used in genetic map construction (n= 40,548). Lep-MAP2 [90] was used to order these SNPs into linkage groups in a stepwise process beginning with SNPs being assigned to linkage groups using the 'SeparateChromosomes' command. LOD thresholds were adjusted until 22 linkage groups, which correspond with the O. niloticus karyotype.
Unassigned SNPs were subsequently added to linkage groups using the 'JoinSingles' command and a more relaxed LOD threshold, and ordered within each linkage group using the 'OrderMarkers' command.
Sequence flanking each SNP (2 x 35nt) was used to precisely position 40,190  LD results (r 2 > 0.97) presented in Figure 4, Figure. 5 and Additional file F, were produced in PLINK2 version 1.90b3w [92] using the pedigree described above and SNPpositions given in [92].

PacBio Sequencing of M. zebra
The previous version of the M. zebra assembly, M_zebra_UMD1 [12], included 16.5X PacBio sequencing (25 SMRT cells using the P5-C3 chemistry) on an PacBio RS II machine [12]. An additional library was prepared using the same Qiagen MagAttract HMW DNA extraction and Blue Pippin pulse-field gel electrophoresis size selection. An additional 60 SMRT cells (using the P6-C4 chemistry) were sequenced on the same PacBio RS II at the University of Maryland Genomics Resource Center as the previous 16.5X P5-C3 data. These P6-C4 SMRT cells comprised ~48.5X coverage to bring combined total to ~65X coverage.

M. zebra diploid genome assembly
The 65X PacBio reads were assembled using FALCON-integrate/FALCON_unzip (version 0.4.0) [7]. The following parameters were used for the 'fc_run.py' assembly step: This was followed by the unzip step ('fc_unzip.py') and quiver polishing of the diploid assembly with the 'fc_quiver.py' assembly step.

Polishing of the M. zebra diploid genome assembly
The diploid assembly described above includes a PacBio polishing (quiver) step.
However, there were also Illumina reads available for M. zebra from the first version of the assembly [5]. Trimming and filtering of the raw M. zebra Illumina reads are described for the previous version of the assembly [12]. The markers for each of the four maps were aligned to the intermediate polished assembly using BWA mem [93] (version 0.7.12-r1044) and a separate SAM file was generated.
Chromonomer [96] (version 1.05) was run for each map using these respective SAM files and map information as input. Chromonomer detected contigs in the intermediate assembly that were mapped to multiple linkage groups.
To narrow the location of these identified misassemblies, the Illumina 40kb mate-pair library from the first M. zebra assembly [5] was aligned to the intermediate assembly. The raw PacBio reads were aligned using BLASR [97] (version 1.3.1.127046) with the following parameters: '-minMatch 8 -minPctI-dentity 70 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 40 -noSplitSubreads -sam'. Regions of abnormal coverage in the PacBio read alignments as well as abnormal clone coverage in the 40kb mate-pair were identified for most potential misassemblies identified by the genetic maps. These misassembly regions were manually inspected using these alignments in IGV [98]. Additionally, RefSeq [51] (release 76) M. zebra transcripts were aligned to the intermediate assembly using GMAP [99] (version 2015-07- 23) and RepeatMasker [100] repeat annotations were considered when defining the exact location of a misassembly break.
One additional misassembly was identified during the comparison of linkage maps (next section) and was subsequently broken using the same process as above.

M. zebra assembly anchoring
The same four genetics maps used above for misassembly detection were also used for anchoring the assembly contigs (after breaking) into the final set of linkage groups.
Chromonomer [96] (version 1.05) was run on each of these four genetic maps to anchor the polished and misassembly corrected contigs. BWA mem (version 0.7.12-r1044) was used to create the input SAM file by aligning each respective map marker sequences to these contigs.
Gaps of 100bp were placed between anchored contigs. The final M_zebra_UMD2 anchoring was generated by anchoring LG9 and LG11 with the M. zebra and M. mbenjii map [49], LG20 with the M. mbenjii and A. baenschi map and the remaining 19 LGs with the M. mbenjii and A.
koningsi map. To accomplish this anchoring, the markers for each of those respective maps and LGs were used with Chromonomer as described above.

M. zebra repeat annotation
RepeatModeler [101] (version open-1.0.8) was first used to identify and classify de novo repeat families present in the final anchored assembly. These de novo repeats were combined with the RepBase-derived RepeatMasker libraries [102]. RepeatMasker [100] (version open-4.0.5) was run on the final anchored assembly using NCBI BLAST+ (version 2.3.0+) as the engine ('-e ncbi') and specifying the combined repeat library ('-lib'). The more sensitive slow search mode ('-s') was used. The repeat landscape was generated with the RepeatMasker 'calcDivergenceFromAlign.pl' and 'createRepeatLandscape.pl' utility scripts.

M. zebra BUSCO genome-completeness analysis
BUSCO (version 3.0.2) was run on the M_zebra_UMD2 anchored assembly in the genome mode (-m geno) and compared against the vertebrate BUSCO set ('vertebrata_odb9').

Whole genome alignment of M. zebra to O. niloticus
The final anchored M_zebra_UMD2 assembly was aligned to the O_niloticus_UMD_NMBU assembly using the 'nucmer' program of the MUMmer package [103] (version 3.1). The default nucmer parameters were used and the raw nucmer alignments were filtered using the 'delta-filter' program with the following options: '-o 50 -l 50 -1 -i 10 -u 10'. These filtered alignments were converted to a tab-delimited set of coordinates using the 'show-coords' program with the following options: '-I 10 -L 5000 -l -T -H'. This set of coordinates was then visualized using Ribbon [104] and used to generate the images in Additional File D.

Whole genome alignment of M. zebra to medaka
The HSOK medaka genome assembly version 2.2.4 was downloaded from http://utgenome.org/medaka_v2/#!Assembly.md and corresponds to NCBI accession (GCA_002234695.1). Similar to the M_zebra_UMD2 comparison, O_niloticus_UMD_NMBU was aligned to the medaka HSOK genome with nucmer. The 'delta-filter' settings were adjusted to '-1 -l 50 -i 50 -u 50' to account for the increased divergence between the two more distantly related species. The 'show-coords' settings were also adjusted to '-I 50 -L 50 -l -T -H'.
Alignments were again viewed with Ribbon to identify putative chromosome fusion and translocation events and used to generate the part of the images in Figure 4 and Figure 5.

List of abbreviations
AFC -African cichlid specific repetitive element.
LINE -Long interspersed nuclear element.
NG50 -Shortest contig/scaffold/read/sequence length at 50% of the estimated genome/read set size.
map data for anchoring. MAC and TDK wrote the manuscript. All authors read and approved the manuscript.