Long-read sequence capture elucidates the evolution of the hemoglobin gene clusters in codfishes

Combining high-throughput sequencing with targeted sequence capture has become an attractive tool to study specific genomic regions of interest. Most studies have so far focused on the exome using short-read technology. This approach does not capture intergenic regions needed to reconstruct genomic organization, including regulatory regions and gene synteny. In this study, we demonstrate the power of combining targeted sequence capture with long-read sequencing technology, leading to the successful sequencing and assembling of the two hemoglobin (Hb) gene clusters LA (~100kb) and MN (~200kb) across eight species of codfishes that are separated by up to 70 million years of evolution. The highly continuous assemblies – capturing both intergenic and coding sequences – revealed overall conserved genetic organization and synteny of the Hb genes within this lineage, yet with several, lineage-specific gene duplications. Moreover, for some of the species examined we identified amino acid substitutions at two sites in the Hbb1 gene as well as length polymorphisms in its regulatory region, which has previously been linked to temperature adaptation in Atlantic cod populations. Taken together, our study highlights the efficiency of targeted long-read capture for comparative genomic studies by shedding light on the evolutionary history of the Hb gene family across the highly divergent group of codfishes.


50
The rapid advancement of high-throughput sequencing has over the last decade 51 revolutionized genomic research with the increasing numbers of whole genome 52 resources available for multiple vertebrate species, including the diverse group of 53 teleost fishes (Volff 2005 Our study demonstrates that this approach, combining sequence capture 138 technology with long-read sequencing is a highly efficient and versatile method to 139 investigate specific genomic regions of interest -with respect to micro-synteny, 140 regulatory regions and genetic organization -across distantly related species where 141 genome sequences are lacking. 142 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint

143
Capture and de novo assembly of the target regions 144 The capture probe design (workflow schematically shown in Figure 1)  was variable across the target region for all species (Figure 2 and 3). Because of the 157 skewed distribution of mapping depth, we also calculated median depth, which was, 158 as expected, the highest for Atlantic cod at 242x (Table S1). The median mapping 159 depth was consistently high for most of the other species as well, with the lowest for 160 roughhead grenadier (12x). Both median and average depths for the MN region are 161 persistently higher than for the LA region for all species, with the exception of silvery 162 pout (Table S1). Furthermore, positions with high degree of mapping corresponded 163 to the location of the genes used in the design of the capture probes across all species 164 ( Figure 2 and Figure 3). The percentage of reads mapping to the target region ranged 165 from 25-43%, however, the percentage of the target region covered by reads ranged 166 from 53-100% with five species having more than 90% of the target region covered by 167 reads (Figure 4c and Table S1). 168 To address factors influencing capture success we compared various capture 169 statistics to overall genomic divergence between the Atlantic cod genome and 170 independent WGS data for each species from (Malmstrøm et al. 2017) (Table S1). We 171 found a strong negative correlation between genomic divergence to Atlantic cod and 172 median mapping depth against the target region (r=-0.90, Figure 4a), percent of reads 173 mapped to the target region (r=-0.90, Figure 4b), and percentage of reads mapped to 174 the target region (r=-0.84, Figure 4c). 175 We constructed de novo assemblies with quite consistent assembly statistics 176 across species. Contig N50 ranged from 8055 bp in burbot to 6523 bp in European 177 hake and the total number of contigs varied from 205 in burbot to 455 in marbled 178 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint moray cod. However, there was some variation in the size of the largest contig, 179 which ranged from 79 kbp in Atlantic cod to 30 kbp in marbled moray cod (Table 1). 180 To evaluate whether the assemblies represent the actual target regions we mapped 181 the de novo assemblies for each species to the target region in Atlantic cod, for which 182 the capture design is largely based upon (Figure 2 and 3). As expected, the 183 assemblies corresponded to the regions with high coverage of reads, i.e. the areas of 184 the target region containing genes included in the probe design. 185 Synteny of the Hb gene regions  (Table S2). 205 Target region in the haddock and Atlantic cod genome assemblies certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint genes in the genome assembly of haddock is conserved compared to Atlantic cod 215 with the exception of Hbb4 in the MN region, which is absent in haddock ( Figure 5). 216 Repetitive sequences in the in the Hb gene regions 217 Quantifying the amount of repetitive sequences in the target region(s) was only 218 possible for Atlantic cod (gadMor2) and haddock (melAeg), for which high-quality 219 genome assemblies exist. The amount of repetitive sequences in the target region 220 differed between the MN cluster and the LA cluster in Atlantic cod. The MN region 221 (214 kb) contained a total of 10.7% repeated sequences, including 1.0% retro-222 elements, 1.3% transposons, 5.8% simple repeats, and 2.6% of various low complexity 223 and unclassified repeated sequences (Table S3). In comparison, in the smaller LA 224 region (123 kb) the proportion of repeated sequences was twice as high (20.3%), 225 which comprised of 2.8% retro-elements, 2.4% transposons, 13.8% simple repeats, 226 and 1.3% of various low complexity and unclassified repeated sequences. 227 Furthermore, the orthologous target regions in haddock followed the same pattern. 228 The MN region contained 16.3 % repeated sequences, in contrast to 19.8 % found in 229 the LA region (Table S3). , which we suspect may be due to collapse of paralogous Hb genes, as they may 280 have as high as 98% sequence identity (Table S2). 281 Length variation in the bi-directional Hba1-Hbb1 promoter within the certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint migratory NEAC population has been shown to harbor the 73 bp longer variant at a 287 higher frequency compared to coastal cod populations (see Figure 6 and (Star et al. 288 2011)). Interestingly, we found relatively long promoters with high sequence 289 similarity to the NEAC indel in haddock and silvery pout. In contrast, cusk 290 displayed a relatively short promoter, however, still 17 bp longer than in NCC 291 ( Figure 6). Furthermore, we found the amino acid positions 55 and 62 in Hbb1, 292 known to be polymorphic in Atlantic cod, to be variable across all codfishes included 293 in this study ( Figure 6). Investigations of the same positions in a number additional 294 codfishes for which we have available gene sequences (Baalsrud et al. 2017), revealed 295 that these positions are highly variable across this linage (Table S2) promoter. Scenario 2: The long promoter is the ancestral state with independent 304 deletions of variable lengths in cusk, silvery pout, haddock and costal/southern 305 Atlantic cod (Hbb1-1). To disentangle this, we would need to obtain promoter 306 sequences from additional gadiform species. Regardless, the short-long promoter 307 polymorphism has been maintained throughout speciation events based on the 308 presence of both variants in Atlantic cod. Moreover, in both scenarios, cusk and 309 Atlantic cod (Hbb1-1) have maintained the ancestral Met55Lys62, while silvery pout, 310 haddock and Atlantic cod (Hbb1-2) have acquired substitutions at these positions due 311 to similar selection pressures or genetic drift. In this regard, it could be mentioned 312 that the NEAC, haddock and silvery pout display migratory behavior (e.g. diurnally 313 feeding movements as well as seasonal spawning migrations) compared to the more 314 stationary cusk and coastal cod (Eschemeyer and Fricke 2017) which could mean that 315 they have a higher O2 demand and are exposed to greater temperature variation, 316 which in turn has selected for a temperature-dependent long promoter.  Assembly success affected by probe design and repeat content 321 In some species, nearly the complete target region is assembled in large contigs 322 containing multiple genes including cusk, whereas in other species such as the more 323 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint distantly related roughhead grenadier, the cluster is more fragmented (Figure 5). In 324 all species, the areas of the target regions that harbor genes of which probes are 325 designed for, as well as any areas containing repeated sequences, have very high 326 depths in comparison to the areas of intergenic sequences (Figure 2 and 3). This 327 poses a challenge for the assembly software, which is based in the assumption of 328 uniform depth over the sequencing data (Miller et al. 2010). 329 Overall, the MN cluster seems to be more successfully assembled than the LA 330 cluster, which is more fragmented ( Furthermore, we were able to cover up to 98% of the target region with >10 reads 357 across species (Table S1)  certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint We were able to capture complete genes for species with 70 My divergence 361 time from the Atlantic cod ( Figure 5). As expected, we found that capture success 362 declines with increased sequence divergence between the reference genome of which 363 we chiefly based our capture probes and the genomes of the included codfishes 364 (Figure 4). It has been reported that orthologous exons were successfully captured in 365 highly divergent frog species (with 200 My of separation), nevertheless the capture 366 success greatly decreased with increased evolutionary distance (Hedtke et al. 2013). 367 Similarly, it has been demonstrated that it is possible to capture >97% of orthologous 368 sequences in four species of primates that diverged from humans 40 My ago, using 369 probes entirely based on the human exome (George et al. 2011 Figure 5). We argue, in line with (Schott et al. 2017), that sequence divergence 379 may be a more exact predictor of capture success than evolutionary distance, as the 380 sequence capture process is mainly influenced by the difference between the probe 381 sequence and the target sequence. European hake, marbled moray cod and 382 roughhead grenadier all diverged from cod about 70 My ago, however, the European 383 hake Hb regions was more successfully captured and assembled (Table 1; Figure 2). 384 This could be due to European hake having a lower genome-wide divergence to 385 Atlantic cod than marbled moray cod and roughhead grenadier (809k vs 879k and 386 907k SNPs; Table S1). 387 Finally, it should be mentioned that cusk -which diverged from Atlantic cod 388 39 My ago -was added to the experimental design after the species-specific probes 389 were generated. Thus, the successful capture of cusk was therefore solely based on 390 cross-species target enrichment, and could most likely been further improved if 391 species-specific probes for this species have been included. 392 Concluding remarks and future perspectives 393 Here, we have successfully demonstrated that combining targeted sequence capture 394 with long-read sequencing technology is as an efficient approach to obtain high 395 quality sequence data of a specific genomic region, including both coding and 396 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint noncoding sequences, across evolutionary distant species. We show that genome-397 wide divergence is of importance for capture success across species. Furthermore, the 398 use of long-read sequencing augmented the de novo assembly of regions containing 399 repeated sequences that would otherwise fragment assemblies based on short-read 400 sequencing. This is crucial for capturing complete intergenic sequences that may be 401 highly divergent compared to genic regions even among fairly closely related 402 species. Given the rapid development in sequencing technologies future methods 403 will enable read-through of repeated regions and thus further increase the 404 completeness of assemblies. Moreover, a less stringent hybridization protocol should 405 make it possible to capture sequences across even deeper evolutionary time. In sum, 406 our approach has the potential of enhancing comparative genomic studies of 407 continuous genic and intergenic regions between any eukaryotic species-group 408 where genomic resources are scarce. 409 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. with an E-value threshold of <0.1 against the genome assembly data of all ten species. 424 In total, 5604 sequences from the chosen species were supplied to NimbleGen 425 probe design. Protein coding genes from the ENSEMBL database were used to define 426 the regions to be tiled in the probe design (Table S5)  Target-area read depth for all the species based on mapping against gadMor2, were 506 calculated using Samtools v1.3.1 ). We calculated both average and 507 median mapping depth against the target region as a whole and for the MN and LA 508 region separately. We also calculated percentage of reads that mapped to the target 509 region, and the percentage of the target regions covered by reads to a minimum 510 depth of 10x. To compare assemblies to the target region we additionally mapped the 511 assemblies to the target region. In order to verify the sequence capture process, 512 sequence data for Atlantic cod and haddock were mapped back to their reference 513 genomes using BWA-mem v0.7.10 (Li and Durbin 2009). The results were visualized 514 using Integrative Genome Viewer (Robinson et al. 2011). 515 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint To obtain an independent measure of divergence between species in the 516 capture experiment we calculated genome wide level of divergence of each species to 517 the reference genome of Atlantic cod using low-coverage whole-genome sequence 518 data from (Malmstrøm et al. 2017). We mapped raw reads to Atlantic cod using 519 BWA-MEM (Li and Durbin 2009) and called SNPs using the Freebayes variant caller 520 (Garrison and Marth 2017). Some species are more closely related to Atlantic cod 521 than others, which could introduce a bias in mapping. To avoid this, we only looked 522 at genomic regions where all species mapped. The number of SNPs was then used as 523 an estimate of genome-wide divergence of each species to Atlantic cod. We also 524 mapped a low-coverage genome of Atlantic cod to the Atlantic cod reference genome 525 as a control. 526 In pursuance of factors explaining capture success we tested for correlations 527 and plotted the relationship between the genome wide level of divergence and the 528 following variables; median mapping depth against the target region (for total, LA 529 and MN, respectively); percentage of reads that mapped to the target region; and the 530 percentage of the target region covered by reads. All tests and plots were done using 531 R version 3.2.5 (Team 2013). 532 Assembly continuity is very often hampered by the presence of repeats, which 533 create gaps. We therefore quantified repeat-content in the target region extracted 534 from gadMor2 and orthologous regions in haddock using Repeatmasker Open 3.0 535 (Smit et al. 2010) for the MN region and the LA region separately. 536 Identifying gene location and synteny 537 In order to identify the genes of interest and their location in the assembly we used 538 local sequence alignment algorithm BLAST v2.4.0 (Altschul et al. 1990) with protein 539 sequences of the genes of interest (Table S5)  Additionally, we estimated sequence identity using EMBOSS Needle (Rice et 547 al. 2000) with default settings, between Hbb gene sequences from (Baalsrud et al. 548 2017) that where missing and present in the de novo assemblies to evaluate similarity 549 (Table S2). 550 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint   Genes highlighted in bold are missing from the assemblies in figure 5. certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint  . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint Figure 2 823 824 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint

825
Figure 3 826 827 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint Figure 4   . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint

830
Figure 5 831 832 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint

833
Figure 6 834 835 . CC-BY 4.0 International license under a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was not this version posted April 9, 2018. ; https://doi.org/10.1101/297796 doi: bioRxiv preprint