Spatial proximity of homologous centromere DNA sequences facilitated 1 karyotype diversity and seeding of evolutionary new centromeres 2 3

Aneuploidy is associated with drug resistance in fungal pathogens. In tropical countries, Candida tropicalis is the most frequently isolated Candida species from patients. To facilitate the study of genomic rearrangements in C. tropicalis, we assembled its genome in seven gapless chromosomes by combining next-generation sequencing (NGS) technologies with chromosome conformation capture sequencing (3C-seq). Our 3C-seq data revealed interchromosomal centromeric and telomeric interactions in C. tropicalis, similar to a closely related fungal pathogen Candida albicans. By performing a genome-wide synteny analysis between C. tropicalis and C. albicans, we identified 39 interchromosomal synteny breakpoints (ICSBs), which are relics of ancient translocations. Majority of ICSBs are mapped within 100 kb of homogenized inverted repeat-associated (HIR) centromeres (17/39) or telomere-proximal regions (7/39) in C. tropicalis. Further, we developed a genome assembly of Candida sojae and used the available genome assembly of Candida viswanathii, two closely related species of C. tropicalis, to identify the putative centromeres. In both species, we identified the putative centromeres as HIR-associated loci, syntenic to the centromeres of C. tropicalis. Strikingly, a centromere-specific motif is conserved in these three species. Presence of similar HIR-associated putative centromeres in early-diverging Candida parapsilosis indicated that the ancestral CUG-Ser1 clade species possessed HIR-associated centromeres. We propose that homology and spatial proximity-aided translocations among the ancestral centromeres and loss of HIR-associated centromere DNA sequences led to the emergence of evolutionary new centromeres (ENCs) on unique DNA sequences. These events might have facilitated karyotype evolution and centromere-type transition in closely-related CUG-Ser1 clade species.

invasion-induced rearrangements (MIR) involving more than one template donor has 104 recently been shown to be influenced by physical proximity and homology (6). 105 Therefore, the outcome of the genomic rearrangements is largely dependent on the 106 nature of the spatial genome organization. In yeasts, apicomplexans, and certain 107 plants, centromeres cluster inside the nucleus (7), which may facilitate translocations 108 between two chromosomes through their pericentromeric loci. 109

110
The centromere, one of the guardians of the genome stability, assembles a 111 large DNA-protein complex to form the kinetochore, which ensures fidelity of 112 chromosome segregation by correctly attaching every chromosome to the spindle 113 machinery. Paradoxically, this conserved process of centromere function is carried 114 out by highly diverged species-specific centromere DNA sequences. For example, 115 the length of the functional centromere DNA is ~125 bp in budding yeast S. 116 cerevisiae (8), but it can be as long as a few megabases in humans (9). The only 117 assemble the nuclear genome of C. tropicalis in seven chromosomes, we combined 186 short-read Illumina sequencing, and long-read single molecule real-time sequencing 187 (SMRT-seq) approaches together with high-throughput 3C-seq (simplified Hi-C) 188 experiment ( Figure 1A, S1A-D). We started from the publicly available genome 189 assembly of C. tropicalis strain MYA-3404 in 23 nuclear contigs (ASM633v3,190 Assembly A) (25) and used Illumina sequence reads to scaffold them into 16 contigs 191 to get Assembly B ( Figure 1A). Next, we used the SMRT-seq long reads to join 192 these contigs, which resulted in an assembly of 12 contigs (Assembly C, Table S1).   Table S2). Finally, we used the 3C-seq reads to polish the complete genome 202 assembly of C. tropicalis constituting of 14,609,527 bp in seven telomere-to-telomere 203 long gapless chromosomes ( Figure 1B). We call this new assembly as 204

Assembly2020. 205 206
We then named the chromosomes in the order of their length from 207 chromosome 1 (Chr1) through chromosome 6 (Chr6), and the chromosome 208 containing rDNA locus is named as chromosome R (ChrR) ( Figure 1C). Accordingly, 209 the centromere on each chromosome is named after the respective chromosome 210 number. Additionally, we assembled the genome sequence of each chromosome in 211 a way to consistently maintain the short arm of chromosomes at the 5¢ end. The 212 statistics of the intermediate and final genome assemblies are summarized in Table  213 S3. In Assembly2020, 1278 out of 1315, Ascomycota-specific BUSCO gene sets 214 could be identified compared to 1255 identified using Assembly A (Table S4, 215 Methods). Inclusion of 23 additional BUSCO gene sets as compared to the 216 Assembly A suggests improved contiguity and completeness of Assembly2020. 217 218 Previously, using centromere-proximal probes, we could distinctly identify five 219 chromosomes (Chr1, Chr2, Chr3, Chr5, Chr6) in chromoblot analysis (22). However, 220 the length of Chr4 as well as ChrR remaind unknown. To validate the correct 221 assembly of these two chromosomes (Chr4 and ChrR), we performed chromoblot 222 analysis. We observed that the Chr4 homologs differ in size ( Figure S4A). Analysis 223 of the sequence coverage across Chr4 identified an internal duplication of ~235 kb 224 region, which explains the size difference between the homologs Chr4A and Chr4B 225 ( Figure 1C, S4B). We named this duplicated locus as DUP4. Subsequently, we 226 scanned the entire genome for the presence of copy number variations (CNVs), 227 which led to the identification of two additional large duplication events: one each on 228 Chr5 (DUP5, ~23 kb) and ChrR (DUPR, ~80 kb) ( Figure 1C, S4B). Additionally, we 229 detected a balanced heterozygous translocation between Chr1 and Chr4 ( Figure  230 S4C) through analyses of 3C-seq data and the de novo contigs ( Figure S4D). This 231 translocation was validated using chromoblot analysis (Figure S4E), Illumina and 232 SMRT-seq read mapping ( Figure S4F). A chromoblot analysis for ChrR revealed 233 that the actual length of ChrR is ~2.8 Mb, while the assembled length is 2.1 Mb 234 ( Figure 1C, S4G). Considering the length of rDNA locus is ~700 kb in C. albicans 235 (26), we reason that the difference between the assembled length and actual length 236 (derived from the chromoblot analysis) of ChrR in C. tropicalis can be attributed due 237 to the presence of the repetitive rDNA of ~700 kb, which is not completely 238 assembled in Assembly2020. 239 240 Next, we performed phasing of the diploid genome of C. tropicalis using our 241 SMRT-seq, and 3C-seq data to identify the homolog-specific variations (Methods). 242 This analysis produced 16 nuclear contigs, which were colinear with the 243 chromosomes of Assembly2020, except for the previously validated heterozygous 244 translocation between Chr1 and Chr4 ( Figure S4H). In order to characterize the 245 sequence variations in the diploid genome of C. tropicalis, we identified the single 246 nucleotide polymorphisms (SNPs) and insertions-deletions (indels) (Methods). 247 Intriguingly, we detected a long chromosomal region depleted of SNPs and indels on 248 the left arm of ChrR ( Figure 1D). We refer to this region with loss of heterozygosity 249 on ChrR as LOH R . Strikingly, we found parts of the syntenic regions of LOH R to be 250 SNP and indel depleted in the Candida sojae strain NCYC-2607, a closely related 251 species of C. tropicalis, as well as in C. albicans reference strain SC5314 ( Figure  252 S5). We also identified the genome-wide distribution of transposons and simple 253 repeats but could not detect preferential enrichment of these sequence elements at 254 any specific genomic location in C. tropicalis ( Figure 1D). Together, we identified 255 multiple long CNVs, long-track LOH, and heterozygous translocation events in the 256 diploid genome of C. tropicalis. Possible implications of these events in virulence and 257 drug resistance of this successful human fungal pathogen need to be explored. Indirect immunofluorescence imaging of C. tropicalis strain expressing 262 protein-A tagged Cse4 suggests the clustering of the centromere-kinetochore 263 complex, which is localized at the periphery of the DAPI-stained nuclear DNA mass 264 as a single punctum (Figure 2A-B). We re-aligned 3C-seq data to the 265 Assembly2020 to generate the genome-wide chromatin contact map of C. tropicalis. 266 The resultant heatmap shows high signal intensity along the diagonal indicating that 267 the intra-chromosomal interactions are generally stronger than interchromosomal 268 interactions ( Figure 2C). However, the most striking feature of the heatmap is the 269 presence of conspicuous puncta in the interchromosomal areas, which signify strong 270 spatial proximity between centromeres ( Figure 2C-D). Aggregate signal analysis 271 further reiterates the enrichment of centromere-centromere interactions ( Figure 2E). 272 All these observations suggest the clustering of centromeres and conservation of the 273 Rabl configuration in C. tropicalis, a well-known feature of a higher-order genome 274 organization in yeasts (27)(28)(29). Strikingly, we also noted enrichment of interactions 275 between telomeres of different chromosomes ( Figure 2E). These interchromosomal 276 telomeric interactions were significantly greater than the average interchromosomal 277 interaction (Mann-Whitney U test P value = 6.547´10 -7 ) ( Figure S6A). We also 278 observed enhanced cis interaction between the two telomeres of an individual 279 chromosome compared to average intra-chromosomal long-range (≥100 kb) 280 interaction (Mann-Whitney U test P value = 1.091´10 -9 ) ( Figure S6B). Using the chromosome-level Assembly2020 of C. tropicalis and publicly 295 available chromosome-level assembly of the C. albicans reference genome of 296 SC5314 strain (ASM18296v3), we performed a detailed genome-wide synteny 297 analysis following four different approaches. We used two published analysis tools, 298 Symap (32) and Satsuma synteny (33), and a custom approach to identify the ICSBs 299 based on the synteny of the conserved orthologs ( Figure 3A). Next, we compared 300 and validated the results obtained from our custom approach of analysis with 301 another published tool Synchro ( Figure S7A-B) (34). All four methods of analysis 302 detected that six out of seven centromeres (except CEN6) of C. tropicalis are located 303 proximal to multiple ICSBs, which are rare at the chromosomal arms ( Figure 3A). 304 The ORF-level synteny analysis detected four out of seven centromeres (CEN2, 305 CEN3, CEN5, CENR) in C. tropicalis to be precisely located at the ICSBs, while 306 multiple ICSBs are located within ~100 kb of other two centromeres ( Figure 3B). 307 However, no ICSB could be identified on Chr6. Additionally, we found a convergence 308 of orthoblocks from as many as four different chromosomes within 100 kb of 309 centromeres ( Figure 3B). 310

311
To correlate the frequency of translocations with the spatial genome 312 organization, we quantified ICSB density (the number of ICSBs per 100 kb of the 313 genome) at different zones across the chromosome for all chromosomes except for 314 Chr6 ( Figure 3C). Since no ICSBs were mapped on Chr6, it was excluded from the 315 analysis. This analysis revealed that the ICSB density is the highest at the 316 centromere proximal zones for all six chromosomes, but dropped sharply at the 317 chromosomal arms. However, the ICSB density near the telomere proximal zone for 318 Chr2, Chr4, and ChrR showed an increase over the chromosomal arms, albeit at a 319 lower magnitude than the centromeres. We also compared the length of the 320 orthoblocks across three different genomic zones-the centromere proximal (within 321 300 kb from the centromere on both sides), centromere distal (beyond 300 kb from 322 the centromere to 200 kb from the telomeres), and telomere proximal (within 200 kb 323 from the telomeres) zones. This analysis revealed that the length of the orthoblocks 324 located proximal to centromeres and telomeres are significantly smaller compared to 325 the orthoblocks located at the centromere distal zone ( Figure 3D). the HIR-associated centromeres in C. tropicalis ( Figure 4A). In order to study the 371 presence or absence of HIRs in C. sojae, a sister species of C. tropicalis (24), we 372 assembled its genome into 42 contigs, including seven chromosome-length contigs 373 (Methods). Using this assembly, we identified seven putative centromeres in C. 374 sojae as intergenic and HIR-associated loci syntenic to the centromeres in C. 375 tropicalis (Figure S9A-B, D). Each of these seven centromeres in C. sojae consists 376 of a ~2 kb long central core (CC) region flanked by 3-12 kb long inverted repeats 377 (Table S5). Using a similar approach, we identified six HIR-associated centromeres 378 in the publicly available genome assembly (ASM332773v1) of Candida viswanathii, 379 another species closely related to C. tropicalis (Figure S9C, E, Table S6) (35). A 380 dot-plot analysis found extensive homology shared across the IRs but not among the 381 CC elements (Figure 4A) of the HIR-associated centromeres present in C. tropicalis 382 and the putative centromeres of C. sojae and C. viswanathii (Table S7). Moreover, 383 We detected extensive structural conservation in CEN DNA-elements, especially 384 among IRs within an individual species (Figure S10A). This structural feature of IRs 385 is also significantly conserved across the three species, C. tropicalis, C. sojae, and 386 C. viswanathii, with HIR-associated centromeres ( Figure S10B).  found to be specifically concentrated on the IRs but not at the mid-core region in 402 HIR-associated centromeres present in C. tropicalis ( Figure 4E) and at the putative 403 centromeres in C. soaje and C. viswanathii (S10D). Additionally, we detected that 404 the direction of the IR-motif is diverging away from the central core of the 405 centromeres in C. tropicalis (Figure S10E), and this pattern remained conserved in 406 the other two species as well ( Figure S10F). The conserved structure and 407 organization of the IR-motif sequences in the HIR-associated centromeres of three 408 Candida species suggest an inter-species conserved function of the IR DNA 409 sequence among these three species, although the clusters of IR-motifs are located 410 at a variable distance from the CC in these three species (Figure S10G). The 411 importance of this 12-bp conserved motif on the centromere function is yet to be 412 In this study, we improved the current genome assembly of the human fungal 417 pathogen C. tropicalis by employing SMRT-seq, 3C-seq, and CHEF-chromoblot 418 experiments, and present Assembly2020, the first chromosome-level gapless 419 genome assembly of this organism. We identified three long duplication events in its 420 genome, phased the diploid genome of C. tropicalis and mapped the SNPs and 421 indels. We constructed genome-wide contact maps and identified centromere-422 centromere as well as telomere-telomere spatial interactions. A comparative genome 423 analysis between C. albicans and C. tropicalis revealed that six out of seven 424 centromeres of C. tropicalis are mapped precisely at or proximal to identified in this study that is present in multiple copies on the centromeric IR 501 sequences across these three species. Some centromeres in C. albicans carry 502 chromosome-specific IRs, which lack IR-motifs. In addition, CaCEN5 IRs could not 503 functionally complement the centromere function in C. tropicalis for the de novo 504 CENP-A Cse4 recruitment. This indicates a possible role of the conserved IR-motifs on 505 species-specifc centromere function (22). Therefore, the loss of HIR-associated 506 centromere in C. albicans that are only epigenetically propagated (23) clearly shows 507 how ability of de novo establishment of kinetochore assembly in an ancestral lineage 508 can be lost in an derived lineage. However, the details of the mechanism through 509 which IR-motifs may regulate centromere identity remains to be explored. 510

511
Loss of HIR-associated centromeres during inter-centromeric translocations 512 or MIR must have been catastrophic for the cell, and the survivor needed to activate 513 another centromere at an alternative locus. How is such a location determined? 514 Artificial removal of a native centromere in C. albicans leads to the activation of a 515 neocentromere (55, 56), which then becomes part of the centromere cluster (27). 516 This evidence supports the existence of a spatial determinant, known as the CENP-517 A cloud or CENP-A-rich zone (55, 57), influencing preferential formation of 518 neocentromere at loci proximal to the native centromere (55, 58). We found that the 519 unique and different centromeres of C. albicans are located proximal to the ORFs, 520 which are also proximal to the centromeres in C. tropicalis. This observation 521 indicates that the formation of the new centromeres in C. albicans may have been 522 influenced by spatial proximity to the ancestral CEN cluster. However, the new 523 centromeres of C. albicans are formed on loci with completely unique and different 524 DNA sequences. Because of these reasons, it may be logical to consider the 525 centromeres of C. albicans as ENCs ( Figure 5B). Intriguingly, even after the 526 catastrophic chromosomal rearrangements, the ENCs in C. albicans remain 527 clustered similar to C. tropicalis (Figure 5C). This observation identifies spatial 528 clustering of centromeres as a matter of cardinal importance for the fungal genome 529 organization. 530 531

Materials and Methods 532
The strains, primers, and plasmids used in this study are listed in SI Appendix, 533 Tables S8, S9,  respectively are marked and shown as using black mesh. The CNVs for which the 720 correct homolog-wise distribution of the duplicated copy is unknown are marked with 721 asterisks. Homolog-specific differences for Chr1 and Chr4, occurred due to an 722 exchange of chromosomal parts in a balanced heterozygous translocation between 723 Chr1B and Chr4B, is highlighted with black borders (also see Figure S4C). C. An 724 ethidium bromide (EtBr)-stained CHEF gel picture where the chromosomes of the C.  between CtChr3 and CtChr4, as representative chromosomes, which can be 804 mapped proximal to the centromere on CaChrR (as shown in Figure 3F). C. A 805 cartoon representing the conservation of spatial genomic organization during inter-806 centromeric translocation that mediated centromere-type transition. 807  C.     Figure 5