Full-length Genome of the Ogataea polymorpha strain HU-11 reveals large duplicated segments in subtelomeric regions


 Background: Currently, methylotrophic yeasts (e.g., Pichia pastoris, Hansenula polymorpha, and Candida boindii) are subjects of intense genomics studies in basic research and industrial applications. In the genus Ogataea, most research is focused on three basic O. polymorpha strains—CBS4732, NCYC495, and DL-1. However, these three strains are of independent origin and unclear relationship. As a high-yield engineered O. polymorpha strain, HU-11 can be regarded as identical to CBS4732, because the only difference between them is a 5-bp insertion. Results: In the present study, we have assembled the full-length genome of O. polymorpha HU-11 using high-depth PacBio and Illumina data. Long terminal repeat (LTR) retrotransposons, rDNA, 5' and 3' telomeric, subtelomeric, low complexity and other repeat regions were curated to improve the genome quality. We took advantage of the full-length HU-11 genome sequence for the genome annotation and comparison. Particularly, we determined the exact location of the rDNA genes and LTR retrotransposons in seven chromosomes and detected large duplicated segments in the subtelomeric regions. Three novel findings are: (1) O. polymorpha NCYC495 is so phylogenetically close to CBS4732/HU-11 that the syntenic regions covers nearly 100% of their genomes with a nucleotide identity of 99.5%, while NCYC495 is significantly distinct from DL-1; (2) large segment duplication in subtelomeric regions is the main reason for genome expansion in yeasts; and (3) the duplicated segments in subtelomeric regions may be integrated at telomeric tandem repeats (TRs) through a molecular mechanism, which can be used to develop a simple and highly efficient genome editing system to integrate or cleave large segments into yeast genomes. Conclusions: Our findings provide new opportunities for in-depth understanding of genome evolution in methylotrophic yeasts and lay the foundations for the industrial applications of O. polymorpha HU-11 and CBS4732. The full-length genome of the O. polymorpha strain HU-11 should be included into the NCBI RefSeq database for future studies of O. polymorpha CBS4732, NCYC495, and their derivative strains.

Introduction 64 database is the O. polymorpha DL-1 mitochondrial genome (RefSeq: NC_014805). More high-quality 100 complete genome sequences of Ogataea spp. need to be sequenced to bridge the gap in Ogataea basic 101 research and industrial applications. 102 In the present study, we have assembled the full-length genome of O. polymorpha  depth PacBio and Illumina data and conducted the annotation and analysis to achieve the following research 104 goals: (1) to determine the relationship between CBS4732/HU-11, NCYC495, and DL-1; (2) to provide a 105 high-quality and well-curated reference genome for future studies of O. polymorpha CBS4732 and its 106 derivatives-LR9, RB11,and (3) to discover important genomic features (e.g., high yield) of 107 Ogataea spp. for basic research (e.g., synthetic biology) and industrial applications. 108 109 Results and Discussion 110 Genome sequencing, assembly and annotation 111 One 500 bp and one 10 Kbp DNA library each were prepared using fresh cells of O. polymorpha  11 and sequenced on the Illumina HiSeq X Ten and PacBio Sequel platforms, respectively, to obtain the 113 high-quality genome sequences. Firstly, 18,319,084,791 bp cleaned PacBio DNA-seq data were used to 114 assembled the complete genome, except for the rDNA region (analyzed in further detail in subsequent 115 sections), with an extremely high depth of ~1800X. However, the assembled genome using high-depth 116 PacBio data still contained two types of error in the low complexity ( Figure 1A) and the short tandem repeat 117 (STR) regions ( Figure 1B). Then, 6,628,480,424 bp cleaned Illumina DNA-seq data were used to polish the 118 complete genome of HU-11 to remove the two types of error. However, Illumina DNA-seq data contained 119 errors in the long (>10 copy numbers) poly(GC) regions. Following this, the poly(GC) regions, polished 120 using Illumina DNA-seq data, were curated using PacBio subreads ( Figure 1C). Finally, Long Terminal 121 Repeat (LTR) retrotransposons, rDNA (analysed in more details in following sections), 5' and 3' telomeric, 122 subtelomeric, low complexity, and other repeat regions were curated to obtain the full-length genome using 123 103,345 long (> 20 Kbp) PacBio subreads (Supplementary file 1).

124
Ogataea polymorpha HU-11 has a nuclear genome (Figure 2A) with a summed sequence length of 125 9.1 Mbp and a mitochondrial (mt) genome ( Figure 2B) with a sequence length of 59,496 bp (Table 1). For 126 the data submission to the GenBank database, the sequence of circular mt genome was linearized, starting at 127 the first codon of the ORF3 coding sequence (CDS). Analysis of long PacBio subreads revealed that the 128 telomeric regions at 5' and 3' ends of each chromosome consist of tendam repeats (TRs) [ACCCCGCC] n and 129 [GGCGGGGT] n (n is the copy number) with average lengths of 166 bp and 168 bp (~20 copy numbers), 130 respectively. As these TRs vary in lengths, the 5' and 3' telomeric regions were not included into the seven 131 linear chromosomes of HU-11, which were named as 1 to 7 from the smallest to the largest, respectively 132 ( Table 1). The full-length O. polymorpha HU-11 genome includes the complete sequences of all seven 133 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. chromosomes, while the 5' and 3' ends of NCYC495 or DL-1 chromosomes have many errors 134 (Supplementary file 1). Therefore, we recommend the inclusion of our genome sequences into the NCBI 135 RefSeq database to facilitate future studies on O. polymorpha CBS4732 and its derivatives-LR9, RB11, and 136 HU-11. 137 The 9. rDNA TRs was estimated as 20 in the HU-11 genome ( Figure 3A), while that was estimated as 6 and 25 in 155 NCYC495 and DL-1, respectively [1]. TRs of HU-11 and NCYC495 rDNAs share a very high nucleotide 156 (nt) sequence identity of 99.5% (8,115/8,152), while those of HU-11 and DL-1 rDNAs share a 157 comparatively low nt sequence identity of 97% (7,530/7,765). As the largest TR region (~162 Kbp) in the 158 HU-11 genome, the only rDNA locus is located in chromosome 2 and the organization of rDNA genes with 159 different copy numbers may be conserved in the Ogataea genus. A TR of rDNAs in Saccharomyces 160 cerevisiae also contains 5S, 18S, 5.8S, and 25S rDNAs as a quadruple, repeating two times on the 161 chromosome 7 of its genome ( Figure 3B). Four other 5S rDNAs are located separately away from the rDNA 162 quadruples in S. cerevisiae. Compared to O. polymorpha or S. cerevisiae with only one rDNA locus, Pichia 163 pastoris GS115 carries several rDNA loci, which are interspersed in three of its four chromosomes. Since 164 the genome of P. pastoris GS115 (Genbank assembly: GCA_001708105) is incomplete and poorly 165 annotated, we only estimated the copy number of its rDNAs as three. In eukaryotes, rDNAs encoding 18S, 166 5.8S, and 28S rRNAs that are transcribed into a single RNA precursor by RNA polymerase I are also 167 organized in TRs. For example, there are approximately 200-600 rDNA copies ( Figure 3C) distributed in 168 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint this version posted April 19, 2021. ; https://doi.org/10.1101/2021.04.17.440260 doi: bioRxiv preprint short arms of the five acrocentric chromosomes (chromosomes 13, 14, 15, 21, and 22) of human. [10]. In 169 prokaryotic cells, 5S, 23S, and 16S rRNA genes are typically organized as a co-transcribed operon. There 170 may be one or more copies of the operon dispersed in the genome and the copy numbers typically range 171 from 1 to 15 in bacteria. For example, there are four copies at two rDNA loci in the chromosome 1 172 (GenBank: CP022603) and 2 (GenBank: CP022604) of Ochrobactrum quorumnocens ( Figure 3D). for their detection, identification, classification, and phylogenetic analysis. 209 In the previous study, 50,000 fragments of 13 Hemiascomycetes species were used to identify LTR 210 retrotransposons. However, the analysis was probably biased as it was based on only random sequences of 211 approximately 1 kb on an average and not the complete genome sequences [11]. In the present study, seven 212 copies of intact LTR retrotransposons (Supplementary file 1) were discovered and accurately positioned in 213 the HU-11 genome (Figure 2A); five of them were located on the sense strands of chromosome 1, 2, 3, 6, 214 and 7 (named LTR-rt1, 2, 3, 6, and 7), while the other two were located on the antisense strands of 215 chromosome 5 and 7 (named LTR-rt5R and 7R). LTR-rt1, 3, and 6 share very high nt identities of 99.9% 216 with each other. LTR-rt1 or 3 contains a single ORF encoding a polyprotein with the same aa sequence, 217 while LTR-rt6 contains a single ORF with a 42-bp insertion (encoding RSSLFDVPCSPTVD), compared to 218 LTR-rt1 and 3. LTR-rt2, 5R, 7, and 7R contain several single nt substitutions and small Insertion/Deletions 219 (InDels), which break the single ORFs into several ORFs. The homologs of LTR-rt2, 3, and 5R in HU-11 220 are present in the NCYC495 genome with very high nt identities of 99.9%. The homologs of LTR-rt1, 7, and 221 7R, however, were not detected in the NCYC495 genome. Further analysis suggested that their absence in 222 the NCYC495 genome was resultant from assembly errors (described in more detail subsequently).

224
Structural variation and large segment duplication

225
Sequence comparison between the NCYC495 and HU-11 genomes showed that they share a nt identity 226 of 99.5% through the whole genomes, including the rDNA region and LTR retrotransposons. However, the 227 DL-1 and HU-11 genomes share a comparatively low nt identity (< 95%) through the whole genomes. 228 Syntenic comparison revealed that NCYC495 is so phylogenetically similar to HU11 that a nearly 100% of 229 their genomes is covered by their syntenic regions, whereas NCYC495 is significantly distinct from DL-1 230 (As shown in preceding sections). Subsequently, the detection of structural variations (SVs) was performed 231 between the NCYC495 and HU11 genomes. Further analysis revealed that all the large deletion or insertions 232 (two types of SVs) are errors in the assembly of NCYC495 genome ( Figure 4C): (1) LTR-rt1, 7, and 7R 233 (lost in NCYC495) should have been included in the NCYC495 genome; (2) two large deletions (lost in 234 NCYC495) are large duplicated segments at 5' and 3' ends of chromosome 2; and (3) a large insertion 235 (acquired by NCYC495) is a large duplicated segment at 3' end of chromosome 6 236 (NW_017264698:1509870-1541475), the upstream and downstream parts of which are identical to segments 237 from chromosome 2 and 3, respectively. In addition, telomeric TRs [GGCGGGGT] n 238 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint this version posted April 19, 2021. ; https://doi.org/10.1101/2021.04.17.440260 doi: bioRxiv preprint (NW_017264698:1509840-1509869) were discovered at the 5' ends of this large insertion, confirming that it 239 resulted from assembly errors. 240 All large duplicated segments were correctly assembled in the subtelomeric regions of the HU-11 241 genome. Human curation was performed to verify their locations using long (> 30 Kb) PacBio subreads. A 242 27,850 bp duplicated segment is present at 3' ends of chromosome 1 and 2 in the HU-11 genome. There are 243 only four mismatches and one 1-bp gap between these two sequences. The homolog of this 27,850 bp 244 duplicated segment is present at the 3' end of chromosome 1 in the NCYC495 genome. The downstream part 245 of it, however, is absent at 3' end of chromosome 2 in the NCYC495 genome (Figure 4C), which 246 corresponds to one 14,090 bp large deletion of NCYC495. An approximately 5,100 bp duplicated segment is 247 present at 5' ends of both chromosomes 2 and 5 in the HU-11 genome. The homolog of this ~5,100 bp 248 duplicated segment is present at 5' end of chromosome 5, but absent at 5' end of chromosome 2 in the 249 NCYC495 genome (Figure 4C), which corresponds to the other large deletion of NCYC495. An 250 approximately 2,500 bp duplicated segment is present at 5' ends of chromosome 7 and downstream of the 251 ~5,100 bp duplicated segment of chromosome 2 in the HU-11 genome. Different from homologs of the 252 27,850-bp and ~5,100 bp duplicated segments, the homolog of this ~2,500 bp duplicated segment is 253 correctly assembled in the NCYC495 genome ( Figure 4C). As an important finding, telomeric TRs 254 [ACCCCGCC] n (n > 2) were discovered at 3' end of the ~5,100 bp duplicated segment located in both 255 chromosomes 2 and 5, and at 3' end of the ~2,500 bp duplicated segment located in chromosome 7; this 256 finding indicated that these duplicated segments were integrated at telomeric TRs. 257 258 259 Conclusions 260 HU-11 is a nutritionally deficient mutant derived from CBS4732 by a 5 bp insertion of "GAAGT" 261 into the 31 st position of the URA3 CDS; this insertion causes a frame-shift mutation of the URA3 CDS, 262 resulting in the loss of the URA3 functions. Since the difference between the genomes of CBS4732 and HU-263 11 is only five nts, CBS4732 and HU-11 can be regarded as identical strains. As the syntenic regions cover 264 nearly 100% of their genomes, NCYC495 is considered to be phylogenetically very close to HU11. This fact 265 clearly reveals the origin and relationship of NCYC495 and CBS4732/HU11. Therefore, the full-length 266 genome of the O. polymorpha strain HU-11 can be used as a reference for future studies of CBS4732, 267 NCYC495, and their derivative strains. For example, we corrected assembly errors in the NCYC495 genome 268 using the full-length genome of HU-11, which facilitated in obtaining the full-length genome of NCYC495. 269 Using the high-quality full-length HU11 genome, large duplicated segments in subtelomeric regions 270 were first comprehensively discovered within a methylotrophic yeast, which was overlooked in the previous 271 studies due to the incomplete genomes. A computational study showed that subtelomeric families are 272 evolving and expanding much faster than that are the families that do not contain subtelomeric genes in the 273 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint this version posted April 19, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 yeasts. These findings thus, indicate that the extraordinary instability of eukaryotic subtelomeres supports 274 rapid adaptation to novel niches by promoting gene recombination and duplication followed by functional 275 divergence of the alleles [9]. Our results suggest that large segment duplication in subtelomeric regions is 276 the main reason for genome expansion in yeasts, also suggesting that subtelomeric families are evolving and 277 expanding much faster. The discovery of telomeric TRs in these segments indicate that the duplicated 278 segments in subtelomeric regions were integrated at telomeric TRs. However, the underlying molecular 279 mechanism is still unknown. The duplicated segments in different regions are nearly identical; suggesting 280 that this level of divergence was acquired shortly after the duplication. Such unknown molecular mechanism 281 may still function in O. polymorpha and the exogenous genes may get integrated into the subtelomeric 282 regions, which is extremely important for the industrial applications of O. polymorpha. Based on the 283 molecular mechanism, a simple and highly efficient genome editing system can be developed to integrate or 284 cleave large segments at the telomeric TRs. 285 286 Methods and Materials

287
The Ogataea polymorpha strain HU-11 was obtained from Tianjin Hemu Health Biotechnological Co., 288 Ltd. DNA extraction and quality control were performed as described in our previous study [12]. A 500 bp 289 DNA library was constructed as described in our previous study [12] and sequenced on the Illumina HiSeq 290 X Ten platform. A 10 Kb DNA library was constructed and sequenced on the PacBio Sequel platforms, 291 according to the manufacturer's instruction. The software SMRTlink v5.0 (--minLength=50, --292 minReadScore=0.8) was used for PacBio data cleaning and quality control, while the software Fastq_clean 293 v2.0 [13] was used for Illumina data cleaning and quality control. The software MECAT v1.2 was used to 294 assemble the HU-11 draft genome using PacBio data. To polish the HU-11 genome, Illumina data was 295 aligned to the HU-11 draft genome using the software BWA. Then, samtools was used to obtain the BAM 296 and pileup files from the alignment results. A Perl script was used to extract the consensus sequence from 297 the pileup file. This procedure was repeatedly performed to obtain the final genome sequence. The curation 298 of genome and genes was performed using the software IGV. Statistical computation and plotting were 299 performed using the software R v2.15.3 with the Bioconductor packages [14]. 300 Gene annotation and SV detection were performed using the software blast v2.9.0. Syntenic 301 comparison were performed using the CoGe website (https://genomevolution.org/CoGe). 5,155 protein-302 coding genes of NCYC495 were annotated in the previous study. Among them, two genes were removed as 303 they are ITS sequences. 17 genes in the large insertion in the chromosome 6 of NCYC495 were removed as 304 they are assembly errors. Three and six genes in two large deletions (lost in NCYC495) were added, as the 305 losses of them in chromosome 2 resulted from assembly errors. Finally, the corrected 5,145 genes of 306 NCYC495 were aligned to the HU-11 genome using the software blast v2.9.0 to annotate protein-coding 307 genes of HU-11. 308 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint this version posted April 19, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 Supplementary information 310 Additional file 1: Figure S1. A workflow to generate full-length genome sequence of an RNA virus. 311 Table S1. Collection of ticks.  Errors in PacBio data and Illumina data 388 The errors in the low complexity and short tandem repeat (STR) regions can be corrected during the genome 389 polishment using Illumina data, while the error in the long (>10 copy numbers) poly(GC) regions need be 390 curated after the genome polishment. A. A example to show that the assembled genomes using high-depth 391 PacBio data still contain errors in the low complexity regions. B. A example to show that the assembled 392 genomes using high-depth PacBio data still contain errors in the STR regions. C. A example to show that the 393 genome polishment using Illumina data causes error in the long poly(GC) regions. 394 395 Figure 2 Full-length genome of the Ogataea polymorpha strain HU-11 396 A. The full-length O. polymorpha HU-11 genome includes the complete sequences of seven linear 397 chromosomes, which were named as 1 to 7 from the smallest to the largest. The 5' and 3' telomeric regions 398 were not included. For the data submission to the GenBank database, the genome sequence of circular mitochondrion was 405 linearized, starting at the first codon (indicated by a red arrow) of the ORF3 coding sequence (CDS). color) and 7 (in blue color). NCYC495 is so phylogenetically close to HU11 that the syntenic regions cover 428 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.