A chromosome-level genome assembly for the paramylon-producing microalga Euglena gracilis

Euglena gracilis (E. gracilis), pivotal in the study of photosynthesis, endosymbiosis, and chloroplast development, is also an industrial microalga for paramylon production. Despite its importance, E. gracilis genome exploration faces challenges due to its intricate nature. In this study, we achieved a chromosome-level de novo assembly (2.37 Gb) using Illumina, PacBio, Bionano, and Hi-C data. The assembly exhibited a contig N50 of 619 Kb and scaffold N50 of 1.12 Mb, indicating superior continuity. Approximately 99.83% of the genome was anchored to 46 chromosomes, revealing structural insights. Repetitive elements constituted 58.84% of the sequences. Functional annotations were assigned to 39,362 proteins, enhancing interpretative power. BUSCO analysis confirmed assembly completeness at 80.39%. This first high-quality E. gracilis genome offers insights for genetics and genomics studies, overcoming previous limitations. The impact extends to academic and industrial research, providing a foundational resource.


Background & Summary
Euglena, a genus of single-celled flagellate eukaryotes, is ubiquitously distributed in both freshwater and saltwater environments.Possessing photosynthetic chloroplasts, Euglena exhibits autotrophic characteristics akin to plants, while also displaying heterotrophic attributes similar to animals [1][2][3] .E. gracilis, a prominent species within the genus, serves as a widely utilized model organism in both academic and industrial research due to its rich array of valuable compounds, including pigments, unsaturated fatty acids, vitamins, amino acids, and the distinctive β-1,3-glucan, paramylon-an advantageous functional food ingredient [4][5][6] .Notably, recent studies, such as Wu et al. 's pilot-scale fermentation achieving maximal biomass and paramylon content 7 , underscore the industrial potential of E. gracilis.
Despite substantial advancements in genetic modification [8][9][10][11][12][13] , hindered by the absence of a high-quality genome, E. gracilis remains a subject of limited genetic engineering tools and applications.In 2019, Ebenezer et al. presented an initial genome assembly of E. gracilis (1.43 Gb), which, though informative, proved significantly fragmented 14 .Consequently, researchers have resorted to omics approaches, including de novo transcriptome assembly 14,15 and proteomic analysis 1,14 , to explore physiological and genomic aspects.Nevertheless, a definitive high-quality genome assembly remains a critical prerequisite for advancing genetic engineering and synthetic biology applications in E. gracilis 6 .
This study addresses the existing gap by introducing a chromosome-level genome assembly of E. gracilis through the integration of Illumina, PacBio, Bionano, and Hi-C technologies (Table 1).The resulting assembly, spanning 2.37 Gb, with contig N50 of 619 Kb and scaffold N50 of 1.12 Mb, exhibits superior continuity (Table 2).Anchoring to 46 chromosomes (Fig. 1a) achieved a remarkable 99.83% rate, unveiling structural insights.Repetitive elements, constituting 58.84% of the genome, contribute to its complexity.The annotation of 39,362 protein-coding gene models and the assessment of 80.39% gene completeness attest to the high quality of this genome.This achievement marks a pivotal step in enhancing our comprehension of E. gracilis, offering a genetic foundation for both experimental and computational inquiries in this species.

Methods
Sample collection and sequencing.Sample preparation.The E. gracilis Z strain (CCAP 1224/5Z) was purchased from CCAP (Culture Collection of Algae and Protozoa, United Kingdom) and cultivated in our laboratory under autotrophic conditions using CM medium at 26 °C, with a continuous white light intensity of 80 μmol photons•m −2 •s −1 .Cellular samples were harvested during the mid-log phase, rapidly frozen with liquid nitrogen, and subsequently preserved at −80 °C for subsequent sequencing library preparation.
Library preparation and sequencing.Genomic DNA of high quality was extracted using the CTAB method.Paired-end libraries were constructed using NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB, USA) and sequenced on an Illumina HiSeq2500 platform (Illumina, USA), which generated a total of 264.2 Gb Illumina data, providing approximately 111-fold coverage of the genome (Table 1).In total of 50 mg DNA were used to construct the PacBio Sequel sequencing libraries, then sequencing was performed to produce raw reads.For Bionano sequencing, high molecular weight DNA with a fragment distribution greater than 150 kb were isolated and used for DNA nicking using Nb.BssSI (NEB).The nicks were labelled and then loaded onto the Saphyr Chip nanochannel array (Bionano Genomics) and imaged using the Saphyr system and associated software (Bionano Genomics) according to the Saphyr System User Guide.The PacBio Sequel and Bionano platforms contributed 377.5 Gb and 306.6 Gb data, achieving coverages of approximately 159X and 129X, respectively (Table 1).Hi-C libraries was prepared with the standard procedure described.After digesting the genomic DNA with a restriction enzyme MboI, the sticky ends of the digested fragments were biotinylated, diluted, and then ligated to each other randomly.The prepared sequencing library was sequenced on a NovaSeq platform (Illumina, USA), which yielded a total of 402.3 Gb data with the Illumina sequencing platform (Table 1).Library preparation and sequencing of Illumina survey libraries, PacBio Sequel libraries, Bionano libraries, and all transcriptome libraries were executed by Nowbio Biotechnology Company (Yunnan, China).Frasergen Bioinformatics Co., Ltd (Wuhan, China) undertook the preparation and sequencing of Hi-C libraries on their sequencing platform.
Transcript integration and ab initio prediction.The two distinct sets of transcripts were amalgamated via PASA 35

Fig. 1
Fig. 1 Chromosome-level assembly of the E. gracilis genome.(a) Genome landscape of the E. gracilis.From the outer ring to the inner ring are the distributions of chromosome length, gene density, transposable element (TE) density, tandem repeat (TR) density, and GC content, with densities calculated within a 1 Mb window.(b) Distribution estimation of 19-kmer.(c) Estimation based on flow cytometry.(d) Hi-C interaction heatmap illustrating the genomic interactions within the E. gracilis genome.The colour bar indicates contact density, ranging from red (high) to white (low).

Table 1 .
Statistical analysis of sequencing reads from Illumina, Pacbio, Bionano and Hi-C.

Table 2 .
Assembly statistics and comparison to previous published data.

Table 3 .
Length of the assembled chromosome of the E. gracilis genome.

Table 4 .
39assification of the TE sequences in the E. gracilis genome.(v3.01.02).The ensuing analysis revealed a total of 32,806 genes and 39,362 coding DNA sequences (CDSs) within the E. gracilis genome, with an average CDS length of 1,149 bp and an average of 8 exons per gene.Functional annotation.For functional blastp30(v2.2.26) was applied to align protein-coding genes with KEGG39database.The GO Ontology 40 (GO) and InterPro 41 function were obtained using InterProScan.The subsequent functional annotation of CDSs demonstrated coverage of 28.2%, 40.6%, and 50.2% across the GO, InterPro, and KEGG databases, respectively, with a cumulative 57.3% of CDSs annotated in at least one database.