A comparative analysis of planarian genomes reveals regulatory conservation in the face of rapid structural divergence

The planarian Schmidtea mediterranea is being studied as a model species for regeneration, but the assembly of planarian genomes remains challenging. Here, we report a high-quality haplotype-phased, chromosome-scale genome assembly of the sexual S2 strain of S. mediterranea and high-quality chromosome-scale assemblies of its three close relatives, S. polychroa, S. nova, and S. lugubris. Using hybrid gene annotations and optimized ATAC-seq and ChIP-seq protocols for regulatory element annotation, we provide valuable genome resources for the planarian research community and a first comparative perspective on planarian genome evolution. Our analyses reveal substantial divergence in protein-coding sequences and regulatory regions but considerable conservation within promoter and enhancer annotations. We also find frequent retrotransposon-associated chromosomal inversions and interchromosomal translocations within the genus Schmidtea and, remarkably, independent and nearly complete losses of ancestral metazoan synteny in Schmidtea and two other flatworm groups. Overall, our results suggest that platyhelminth genomes can evolve without syntenic constraints.


Assembly quality assessment
Table 1 Summary statistics for genome assemblies of the S. mediterranea sexual strain.dd_Smes_g4: Our previous diploid consensus assembly [1]; schMedS2: chromosome-scale scaffolding of most contigs of the dd_Smes_g4 assembly [2], schMedS3h1 and schMedS3h2: haplotype phased assemblies (this study).Figure 1 Representative 'cn-spectra' plots from a merqury k-mer analysis of the short-read dataset SRR959588.Density plots show the coverage of distinct k-mers and their copy number variation.A: schMedS3h1; B: schMedS3h2; C: schMedS3BH.Read-only k-mers with low multiplicity (peak on the left of each plot) represent errors in the short-read data.Peaks with high multiplicity represent genuine k-mers present in the assemblies.A and B show a shoulder of read-only peaks with higher than error multiplicity representing haplotype-specific k-mers absent in the respective assembly.Panel C shows that these k-mers are well represented in schMedS3BH, leading to the increase in completeness of the diploid assembly.
Figure 2 BUSCO scores of our previous assembly dd_Smes_g4, the schMedS2 assembly, and the two haplotypes of the new phased assembly, demonstrating increased completeness in the new assembly.Table 3 Genomic location of ATAC-seq peaks depending on their CHIP-intersect classification.

ATAC-seq peak classification
Annotations are based on the closest gene.Application of a Chi-square test showed that the annotations were not equally distributed across the categories, with putative promoter annotations enriched within 1 kb of the TSS, putative enhancers in introns, exons and distal intergenic regions and uncharacterized peaks in introns and distal intergenic regions.The broad match with expectations in the former two cases confirms the utility of our approach.A two-sided Z-score-based post-hoc tests with a Bonferroni correction for multiple testing to determine the enrichment of element types.
Corrected p-values < 10 -16 are indicated accordingly.Right side: Same plot but split by species.

Orthofinder based synteny analysis
As described in the main text, we used single-copy orthologs derived using Orthofinder to test for synteny conservation using dotplots and Chi-square tests.The results for the comparison of S.
mediterranea haplotype 1 against the other Schmidtea and the parasites as well as Schistosoma mansoni against the same are shown below (Table 9-10).Additionally, we compared Schmidtea and the parasites to Amphioxus (Branchiostoma lancelatum), a representative of ancestral vertebrate linkage groups.Although this analysis recapitulated the previously described conservation of synteny between the cnidarian Nematostella vectensis and Amphioxus [4], synteny conservation was lost in comparison with members of the genus Schmidtea (Table 11).Interestingly, macrosynteny between Amphioxus and S. mansoni and the other parasites was degraded with all effect sizes <= 0.22 (Table 11), but some synteny appeared conserved for specific chromosomes (e.g., Amphioxus chromosomes 3 and 4).
Table 9 Test for macrosynteny using a chi-square test of the distribution of one-to-one orthologs between the target and query genome assemblies.Given is the number of orthologs, the degrees of freedom, the chi-squared statistic, the effect size Cramer's V which is 0 for a random distribution and 1 for a perfect correlation and the p-value calculated assuming the chi-square distribution, or 100,000 permutations of the data.P-values < 10 -16 for the chi-square test are noted as such.For the permutation test, p-values below its precision of 10  10 Test for macrosynteny using a chi-square test of the distribution of one-to-one orthologs between the target and query genome assemblies.Given is the number of orthologs, the degrees of freedom, the chi-squared statistic, the effect size Cramer's V which is 0 for a random distribution and 1 for a perfect correlation and the p-value calculated assuming the chi-square distribution, or 100,000 permutations of the data.P-values < 10 -16 for the chi-square test are noted as such.For the permutation test, p-values below its precision of 10 Table 11 Test for macrosynteny using a chi-square test of the distribution of one-to-one orthologs between the target and query genome assemblies.Given is the number of orthologs, the degrees of freedom, the chi-squared statistic, the effect size Cramer's V which is 0 for a random distribution and 1 for a perfect correlation and the p-value calculated assuming the chi-square distribution, or 100,000 permutations of the data.P-values < 10 -16 for the chi-square test are noted as such.For the permutation test, p-values below its precision of 10

ODP based synteny analysis
To evaluate the synteny among the investigated genomes, we utilized the ODP tool.In the comparisons among the parasites, there was a marked conservation of synteny, as evidenced by significant p-values from Fisher's exact tests (Table 12).This conservation was evident in the pronounced clustering of orthologs by chromosomes, and even within chromosomal segments.Similarly, within Schmidtea, we observed substantial synteny conservation, with by significant p-values from Fisher's exact tests (Table 13) and the distinct clustering of orthologs by chromosome and gene order.
However, in pairwise comparisons between Schmidtea and the parasites, there was no evident clustering of orthologs by chromosomes (Table 14).The results of Fisher's exact tests indicated no association for cloSin.For taeMul, an association was observed between LG6 and chromosome 2 of both schNov1 and schLug1, though this was based on a limited gene set and was accompanied by a high p-value (Table 14).For hymMic, a more pronounced association was found: scaffold HMN_01_pilon aligned with a chromosome in schMedS3h1, schNov1, and schPol2, while HMN_06_pilon was associated with chr4 in schMedS3h1.Despite these associations involving a greater number of genes, the p-values remained elevated, consistent with the observed lack of clear clustering in the dotplot.For schMan a similar picture emerges, but with statistically significant associations of at least one chromosome with each Schmidtea species (Table 14).Lastly, there was no significant association of any Machtxv2 scaffold with a chromosome of any other species included in this study.

MALG conservation
Here we describe the conservation of the Metazoan ancestral linkage groups (MALG) defined by [5].
Following their notation, we use ⊗ to denote the fusion and mixing of two MALG and describe the equivalence of chromosomes.Note, that none of the MALG were conserved in S. mediterranea, S. polychroa, S. nova, S. lugubris (Figure 17A-D), or Macrostomum hystrix (Figure 14I) and therefore they are not included in this description.Table 15 gives the conserved MALGs for each chromosome based on the ODP results visualized in Figure 17E-H.

Part I. Transcriptome assembly pipeline
This section describes the genome annotation pipeline developed for the de novo annotation of all Schmidtea genomes reported in this study.In general, it comprises an evidence-based (i.e. based on RNA sequencing data), genome-guided approach, which is summarised in the following workflow.A detailed block diagram of the pipeline is included at the end of Part I.

Overall workflow
5.1 Nanopore read pre-processing

Basecalling
ONT reads are obtained by basecalling fast5 files using guppy (6.2.1): The following state-of-the art models are used for basecalling: • Super High Accuracy mode for cDNA reads. MODEL="dna_r9.4.1_450bps_sup_prom.cfg" • High Accuracy mode for direct RNA reads. MODEL="rna_r9.4.1_70bps_hac_prom.cfg" Only the pass reads of each run (i.e.Q-score>10 for cDNA and >7 for direct RNA) are retained and merged into a single fastq file.

Read trimming and filtering
Oriented reads > 150 nt are retained, then poly-A tails are trimmed and low complexity reads While strongly ameliorated by the latest library preparation strategies, it is still possible that a fraction of the mapping reads arise from an internal priming event (i.e.oligo-dT cDNA primer annealing to an A-rich sequence within the transcript).Such events are discarded by filtering according to the genomic content around the 3' end of the read using seqkit (v0.15.0): samtools sort -@ $THREADS -o $SPECIES"_np.sort.bam"$SPECIES"_np.bam"samtools index $SPECIES"_np.sort.bam"5.2 Illumina short read pre-processing

PCR duplicate removal
In order to maximise the information content of short reads and avoid redundant mappings, we remove putative PCR duplicates (i.e.identical reads) using BBmap.

Read merging by library type
Most of the datasets include: • ISR (Inward paired-end, stranded:first-strand)

Short read genome mapping
Short reads are mapped to the genome using HISAT2 (2.2.1), after generating indexes: hisat2-build -p $THREADS $SPECIES.genome.fa$SPECIES.HISAT2_index Genomic alignment is performed by allowing a maximum intron length of 100kb (identical to long reads), and only single mapping reads are retained in the final bam files.Finally, alignment files from all library types of the same species are merged into a single file, sorted and indexed: The former approach is better suited for reconstructing an overall robust (albeit slightly less sensitive) exon chaining, the latter for fine-grained gene structures at the expenses of a slightly higher proportion of artifacts (chimaeras, spurious transcripts, etc.).The threshold parameters

Model refinement
The low confidence set of annotation is refined using FLAIR (v1.5): this program is able to remove spurious antisenses, to trim chimaeras, refine the splice junctions and filter out potential artifacts retaining transcripts with a higher confidence.All these steps are performed by relying on ONT read evidence only: This time, as a last argument, we specify with the suffix "c" or "n" whether the transcript belongs to the coding or non-coding set, respectively (e.g.h1Smedc, Slugn).

Chimaeric transcript filtering
While this hybrid transcriptome assembly approach results in a lower proportion of transcript fusions, with the subsequent step we attempt to fix different kinds of remaining errors: • Chimaeric Loci i.e.Bundle of physically different, but overlapping transcripts that are mistakenly assigned to the same Gene.This is a common event.
• Chimaeric Transcripts Transcripts that encode two distinct ORFs but are assembled as a single fusion transcript.

Mix of both
The script ChimaeraDetect.R works as follow: • For each locus: -Create genomic ranges with the min and max coordinates of all the predicted open reading frames; -Reduce these coordinates by locus, creating range_blocks ; -If the locus has 0 or 1 range_block, this is a normal locus and it can be discarded from further analysis; • For each chimaeric locus: -Map each range_block(s) to its overlapping transcript(s); The resulting $SPECIES.ENCODE_hybrid_annot.tmp.gff3annotation is then parsed using the IDpatch.R script.This command polishes the GFF3 file by resolving format inconsistencies (e.g.adds explicit gene entries, renames the file source, fixes the type field layout): Rscript scripts/IDpatch.R $SPECIES.ENCODE_hybrid_annot.tmp.gff3\ $SPECIES.ENCODE_hybrid_annot.gff3 The following figure represents a description the overall naming scheme used in this annotation:

Gene naming scheme
In conclusion the file is sorted: This results in the final annotation file ENCODE_hybrid_annot.sort.gff3 .

Figure 3 Figure 4
Figure 3 Dotplot alignment between schMedS3h1 from this study and schMedS2 from [2].Dots represent alignments of at least 1000bp and the color indicates alignment quality.

Figure 5
Figure 5 Details on the 19 transcript sequences deposited in NCBI that were not present in the S3BH annotation.A Mapping characteristic of the missing transcripts.B Details on whether the 14 uniquely mapped transcripts were present in some of the previous annotations (SMEST, dd_Smed_v6, or dd_Smes_v1).Due to the large fraction of non-characterized EST sequences in this data set, the transcripts with little-or no prior evidence likely represent non-coding transcripts that were not a specific focus of the current annotation effort.Note that the mitochondrial genome is not included in the schMedS3 assemblies.

Figure 6
Figure 6 UCSC genome browser view showing the chimeric merger of the gene annotations of a frz receptor (delimited by green bounding lines) and the Activin inhibitor follistatin immediately downstream in the new annotations (h1SMcT*; purple, bottom tracks), but not in the previous dd_v6 transcriptome (blue, above).Note the extremely close proximity, same strandedness and similar RNA-seq coverage of both genes/transcripts, making this a particularly challenging annotation problem.

Figure 8 A
Figure 8 A Identification of optimal sonication settings for chromatin fragmentation using the Covaris S2 sonicator with AFA tubes.(5% Duty Cycle, Intensity 4, 200 Cycles/Burst, 30 sec/cycle).Displayed are DNA samples from crosslinked nuclei without sonication (0), 20 cycles, 30 cycles and 40 cycles.B: Western blot validation of H3K4me3 ChIP-seq pulldown.H3K4me3 pulldown enriched for protein of interest (left), while IgG control did not show any H3K4me3 enrichment (right), confirming the specificity of the assay.The blot was probed with the same H3K4me3 AB as used for the pull-down, which is why the IgG bands are visible (intense bands in all pull-down lanes).C Fingerprint plot displaying the degree of ChIP signal enrichment in the indicated published data sets and this study.A diagonal line indicates uniform read distribution along the genome (no enrichment/background), while a steep curve indicates strong enrichment.Accordingly, the signal/noise ratios of this study's ChIP-seq data is at least en par with the published data.

Figure 9
Figure 9 Uncropped and unedited gel images supporting Figure 8.

Figure 10
Figure 10 Heatmap of ChIP signals at the summits of ATAC-seq peaks.A: H3K4me3 and H3K27ac signal at the summits of putative promoters.B: H3K4me3 and H3K27ac signal at the summits of putative enhancers.C: H3K4me3 and H3K27ac signal at the summits of the remaining uncharacterized ATAC-seq peaks.

Figure 11
Figure 11 Histogram of the distance (from top to bottom) of ATAC-seq peaks, H3K27ac peaks, and H3K4me3 peaks to the closest TSS in S. mediterranea.The median distance to the TSS were 3599 bp, 1693 bp, and 0 bp for ATAC-seq peaks, H3K27ac peaks, and H3K4me3 peaks, respectively.

Figure 12
Figure 12 Size distribution of S. mediterranea ATAC-seq peak liftover depending on the conservation classification.Left side: Cumulative analysis of all three liftover species.Liftovers categorized as "not conserved" (region no summit and peak no summit; top), were shorter than conserved peaks (bottom).

Figure 13
Figure 13 Example of a highly conserved regulatory elements at the nou-darake (ndk) gene locus, showing (top to bottom) the H3K4me3 ChIP, H3K27ac ChIP, and ATAC-seq signal tracks for S. mediterranea, the of mapping of conserved non-coding elements (CNE) previously inferred on basis of the draft genome of Dugesia japonica [3] , the CNEs predicted by this study, the exon/intron gene annotation of S. mediterranea ndk, followed by the ATAC-seq signal, CNE predictions and ndk gene annotations in the genomes of S. polychroa (schPol2), S. nova (schNov1), and S. lugubris (schLug1), all aligned by the ndk TSS.Dark blue, blue or yellow vertical bars indicate highly conserved, conserved and non-conserved ATAC-seq peaks.

Figure 14
Figure 14 Oxford dotplot showing orthologous genes between Amphioxus as representative of vertebrate ancestral linkage groups and the cnidarian Nematostella vectensis, Schmidtea mediterranea haplotype 1, and the parasitic flatworm Schistosoma mansoni.

Figure 16
Figure 16 Oxford dotplots between the Schmidtea species included in this study.A S. lugubris vs S. mediterranea haplotype 1. B S. lugubris vs S. nova.C S. lugubris vs S. polychroa.D S. mediterranea vs S. nova.E, S. mediterranea vs S. polychroa.F S. nova vs S. polychroa.

Table 2
Summary of the 19 gene sequences deposited in NCBI that were not detected in the S3BH annotation.Two genes could not be mapped, likely due to assembly gaps.The other 14 mapped uniquely to the assembly and eight of them were present in some of our previous annotations (SMEST, dd_Smed_v6, or dd_Smes_v1), indicating errors in the S3BH gene preditions.

Table 4
Conservation of ATAC-seq peaks depending on their ChIP-intersect characterization.

Table 5
Size distribution of the cumulative schMedS3h1 liftover by category from Supplemental Figure 12-left.

Table 6
Size distributions of schMedS3h1 liftover by category and species from Supplemental Figure

Table 7
Syntenic blocks identified between schMedS3h1 and the other assemblies using GENESPACE.Given is the number of syntenic blocks with their median, standard deviation, minimum size, maximum size, and the genome coverage in Mb and as a percentage.

Table 8
Enrichment analysis of 10 kb windows flanking synteny breakpoints inferred using GENESPACE.Tests are two-sided and based on comparisons to 1000 iterations of random placement of an equal number of 10 kb windows in the reference.Given are results for all transposable elements followed by the LTR/Gypsy and LINE/R2 elements since they returned the most relevant results.For a full table including all tested elements, see Additional File 2: TableS12.P values are adjusted for multiple testing using the Benjamini-Hochberg procedure and printed in bold when <0.05.

Table 12
Results of one-sided test for the hypergeometric distribution via Fisher's exact tests for synteny conservation between the parasite genomes in the study.P-values have been adjusted for multiple testing using the Bonferroni method.

Table 13
Results of one-sided test for the hypergeometric distribution via Fisher's exact tests for synteny conservation between the Schmidtea genomes in the study.schMedS3h2 was omitted for readability since it had qualitatively identical results to schMedS3h1.P-values have been adjusted for multiple testing using the Bonferroni method.

Table 14
Results of one-sided test for the hypergeometric distribution via Fisher's exact tests for synteny conservation between the Schmidtea genomes and the parasite genomes in the study.schMedS3h2 was omitted for readability since it had qualitatively identical results to schMedS3h1.Pvalues have been adjusted for multiple testing using the Bonferroni method.

Table 15
Description of conservation of metazoan Ancestral Linkage Groups in four parasitic flatworms.Fusion and mixing of Ancestral Linkage Groups are denoted by ⨂.
It is important to notice that this pipeline strictly requires the input fast files to be uncompressed and to have a .fastqextension (i.e..fqor other suffixes are not accepted).Furthermore, if the chromosome names contain underscore symbols, the resulting 3P- seq_processed_filtered.bedcount output files will be formatted incorrectly (the program replaces the _ with a TAB ), therefore it is necessary to restore the right format using sed, as shown below.Briefly, short reads are mapped to the genome using Bowtie (1.3.1), and a series of Python scripts extract the 3'-end positions.Only sites with at least 3 supporting reads are considered a valid TSS.