Characterization of the Polycomb-Group Mark H3K27me3 in Unicellular Algae

Polycomb Group (PcG) proteins mediate chromatin repression in plants and animals by catalyzing H3K27 methylation and H2AK118/119 mono-ubiquitination through the activity of the Polycomb repressive complex 2 (PRC2) and PRC1, respectively. PcG proteins were extensively studied in higher plants, but their function and target genes in unicellular branches of the green lineage remain largely unknown. To shed light on PcG function and modus operandi in a broad evolutionary context, we demonstrate phylogenetic relationship of core PRC1 and PRC2 proteins and H3K27me3 biochemical presence in several unicellular algae of different phylogenetic subclades. We focus then on one of the species, the model red alga Cyanidioschizon merolae, and show that H3K27me3 occupies both, genes and repetitive elements, and mediates the strength of repression depending on the differential occupancy over gene bodies. Furthermore, we report that H3K27me3 in C. merolae is enriched in telomeric and subtelomeric regions of the chromosomes and has unique preferential binding toward intein-containing genes involved in protein splicing. Thus, our study gives important insight for Polycomb-mediated repression in lower eukaryotes, uncovering a previously unknown link between H3K27me3 targets and protein splicing.


INTRODUCTION
In the eukaryotic cells, transcription state is dependent on the underlying chromatin state. The fundamental structure of chromatin is based on a nucleosome, a complex of 147bp-long fragments of DNA wrapped around the core histone proteins (H2A, H2B, H3, H4). Chromatin state can be influenced by post-translational modifications deposited on the histones, either by direct structural changes in the nucleosomes or recruitment/displacement of secondary proteins involved in chromatin remodeling or transcription. The presence of particular histone modifications often determines the type of the chromatin and correlates with transcriptional activity of the target DNA. Among those, tri-methylation of lysine 27 on histone H3 (H3K27me3) and mono-ubiquitination of histone H2AK118/H2AK119 (H2AK118ub/H2AK119ub) are commonly associated with transcriptionally silent facultative heterochromatin.
Deposition of H3K27me3 and H2AK118ub/H2AK119ub is mediated by Polycomb group (PcG) proteins, whose function was initially shown to control developmentally regulated processes and maintain cell identity (Papp and Müller, 2006;Schuettengruber et al., 2007) in both, animals [reviewed in Schwartz and Pirrotta (2008)] and plants [reviewed in Köhler and Villar (2008)]. PcG proteins form distinct complexes, like Polycomb Repressive Complex 2 (PRC2), involved in H3K27 trimethylation [in metazoans Pcl-PRC2 complex (Nekrasov et al., 2007;Müller and Verrijzer, 2009)], and Polycomb Repressive Complex 1 (PRC1), responsible for H2A mono-ubiquitination. PRC2 in Drosophila melanogaster, where it was initially discovered, consists of four core subunits: Enhancer of zeste [E(z)], a catalytic component needed for methylation of H3K27; Extra Sex Combs (ESC), a WD40 motif-containing protein that scaffolds interactions within the complex; Suppressor of zeste (Su(z)12), a Zinc Finger subunit essential for binding of PRC2 to nucleosomes and p55, a nucleosome remodeling factor (Schwartz and Pirrotta, 2013). In turn, PRC1 is formed by: dRING/Sex Combs Extra (Sce); Posterior Sex Combs (Psc), both responsible for mono-ubiquitination activity; Polyhomeotic (Ph), essential for maintaining protein-protein interactions; and Polycomb (Pc), involved in recruitment of the complex to chromatin; and Sex comb on midleg (Scm), important for spreading of PcG silencing (Schwartz and Pirrotta, 2013). Both complexes are functionally connected with each other. In the canonical hierarchical model, initially introduced H3K27me3 mark is recognized by PRC1 and followed by H2A mono-ubiquitination to maintain the repression. However, it was shown that the mechanism of PcGmediated repression can happen as well in the opposite order, with PRC1 introducing its modification prior to PRC2 activity (Yang et al., 2013;Del Prete et al., 2015).
Polycomb-group complexes were identified also in the plant kingdom. Core PRC1 complex in model flowering plant Arabidopsis thaliana consist of AtRING1a/b (equivalent to dRING/Sce), AtBMI1a/b/c and EMF1 (equivalent to Psc), and LHP1 (equivalent to Pc). Although a wide range of PRC1associated proteins exists , it is unclear how the combinations of different subunits depend on specific temporal and spatial conditions. In contrast to PRC1, composition of the Arabidopsis PRC2 complex is better characterized. Arabidopsis PRC2 subunits underwent gene duplication, resulting in the presence of different, partially redundant or independently acting PRC2 complexes (Derkacheva and Hennig, 2014). In the Arabidopsis genome the following PRC2 components can be identified: three E(z)-homologs -CURLY LEAF (CLF), SWINGER (SWN), MEDEA (MEA); three Su(z)12-homologs -EMBRYONIC FLOWER 2 (EMF2), VERNALIZATION 2 (VRN2), FERTILIZATION INDEPENDENT SEED 2 (FIS2); as well as single gene subunits: ESC-homolog -FERTILIZATION INDEPENDENT ENDOSPERM (FIE) and p55-homologs -MULTICOPY SUPPRESSOR OF IRA 1-5 (MSI1-5).
One of the aspects of PRC2 biology concerns its origins and the abundance in lower eukaryotes. Due to the absence of PRC2 in the model unicellular species Saccharomyces cerevisiae and S. pombe (Lachner et al., 2004;Veerappan et al., 2008), the PRC2 appearance was previously thought to co-occur with the emergence of multicellularity. However, high conservation of PRC2 subunits in both, animal and plant lineages implies that PcG genes appeared already in the last common unicellular ancestor (Bowman et al., 2007), before those two kingdoms diverged. In an elegant study, Shaver et al (Shaver et al., 2010) identified the novel computational PRC2-homologs in several unicellular species and showed that E(z)-homolog in unicellular green algae Chlamydomonas reinhardtii is responsible for mono-and di-methylation of lysine 27 on histone H3. As the tri-methylation activity of Chlamydomonas E(z) homolog was not found, this study highlights important differences in the functional conservation of PRC2 components. For instance, although H3K27 tri-methylation is considered to be a prominent histone mark in PRC2-mediated repression, there is a range of H3K27 methylation levels catalyzed by PRC2 that varies between plant and Drosophila or mammals. In Arabidopsis, PRC2 controls H3K27 tri-methylation and H3K27 di-methylation in the euchromatin (Bastow et al., 2004;Lindroth et al., 2004;Sung et al., 2006), with mono-methylation being catalyzed by ATXR5 and ATXR6 (Jacob and Michaels, 2009;. However, in metazoans homologs of ATXR5/6 were not found and their PRC2 complexes mediate H3K27 methylation in all contexts (Müller et al., 2002;Ebert et al., 2004;Ferrari et al., 2014). Another difference between species is the type of DNA elements targeted by PRC2. Drosophila PRC2 targets both, genes and transposable elements/repetitive sequences (Yin et al., 2011), whereas in Arabidopsis the H3K27me3 mark is excluded from the majority of transposable elements/repetitive sequences (Lafos et al., 2011;Deleris et al., 2012;Park et al., 2012).
Even though PRC2-mediated repression has been extensively studied in higher plants and animals, its characterization in lower eukaryotes gathered much less attention. The existence of H3K27me3 in lower eukaryotes is largely unknown (Shaver et al., 2010) and distribution of the mark in the genome and its function were reported only for a handful of species, such as Neurospora crassa, Phaeodactylum tricornutum, or Tetrahymena thermophila. In N. crassa H3K27me3 is arranged into broad domains covering 774 genes connected with a full spectrum of functions (Jamieson et al., 2013). In T. thermophila, the mark associates with the developmentally regulated genome rearrangements as it occupies sequences eliminated during differentiation of the macronucleus (Liu et al., 2007). In P. tricornutum, H3K27me3 covers transposable elements and genes, and its targets are involved in, i.e., signal transduction, development and cell cycle control (Veluchamy et al., 2015).
Given the distinct degrees of phylogenetic conservation of PRC2 and PRC1, and poor characterization of H3K27me3 in lower eukaryotes, we sought to investigate the presence of Polycomb-group homologs and H3K27me3 abundance in several representative unicellular, photosynthetically active eukaryotes. Our scope included the representatives of red algae, green algae and Glaucophyta that vary in genome size, genome architecture, ecological niche and metabolism (Matsuzaki et al., 2004;Palenik et al., 2007;Worden et al., 2009;Merchant et al., 2010;Price et al., 2012). After an initial small-scale screen on H3K27me3 presence, we focused on one of the species, Cyanidioschyzon merolae, to study genome-wide occupation of H3K27me3 and characterize its targets. C. merolae is a unicellular red alga, living in highly acidic environment with high temperatures. It contains a small (16 Mb) genome which was fully sequenced as the first algal genome (Matsuzaki et al., 2004) and assembled as first 100% complete eukaryotic genome (Nozaki et al., 2007). Interestingly, the genome shows extremely simplified structure that contains almost exclusively intron-lacking genes (only 26 genes contain an intron), very low percentage (0,7%) of transposable elements and a novel class of a repetitive element, corresponding to the truncated ORF from White spot syndrome virus (WSV repeat) (Nozaki et al., 2007). Given its unicellularity, evolutionary ancestry, primitive architecture of the genome and availability of sequencing data, we argue that C. merolae is a suitable model for studying chromatin repression in an evolutionary context.
Our results confirmed the high conservation of PRC2 core members in lower eukaryotes and the widespread presence of H3K27me3 modification. We report that in C. merolae H3K27me3 mark targets both, genes and repetitive elements and is anti-correlated with transcriptional activity. We show that H3K27me3 distribution over its targets is not uniform, but can be grouped into several different clusters that correlate with different levels of repression. Furthermore, we demonstrate that H3K27me3 has a preferential chromosomal localization toward telomeric and subtelomeric regions. For the first time, we also reveal that H3K27me3 target genes are enriched in the functional class of intein-mediated protein splicing. Moreover, by deposition of RNA-and ChIP-sequencing data, we provide a resource for studies on the chromatin and transcriptome in the model red alga C. merolae.
The highest scoring candidate proteins were used for BLAST searches against the TAIR10 genome annotation to confirm a reciprocal match to the protein used as an initial query. The sequences were aligned using ClustalXv2.0 (Larkin et al., 2007) and conserved blocks were selected by gBlocks v0.91b (Talavera and Castresana, 2007). For substitution model estimation, ProtTestv3.4 was employed (Abascal et al., 2005). Phylogenetic trees were created using a Bayesian approach in Mr Bayesv3.2 (Ronquist et al., 2012), self-compiled developer version r1067 with implemented LG model. The runs were done with 20 mln generations. Sequences from D. melanogaster were selected as an out group. Consensus trees and alignments were visualized with Figtree v1.4.2 (Rambaut, 2009) and Jalviewv2 (Waterhouse et al., 2009), respectively.

Algae Growth
Cyanidioschyzon merolae cells were grown at 42 • C with shaking at 150 rpm. The cultures were kept under continuous light, with light intensity of 70 µmol.

Western Blot
Around 50.000 algae cells were diluted in Laemmli's sample buffer, and 3-7 µl of total protein/lane was subjected to sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) using 12% SDS-polyacrylamide gel. The resolved proteins were transferred onto polyvinylidenedifluoride (PVDF) membranes (Immobilon-P membrane, Milipore) using a Vertical Electrophoresis Cell (Bio-Rad). Following transfer, membranes were washed with phosphate buffered saline containing 0.05% Tween-20 (PBST) and then blocked in 5% milk diluted in PBST for 1 h at room temperature. The membranes were then incubated overnight with anti_H3K27me3 (N 07-449, Millipore or C15410195, Diagenode) and anti H3 (ab1791-100, Abcam) rabbit polyclonal antibodies 1:2000 in PBST containing 5% milk at 4 • C. After washing three times for 10 min in PBST, membranes were incubated with peroxidaselabeled secondary antibody for 1 h at room temperature. The membranes were washed three times for 10 min in PBST, incubated with SuperSignal West Femto Chemiluminescence Substrate (Thermo Scientific) for 1 min, and exposed for 1-2 min. Commercial histone extract from calf thymus (Sigma, #9064-47-5) was used as a positive control for H3K27me3 detection.

Competition Assay
Pre-incubation of anti-H3K27me3 antibody (C15410195, Diagenode) and H3K27me3 peptide (Intavis) was done for 2 h in RT in PBST, with occasional mixing. The peptide-to-antibody molar ratio was 50:1. A control solution -PBST with just antibody at 2X final concentration -was used at the same time. At the end of the pre-incubation, 4% BSA blocking solution was added to the peptide/antibody mixture to a final concentration of 3% BSA, mixed briefly and added to the membranes. After washing three times for 10 min in PBST, membranes were incubated with peroxidase-labeled secondary antibody for 1 h at room temperature. The membranes were washed three times for 10 min in PBST, incubated with SuperSignal West Femto Chemiluminescence Substrate (Thermo Scientific) for 1 min, and exposed for 1-2 min.

Chromatin Immunoprecipitation
Cyanidioschyzon merolae cells were grown until late log-phase. 40 ml samples were fixed [Formaldehyde, 1% (v/v)] for 10 min, until addition of glycine (to final concentration of 125 mM, 5 min incubation). Superfluous formaldehyde was removed by three washes with ice cold PBS buffer and the remaining cell pellet was resuspended in 4 ml of Extraction Buffer (50 mM Tris-HCl pH 8, 10 mM EDTA, 1% SDS) with Protease Inhibitor Cocktail (Roche). The samples were sonicated for 5 min with 30 s ON/30 s OFF cycle using Bioruptor Plus device (Diagenode) and cleared by two rounds of centrifugation (13000 rpm, 4 • C, 10 min). Subsequent steps were performed as in the Plant ChIPseq kit protocol (Diagenode) with higher volume of sample taken aside as an input (1:5 of chromatin for IP). Immunoprecipitation was done using anti-H3K27me3 Polyclonal Premium antibody (C15410195, Diagenode) and, as a negative control, IgG fraction from rabbit (C15410206, Diagenode). Quality and fragment size of immunoprecipitated DNA and input samples were measured using agarose gel electrophoresis and Bioanalyzer 2100 (Agilent Technologies).

Real Time PCR
DNA samples obtained from ChIP were used for H3K27me3 enrichment analysis for several target genes by real-time quantitative PCR. Reactions were prepared using KAPA SYBR FAST qPCR Mastermix (KapaBiosystems), according to the manufacturer's protocol, and run on an iQ5 detection system (Biorad) using a 2-step program. Differences in H3K27me3 enrichment on target genes were scored by comparison of % recovery of input and standard error values. 5 -> 3 sequences of oligonucleotides used for amplification were as follows: CmMADS-forward: GGATGAGAAAGCGAGAAATACGA and reverse: TCACAATGCCGATCTCACAG; CmEIF-4A -forward: TGTACGATATGATCCAGAGAAGAG and reverse: TGTAGAT TTGCTCCTTGAAACC; Cm60S -forward: AAGTTTCGCTGT ACGCTTGG and reverse: TAACCAGGACCATATCGCCG.

ChIP-seq
DNA library was prepared with MicroPlex Library Preparation Kit (Diagenode) and sequenced on HiSeq 2000 sequencer (Illumina Inc.). Quality of paired-end raw reads was assessed using FastQC v0.11.4. The reads were trimmed according to the quality and mapped to the reference in Bowtie v2.2.6, with standard options. Peaks were called using MACS v2.1 with exclusion of clonal reads. The new annotation from the C. merolae Genome Project v3 1 was employed as a reference. The correlation between ChIP-seq replicates was scored using Pearson correlation and shown in Supplementary Figure S3A. The data was deposited in the Gene Expression Omnibus under the series GSE93913.

RNA Extraction and RNA-seq
Cyanidioschyzon merolae cells were grown until late log-phase. RNA was isolated using RNeasy Plant Mini Kit (Qiagen), following a standard protocol. cDNA synthesis and DNase treatment were performed using RevertAid First Strand cDNA Synthesis Kit (Thermo Scientific) with oligo(dT) primers, according to a standard manual. The quality and concentration of samples were measured spectrometrically using Nanodrop 1000 (Thermo Scientific) and electrophoretically using Bioanalyzer 2100 (Agilent Technologies). Libraries were constructed using TruSeq RNA Library Prep kit (Illumina) with gel-free library purification based on Agencourt AMPure XP beads (Beckman Coulter, Inc.). The samples were sequenced on HiSeq 2000 device (Illumina Inc.). Paired-end raw sequencing reads were analyzed using Galaxy implementations (usegalaxy.org) of 1 czon.jp/download/annot_list.txt FastQC program and Tuxedo protocol (Trapnell et al., 2012). The new annotation from C. merolae Genome Project v3 was employed as a reference. The correlation between replicates in RNA-seq experiment was scored using Pearson correlation and shown in Supplementary Figure S4. The data was deposited in the Gene Expression Omnibus, under the series GSE93912.

Bioinformatic Secondary Analysis of NGS Data
Identified H3K27me3 peaks were annotated by intersection with the reference using Bedtools v2.17 (Quinlan and Hall, 2010). Sequences from unannotated peaks were translated in silico to ORF in HMMER2GO v0.17 software and the longest ORF was used for homology searches by BLASTP/Pfam against respective protein databases. Alternatively, unannotated peak sequences were used for homology searches by BLASTN against NCBI nucleotide databases: nr/nt and ref_seq. Homology searches were performed in BLAST+ standalone package v2.2.28+ (Camacho et al., 2009). Alignment threshold was set as follows: >50% alignment length and <0.5 E-value. Complete genome records in nucleotide databases were used to form negative GI list and excluded from the BLASTN searches. Only the top alignment hit was included in the further analyses. Distance between peak and annotated genomic feature was obtained using 'closest' command from Bedtoolsv2.17. Mapping of the reads was visualized in IGV v2.3 (Thorvaldsdóttir et al., 2013). Clustering of H3K27me3 enrichment was done on deepTools2 (Ramirez et al., 2016). Gene ontology was inferred by using Singular Enrichment Analysis on the AgriGO server (Du et al., 2010) against complete GO list. Protein annotations were extracted from UniProtKB database and used as reference. To select significantly enriched GO terms, Fisher test with Yekutieli adjustment was used a statistical method. P-value was set to 0.05 and minimal number of entries kept at 5.

Conservation of PRC1 and PRC2 Homologs
In order to assess evolutionary conservation of Polycomb complexes, we performed phylogenetic analysis in the species from various groups of lower plants. We focused on the representatives from subclades: Chlorophyta (green algae), Rhodophyta (red algae), Glaucophyta and Embryophyta. For comparison, a representative from Metazoa, D. melanogaster, was included as well. Using reciprocal BLAST searches with full length proteins from A. thaliana or D. melanogaster, we identified homologs of PRC1 and PRC2 members in several species (Table 1).
We found a general absence of PRC1 components: Psc/BMI1, Pc/LHP1, EMF1 and Scm in the algal genomes analyzed, with an exception of Psc/BMI1 homolog in C. reinhardtii. In contrast, the RING1 subunit is widely conserved, suggesting a monoubiquitylation activity unrelated to PRC1 in algae and protein moonlighting (Jeffery, 2003). In general, an abundance of RING1 homologs in chlorophytes, but the absence of other PRC1 components in lower branches of Archaeplastida used here is consistent with the previous studies (Hennig and Derkacheva, 2009;Berke and Snel, 2015;Chen et al., 2016) and agrees with the notion of lower PRC1 conservation and potential loss of the complex in several phylogenetic branches. Homology searches on PRC2 revealed broad distribution of complex components: E(z), ESC and p55, consistently with results published elsewhere (Shaver et al., 2010;Butenko and Ohad, 2011;Kim et al., 2013). Importantly, p55 in higher eukaryotes has various functions and participates also in other complexes than PRC2 (i.e., chromatin assembly factor 1 [CAF1)]. Therefore it still remains to be proven whether a role of p55 homologs is Polycomb-related in lower plants. Moreover, we observed an absence of Suz12 from C. reinhardtii and C. paradoxa. Similarly to the other studies (Shaver et al., 2010), the lower abundance of this subunit shows that Suz12 was lost in several species, despite the presence of the other PRC2 components. Interestingly, we did not detect any core PRC2 members in the Cyanophora representative. Given that E(z) and ESC homologs were present in the representatives of Rhodophyta and Chlorophyta, it suggests the existence of PRC2 in the common algal ancestor and subsequent loss of the complex from Cyanophora. However, we could not exclude technical issues coming from incomplete annotation and incorrectly predicated protein models in the Cyanophora genome. In addition, we identified mostly one homolog per complex component in all screened species, apart from Arabidopsis, which adds another piece of evidence to the frequent gene duplication occurring in flowering plants.
Overall, our results agree with the ancient presence of core PRC2 components and the frequent losses of the Suz12 subunit and PRC1 complex members.

Phylogenetic Relationship of PRC2 Homologs
Phylogenetic analyses on the conserved domains or full sequences of core PRC2 homologs in lower and higher plants with the creation of Neighbor-Joining (NJ) trees revealed grouping into several distinct classes (Shaver et al., 2010;Kim et al., 2013;Huang et al., 2016). In a different approach, we extracted amino acid sequences from the most abundant PRC2 members, aligned them and automatically selected only the conserved blocks in the multiple alignment (see Material and Methods). With such prepared sequences we created Bayesian trees for E(z)z, ESC and Su(z)12 homologs. Consistently with the published data, the homologs formed defined clades. Bayesian trees for E(z) and ESC sequences generally resolved the evolutionary distance between groups (Figures 1A,B). In those cases, we found separation of representatives of Rhodophyta and Viridiplantae, with species from Chlorophyta and Embryophyta forming separate subclades within the latter. The exceptions included one of the ESC homologs from C. reinhardtii (CrESC.1), which clustered together with the C. merolae homolog, rather than the other green algae. In turn, our analysis on Su(z)12 detected distinct clades for the organisms from Embryophyta, Chlorophyta and Rhodophyta, without grouping of Arabidopsis and green algae sequences ( Figure 1C). Moreover, we noted a phylogenetic distance between C. merolae and any other species for all the three proteins.
In summary, our results on Bayesian phylogenetic trees on conserved sequence blocks are in agreement with the published NJ trees using the full length (Kim et al., 2013) or domainonly (Shaver et al., 2010) sequences. Given high conservation of domain architecture in core PRC2 components between the subclades (Shaver et al., 2010;Kim et al., 2013), we conclude that PRC2 is widely distributed and already evolved in a common unicellular ancestor.

Conservation of Histone H3 Sequences in Various Eukaryotic Organisms
As PRC2 has the canonical function to methylate lysine 27 in histone H3, we decided to analyze protein sequence conservation of histone H3. Amino acid sequences of algal histone H3 were retrieved from the databases of respective genome sequencing projects (see materials and methods) or NCBI, using the canonical histone H3.1 from Arabidopsis as a query (Okada et al., 2005). The results revealed a presence of multiple H3 genes in all of the species studied (Table 1). Noteworthy, we detected an equal amount of histone H3 gene copies as in the published reports for M. pusilla and C. reinhardtii (Cui et al., 2015). The number of H3 gene copies in D. was taken from the published results (McKay et al., 2015).
Next, the closest H3 homologs between the species were aligned using ClustalX and visualized with Jalview. The alignment revealed overall high amino acid conservation and a presence of lysine-27 in all studied species (Figure 2). Moreover, we noted that the sequences around lysine-27, including the underlying motif ARKS, does not show any sequence divergence with an exception of C. reinhardtii, which contains threonine-28 instead of serine-28. The results suggested that there is the potential for introduction of the H3K27me3 modification in most of species analyzed. However, in C. reinhardtii the presence of threonine-28 next to lysine-27 might not permit detection using commercial antibodies against the methylated H3K27. Moreover, a mass spectrometry analysis on Chlamydomonas histone H3 did not reveal the presence of the H3K27me3 modification (Shaver et al., 2010).

Detection of H3K27me3
Given the phylogenetic conservation of PRC2 core components and high similarity of amino acid composition of histone H3 in the vast majority of the species, we sought to examine H3K27me3 presence in total protein extracts from selected organisms. Isolated proteins from crude extract were separated on a SDS-PAGE gel and detected using two independently raised anti-H3K27me3 antibodies. Our results revealed the presence of the H3K27me3 modification in M. pusilla, O. tauri, C. paradoxa, and C. merolae (Figures 3A,B). In order to decipher H3K27me3 relative abundance, we performed western blotting for histone H3 ( Figure 3D) and calculated H3K27me3/H3 ratios based on band intensity quantification. We detected different relative amounts of the modification in the studied species (Figure 3E), with the lowest H3K27me3/H3 ratios for M. pusilla and O. tauri, intermediate for C. merolae and the highest for C. paradoxa.
Due to the fact that one of the antibodies detected proteins in the sizes not corresponding to histones, we performed protein competition assay for the confirmation of specific band reactivity. FIGURE 2 | H3 sequence conservation. AtH3.1 was used a query for BLAST searches against databases of respective species. The closest homologs were selected and used for multiple alignment in ClustalX, visualized afterward in Jalview v2.9. Color-shading marks non-conserved residues. Red frame correspond to lysine-27, a residue typically modified by PRC2. After preincubation with H3K27me3 peptide, we did not detect any band corresponding to this modification for a commercial histone extract from calf thymus (positive control), as well as for M. pusilla, O. tauri, and C. merolae (Figure 3C), suggesting a specificity of the antibody. Intensity decrease of the band corresponding to H3K27me3 for C. paradoxa was accompanied by the overall signal loss, including the bands of the other molecular weight proteins. Therefore, the presence of H3K27me3 in C. paradoxa remains inconclusive.
In summary, we were able to biochemically identify and confirm a presence of H3K27me3 in M. pusilla, O. tauri and C. merolae. This result is consistent with our phylogenetic analyses and suggests that the PRC2 complex is active in these species. For C. paradoxa, the biochemical results are unclear: potential absence of H3K27me3 mark would correspond to the lack of PRC2 homologs identified in its genome. Differences in relative H3K27me3 amounts between species imply, e.g., a distinct abundance of genomic targets of H3K27me3 or different H3K27me3 enrichment levels per genomic region. Further work should decipher an impact of such differences in the evolutionary context.

Characterization of H3K27me3 Target Genes
Based on the broad distribution of PRC2 genes, high sequence conservation of histone H3 and specific signal detected with the anti-H3K27me3 antibody on a protein blot, we selected one of the species, C. merolae, to investigate the H3K27me3 abundance further. Firstly, we asked whether the modification has similar targets as in the other organisms. Reciprocal BLAST searches between protein databases of C. merolae and A. thaliana were used to detect homologous genes of known A. thaliana PRC2 targets (Supplementary Figure S1). We selected a MADS boxcontaining gene CMA095C (CmMADS) as nearly all Arabidopsis MADS box genes carry H3K27me3. For negative controls, estimated not to be targeted by H3K27me3, homologous genes for EUKARYOTIC ELONGATION FACTOR EIF4A (CMK028C, CmEIF4A) and 60S RIBOSOMAL PROTEIN L23 (CMS262C, Cm60S) were taken.
Next, we performed Chromatin-immunoprecipitation (ChIP) using anti-H3K27me3 antibody and analyzed the expression of candidate genes by RT-PCR. As a result, we were able to show significant enrichment of H3K27me3 on CmMADS and low abundance of the mark on negative targets: CmEIF4A and Cm60S (Figure 4). Overall, we were able to identify positive target of H3K27me3 and show high enrichment of the mark comparing to negative loci.

H3K27me3 Genome-wide Distribution -Peak Identification
In order to characterize the targets of H3K27me3 and the distribution of H3K27me3 peaks in a genome-wide scale, we created a sequencing library and performed chromatinimmunoprecipitation coupled with sequencing (ChIP-seq) for the samples: H3K27me3-bound DNA, H3-bound DNA (both in triplicate) and input.
FIGURE 4 | H3K27me3 abundance on selected genes from C. merolae. Chromatin immunoprecipitation was performed using H3K27me3 antibody (#C15410195, Diagenode) and IgG (#C15410206, Diagenode) for negative control, followed by RT-PCR analysis on a putative H3K27me3 target (CmMADS) and housekeeping genes as the putative non-H3K27me3 targets (Cm60S, CmEIF4A). Selected genes relate to following locus IDs from C. merolae genome project: CmMADS -CMA095C, Cm60S -CMS262C, CmEIF4A -CMK028C. Error bars correspond to the standard deviation from three biological replicates. Significance was calculated using Student's t-test for the comparison between CmMADS and either, Cm60S, or CmEIF4A. A triple asterisk indicates the significance level of p < 0.001.
Cleaning raw sequencing reads, mapping to the reference genome and peak calling with input or H3 sample for normalization, let us identify more than 1300 H3K27me3 peaks per replicate (Figure 5A). The peaks cover 14% of total nuclear genome length, with an average peak size of 1792 bp. Good quality of the data was confirmed by a high number of H3K27me3 peaks in the overlaps between the biological replicates, as well as the high correlation values of the reads' occupancy (Supplementary Figure S3).
Next, we compared H3K27me3-peak coordinates with the loci in C. merolae reference genome. We applied a threshold of 50% overlap of the locus length and successfully annotated ca. Six hundred peaks ('high overlap annotated peaks') ( Figure 5A). The remaining peaks corresponded to the reference loci covered by H3K27me3 in less than 50% of their length ('low overlap annotated peaks') and to the unannotated peaks without any overlap with reference loci. We concluded that the unannotated peaks are the intergenic regions and/or uncharacterized genomic features missing in the current annotation. Considering the second possibility, we used several strategies to find de novo annotation of unknown peaks.
Firstly, we in silico translated sequences from the unannotated peaks and searched for open reading frames (ORFs) within them. We found ORFs for all of the unannotated peaks and used them for BLASTP/Pfam searches against the NCBI nonredundant protein sequence database (nr database) or Pfam protein database. Using a specific alignment threshold (alignment length > 50%; E-value < 0.5), we could annotate further 178, 164, and 195 peaks for biological replicate 1, 2, and 3, respectively ( Figure 5B). In the second approach, we extracted DNA FIGURE 5 | H3K27me3 genome-wide distribution and peak annotation. (A,B) -Number of peaks identified in three different biological replicates from chromatin immunoprecipitation experiment (IP1-3) in 'high overlap annotated peaks,' 'low overlap annotated peaks' category, 'de novo-annotated' and 'unannotated' categories (see main text). (C,D) -Characterization of H3K27me3 peaks according to genomic element type. High overlap H3K27me3 peaks cover 242 genes, which amounts to 4% of total gene number (C) or 172 repetitive elements, 50% of total repetitive element number (D). Both, loci of hypothetical proteins and transcripts, were included in total gene number.
sequences of the unknown peaks and used BLASTN searches against the NCBI nucleotide collection (nr/nt database) and NCBI transcript reference sequences (refseq_rna database). The hits surpassing the alignment threshold (alignment length > 50%, E-value < 0.5) were compared to BLASTP/Pfam output, which allowed to annotate next 4 to 8 peaks, depending on the biological replicate ( Figure 5B). In general, our sequence homology searches helped to detect de novo annotation of 13-15% peaks from their total number, depending on the biological replicate ( Figure 5A).
We reasoned that the remaining unknown peaks after de novo annotation corresponded to the regions containing promoters and the other regulatory elements of neighboring annotated elements. Given the high density of genes along the chromosomes in C. merolae (Matsuzaki et al., 2004), such regions would reside close to 5 or 3 gene ends, but would not be present in the annotation. In order to validate this idea, we calculated average distances between gene locations and unannotated peaks. We compared distributions between peaks successfully aligned to BLASTP database (annotated peaks) and the ones showing no significant alignment (unannotated peaks), assuming that BLASTP-aligned peaks truly correspond to novel genes missing in our reference, rather than uncharacterized cis elements. However, we did not observe major differences between distances calculated for annotated and unannotated peaks (Supplementary Table S1), suggesting that unannotated peaks do not exclusively correspond to cis regulatory elements of known genomic elements. Further development of C. merolae genome annotation should deepen our understanding about these remaining uncharacterized H3K27me3 peaks.

H3K27me3 Genome-wide Distribution -Genomic Feature
Next, we sought to investigate the genomic feature profile underlying H3K27me3 peaks. We picked a stringent group of high overlap annotated targets, removed duplicates (two peaks spanning one genomic element) for each replicate and selected a consensus set of features that appear in any two out of three replicates. As a result, we found that H3K27me3 covers 242 genes, which amounts to 4% of C. merolae total gene number ( Figure 5C). On the other hand, we also detected 172 H3K27me3-bound repetitive elements. Given very low repetitive element occupancy in the C. merolae genome (0,7% transposable elements + 5% WSV repeats) (Nozaki et al., 2007), H3K27me3 seems to be predominantly present on repetitive elements, covering as much as 50% of their total number ( Figure 5D). We noted that the enrichment on both, genes and repetitive elements, is in concordance with the studies in Drosophila (Yin et al., 2011). Interestingly, the predominant enrichment of H3K27me3 on repetitive elements was found also in diatom P. tricornutum (Veluchamy et al., 2015), suggesting an ancestral role of Polycomb-mediated gene regulation in guarding the genome.
H3K27me3 Is a Silencing Mark in C. merolae Next, we wanted to characterize the correlation of H3K27me3 enrichment with the expression of its targets. In order to decipher expression level on a genome-wide scale, we performed a RNAseq experiment on reverse-transcribed RNA extracted from C. merolae cultures. The sequencing reads were mapped to the reference genome and expression level was assessed based on the FPKM values. We intersected the FPKM data with H3K27me3binding profile and distinguished genes and repetitive elements for the analyses.
We found that both, H3K27me3-covered genes and H3K27me3-covered repetitive elements are on average significantly less expressed than non-H3K27me3 targets (Figure 6), confirming that the H3K27me3 modification in C. merolae is highly correlated with gene repression. Moreover, we noted that the general level of repetitive element expression is lower than gene expression, irrespectively of the H3K27me3 status. These results suggest the existence of additional mechanisms involved in repression of the repetitive elements.

H3K27me3 Gene-body Distribution
H3K27me3 enrichment has a broad distribution over gene bodies in plants (Zhang et al., 2007;Yang et al., 2014) or gene bodies and flanking regions in animals (Pauler et al., 2009;Cerase et al., 2014) and diatoms (Veluchamy et al., 2015). To inspect the distribution of the mark in C. merolae, we FIGURE 6 | Relationship between expression and H3K27me3 presence. H3K27me3 occupancy is anti-correlated with expression level in both, genes and TEs. Expression was estimated based on FPKM values obtained from RNA-seq experiment. Student t-test was used to calculate significance. * * p-value < 0.01, * p-value < 0.05. Error bars represent standard error (SE). calculated average H3K27me3 enrichment over genic coordinates and observed that H3K27me3 covers gene bodies, 0.5 kb upstream and 0.5 kb downstream regions (Supplementary Figure S2). This general approach obscured whether all targets show similar, broad distribution of H3K27me3 or whether the differential enrichment subgroups are present among the targets. We therefore performed k-means clustering on the results. Our analysis identified four clusters with the preferential presence of H3K27me3 at gene bodies (cluster 1), the 0.5 kb downstream region (cluster 2), and the 0.5 kb upstream region (cluster 3) ( Figure 7A). Cluster 4 contains non-H3K27me3 targets.
As distinct H3K27me3 enrichment clusters impact target gene activity differently in some other species (Young et al., 2011;Veluchamy et al., 2015), we asked whether gene expression level varies between the H3K27me3 clusters in C. merolae. H3K27me3 differential enrichment compared with RNA-seq data revealed that clusters 1-3 correlate with the gene repression, albeit to an uneven extent. Cluster 1 correlated with the strongest repression level, whereas clusters 2 and 3 correlated with a similar, milder repression ( Figure 8A). Thus, the strongest silencing effect was seen for H3K27me3 distribution at the gene bodies, similarly to the diatom P. tricornutum (Veluchamy et al., 2015), and the H3K27me3 enrichment at the flanking regions is correlated with the milder repression.
We performed similar clustering analysis to characterize H3K27me3 enrichment also on the repetitive elements. We could distinguish two clusters with histone mark presence on 0.5 kb FIGURE 7 | Clustering of H3K27me3 gene body enrichment. Coverage of ChIP-seq H3K27me3 reads normalized to input was calculated and used to generate matrix with gene (A) or repetitive element (B) locations as reference. The reads were scaled to 500 bp windows and flanking regions set to 500 bp. Matrix files served as an input for heatmap generation and K-means clustering in deepTools2.
upstream region and gene bodies (cluster 1) or gene bodies only (cluster 2) ( Figure 7B). Cluster 3 corresponds to non-H3K27me3 targets. In contrast to the genes, we found that cluster 1 and 2 correlate with reduced gene expression to the same extent ( Figure 8B), suggesting that gene body H3K27me3 distribution has a predominant effect on gene repression. On the other hand, clusters present at repetitive elements are not fully analogous to those at genes as for the former we did not identify a cluster with enrichment exclusively at the flanking regions.

H3K27me3 Location on Chromosomes
Looking at the whole-chromosome level of H3K27me3 distribution, we found a preferential enrichment of the mark on the chromosome ends (Figure 9). Such observation concerns 36 out 40 ends from 20 chromosomes. The exceptions, in which the closest H3K27me3 domain was detected only 4-5 kb away from the chromosome end, include: 5 end of chromosome 3 and 3 end of chromosomes 7, 14, 16 (Supplementary Table S2 distance). Telomeric repeats in plants span regions from 0.5 kb in Chlorella vulgaris to 150 kb in Nicotiana tabacum (Fajkus et al., 1995;Higashiyama et al., 1995). C. merolae telomere length is relatively short and varies from 400bp to 700bp (Nozaki et al., 2007). Hence, the H3K27me3-domains at the chromosome ends detected in C. merolae cover the regions of telomeres and subtelomeres, similarly to what has been shown for the fungi: N. crassa (Jamieson et al., 2013) and Cryptococcus neoformans (Dumesic et al., 2014).
In order to decipher functional classification of H3K27me3 targets, we performed gene ontology analysis with Singular Enrichment Analysis from the AgriGO toolkit (Du et al., 2010). GO classes of H3K27me3-targets and full lists of C. merolae genes used as a reference were extracted from the UniProtKBdatabase. We found a GO annotation for 3085 genes (50,5% from total number of 6108, including hypothetical proteins and transcripts) in the reference with 144 GO-annotated genes among H3K27me3-targets (62,6% from total number of 230). Statistical significance was determined based on the adjusted p-value with Yekutieli test for FDR. For statistical tests, we selected only those GO classes that were represented by at least five entries (see material and methods). All three different GO sub-ontologies were taken into account: molecular function (GO:0003674), biological process (GO::0008150), and cellular component (GO::0008372).
We did not reveal any significantly (adjusted p-value < 0.05) enriched GO terms in cellular component sub-ontology. In contrast, we detected three GO term branches in biological process and one in molecular function sub-ontologies (Figure 10). Consistently with the known H3K27me3 function in the repression of developmental programs, we detected significantly enriched GO terms related to organismal process and development. Surprisingly, among enriched terms we found also those corresponding to protein maturation and intein-mediated protein splicing.
Protein splicing is a protein maturation mechanism based on the excision of a protein fragment (intein) from a precursor and ligation of the flanking polypeptides (exteins) FIGURE 9 | H3K27me3 domain enrichment on chromosome ends. 5 end of chromosome 13 is used as an example. Results from three biological replicates (IP1-3) taken for sequencing are shown. Bars below H3K27me3 tracks correspond to peaks scored by MACS2 with input (top bar) or H3 (bottom bar) used for normalization. The lowest track correspond to C. merolae genome annotation (merolae.biol.s.u-tokyo.ac.jp). The data was visualized in IGV v2.3. (Topilina and Mills, 2014). Comparative analyses of amino acid sequences revealed a homology of protein splice-junction motif from inteins to the C-terminal Hog domains in Hedgehog proteins, the secretory proteins controlling developmental processes in Metazoa (Jiang and Hui, 2008).
In concordance to the protein splicing mechanism and the function of intein-containing genes, we observed that the majority of protein entries present in significantly enriched GO terms (organismal process and development, cell communication and peptidase function) overlap with the protein splicing group (Supplementary Table S3). Moreover, we found that all of the H3K27me3 targets present in the protein splicing term indeed possess a Hog domain as the Hedgehog proteins.
Interestingly, the genes encoding Hedgehog proteins in C. merolae reside in the telomeric and subtelomeric regions of the chromosomes (Nozaki et al., 2007), which is in agreement to our results on the preferential H3K27me3 enrichment at the chromosome ends (Figure 9). Moreover, comparing with clustering data, we observed that Hedgehog loci were enriched in cluster 3 (Supplementary Table S4), bound by H3K27me3 preferentially to the upstream regions and moderately repressed.

DISCUSSION
Polycomb-group mediated regulation of gene expression in multicellular organisms is intensively studied in plants and has important roles in stress responses and developmental phase transitions (Köhler and Villar, 2008;Kleinmanns and Schubert, 2014). However, a model for Pc-G function in unicellular, photosynthetic organisms is currently lacking. We therefore performed homology searches and Bayesian phylogenetic tree analyses on the conserved alignment blocks. Our results confirmed a frequent absence of the PRC1 complex, ancient origin of PRC2 complex and the widespread distribution of core components of PRC2 in the green lineage [see also: (Shaver et al., 2010;Kim et al., 2013)]. We detected the PcG mark H3K27me3 in prasinophytes and red algae, arguing for the functionality of PRC2 in these organisms. Subsequent analyses of H3K27me3 occupancy in the model red alga, C. merolae, and correlative analyses with transcriptomic data revealed several important observations in this unicellular organism.
We observed that H3K27me3 in C. merolae is present on both, genes and repetitive elements, covering 4% of total gene number and 50% of total repetitive element number. The occupancy on both genomic types is in the agreement with the results from Drosophila and Phaeodactylum, suggesting an ancestral function of PcG in repetitive element silencing. As Arabidopsis H3K27me3 occupies preferentially genes, our results suggest also a partial divergence from PcG-mediated repetitive element repression in higher plants.
In Arabidopsis endosperm, the vicinity of repetitive elements was shown to impact H3K27me3-mediated gene imprinting (Weinhofer et al., 2010;Wolff et al., 2011). In contrast, the H3K27me3-covered genes in C. merolae are not in the vicinity of repetitive elements. However, we cannot exclude the possibility FIGURE 10 | Gene ontology analysis on H3K27me3 target genes. Significantly enriched terms from molecular function and biological process subontologies (no significant hits found for cellular component subontology). Color coding represents significance level (1-9) below adjusted p-value threshold (p < 0.05). Adjusted p-values are shown next to GO IDs for each GO term. Relationships between GO terms are reflected in the arrow types. Red frames highlight GO terms connected with intein-mediated protein splicing. Numbers below subcategory name correspond to: GO subcategory gene number in query/total gene number in query | GO subcategory gene number in reference (TAIR10)/total gene number in reference.
that the genome annotation is incomplete, with unknown types of repetitive elements being not reported.
We showed that H3K27me3 occupancy anti-correlates with the expression of its targets, consistently with its function in the multicellular organisms. As the other lysine-27 methylation marks, H3K27me1 and H3K27me2, have a repressive function in higher plants (Bastow et al., 2004;, similarly to the H3K27me3, the future steps should include profiling of those modifications in C. merolae genome. We also observed that the degree of repression is tightly correlated with the H3K27me3 profile at the genic level. The lowest expression was found for the genes covered by H3K27me3 over the gene body, whereas the targets with H3K27me3 occupancy on 5 and 3 flanking regions associate with intermediate repression. Furthermore, H3K27me3 shows an enrichment on chromosome ends and covers a broad region including telomeric repeats and subtelomeric genes in C. merolae. Interestingly, several other species were reported to show an enrichment of H3K27me3 at the subtelomeric and telomeric regions (Smith et al., 2008;Vaquero-sedas et al., 2012). Moreover, Polycomb proteins were proved to be associated with telomere-binding factors (Zhou et al., 2013) and telobox motifs demonstrated to be enriched at PcG-peaks (Deng et al., 2013), arguing for a functional connection between PcG-mediated repression and chromosomal ends.
Our gene ontology analysis showed that H3K27me3 covers genes associated with intein-mediated protein splicing. Inteins are considered as ancient mobile elements present in all three domains of life (eukaryotes, archaea, bacteria) and viruses (Pietrokovski, 2001;Novikova et al., 2014). Hence, our results suggest that the H3K27me3-mediated repression might function in guarding the C. merolae genome. Such notion would be consistent with the predominant enrichment of H3K27me3 on the repetitive elements ( Figure 5D). On the other hand, inteins may also develop to become posttranslational regulatory elements in the course of evolution, as seen for the conditional protein splicing of Hog domains in animals (Shah and Muir, 2014). Therefore, it is of special importance to decipher the function of intein-containing Hedgehog genes and the conditions for their de-repression in C. merolae.
Interestingly, we also observed that the intein-containing genes reside in the subtelomeric regions and are moderately repressed by H3K27me3 that occupies upstream regions from their TSS. It is currently unclear how protein splicing, moderate H3K27me3-mediated repression and subtelomeric localization are inter-connected. It is feasible that the integration of intein-containing genes appeared relatively recently and that these 'young' H3K27me3 targets display H3K27me3 enrichment at regions upstream of their TSS, in agreement with their moderate repression. Comparative genomics between C. merolae and related species should be used to validate this idea. In turn, subtelomeres are dynamic regions and the objects of ectopic recombination events, which facilitate gene diversification, phenotypic diversity and adaptation to different environments, exemplified by an olfactory receptor gene family in humans (Linardopoulou et al., 2001), contingency systems in various pathogens (Barry et al., 2003) and carbon-source metabolism genes in S. cerevisiae (Carlson and Botstein, 1983;Louis et al., 1994;Brown et al., 2010). Hence, it is possible that the intein-containing subtelomeric H3K27me3-covered genes are the targets of rapid adaptive evolution. Characterization of their duplication frequency and diversification within their gene families should validate such notion.
Last but not the least, our results on H3K27me3 presence in various algae and genome-wide characterization of expression status and H3K27me3 occupancy in the model red alga, C. merolae, provide a resource for chromatin and transcriptomic studies. An existence of genetic tools, optimized growth conditions and the sequenced genome in C. merolae offer a possibility to decipher genome regulation in a broad evolutionary context, including the organisms with super-reduced genomes.

AUTHOR CONTRIBUTIONS
PM, OK, and DS designed the research. FF and AW designed C. merolae growth scheme. FF handled C. merolae cell cultures. PM performed bioinformatic analyses, chromatin immunoprecipitations and sequencing library preparations. OK performed competition assay and western blot experiments. PM and DS performed homology searches for H3K27me3 targets used in ChIP-RT-qPCR. The manuscript was written by PM and revised by the other authors.
FUNDING PM, OK, and DS were supported by the DFG-funded SFB973, the Marie Curie International Training Network "EpiTRAITS" and by Freie Universität Berlin within the Excellence Initiative of the German Research Foundation.