Genome-Wide Analysis of Polyadenylation Events in Schmidtea mediterranea

In eukaryotes, 3′ untranslated regions (UTRs) play important roles in regulating posttranscriptional gene expression. The 3′UTR is defined by regulated cleavage/polyadenylation of the pre-mRNA. The advent of next-generation sequencing technology has now enabled us to identify these events on a genome-wide scale. In this study, we used poly(A)-position profiling by sequencing (3P-Seq) to capture all poly(A) sites across the genome of the freshwater planarian, Schmidtea mediterranea, an ideal model system for exploring the process of regeneration and stem cell function. We identified the 3′UTRs for ∼14,000 transcripts and thus improved the existing gene annotations. We found 97 transcripts, which are polyadenylated within an internal exon, resulting in the shrinking of the ORF and loss of a predicted protein domain. Around 40% of the transcripts in planaria were alternatively polyadenylated (ApA), resulting either in an altered 3′UTR or a change in coding sequence. We identified specific ApA transcript isoforms that were subjected to miRNA mediated gene regulation using degradome sequencing. In this study, we also confirmed a tissue-specific expression pattern for alternate polyadenylated transcripts. The insights from this study highlight the potential role of ApA in regulating the gene expression essential for planarian regeneration.

during mouse embryonic development, differentiation of monocytes, and in mammalian neurons Shepard et al. 2011). In contrast, the average 39UTR length was observed to decrease during the developmental stages in Caenorhabditis elegans (Mangone et al. 2010). Further, ApA was shown to contribute to oncogene activation through the loss of 39UTR suppressor elements (Mayr and Bartel 2009). Recently, ApA was also implicated in the regulation of membrane protein localization (Berkovits and Mayr 2015).
The planarian Schmidtea mediterranea serves as an ideal model system to study stem cell function and regeneration (Oviedo et al. 2008). The unique property of planaria to regenerate any organ, including the brain, is attributed to the presence of adult pluripotent somatic stem cells called neoblasts (Sanchez Alvarado 2003). Various genes and gene regulatory mechanisms have been implicated in neoblast function (Rossi et al. 2008;Pearson and Sanchez Alvarado 2010;Onal et al. 2012;Hubert et al. 2013;Sasidharan et al. 2013;Solana et al. 2013;Zeng et al. 2013). Recent efforts have identified several RNA binding proteins, mostly involved in translation regulation and mRNA stability, that are enriched in neoblasts. These were shown to be essential for stem cell maintenance and differentiation (Rouhana et al. 2014(Rouhana et al. , 2010. At present, the study of posttranscriptional gene regulation in the planarian model system is limited by the lack of well-annotated 39UTRs. In order to address this issue, we carried out poly(A) position profiling ) on the transcriptome of S. mediterranea. Using the 31,377 poly(A) sites that were identified, we could reliably annotate the 39UTRs for 44% of the protein coding genes. Around 40% of the annotated transcripts were associated with more than one poly(A) site, hinting at the predominant role played by ApA in regulating gene expression. We identified around 97 transcripts in planaria that are polyadenylated within their ORFs. Furthermore, we explored the functional association between miRNA-mediated gene regulation and ApA, which revealed that miRNA binding sites are either clustered around the proximal or distal polyadenylation sites. We confirmed tissue-specific expression of 39UTR isoforms arising from alternate polyadenylation. Thus, the work presented here highlights the potential role played by ApA in planarian regeneration. This study will also help to identify the regulators essential for the ApA mechanism in planarians.

Identification of polyadenylation machinery in planaria
We used a polyadenylation machinery protein sequence from humans as a query and blasted it against the PlanMine database to determine the corresponding homologous in planaria (Brandl et al. 2015). For 12 of the proteins (CFIm25, CPSF100, CPSF30, CPSF160, CPSF73, CstF50, CstF77, CstF64, ClpI, PAPOA, PAPOB, and PAPOC) the homolog could be easily determined. The corresponding IDs from the Ox_Smed_V1 version of the transcriptome could be identified and the presence of these were experimentally verified through RT-PCR. For the four proteins from the polyadenylation machinery (CFIm59, CFIm68, hFIP1, and Pcf11), the sequence search through the PlanMine database did not result in a reliable hit. To rule out the possibility that these could be missed due to poor genome assembly/annotation, we searched for similar sequences in the recent version of the Smed genome (Robb et al. 2015). The query through BLAST resulted in hits that had partial similarity to the human homologs. All the orthologous sequences for these proteins were downloaded from the gene tree available at the Ensemble database. The gene tree was reconstructed by including the hits from the Smed genome along with orthologs from other organisms using ete3 (Huerta-Cepas et al. 2010). The percentage identity matrix (PIM) was also constructed through multiple sequence alignment using clustal omega (Sievers et al. 2011). Both the constructed gene tree and the PIM confirmed the sequence divergence of these four proteins in planaria (Supplemental Material, Figure S1, Figure S2, Figure S3, and Figure S4).

Preparation of 3P-Seq library
The 3P libraries were prepared as per the standard protocol previously described in the literature ) with some modifications. The total RNA amounting to 200 mg was extracted using the trizol method from both sexual and asexual planaria, S. mediterranea, to prepare 3P libraries. The extracted RNA was subjected to RNase H treatment for 25 min. The cDNA was prepared from the library using Superscript III RT (Catalog no. 18080044) and the derived product was directly amplified using PCR (25 cycles) without desalting. The adaptor (Illumina genomic sequencing adapter) ligated product was sequenced using an Illumina GAIIx platform.

Analysis of 3P-Seq reads
In order to determine the polyadenylation sites in S. mediterranea, poly(A) RNAs were isolated from the whole organism (both sexual and asexual forms separately) and then subjected to 3P-Seq. Around 165 million and 103 million reads were obtained for the sexual and asexual forms, respectively. These reads were combined, processed, and mapped to the SxlV3.1 genome using bowtie (Langmead et al. 2009). The raw reads obtained from the 3P-Seq protocol  were reverse complemented. The reads ending with a poly(A) stretch were trimmed down to contain a maximum of two A's at the end. These were then selectively mapped onto the genome and the reads were considered as 3P-tags if: a) the alignment ended with the mismatch in the 39, end as it would signify the untemplated adenosine nucleotide added by poly(A) machinery; and b) the minimum length of read was above 20 nucleotides (nt). Raw numbers are shown in Figure S5. The reads were aligned to the S. mediterranea genome using bowtie with -m 4 and -v 2 options. Around 77 million reads aligned to the genome, among which 34 million reads qualified as 3P-tags. The density profile of 3P-tags across the whole genome revealed that 81% of the genomic locus is covered with a minimum depth of three 3P-Seq reads ( Figure  S5). Hence, we used a minimum cut-off of three 3P-tags to identify the 3P-peaks. The 3P-peak has a minimum coverage of three 3P-tags associated across the complete length. This stringent scan across the entire genome involving the above cut-offs, resulted in 31,377 unique polyadenylation sites. The identified cleavage sites have a median length of 50 nt ( Figure S5) and were highly AT-rich ( Figure S5). The in-house python scripts used to process the raw FASTQ data can be obtained from github (https://github.com/VairavanL/3PSeq_analysis).

Ion torrent sequencing
The total RNA, amounting to 7 mg extracted from both the sexual and asexual worms, was subjected to cDNA conversion using OligodT primers from an AffinityScript QPCR cDNA Synthesis Kit. The cDNA was fragmented using Covaris. The fragmented DNA was quantified and subsequently utilized for library preparation. End-repair and adapter ligation was carried out according to the protocol (Ion plus fragment library kit, # 4471252). The samples were bar-coded at this step and then cleaned using Ampure XP beads. The library was amplified using eight cycles of PCR for enrichment of adapter-ligated fragments. The amplified product was again cleaned using Ampure XP beads. The prepared library was quantified using Qubit and validated for quality by running an aliquot on a High Sensitivity Bioanalyzer Chip (Agilent). Ion PGM sequencing was performed on this sample. Around 2.3 million reads were derived from the transcriptome through ion torrent sequencing after poly(A) pulldown. Approximately 70% of these sequenced reads were . 50 nt in length ( Figure S12). The reads containing poly(A/T) (28.6%) were selected as these would represent the polyadenylated mRNAs. The selected reads were subsequently trimmed to remove poly(A). We used blast (Altschul et al. 1990) and bowtie to align the trimmed reads to the genome. Around 0.37 million reads were successfully aligned and these covered 18,834 of 31,377 3P-peaks.

Scanning for poly(A) signals (PAS)
In order to identify a statistically significant hexamer that is enriched across the identified 31,377 peaks, we adopted a pipeline depicted in Figure S6. A random dataset was generated by shuffling the 3P-peak sequences using the Knuth-shuffle algorithm. The probability of hexamer occurrences [4096 (4 6 )] is calculated across all positions in both shuffled and unshuffled sequence datasets. The z-score is calculated from the occurrence value for each position. The z-scores are then converted into P-values using the pnorm function in R and the P-value is adjusted using Bonferroni corrections. A particular signal was considered as an enriched hexamer motif at a specific position if it had an adjusted P-value , 0.01 and at least a twofold enriched occurrence compared to a random sequence. We identified 737 enriched hexamer signals across various positions in the peak. These 737 hexamers are position-dependent enriched motifs. Among these, the canonical signal 5'-AAUAAA-3' was present in 14,183 peaks. A randomized dataset of 3P-peaks was generated to test for the significance of these hexamers and the whole pipeline was repeated on this dataset. We did not find any significant hexamer enrichment. This shows the robustness of our methodology. We also looked for pentamer, heptamer, and octamer sequence enrichment in the identified peaks. We did not find any enriched motifs of these sizes.
To identify other signals (minor) apart from 5'-AAUAAA-3' that could also act as PAS, we adopted a method as illustrated in Figure S6. We performed Fisher's exact test to identify minor signals. Total peaks were divided into two categories: one that housed the canonical PAS (14,183 peaks) and another that did not contain canonical PAS (17,194 peaks). We considered 736 signals (apart from the canonical signal) that were enriched and performed a two-category association test to determine in which category they were enriched. Based on the association value (i.e., odds ratio , 0.75 and P-value , 0.01), 82 signals were identified that could likely represent the minor signals. In order to test the robustness of this protocol, the whole analysis was also repeated with different cut-offs of 3P-tags (ranging from four 3P-tags up to 100 3P-tags). Although the number of 3P-peaks reduced with the increase in cut-off, the positional enrichment of the major PAS signal, along with the secondary ($ 58% retained) and the minor signal ($ 44% retained), did not change ( Figure S7).
A directed search was also performed to identify the PAS associated with each of 31,377 3P-peaks by scanning the sequence within the distance of 10-40 nt from the cleavage site (39 end of the sequence). The first priority was given to the canonical signal 5'-AAUAAA-3', and around 40.8% peaks were found to house this signal within 40 nt from the cleavage site. The second priority was given to all hexameric sequences that had just one mismatch to the canonical signal 5'-AAUAAA-3'. The one closest to the cleavage site was considered as PAS for the peak. Additionally, of the remaining peaks, around 40.4% could now be associated with PAS containing a variant of the canonical signal with one mismatch. Similarly, the same process was repeated with PAS containing two mismatches. Finally, around 97.5% (30,994) of the polyadenylation sites were assigned with a PAS signal ( Figure S6).

Identification of hexameric motifs using MEME
We used MEME 4.10.2 (Bailey et al. 2006) to predict PAS along with two other strategies. We used 39UTR sequences from zebrafish, Drosophila, humans, C. elegans, and mouse as positive background datasets for MEME. We categorized the identified polyadenylation sites into three bins based on its coverage as top, median, and last 1000 sites. When we individually ran MEME on these three datasets, we identified 5'-AAUAAA-3' as a major enriched signal. Along with 5'-AAUAAA-3', we got other variant signals, which overlapped with 737 signals that we identified before ( Figure S6).

Transcript and peak statistics
There are 43,209 contigs in the smed genome (Version 3.1). In total, 31,067 maker transcripts are spread across 16,053 contigs. Among these, 7219 contigs (9449 maker transcripts) have only maker-annotated transcripts and no identified 3P-peaks falls into these contigs. On the contrary, there are 3793 contigs (5988 3P-peaks) that have only 3P-peaks but no annotated transcripts that fall within them. There are 8834 contigs that have both annotated transcripts and 3P-peaks associated with them. These 8834 contigs constitute 21,577 transcripts and 29,134 3P-peaks. Similarly, in the Ox_Smed_v1 transcriptome, 2432 contigs (3175 transcripts) have only annotated transcripts. 5145 contigs (7916 3P-peaks) have only peaks associated without containing any transcripts. In the dd_Smed_v4 transcriptome, 3888 contigs (6459 transcripts) have only transcripts and 4341 contigs (6566 3P-peaks) have only 3P-peaks coming from them.

Stitching of transcripts
We could classify the association of polyadenylation sites to the transcript into five different categories. Polyadenylation sites that falls inside the transcript were excluded from this analysis, except for the ones that are present in the last exon and end after the annotated transcript model. The transcripts that have a 3P-peak outside the annotated end were extended to get the complete gene model by utilizing the genomic information ( Figure S9). We used four datasets (Genscan, Genpred, Cufflinks from RNA-seq, and the EST database) to fill the gap between known transcript model ends and the polyadenylation site start. At least two of the four datasets should support a region to extend the existing transcript model. If a transcript is associated with more than one polyadenylation site, we considered it as a different 39UTR isoform and stitched each polyadenylation site individually. For comparison of the 39UTR lengths across different organisms, we derived the information from biomart (Ensemble Genes 84 version). The longest transcript isoform was chosen for each gene to calculate the 39UTR length.
Using RNA-seq data as evidence for stitch and validation of cleavage site We considered those transcripts that have one cleavage site and filtered them by genomic length. A cut-off of 1 kb between transcript end and peak start was used. Transcripts that satisfied these two conditions were taken for analysis. We divided these transcripts into two bins. One was the Upstream bin, which had sequences corresponding to the region between the transcript end and peak start. We also took exactly the same genomic length sequence as the Upstream bin after the cleavage site and called it the Downstream bin. We made sure that there was no transcript that started in the Downstream bin. To calculate RNA-seq coverage we used the following formula.
The total numbers of reads are reduced by half as it represents pairedend data. Coverage is normalized to the total length of the transcript.

Validation of the candidates through RT-PCR
The forward primers were designed from the region within the annotated transcripts and the reverse primers were designed from the region spanning the derived 3P-peak. Candidates were chosen based on the strength of the 3P-peaks (number of 3P-tags within the 3P-peak region). Candidates having a similar number of 3P-tags on both the proximal and distal sites were selected. As the 3P-Seq was performed from the RNA isolated from the whole organism, this would ensure that both the isoforms of the transcripts are expressed. These were further filtered to ensure that the distance between the proximal and distal peaks was $2 kb. 1 mg of RNA was converted to cDNA using SSII RT (Invitrogen). The PCR primers for cDNA were designed as shown in Figure 4E ( Table S9). The RT-PCR was performed using LA Taq (TAKARA). Amplicons were run on agarose gel and subsequently sequenced.

Degradome sequencing
We adopted parallel analysis of RNA ends (PARE), a specific sequencing method, to identify degraded RNA products (German et al. 2008;Karginov et al. 2010;Cass et al. 2016). These degraded RNAs are the by-products of AGO-DROSHA processing. Total RNA was isolated from sexual and asexual strains of S. mediterranea. Around 500 ng of RNA was directly adapter ligated using T4 RNA ligase, to select only for degraded mRNA that would have a free 59 phosphate. The ligated RNA was then reverse transcribed using oligo dT (with an attached adapter sequence) and PCR amplified using primers from the TruSeq Illumina Small RNA library. The library was purified using Ampure XP beads and sequencing was performed on an Illumina HiSeq-1000. We got 61 and 72 million reads from the sexual and asexual strains, respectively. These reads were adapter trimmed and the first 15 nt of the reads were used for alignment to the genome and transcriptome using bowtie allowing no mismatches. To find miRNA correspondence for each tag, we used the FASTA v36 program using the parameters -n -H -Q -f -16r +15/-10 -g -10 -w 100 -W 25 -E 100 -i -U and scored the alignments using a match-mismatch profile in the seed and guide region. The analysis pipeline used here has been published previously in the literature (Moran et al. 2014). In metazoans, the requirement for complementarity between the miRNA and its corresponding target for AGO cleavage is still debatable. The previously published literature hints at the requirement of complementarity at the ninth, 10th, and 11th nucleotides, along with the seed region for AGO-mediated mRNA degradation (Shin et al. 2010;Elkayam et al. 2012;Addo-Quaye et al. 2009). Hence, we selected for sequence complementarity between the degradome tag and the miRNA, even at the ninth, 10th, and 11th positions. This would ensure that the derived degradome tags were a result of AGO cleavage. We then specifically analyzed these degradome tags based upon their location on the proximal or distal 39UTR region.
miRNA predictions using miRanda All of the transcripts that had two cleavage sites associated with them were chosen for prediction of miRNA targets. Transcripts that had both the cleavage sites within them were not considered further for the analysis. We extracted the sequences corresponding to the proximal and distal UTRs from the transcripts. These sequences were specifically scanned for miRNA binding sites using a tool called miRanda (John et al. 2004). We downloaded all the known miRNAs of S. mediterranea from miRbase (Griffiths-Jones et al. 2008). The predicted binding sites from miRanda were filtered based on the complementarity score (score $ 140) calculated from an algorithm, as well as the free energy of optimal complementary base pair interaction between the 39UTR and miRNA sequence (DG # 214Kcal=mol). These filtered binding sites were normalized to the length of the 39UTR region.
Control dataset for distribution of miRNA binding sites A randomized dataset was generated to determine the statistical significance for bimodal distribution of miRNA binding sites in transcripts with two 3P-peak candidates. The dataset was generated by randomizing the association of a second 3P-peak to single 3P-peak-containing transcripts. Around 1550 (same size as the actual dataset) single 3Ppeaks were also randomly chosen 100 times. In order to capture the native distance distribution between two 3P-peaks, only the distal 3Ppeak from two 3P-peak candidates were randomly chosen, along with the region between the peaks and stitched to single 3P-peak candidates. We predicted miRNA binding sites and normalized the dataset as described above. We performed Hartigan's dip test for multi-modality for the randomized 100 datasets to see how many of the randomized datasets followed a multi-modal distribution as an actual dataset. Significant P-values and a higher Dip score suggested multimodal distribution. Randomized datasets with P-values , 0.05 were considered to be multimodal ( Figure S14).

Data availability
The data generated in this study has been deposited in the National Center for Biotechnology Information Sequence read archive (SRA) database. Accession ID: SRP070102. All the NGS data used in this study can also be visualized through JBrowse applet (Westesson et al. 2013) at http://bugbears.ncbs.res.in/planaria3p.

Identification of polyadenylation machinery in planaria
The process of polyadenylation involves two major reactions. The first reaction is the endonucleolytic cleavage of pre-mRNA at a specific site. This is followed by the polymerase reaction that adds the polyadenylated tail to the cleaved product. These reactions require various protein factors that are specifically guided to the cleavage site, dictated by the ciselements present within the pre-mRNA (Millevoi and Vagner 2010;Colgan and Manley 1997;Zhao et al. 1999;Davila Lopez and Samuelsson 2008;Tian et al. 2005;Barabino and Keller 1999). The polyadenylation machinery comprises four multi-protein complexes: CPSF (cleavage and polyadenylation specificity factor), CstF (cleavage stimulation factor), CFI, and CFII (cleavage factor I and II, respectively). The components of the polyadenylation machinery in planaria were computationally identified based on their sequence similarity to the known components in the human genome. Of the 16 protein subunits that comprise the polyadenylation machinery, 12 had homologs in planaria ( Figure 1A). For the four proteins [hFIP1 (CPSF subunit), CFIm 59/68 (components of CFI), and Pcf11 (component of CFIIm)], confident planarian homologs could not be identified (due to low sequence identity to human homologs; Figure S1, Figure  S2, Figure S3, and Figure S4). The presence of transcripts coding for these polyadenylation machinery components were subsequently confirmed through PCR, followed by sequencing ( Figure 1B). This suggests that the classical eukaryotic mRNA 39end-processing machinery is well-conserved in planarians.

Characterization of polyadenylation sites reveals a conserved mechanism in planarians
In order to determine polyadenylation sites in S. mediterranea, poly(A) RNAs were isolated from the whole organism (both sexual and asexual forms separately) and then subjected to 3P-Seq. Sequencing reads thus obtained were mapped to the SxlV3.1 genome using bowtie (Langmead et al. 2009). A read was considered as a 3P-tag if the alignment to the genome resulted in a mismatch of "A" at the 39 end, which would be an untemplated adenosine from the poly(A) tail ( Figure S5). The alignment thus obtained was used to derive 3P-peaks, defined as discrete regions on the genome covered by the 3P-tags. Each of these peaks could represent an individual polyadenylation event. In total, 31,377 polyadenylated sites were identified from this analysis (Table S1).
Next, we looked for cis-regulatory elements associated with the 3Ppeaks identified in our study. We found an enrichment of "U" around the cleavage site, a property which is conserved across eukaryotes (Figure 2A). These U-rich regions are specifically recognized by the CFI and CFII complexes. The A-rich region, corresponding to PAS recognized by CPSF, was also observed 10-30 nt upstream of the cleavage site. The CPSF complex cleaves adjacent to a "CA" dinucleotide, which is conserved in eukaryotes (Mandel et al. 2006). In planarians, in addition to "CA," we also observed an enrichment of "UA" and "UU" near the cleavage site ( Figure 2B). The CFIm-59/68 (CFI subunit), which recognizes the "UGUA" site, could not be detected in planarians using a sequence homolog search. This is also further supported by the lack of a "UGUA" motif upstream of the PAS in planaria.
In higher eukaryotes, the canonical hexameric sequence of "AAUAAA" is located 10-30 nt upstream of the cleavage site ( Figure  2C) (Colgan and Manley 1997). This motif is specifically recognized by the CPSF160 subunit (Schonemann et al. 2014). The machinery also tolerates a single/double nucleotide variation within this canonical hexamer. In planaria, we found "AAUAAA" and its four single-nucleotide variants (AAUAUA, AUUAAA, AAUACA, and AAUAGA) enriched 10-30 nt upstream of the cleavage site ( Figure 2D). These four hexameric variants have also been reported to act as PAS in other metazoans (You et al. 2015) (Figure S6). This is consistent with the results of a motif search using the MEME software ( Figure S6). The PAS hexameric signal, including the single and double-nucleotide variants, were found in 98.8% of 3P-peaks. Around 41% of these 3P-peaks contained the canonical "AAUAAA" signal; another 40.4% contained a single-nucleotide variant of the canonical signal; and the remaining 16.3% peaks had two nucleotide variations ( Figure S6 and Table S1).
We also determined the resolution of the polyadenylation sites by measuring the genomic distances between adjacent 3P-peaks. The distance is defined by the number of nucleotides between the end of one 3P-peak and the beginning of the neighboring peak. We observed 1390 peaks that were separated by a distance of only 100 nt. The gap between a majority of these 3P-peak pairs was between 10-35 nt ( Figure 2E and Figure S8). To rule out the possibility that these short distances are not merely single unresolved polyadenylation sites, we tested for the presence of polyadenylation signals within them. For this analysis, 150 bp spanning both upstream and downstream of the distal cleavage site was used. The enrichment of the U-rich signal (Figure 2, F and G) along with the PAS (determined by fixing the proximal cleavage site, Figure S8) confirmed these to be genuine 3P-peaks.

Annotation of 39UTRs in planaria
We associated each transcript to its nearest 3P-peak to determine its polyadenylation site. For this, we used three different transcriptome models of planaria: a) Maker, a computational pipeline that performs abinitio gene prediction and annotation of the genome (Cantarel et al. 2008); b) Ox_Smed_v1 (Blythe et al. 2010); and c) dd_Smed_v4 (Liu et al. 2013), both of which are derived from de novo transcriptome assemblies of RNA-Seq data. The polyadenylation sites obtained could be categorized depending upon its position with respect to the transcript model ( Figure 3A, and Table S2). Around 92% of these polyadenylation sites were present on the terminal exon (Category 4, Figure S9). We also observed polyadenylation sites (43.51%) outside the transcript models (Category 1 in Figure 3A), emphasizing that these transcripts could have longer 39UTRs than previously annotated. We found that $40% of transcripts were associated with more than one polyadenylation site ( Figure 3B), which is close to the estimated ApA in other eukaryotes (Derti et al. 2012). For those transcripts with a polyadenylation site outside the annotated transcript, we extended the transcript to include the nearest 3P-peak. Various sources of evidence such as RNA-Seq reads, Genscan (Burge and Karlin 1997), and Genpred were used to confirm the extended transcripts ( Figure 3C and Figure S9). Around 70% of the extended transcripts, across the three transcriptomes, had a length , 2 kb ( Figure 3D). For 213 transcripts, we were able to extend the protein coding region by more than 15 amino acids ( Figure S9). For 21 of these transcripts, we were able to assign pfam domains to the extended region ( Figure S9). Around 9972 unique 3P-peaks that were mapped to genomic contigs had no transcript annotations (Category 6, Figure S9). These could be putative long noncoding RNAs (Ulitsky et al. 2011), which needs to be further characterized. We also observed cleavage sites that were in the opposite orientation to a neighboring transcript, which suggests the presence of antisense transcripts in these loci (Category 7, Figure  S9). To remove the redundancy arising from the pooling of transcripts from three different datasets, we clustered all protein sequences (coded by longest ORFs) based on sequence similarity using Cd-hit (Li and Godzik 2006). A high sequence identity cutoff of 90% was used to retain the paralogs. This resulted in 31,990 gene clusters ( Figure S10 and Table S3). The longest protein sequence from each cluster was chosen as representative of that cluster to identify the length of 39UTRs. The median length of the 39UTR was found to be around 176 bp, as compared to 138 in C. elegans, 235 in Drosophila, 518 in zebrafish, and 1198 in humans ( Figure 3E).
We found that 3639 transcripts contained two polyadenylation sites. We analyzed the occurrence of hexameric sequence motifs in these 3Ppeaks. The canonical hexameric sequence "AAUAAA" was enriched in the distal compared to the proximal site (Figure 3, F and G; t-test unpaired: P-value , 2 x 10 216 ). The overall abundance of this canonical "AAUAAA" hexameric signal decreases with the increase in the number of polyadenylation sites per transcript ( Figure 3H, t-test unpaired: P-value , 2 x 10 216 ). This suggests that most of the transcripts associated with multiple 3P-peaks have a combination of strong (canonical), and a variant of the canonical, signals, which could have evolved to play a regulatory role. Further, we also explored the association of the canonical signal and its variants with the expression levels of the transcripts, but we did not find any correlation between the expression levels of the transcripts and the corresponding hexameric PAS signals associated with them ( Figure S11).

Validation and confirmation of cleavage sites and polyadenylation signals in planarians
The association of transcripts with polyadenylated sites derived from 3P-Seq were validated using various approaches. Transcripts with a single 3P-peak were specifically selected and their 39 UTR expression probed using publicly available RNA-Seq data (Resch et al. 2012). We found that the region upstream of the polyadenylation site has more RNA-Seq coverage in comparison to its downstream region (Figure 4A and Figure S12; Wilcox test, paired: P-value , 2.2 · 10 216 ). This difference in RNA-Seq coverage increases with the strength (number of 3P-tags) of the 3P-peak ( Figure S12). Although we do see a few outliers with more reads in the downstream, most of these outliers may code for putative transcripts (as determined by Genscan, Genpred, and EST databases) ( Figure S12). The other possible explanation for these outliers could be the presence of a downstream polyadenylation signal that could not be captured by the current depth of 3P-Seq.
The 39UTRs derived from 3P-Seq were further confirmed through ion-torrent sequencing (Rothberg et al. 2011) of polyadenylated mRNAs. This technology adopts a different chemistry optimized to derive reads of longer length (100-300 bp, Figure S12). The reads derived from this ion-torrent sequencing were trimmed to remove the polyadenylated tail and subsequently mapped to the genome. The alignment covered around 42% of the 3P-peaks derived from 3P-Seq, and the majority of these reads mapped upstream of the cleavage site, thus validating the 3P-seq data ( Figure 4B). Thus, RNA-Seq reads and ion-torrent reads coupled to 3P-Seq data from this study can help to obtain complete gene models in planaria ( Figure 4C).
Some of the transcripts associated with polyadenylation sites were also experimentally validated using RT-PCR and Sanger sequencing. The validation ( Figure 4D) clearly suggests that existing transcript models coding for TNF2, Rabl2, and Ppp2r1b can be extended to capture the complete 39UTR. Some transcripts with two polyadenylation sites were also experimentally validated through RT-PCR. We found that genes involved in the regulation of cell fate and patterning like smed-wnt, smed-noggin, and smed-ybox-4 undergo ApA.

Biological functions mediated by alternate polyadenylation
Around 40% of the transcripts in planaria were associated with more than one polyadenylation site, which emphasizes the need to explore the biological significance of ApA. A Gene Ontology (GO) enrichment analysis was carried out for these transcripts. The following biological processes were significantly enriched: hemophilic cell adhesion via plasma membrane (GO:0007156), transmembrane transport (GO:0055085), the G-protein coupled receptor-signaling pathway (GO:0007186), and ion transport (GO:0006811). We also observed the enrichment of genes involved in cell cycle arrest (GO:0007050) and the regulation of sequence-specific DNA-binding transcription factors (Table S4). The molecular functions that were enriched included: zinc-ion binding (GO:0008270), DNA/nucleic acid binding (GO:0003677, GO:0003676), protein serine/threonine kinase activity (GO:0004674), protein dimerization activity (GO:0046983). and ribonuclease activity (GO:0004540) ( Figure S13 and Table S4).
Candidate-based approaches have previously reported that ApA plays a role in regulating biological processes such as the production of membrane-bound and secreted immunoglobins (Alt et al. 1980;Rogers et al. 1980;Early et al. 1980), T-cell activation (Sandberg et al. 2008), hippocampal neuron stimulation (Flavell and Greenberg 2008), embryonic development Mangone et al. 2010;Hilgers et al. 2011), lymphomas (Singh et al. 2009), and breast cancer (Fu et al. 2011). We performed a literature survey to assemble a list of genes known to be regulated by ApA (Mayr and Bartel 2009;Smibert et al. 2012;Sun et al. 2012). Around 70 genes were identified by this survey (Table S5) and homologs for 22 of them could be found in planaria. Of these, 10 were associated with more than one polyadenylation signal. Functional consequences of alternate polyadenylation (ApA) For transcripts containing two 3P-peaks, we investigated their location relative to the known transcript model. These peaks could either be located on the 39UTR, giving rise to tandem 39UTR ApA, or on the coding regions resulting in coding region polyadenylation (CR-ApA) (Elkon et al. 2013). For a majority of transcripts ($80.1%, 1213 transcripts), the multiple ApA sites were all located on the last exon resulting in tandem 39 UTR ApA ( Figure 5A). This tandem 39UTR ApA results in mRNA with an altered 39UTR sequence and thus, could regulate posttranscriptional gene expression. The altered 39UTR produced by ApA can result in a loss or gain of miRNA binding sites Figure 4 Validation of polyadenylation sites and transcript association. (A) A schematic highlighting the coverage of ion-torrent reads (black) and RNA-seq reads in the regions upstream (red) and downstream (blue) of polyadenylation sites determined from 3P-Seq (i). The bar plot shows the comparison between the number of transcripts associated with one 3P-peak and those transcripts that were also mapped by the RNA-seq reads (ii). The scatter plot depicts the transcripts with the coverage of RNA-seq reads in the inner (blue) and outer (red) region of the polyadenylation site (iii). The bar plot depicts comparison of the number of transcripts associated with one 3P-peak and those transcripts that were mapped by iontorrent reads across different transcriptomes (iv). The scatter plot depicts the transcripts with the coverage derived from the ion-torrent reads aligned to the upstream and downstream regions of the cleavage site (v). (B) The coverage of ion-torrent reads for the region spanning the 3Ppeaks. (C) A coverage plot showing the coverage of RNA-seq reads and ion-torrent reads across different transcript models for ARS2 domaincontaining protein. The corresponding 3P-peak derived for it supports the ox_Smed_V1 and dd_Smed_V4 models. (D) Single peak candidates that were experimentally validated through PCR amplification. A schematic highlighting the strategy for designing forward (blue) and reverse primers (red). (E) The transcripts with two 3P-peaks were experimentally validated using PCR. The upper band represents the amplification of longer transcript isoform and the lower band represents the shorter form of the transcript. 3P-Seq, poly(A)-position profiling by sequencing; PCR, polymerase chain reaction; RNA-Seq, RNA sequencing.
( Figure 6A). Hence, a systematic search was performed to predict miRNA binding sites within the 39UTRs of the transcripts using miRanda (John et al. 2004). This is the first comprehensive prediction of miRNA binding sites in the planarian genome ( Figure 6B). We defined the proximal 39UTR to include the region from the end of the last exon to the proximal cleavage site. The distal 39UTR involved the region between the proximal and the distal cleavage site. Thus, we ensured that both these regions had no overlap for miRNA binding site prediction. For transcripts with two polyadenylated sites, we observed that miRNA binding sites are preferentially located either at the proximal or the distal 39UTR. This is reflected by the bimodal distribution of the occurrence of these sites along the transcript. This trend was rarely observed in a randomized control dataset ( Figure  6C and Figure S14). Around 14 out of 100 randomized datasets displayed a bimodal distribution (as reported by Hartigan's dip test with P-value , 0.05), but seven of these bimodal distributions were qualitatively different from the actual profile ( Figure S14).
We further performed degradome sequencing to experimentally identify the miRNA targets in the planarian transcriptome. This method has been previously used to identify miRNA targets in other metazoans (Cass et al. 2016;Karginov et al. 2010). The reads that map to the miRNA targets are those that are cleaved by the AGO machinery because of the complementarity at the ninth, 10th, and 11th nucleotides. The targets that were translationally repressed will not be captured by this method. The mapped reads that represent the miRNA binding sites on the transcript were annotated as degradome tags (Table  S6). We identified around 219 tags that mapped to 259 transcripts having two polyadenylation sites associated with them. These tags mapped exclusively either to the proximal (135 transcripts) or the distal (104 transcripts) 39UTRs of the transcripts. Very few transcripts (20 transcripts) had an equal number of degradome tags on both the proximal and the distal 39UTR ( Figure 6D).
Later, we also investigated ApA events that alter the protein coding sequence and subsequently its biological function. Around 19% of the transcripts had polyadenylation sites located within introns/exons; these are called coding region alternate polyadenylation (CR-ApA) sites. We found ApA sites on more than one exon for around 270 transcripts ( Figure 5B and Table S7). Of these, 97 could potentially result in truncated proteins (Table S7). We found that truncation could result in the loss of certain domains from these proteins. These domains include an EF-hand domain pair, a WD domain, tetracopeptide repeats, dynein heavy chain region D6 of dynein motor, a myosin tail, Ras family, and a methyltransferase domain.
We performed whole mount in situ hybridization (WISH) to study the expression pattern of alternatively polyadenylated transcripts dd_smed_V4_650_0_1 (p25a) and dd_smed_V4_3757_0_1 (slmb). The in situ probe designed from the region upstream of the proximal PAS signal will recognize both the isoforms, whereas the probe designed from the region immediately upstream of the distal PAS signal will specifically hybridize to the longer isoform of the transcript. The whole mount in situ hybridization using the probe, which recognizes both the isoforms of the transcript p25a showed expression in the anterior-most region of the head, with characteristic expression at the dorsal tip and around the pharynx. The longer isoform of the transcript was expressed in the anterior-most region of the head but its expression was not seen at the dorsal tip. Weak expression of the longer isoform was seen around the pharyngeal region. The other transcript, dd_smed_V4_3757_0_1 (slmb), has also been reported to be alternatively polyadenylated in Drosophila (Smibert et al. 2012). The in situ hybridization for dd_smed_V4_3757_0_1 using the probe designed to recognize both the isoforms showed its expression in the pharynx, photoreceptor, and cephalic ganglion, whereas the probe designed to recognize the longer isoform revealed its expression only within the pharyngeal region ( Figure 5D). These results demonstrate the tissue-specific expression of ApA isoforms, which implies the role of ApA in tissue organization and function in planaria.

DISCUSSION
This study is the first to report genome-scale polyadenylation events in S. mediterranea using poly(A) position profiling (3P-Seq). An in-house computational pipeline was developed to reliably identify the cleavage site and to determine the PAS associated with it. Nearly 14,000 transcripts could be annotated with their 39UTRs. The analysis presented here was done using the older version of the genome. Recently, a newer version of the Smed genome (SxlV4 and AsxlVs1) was released that contains 15,334 scaffolds in comparison to the 43,294 contigs in the previous SxlV3.1 genome (Robb et al. 2015). We repeated the analysis using the new genome to determine the extent of overlap of 3P-peaks between the new and old genome. Around 83% of the 3P-peaks determined using the new genome overlapped ($ 50% overlap) with the previously determined polyadenylation sites using the SxlV3.1 genome ( Figure S15 and Table S8), which lends support to our approach of annotating the 39 UTR from the 3P-seq data.
Our study also identified most of the proteins involved in the cleavage/polyadenylation process with high confidence, except for FIP-1, Pcf11, CFI, and CFII. The best hits obtained for these four genes, along with 12 other genes of the polyadenylation machinery, showed the evidence for their expression in the transcriptome databases (Onal et al. 2012;Labbé et al. 2012;Resch et al. 2012). This suggests that they are not merely hypothetical proteins and are actively expressed within the planarian system. Interestingly, the cell type-specific expression data from these transcriptomes revealed the enrichment (. twofold) of almost all the transcripts of the polyadenylation machinery excluding PAPg (no change), in the proliferating neoblast (X1) when compared to differentiated cells (Xins). Further experiments would be required to decipher the biological significance of the enrichment. In other metazoans, FIP-1 typically binds to U-rich elements (URE) spanning the cleavage/polyadenylated site (Kaufmann et al. 2004). Though we identified URE in the 39 UTR regions of the planarians, the inability to identify the FIP-1 homolog could be due to the high sequence divergence, which cannot be detected using standard homology based searches. The proteins CFIm and Pcf11 (Noble et al. 2007), subunits of the CFII complex, recognize a "UGUA" signal in other eukaryotes. In planarians, lack of a "UGUA" signal at the cleavage site in 39 UTR regions could potentially explain the absence/divergence of CFIm and Pcf11 protein subunits.
In planarians, the canonical hexameric signal "AAUAAA" is the most prevalent polyadenylation signal. We identified several variants of this canonical signal, which were also reported in other metazoans. They function as secondary signals, which were less preferred for polyadenylation compared to the canonical signals (You et al. 2015). In planarians, for those transcripts with more than one 3P-peak, the distal 3P-peaks were mostly associated with a canonical hexameric signal, whereas the proximal 3P-peaks were associated with the secondary signal variants. This trend was also observed in other metazoans, which supports the hypothesis proposed by Shi (2012) that distal PAS may be recognized as default polyadenylation site by the machinery as it is highly conserved, whereas the proximal PAS could mostly play a regulatory role and is less conserved.
Here, we also showed that . 40% of the planarian transcripts were alternately polyadenylated. We identified transcripts that contained the polyadenylation signal in the internal exon that could lead to altered protein products. One such candidate, p25a, was experimentally validated using in situ hybridization. This protein is reported to stimulate the process of tubulin polymerization and also play a key role in a-synuclein aggregation. However, most of the transcripts that undergo ApA have alternate PAS in the 39 UTR regions, which will result in mRNAs with distinct 39 UTRs. We validated the presence of altered 39UTR for slmb (supernumerary limb gene) through in situ hybridization (ISH). The slmb gene is a member of the F-box/WD40 protein family of the ubiquitin ligase SCF complex that targets phosphorylated proteins for degradation. In addition to these transcripts that undergo tissue-specific alternate polyadenylation, we also identified 1606 ApA transcripts that were enriched by at least twofold in the neoblast population by analyzing the previously published RNA-Seq data (Onal et al. 2012) (Table S10). However, the ApA status of these candidates in the neoblast population needs to be validated. This observation suggests that ApA could potentially play an important role in the regulation of gene expression in the neoblast population. The 39UTR is one of the key regulatory elements that determine the fate of the mRNA within the cell. By altering the length of the 39 UTR in a given cell type, the transcript may either get destabilized leading to translational repression or undergo active translation, mediated by RNA-binding proteins that bind to the 39 UTR. The microRNAs, which are translational repressors, have their binding sites on the 39 UTRs. microRNA binding could be altered by changing the length of the 39 UTR. In planarians, we observed that miRNA binding sites on the transcripts with two 3P-peaks were either present upstream of the proximal 3P-peak or the distal 3P-peak in a mutually exclusive manner. Together, these results support the observation that miRNA-mediated regulation of gene expression could be altered by ApA events.
In summary, our study demonstrates 3P-Seq as a powerful tool to identify alternate polyadenylation events on a genome-wide scale. Our analysis is one of the first to provide a detailed account of the 39UTR landscape in planaria. This study will facilitate the exploration of various posttranscriptional regulatory events mediated by the 39UTRs. More detailed study of polyadenylation events across tissue types and regenerating tissues will provide better insights into the role of ApA in the regulation of tissue homeostasis, stem cell function, and regeneration in planarians. This study will also provide a platform to explore the posttranscriptional regulatory mechanisms mediated through 39UTRs in planaria.