The PARA-suite: PAR-CLIP specific sequence read simulation and processing

Background Next-generation sequencing technologies have profoundly impacted biology over recent years. Experimental protocols, such as photoactivatable ribonucleoside-enhanced cross-linking and immunoprecipitation (PAR-CLIP), which identifies protein–RNA interactions on a genome-wide scale, commonly employ deep sequencing. With PAR-CLIP, the incorporation of photoactivatable nucleosides into nascent transcripts leads to high rates of specific nucleotide conversions during reverse transcription. So far, the specific properties of PAR-CLIP-derived sequencing reads have not been assessed in depth. Methods We here compared PAR-CLIP sequencing reads to regular transcriptome sequencing reads (RNA-Seq) to identify distinctive properties that are relevant for reference-based read alignment of PAR-CLIP datasets. We developed a set of freely available tools for PAR-CLIP data analysis, called the PAR-CLIP analyzer suite (PARA-suite). The PARA-suite includes error model inference, PAR-CLIP read simulation based on PAR-CLIP specific properties, a full read alignment pipeline with a modified Burrows–Wheeler Aligner algorithm and CLIP read clustering for binding site detection. Results We show that differences in the error profiles of PAR-CLIP reads relative to regular transcriptome sequencing reads (RNA-Seq) make a distinct processing advantageous. We examine the alignment accuracy of commonly applied read aligners on 10 simulated PAR-CLIP datasets using different parameter settings and identified the most accurate setup among those read aligners. We demonstrate the performance of the PARA-suite in conjunction with different binding site detection algorithms on several real PAR-CLIP and HITS-CLIP datasets. Our processing pipeline allowed the improvement of both alignment and binding site detection accuracy. Availability The PARA-suite toolkit and the PARA-suite aligner are available at https://github.com/akloetgen/PARA-suite and https://github.com/akloetgen/PARA-suite_aligner, respectively, under the GNU GPLv3 license.


Supplementary Methods
For an overview of the performance of different read aligners and binding site detection algorithms on 10 simulated PAR-CLIP datasets, we calculated the precision, recall and accuracy for each. We considered all reads originating from simulated RBP-binding sites (with T-C conversions) as positives and those originating from other areas of the reference (simulated contaminations) as negatives. True positive and negative reads are those which are aligned correctly, whereas false positive and negative reads are those which are wrongly or not aligned (Table 1; Supplementary Table 3). We used BMix, PARalyzer and our hierarchical clustering to obtain the read clusters. Filtering of the clusters generated with the hierarchical clustering was performed as described in Section 2.2. A correctly reported binding site was considered a true positive, a falsely reported cluster (simulated contamination without elevated T-C conversions) as a false positive, an unreported binding site as a false negative and an unreported cluster (without T-C conversions) as a true negative (Supplementary Table 4). Unfortunately, BMix does not report false negative clusters (contaminations) and thus we were not able to calculate the recall nor the accuracy, but only the precision.

Simulation of uridylate-rich and homopolymeric PAR-CLIP reads
To measure the accuracy of the PARA-suite aligner for special types of data (uridylate-rich sequences, which are common in PAR-CLIP and homopolymeric sequences), we generated subsets of our simulated data that contained either >35% T (uridylate-rich sequences) or homopolymeric sequences with stretches of five or more bases of a particular nucleotide.
For the uridylate-rich PAR-CLIP reads, we observed an increase of 1.37% for PARA-suite alignments and an increase of 2.35% in the accuracy for BWA PSSM alignments compared to our basic simulated data (Supplementary Table 5). The accuracy for the PARA-suite decreased by 1.53% but the accuracy was unchanged for BWA PSSM when the PARA-suite was applied to the homopolymeric PAR-CLIP reads (Supplementary Table 5).

Application of the PARA-suite to HITS-CLIP data
Besides PAR-CLIP, other CLIP protocols are also used widely. Therefore, we chose a previously published Argonaute protein HITS-CLIP dataset generated from mouse brain samples (Chi, Zang et al. 2009) to assess the PARA-suite on a different type of CLIP data. To allow a comparison to previous results on the same dataset, we excluded all sequencing reads that were shorter than 25 bases after quality trimming using cutadapt. Next, we determined the error profile for the pooled replicates of the HITS-CLIP dataset using the respective PARA-suite tool to train its alignment pipeline. Here, we could already verify the high rate of deletions in contrast to insertions or single nucleotide substitutions compared to the mouse reference genome sequence GRCm38 (Chinwalla, Cook et al. 2002). Next, we applied the alignment pipeline to the pooled sequencing reads to align them against GRCm38 and against the transcript database of Ensembl genes Version 77 for the mouse genome assembly, and combined the results.
Again, the transcriptomic mapping step revealed 79,658 additional aligned reads spanning exon-exon junctions out of 15,145,095 aligned reads in total (0.526 %). To achieve comparable results for RBPbound transcribed regions in the mouse genome, we used PIPE-CLIP (Chen, Yun et al. 2014), which is a web-based program for cluster enrichment analysis of CLIP sequencing data. We compared our results with the number of cross-linked regions reported in the PIPE-CLIP publication analyzing the same dataset. The filtering criteria were the same as those in the PIPE-CLIP publication with an enriched cluster length of ≥25 bases and exclusion of duplicated sequencing reads by mapping position. After filtering the entire list of cross-linked regions for those that were supported by deletions in the crosslinked sites, we found 1450 significantly enriched regions by applying false discovery rate (FDR) ≤0.01 filtering. This number was substantially larger than what was found by the initial PIPE-CLIP analysis based on read alignments using Novoalign (http://www.novocraft.com) with 1232 cross-linked regions that were supported by deletions, an increase of 17.69% identified regions in total.
We also applied FDR ≤0.001 filtering to compare our results with the first in-depth analysis of the same data (Zhang and Darnell 2011), which used a cross-linking-induced mutation sites (CIMS) analysis. We  Supplementary Table S1: Statistics of FET PAR-CLIP reads (Hoell, Larsson et al. 2011) Table S3: Average performance of short read aligners on 10 simulated PAR-CLIP datasets sorted by accuracy. The runtime for BWA PARA was determined without error profile estimation, whereas the runtime for the entire PARA-suite pipeline includes error profile estimation, and alignment against genomic and transcriptomic reference sequences and both of these in combination. The results for "PARAsuite pipeline" refer to an execution where the parameter X was automatically evaluated (default). The results for "PARAsuite X1", "X2" and "X3" refer to executions with fixed values for X (i.e. X = 1, X = 2 and X = 3; see section "execution commands" for further information).  Figure S1: (A) T-C conversion frequencies (α) in real PAR-CLIP data (summarized over all FET PAR-CLIPs (Hoell, Larsson et al. 2011)) and sorted by T-C sites within highly confident clusters. (B)

Aligner
Probabilities (β) for the preferred read positions of T-C conversion sites within confident clusters. This graph shows a peak at the beginning of the clusters where the majority of T-C conversions occurred.