CONCUR: quick and robust calculation of codon usage from ribosome profiling data

Abstract Summary CONCUR is a standalone tool for codon usage analysis in ribosome profiling experiments. CONCUR uses the aligned reads in BAM format to estimate codon counts at the ribosome E-, P- and A-sites and at flanking positions. Availability and implementation CONCUR is written in Perl and is freely available at https://github.com/susbo/concur. Supplementary information Supplementary data are available at Bioinformatics online.

1 Running CONCUR on published data 1.1 Codon usage changes in differentiating human cells

Data and processing
Self-renewing human stem cells show a specific codon usage profile, which changes during differentiation (Bornelöv et al., 2019). Specifically, codons that rely on the adenine to inosine modification at position 34 (A34I)-the wobble positionin tRNA anticodon loops, are more efficiently translated in the self-renewing state. This was previously measured as a reduction of the A34I modification levels in the differentiated state and an enrichment of ribosome profiling reads corresponding to those codons at the ribosome A-site (Bornelöv et al., 2019).
To confirm that CONCUR detected those enriched codons, we used the published ribosome profiling data (data accessible at NCBI GEO database, accession GSE123611). Trim galore! [https://www.bioinformatics.babraham.ac.uk/ projects/trim_galore] was used to trim small RNA adapters and trimmed reads shorter than 20 nt were discarded. Alignment was done using TopHat2 (v2.1.0). First, the reads were aligned to a set of known rRNA and tRNA genes (selected from the UCSC RepeatMasker tracks), this was followed by alignment of all unmapped reads to the human reference genome (hg38). An index with known transcripts based on Gencode v23 annotations were used for the genome alignment, novel splice junctions were permitted, and multi-mapping reads were discarded. The resulting BAM files were used as input to CONCUR.

Results
CONCUR was used to select read lengths and frames (Table S1), and to calculate the best shifts. The codon counts derived from the selected reads displayed a strong correlation with each other at both the P-and the A-sites (Fig. S1).  Figure S1: Correlation between selected read sets for the Differentiating 1 sample across all read positions.
An example of the resulting codon counts across all positions is shown in Fig.  S2. These two types of figures ( Fig. S1 and Fig. S2) suitable as a quality control are automatically created by CONCUR unless the "--noR" option is used.
This further suggest that considering all read lengths and frames-as done by CONCUR-was more successful than the previoulsy used method to select the dominating frame for the read lengths that showed the strongest periodicity.

Data and processing
In Eukaryotes, the U 34 base in the tRNA anticodon loop carries a 5-methoxycarbonylmethyl (mcm 5 ) or a 5-carbamoylmethyl (ncm 5 ) group, modifications that require the Elongator (Elp) complex. The mcm 5 group is further modified by the addition of a 2-thio group (s 2 ) in tRNA Lys AAA , tRNA Gln CAA and tRNA Glu GAA to produce mcm 5 s 2 U 34 , a modification that is universally conserved (Grosjean et al., 2014). Thiolation requires the combined action of Uba4, Urm1, Ncs2, and Ncs6 and the disruption of thiolation by knockdown of these proteins causes the translating ribosome to stall in Saccharomyces cerevisiae and Caenorhabditis elegans (Nedialkova and Leidel, 2015). This effect is detectable using ribosome profiling as an enrichment of reads with these codons occupying the ribosome A-site.
To validate that CONCUR could be used to detect those differences, we used the ribosome profiling data for yeast from Nedialkova and Leidel (2015) (data accessible at NCBI GEO database (Edgar et al., 2002), accession GSE67387). The trimmed reads were first aligned to a set of known rRNA and tRNAs (selected from the UCSC RepeatMasker tracks) and this was followed by alignment of all unmapped reads to the the yeast (sc3) reference genome using TopHat2 (v2.1.0) (Kim et al., 2013). No novel junctions were permitted in the alignment to rRNAs and tRNAs. An index with known transcripts based on Ensembl Release 40 annotations were used for the genomeic alignment, novel splice junctions were permitted, and multi-mapping reads were discarded. The resulting BAM files were used as input to CONCUR.

Results
CONCUR was used to select read lengths and frames (Table S2), and to calculate the best shifts. The codon counts derived from the selected reads displayed a strong correlation with each other at both the P-and the A-sites (Fig. S4). An example of the resulting codon counts across all positions is shown in Fig.  S5. Table S2: Selection of reads in yeast

Runtime
Each instance of CONCUR is running as a single-threaded process. The runtime for the yeast samples listed in Table S2 was measured to 26-45 minutes (at 3.6GHz) per sample. The runtime was highly correlated to the number of reads, with approximately 2.7 minutes required per million reads (Fig. S8). A typical ribosome profiling sample with 10 millions uniquely aligned reads would therefore require around 27 minutes (0.45 core-hours) to process.

Overview of methods for P-site offset estimation
Identifying P-site offsets is the first step required to estimate codon usage from Ribo-seq data. While there are several Ribo-seq pipelines available dedicated for tasks such as quality control or ORF detection, most of them work as "black boxes" with respect to any P-site offset calculation. Some pipelines rely on external tools for offset calculation, typically Plastid or Ribowaltz (Dunn and Weissman, 2016;Lauria et al., 2018), or have implemented heuristics similar to Plastid, such as identifying the 5' end read count maxima. Other approaches use a pre-defined position for the P-site, such as RiboVIEW (Legrand and Tuorto, 2020) that uses a fixed -12 offset and SPECtre (Chun et al., 2016) that uses the midpoint of the read. Table S3 provides an overview of existing tools developed for P-site offset calculations as well as general Ribo-seq pipelines that have implemented their own offset calculation. Most tools, including CONCUR, require sequenceindependent cleavage of the ribosome-protected fragments with an RNase. Ri-boProP (Zhao et al., 2019) is the only tool listed that is aimed towards MNase digested Ribo-seq and would be the method of choice for such data. Table S3: Dedicated tools and general Ribo-seq pipelines that can perform P-site offset Section 2.2 provides an overview of some key differences between CONCUR and existing tools. Furthermore, in Section 2.3, we compared CONCUR to Plastid (Dunn and Weissman, 2016), Shoelaces (Birkeland et al., 2018) and ORFik [https://github.com/Roleren/ORFik] for P-site offset calculation and evaluated the results by the consistency in the estimated codon usage.

Simple processing and input format
While there are some existing tools dedicated to P-site offset calculations (Dunn and Weissman, 2016;Lauria et al., 2018;Wang et al., 2017;Birkeland et al., 2018), most other tools perform P-site offset calculations as part of a pipeline and are often not able to run offset calculations separately. Like other dedicated tools, CONCUR relies solely on a BAM file with reads aligned to the genome, which makes it fast, easy to run and suitable to both run independently or to be included in a pipeline.
CONCUR is currenly based on alignments to a reference genome, but support for transcriptome alignment can be added by running concur install genome.pl with a custom GTF file representing the coding sequences in the transcriptome.

Focus on codon usage estimates
CONCUR was built with the primary focus not only to offset and filter reads correctly, but more importantly, to calculate and return codon usage at the E-, P-, and A-site as well as at the flanking positions. Existing tools that are able to calculate offsets do not automatically remove reads that are not in frame with those offsets, makes codon usage measurements impossible without additional work. Furthermore, the stringent filtering in CONCUR makes the final read sets more consistent with each other and suitable for codon usage calculations.
There is a need for Ribo-seq tools aimed at specific tasks in a well-described and rigorous manner. CONCUR is filling this gap by providing a methodology aimed specifically at codon usage calculations.

Correct inclusion of multiple frames per read length
A notable difference between CONCUR and other tools performing P-site offset calculation is that CONCUR does not assume all reads of the same length to have the same offset. The first step of CONCUR is to identify reads that intersect with coding sequences to determine the reading frame of their 5' ends. Reads of the same length-but different frames-are identified and treated as separate read sets in all subsequent analyses.
To illustrate this point, we can consider the yeast study in Section 1.2. CONCUR determined reads of 30 nt in length in the ncs2.YPD.rep1 replicate to have an offset of -12, -14, and -13 for frame 0, 1, and 2, respectively. All three frames gave similar codon usage signals as shown in the pairwise correlation plot (Fig. S4). Notably, the most common frame for length 30 was frame 2, with only 50% of the reads. In other words, half of the reads would either have been processed using an incorrect offset, or been excluded, in case a single frame had been used. All tools shown in Table S3 except CONCUR and Ribodeblur assumes a single frame.

Running CONCUR, Plastid, ORFik, and Shoelaces
We compared the P-site offsets estimated by CONCUR, to those estimated by Plastid, ORFik, and Shoelaces. We selected these three tools as they all have a well documented offset estimation procedure, they are compatible with genomic alignments, they work for all species, and they did not require us to run a whole pipeline to calculate the P-site offsets. We did not find any other tool fulfilling all these criteria.
Plastid (v0.4.7) was run with "--min length 22", "--max length 36" and "-default 100" to allow for non-conclusive read lengths to be identified and omitted from the comparison. CONCUR (v0.9), ORFik (v1.6.9) and Shoelaces (v1.6) were run with default read length settings and any read length that passed filters were included in the results. None of the tools returned any offsets for reads of length <25 or >36, with the exception of a single sample that had an offset estimated for length 38 by ORFik. We therefore focused on P-site offsets for reads of length 25-36.
Plastid, ORFik and Shoelaces report up to one offset per read length, but do not report any reading frame information for that offset. When comparing these tools to CONCUR, we calculated the frame manually as offset modulo 3. Only reads with the expected frame were included in the codon usage estimation that was used to validate the offsets. Excluding reads from other frames is neccessary for any codon usage analysis, and omitting this step would have strongly reduced the performance compared with CONCUR. Table S4 and Table S5 show all P-site offsets reported by CONCUR, Plastid, ORFik, and Shoelaces, respectively, calculated for six yeast samples previously covered in Section 1.2. To validate the offsets we compared the resulting Psite codon usage across different read lengths (and frames, when applicable) to each other. Read sets with P-site codon usage highly correlated to P-site codon usage from other reads sets were considered to be consistent. Read sets with very different P-site codon usage profiles were considered inconsistent, whereas those with only partially different profiles were considered partially consistent. The number of consistent, partially consistent and inconsistent read sets for each tool is summarized in Table S6 (top). CONCUR performed slightly better than Plastid and Shoelaces, reporting on average 10.3 consistent reads sets, compared with 8.7 for both Plastid and Shoelaces. ORFik performed worse with only 3.8 consistent read sets reported per sample. Moreover, CONCUR selected the fewest partially consistent or inconsistent read sets, presumably due to its stringent filtering and validation steps. It should be noted that the samples analyzed here had a low 3nt periodicity. This may explain why ORFik, which uses a Fourier transform, performed less well than the other tools.

Benchmarking results
The number of reads per read set varies in Ribo-seq data, making it more important to estimate offsets correctly for more abundant read sets than for less abundant ones. To evaluate how many reads would have a correct offset for codon usage analysis, we compared the number of reads that were selected from the consistent, partially consistent, and inconsistent read sets, respectively (Table S6, bottom). CONCUR retained more reads from consistent read sets than any other tool (9.2 million reads, compared with 6.8 million reads for Plastid and 6.7 million reads for Shoelaces). This difference is likely due to the use of multiple frames by CONCUR, thereby avoiding the uneccessary exclusion of read sets in alternative frames of abundant read lengths.