XPRESSyourself: Enhancing, Standardizing, and Automating Ribosome Profiling Computational Analyses Yields Improved Insight into Data

Ribosome profiling, an application of nucleic acid sequencing for monitoring ribosome activity, has revolutionized our understanding of protein translation dynamics. This technique has been available for a decade, yet the current state and standardization of publicly available computational tools for these data is bleak. We introduce XPRESSyourself, an analytical toolkit that eliminates barriers and bottlenecks associated with this specialized data type by filling gaps in the computational toolset for both experts and non-experts of ribosome profiling. XPRESSyourself automates and standardizes analysis procedures, decreasing time-to-discovery and increasing reproducibility. This toolkit acts as a reference implementation of current best practices in ribosome profiling analysis. We demonstrate this toolkit’s performance on publicly available ribosome profiling data by rapidly identifying hypothetical mechanisms related to neurodegenerative phenotypes and neuroprotective mechanisms of the small-molecule ISRIB during acute cellular stress. XPRESSyourself brings robust, rapid analysis of ribosome-profiling data to a broad and ever-expanding audience and will lead to more reproducible and accessible measurements of translation regulation. XPRESSyourself software is perpetually open-source under the GPL-3.0 license and is hosted at https://github.com/XPRESSyourself, where users can access additional documentation and report software issues.

libraries that are sequenced, and aligning the sequenced reads to a reference genome or transcriptome to assess relative transcript abundance, differential splice variants, sequence polymorphisms, and more [1]. High-throughput sequencing technologies have been developed or adapted for a variety of applications such as DNA sequencing, ChIP-seq, single-cell RNA-Seq, and ribosome profiling [2].
Although vast strides have been made to implement and perfect these technologies, many bottlenecks still exist. For example, while a basic bioinformatic understanding is more commonplace amongst scientists, the intricacies of processing RNA-Seq data remain challenging for many. Moreover, many users are often not aware of the most up-to-date tools or the appropriate settings for their application [3,4]. Even for the experienced user, developing robust automated pipelines that accurately process and assess the quality of these datasets can be laborious. The variability that inevitably arises with each lab or core facility designing and using distinct pipelines is also a challenge to the field.
Though RNA-Seq is a matured technology, there remains an abundance of biases and peculiarities associated with each analytical method or tool, which are often obscured to a user. Additionally, few if any extant pipelines or toolkits offer a thorough set of integrated tools for assessing standard quality control metrics or performing reference curation. This is particularly true with primary data from ribosome profiling, which in many aspects is still maturing [5]. For example, a common bias in ribosome profiling libraries is 5 -and 3 -read pile-ups [6][7][8] due to slower kinetics associated with the initiation and termination steps of translation compared to the elongation step.
Experts generally recommend that these pile-up-prone regions be excluded when quantifying ribosome profiling alignments [5,9]; however, no publicly available computational tools currently exist to facilitate the automated adjustments to reference transcripts.
Several computational pipelines for RNA sequencing have emerged that intend to tackle various aspects of these bottlenecks, but many suffer from usability issues, are not easily modifiable, or sacrifice quality for speed. For example, a simple internet search for RNA-Seq pipelines reveals several classes of pipelines. These range from simple tutorials and walkthroughs [10,11], to semi-automated pipelines that require extensive manual configuration [12][13][14][15], to automated, more user-friendly software packages [16][17][18]. However, in all the above cases, they may use out-dated software, use methods that sacrifice quality for speed, and/or be missing key quality control measures integrated into their pipeline design. This is particularly true in the case of ribosome profiling.
In response to these issues surrounding the automation of sequencing technology, we built the XPRESSyourself bioinformatics suite for processing and analyzing high-throughput expression data. This suite was architecturally designed from the ground up to be computationally efficient, without sacrificing quality for speed.
Each step of the pipeline utilizes the best performing software package for that task, having been previously vetted by peer-reviewed benchmarking studies where such studies exist for a given tool. Additionally, the pipeline is designed such that updating and testing of a new module are facile tasks for a trained bioinformatician. This enables XPRESSyourself packages to continuously offer the best options available to the entire community, regardless of expertise.
Currently, XPRESSyourself is partitioned into two main software packages. With the XPRESSpipe package, the user is provided with a complete suite of software to handle pre-processing, aligning, and quantifying of sequencing reads, performing quality control via various meta-analyses of pre-and post-processed reads. We also provide access to key quality control measures useful for assessing ribosome profiling and other RNA-Seq experiments.
These include read length distribution plots that are particularly helpful for ribosome profiling experiments due to the unique characteristics of the ribosome footprint-sized libraries (usually around 17-33 nucleotides) [19], and a periodicity sub-module that tracks the P-site of ribosome footprints to assess effective capture of the characteristic one codon step of the ribosome. Ribosome profiling also faces the challenge of efficiently depleting ribosomal RNA (rRNA) from samples as commercial depletion kits are frequently not adequately selective due to the sheer variety of rRNA fragments created during ribosome footprinting via RNase digestion. XPRESSpipe provides a feature that identifies the most abundant rRNA species to target for depletion during library preparation. XPRESSpipe also includes a metagene analysis sub-module that shows the distribution of the relative position of all aligned reads across a representative transcript to help identify any 5 -or 3 -biases in RNA fragment capture during library preparation. Currently, few current computational tools exist for performing this analysis. Additionally, XPRESSpipe includes a module for plotting gene coverage, similar to interactive genome browser programs, such as IGV [20], but where introns are collapsed to more clearly visualize read coverage across exons or coding space. As PCR-based duplicate biases can arise during sequence library preparation, a library complexity visualization sub-module is included in the pipeline to assess the frequency of PCR amplification artifacts in the library and ensure appropriately broad coverage of gene population was captured during sequence library creation. While XPRESSpipe summarizes hundreds to thousands of lines of code to one or two lines, we also provide the user with a guided command builder module.
The second package currently available within XPRESSyourself is XPRESSplot, which provides tools to perform the bulk of the sequence analysis and generation of figures for publication, where many plot generation protocols that frequently require several hundred lines of code are condensed to a single line with minimal input from the user. XPRESSyourself suite packages are coded in Python and R, the current linguae francae of computational biology and bioinformatics, which allows for easy modification and improvement by the sequencing community.
XPRESSyourself provides an accessible, simple-to-use, automated analysis platform for RNA-Seq.
Considerations with the design of this software will diminish various bottlenecks and increase the speed at which community-wide scientific discoveries can be made. XPRESSyourself additionally updates several useful quality control software tools essential in the RNA-Seq community. Previously unavailable methods for handling the complexities of ribosome profiling analysis are also automated and included in this software. All of these tools aim to help the user take their analysis into their own hands, guiding them through the necessary considerations and automating essential steps while helping standardize the analysis protocol for better reproducibility between studies.

XPRESSpipe
XPRESSpipe contains automated pipelines for both single-end and paired-end RNA-Seq. The pipeline was designed based upon many characteristics of The Cancer Genome Atlas (TCGA) (https://www.cancer.gov/tcga) alignment standards with appropriate modifications or updates depending on the use. The pipeline handles pre-processing, alignment, and quantification of sequencing reads, after which it will perform essential quality control analyses of each sequence library. In the case of ribosome profiling libraries, default parameters are optimized for this type of data. Within this manuscript, we will focus on ribosome profiling examples to demonstrate the utility of XPRESSpipe, while the majority of statements are also applicable to general single-or paired-end RNA-Seq. Pipelines and individual sub-modules are capable of operating in a parallel manner for each input file, thus accelerating the speed at which XPRESSpipe is able to process data. More details can be found in the documentation that will be updated as features are added, updated, or modified (https://xpresspipe.readthedocs.io/en/latest/). Table 1 outlines the parameters a user would need to consider modifying based on their sequencing setup or desired output. In the majority of use cases, the default parameters for each module will be sufficient.

Arguments
Description Path to output directory -r, --reference Path to parent organism reference directory -g, --gtf Path and file name to GTF used for alignment quantification -e, --experiment Experiment name Optional --two_pass Include option to perform a two-step alignment to map for unannotated splice-junctions -a, --adaptors Specify adaptor as string --if "None" is provided, software will attempt to auto-detect adaptors --if "POLYX" is provided as a single string in the list, polyX adaptors will be trimmed -q, --quality PHRED read quality threshold (default: 28) --min_length Minimum read length threshold to keep for reads (default: 18) --umi_location Provide parameter to process UMIs --provide the location (see fastp documentation for more details) --umi_length Provide parameter to process UMIs --provide UMI length (must provide the --umi_location argument) --deduplicate Include option to quantify alignment files with de-duplication --output_bed Include option to output BED files for each aligned file -c, --quantification_method Specify quantification method (default: HTSeq [21]) --feature_type Specify feature type (3rd column in GFF file) to be used if quantifying with HTSeq (default: CDS) --stranded Specify stranded library preparation method (Varies based on quantification method, see documentation for more information) --method Provide parameter and method to perform library normalization on samples (options: "RPM", "TPM", "RPKM", "FPKM") --vcf Provide full path and file name to VCF file if you would like to detect personal variants overlapping alignments, otherwise not considered --batch Include path and filename of dataframe with batch normalization parameters --sjdbOverhang Sequencing read-length -1 parameter used during reference curation (see STAR documentation for more information) --mismatchRatio Alignment ratio of mismatches to mapped length is less than this value (see STAR documentation for more information) --seedSearchStartLmax Adjust this parameter by providing a lower number to improve mapping sensitivity (recommended value = 15 for reads 25 nts) (see STAR documentation for more information) --genome_size Change if parameter is provided during reference building and using a two-pass alignment -m, --max_processors Specify number of max processors to use for tasks (default: No limit)

User Aids
Several tools and resources are provided to aid in making XPRESSyourself accessible to all users. One such tool is an integrated command builder for reference curation and sample analysis, accessed by running xpresspipe build. This command builder will walk the user through potential considerations based on their library preparation method and build the appropriate command for execution on their personal computer or a supercomputing cluster. If running the command builder on a personal machine, the user can then have XPRESSpipe execute the command automatically. In addition to this resource, the XPRESSyourself suite provides thorough documentation for each module and tool, along with video walkthroughs (accessible through the README files) and interactive notebooks (found in the home directory of a package.

Inputs
While inputs will vary between sub-modules, a few points of guidance are important to consider.

Automated Reference Curation
One of the first steps of RNA-Seq alignment is curating an organism reference to which the alignment software will map reads. For the current version of XPRESSpipe, a STAR [22] reference is automatically curated simply by providing the appropriate GTF file saved as transcripts.gtf and directory path to the genomic FASTA file(s).
Currently, STAR is used within the pipeline as it has been shown consistently to be the best performing read aligner for RNA-Seq data [23]. Additional modifications are occasionally recommended to this file, which can be performed within this sub-module or separately, as discussed in more detail in the next section. As this can be a time-consuming process, we will leave the --max_processors parameter as default in this example to utilize all cores available to the computing unit. This entire process is automatically handled with the curateReference sub-module for ease of use. More on GTF modification arguments used in this code block follows in the section below.

GTF Modification
As ribosomal RNAs and other non-coding RNAs can be highly abundant in RNA-Seq experiments, it is often recommended to not include these sequences for quantification. By providing the --protein_coding argument, only protein-coding genes are retained in the GTF file, which acts as a masking step of reads aligning to non-coding regions of the genome.
In most eukaryotes, mRNAs undergo alternative splicing of exons to generate the mature mRNA. However, some tools consider reads that align to multiple annotated splice variants of a gene as a multi-mapping read since they map to a location where several isoforms, recognized as separate records, of the same gene overlap. These reads are either penalized or discarded. By providing the --longest_transcript argument, the longest Ensembl canonical transcript [24] is retained for each gene in the GTF file. However, if using HTSeq with default XPRESSpipe parameters or Cufflinks to quantify reads, this is not necessary as the software is optimized to quantify abundances of the different isoforms of each gene [25].
For ribosome profiling, we observe frequent read pile-ups at the 5 -and 3 -ends of an open reading frame which are largely uninformative as to the translational efficiency of the gene. Therefore, the 5 -and 3 -ends of each transcript's total coding region should be truncated as to be excluded from consideration during read quantification [5,9]. By providing the --truncate argument, the 5 -and 3 -ends of each coding region will be trimmed by the specified amounts. These values are set to defaults of 45 nt for 5 -truncation and 15 nt for 3truncation, as is the convention within the ribosome profiling field [5], but these can be modified using the --truncate_5prime or --truncate_3prime parameters. If generating a GTF for use with general RNA-Seq datasets not associated with a ribosome profiling dataset (i.e., an RNA-Seq library that originates from the same sample as a ribosome profiling library), the --truncate argument should not be provided.

Trimming
Reads generally need to be cleaned of artifacts from library creation. These include adaptors, unique molecular identifier (UMI) sequences, and technical errors in the form of low-quality base calls. By doing so, non-native sequences are removed and reads can align properly to the reference. XPRESSpipe uses fastp, a faster, more accurate trimming package that has improved alignable read output compared to its predecessors [26]. Adaptor sequence, base quality, and read length are all adjustable parameters available to the user. Additionally, feature details, such as those for UMIs, can be specified. PCR artifacts will then be identified and grouped during pre-processing, then removed in post-alignment processing [27,28].

Alignment
Reads are aligned to a reference genome. XPRESSpipe uses STAR, which, despite being more memory-intensive, is relatively fast and one of the most accurate sequence alignment options currently available [22,29]. XPRESSpipe is capable of performing a single-pass, splice-aware, GTF-guided alignment or a two-pass alignment of reads wherein novel splice junctions are determined and built into the reference, followed by alignment of reads to the new reference. A coordinate-sorted and indexed BAM file is output by STAR. We abstain from rRNA negative alignment at this step as downstream analysis of these mapped reads could be of interest to some users.

Post-alignment Processing
XPRESSpipe further processes alignment files by optionally parsing for unique alignments that are then passed on to the next steps. PCR duplicates are detected and marked or removed for downstream processing; however, these files are only used for relevant downstream steps (such as library complexity quality control) or if the user specifies to use these de-duplicated files in downstream steps such as read quantification. Use of de-duplicated alignment files may be advisable in situations where the library complexity profiles (discussed below) exhibit high duplication frequencies. However, generally the abundance of PCR-duplicates is low in properly-prepared sequencing libraries; thus, doing so may be overly stringent and unnecessary [27]. These steps are performed using samtools [30].
Optionally, BED coverage files can also be output. These conversions are handled by bedtools [31].

Read Quantification
XPRESSpipe quantifies read alignments for each input file using HTSeq with the intersection-nonempty method by default [21,32]. Our rationale for including this quantification method is that it conforms to the current default TCGA standards and is favorable in most applications. If masking of non-coding RNAs is desired, a protein_coding modified GTF file should be provided for the --gtf argument. HTSeq is recommended for processing ribosome profiling data as it allows selection of feature type across which to quantify, thus allowing for quantification across the CDSs of a transcript instead of the exons. Additionally, if a user is interested in quantifying ribosome occupancy of transcript uORFs for ribosome footprint samples, they could provide five_prime_utr or three_prime_utr for the --feature_type parameter if such annotations exist for the organism of interest. If the user is interested in isoform abundance estimation, Cufflinks is available to perform this method of quantification instead [25,32].

Normalization
Methods for count normalization are available within XPRESSpipe by way of the XPRESSplot package. For normalizations correcting for transcript length, the appropriate GTF must be provided. Current sample normalization methods available include reads-per-million (RPM), Reads-per-kilobase-million (RPKM) or Fragments-per-kilobase-million (FPKM), and transcripts per million (TPM) normalization [33]. For samples sequenced on different flow cells, prepared by different individuals, or on different days, the --batch argument should be provided along with the appropriate metadata matrix, which is then processed by way of XPRESSplot using the ComBat package [34].

Read Length Distribution
The lengths of all reads are analyzed after trimming. By assessing the read distribution of each sample, the user can ensure the expected read size was sequenced. This is particularly helpful in ribosome profiling experiments for verifying the requisite 17-33 nt ribosome footprints were selectively captured during library preparation [5,19].
Metrics here, as in all other quality control sub-modules, are then compiled into summary figures by XPRESSpipe for assessment of the overall experiment by the user.

Library Complexity
Measuring library complexity is an effective method for analyzing the robustness of a sequencing experiment in capturing various, unique RNA species. As the majority of RNA-Seq preparation methods involve a PCR step, sometimes particular fragments will be favored and over-amplified in contrast to others. By plotting the number of PCR replicates versus expression level for each gene, one can monitor any effects of limited transcript capture diversity and/or high estimated PCR duplication rate on the robustness of their libraries. This analysis is performed using dupRadar [35] where inputs are PCR duplicate-tagged BAM files output by XPRESSpipe by way of samtools [30]. Metrics are then compiled and plotted by XPRESSpipe.

Metagene Estimation Profile
To identify any general biases for the preferential capture of the 5 -or 3 -ends of transcripts, metagene profiles can be generated for each sample. This is performed by determining the meta-genomic coordinate for each aligned read in exon space. Required inputs are an indexed BAM file and an un-modified GTF reference file. Outputs include metagene metrics, individual plots, and summary plots. If desired, a meta-profile across a representative CDS can be performed.

Gene Coverage Profile
Extending the metagene estimation analysis, the user can focus on the coverage profile across a single gene.
Although traditional tools like IGV [20] offer the ability to perform such tasks, XPRESSpipe provides the ability to collapse the introns to observe coverage over exon space only. This is helpful in situations where massive introns spread out exons and make it difficult to visualize exon coverage for the entire transcript in a concise manner. When running an XPRESSpipe pipeline, a housekeeping gene will be processed and output for the user's reference.
Figure S1 provides a comparison with the output of IGV [20] and XPRESSpipe's geneCoverage module over a similar region for two genes to demonstrate the compatibility between the methods.

Codon Phasing/Periodicity Estimation Profile
In ribosome profiling, a useful measure of a successful experiment comes by investigating the codon phasing of ribosome footprints [5]. To do so, the P-site positions relative to the start codon of each ribosome footprint that mapped to a transcript are calculated using riboWaltz [36]. The same inputs are required as in the metagene sub-module.

Identify Problematic rRNA Fragments from Ribosome Footprinting for Depletion
Ribosomal RNA (rRNA) contamination is common in RNA-Seq library preparation as the bulk of RNA in a cell is dedicated to rRNA. The sequencing of these RNAs becomes highly repetitive, wasteful, and typically biologically uninteresting in the context of gene expression and translation efficiency. The depletion of these sequences is therefore desired to increase the depth of coverage of ribosome footprints. To facilitate this depletion, many commercial kits are available that target specific rRNA sequences for depletion or that enrich for poly(A)-tailed mRNAs. However, and especially in the case of ribosome profiling experiments, where RNA is digested by an RNase to create ribosome footprints, many commercial depletion kits will not target the most abundant rRNA fragment species produces during the footprinting step of ribosome profiling. Poly(A)-selection kits are inappropriate as footprints will not have the requisite poly(A) tail. To this end, custom rRNA-depletion probes are recommended [2,5]. rrnaProbe analyzes the over-represented sequences within a collection of footprint libraries that have already undergone adaptor and quality trimming, compiles conserved sequences across the overall experiment, and outputs a rank-ordered list of these sequences for probe design.

Differential Expression Analysis
XPRESSpipe incorporates DESeq2 for performing differential expression analysis of count data. We refer users to the original publication for more information about uses and methodology [37]. In this module, the user provides the count table output by XPRESSpipe, along with a sample summary table and design formula (as explained in the DESeq2 documentation).

Outputs
While outputs will vary between sub-modules, generally the user will specify a parent output directory and the necessary sub-directories will be created based on the step in the pipeline. Further information can be found in the documentation (https://xpresspipe.readthedocs.io/en/latest/) or by entering xpresspipe <sub-module name> --help in the command line. Figure 1 provides an example of the output file scheme for XPRESSpipe. For a complete pipeline run, the user can expect BAM alignment files, a collated count table of all samples in the experiment, and quality control figures and metrics. Additionally, broad level reports are collected by MultiQC [38], which include general sequence read statistics generated by FastQC [39]. For almost all sub-modules, a log file will also be written to summarize provided user parameters, track performance, and report errors. An additional log file will be written summarizing the versions of the different dependency software used during the execution of the pipeline or sub-module. Users are encouraged to provide these files as supplements within publications when presenting XPRESSpipe-processed data to aid in documentation.

XPRESSplot
Further analysis and presentation of ribosome profiling or RNA-Seq data are accomplished within XPRESSplot.
XPRESSplot is a Python library of analysis and plotting tools that build upon existing packages, such as Matplotlib [40] and Seaborn [41] to generate flexible, specific analyses and creates plots frequently used by biological researchers that can each be executed in a single line of code rather than tens to hundreds. Additionally, many included features are currently available in an R or other programming language package but not in a Python package. Brief summaries of key components of this package, as well as descriptions of new or more automated tools are provided below and methods are discussed in subsequent sections. We refer the reader to the documentation (https://xpressplot.readthedocs.io/en/latest) for more detailed instructions and descriptions of other features within the toolkit. Although XPRESSplot is designed for handling transcriptomic datasets, it is also typically capable of handling other -omics datasets, such as microarray, proteomics, and metabolomics data.

Input Data
Generally, two inputs are required for all functions within XPRESSplot: 1. Expression Matrix: It is assumed that the input data matrix = i * j where i (rows) are genes or other analytes and j (columns) are samples. Table: It is assumed that the metadata table is a two-column, header-less data matrix where column 0 is the sample ID (as specified in j column names of the expression matrix) and column 1 is the sample group (for example, genotype or treatment group).

Normalization
RNA-Seq experiments can be normalized by the user using the reads-per-million (RPM), Reads-per-kilobase-million (RPKM) or Fragments-per-kilobase-million (FPKM), or transcripts per million (TPM) methods [33], as outlined in Equations 1-4 in the Methods section of this manuscript. Other normalizations, such as mean centering of i features (i.e., genes or other analytes) by scikit-learn's preprocessing module [42] are also available. Count thresholds can also be set to remove lowly expressed genes from an analysis that may be less reliable due to poor sequencing ability.

Principal Components Analysis
Principal components analysis (PCA) for the data matrix is computed using Python's scikit-learn package [42] and desired principal components are plotted over a scatter plot via Matplotlib [40] and Seaborn [41]. In the XPRESSplot PCA module, as in many other analysis modules within XPRESSplot, samples are color-coded for easy visualization of sample groups when verifying proper grouping of treatment types, for example. Confidence intervals are plotted over the scatterplot using NumPy [43,44], a feature currently missing from Pythonic PCA packages.

Volcano Plot
Volcano plots are an efficient method for plotting the magnitude, direction, and significance of changes in expression or other data types between two conditions with multiple replicates. By providing the categorical names for samples of two conditions in the metadata matrix, XPRESSplot will automate the calculation and plotting. When plotting gene expression values, the RNA-Seq-specific volcano plot method should be used which requires a DESeq2-output data table as input. This is essential as RNA-Seq datasets follow a negative-binomial distribution rather than a normal distribution [37].
For other uses, such as for proteomics or metabolomics datasets where the data is normally distributed, the general volcano plot method can be used, which will average the measurements for each analyte between the two conditions and calculate the log 2 (fold change). Additionally, for each gene, the p-value between the two conditions is calculated using SciPy's Individual T-test function [45]. The log 2 (fold change) and -log 10 (p-value) is then plotted for each gene between the two conditions. Additional features available are the ability to plot threshold lines, highlight subsets of genes within the plot, and label specific genes by name To quickly extract key data from the figure.

Validation
To evaluate the ability of XPRESSpipe to provide the user with reliable results, we processed publicly available raw sequence files using this automated pipeline. We chose to highlight a ribosome profiling dataset to showcase the utility of XPRESSpipe for rapidly extracting potentially interesting biological insights from sequence data. To further validate the performance of XPRESSpipe, we additionally chose a small subset of TCGA samples, processed their raw read data through XPRESSpipe, and compared the counts to the publicly available TCGA-processed count tables.

New Insights from Published Ribosome Profiling Data
The integrated stress response (ISR) is a signaling mechanism used by cells and organisms in response to a variety of cellular stresses [46]. Although acute ISR activation is essential for cells to properly respond to stresses, long periods of sustained ISR activity can be damaging. These prolonged episodes contribute to a variety of diseases, including many that result in neurological decline [47]. A recently discovered small-molecule inhibitor of the ISR, ISRIB, has been demonstrated to potentially be a safe and effective therapeutic for traumatic brain injury and other neurological diseases. Interestingly, ISRIB can suppress the damaging chronic low activation of the ISR, while it does not interfere with a cytoprotective acute, high-grade ISR. It has also been shown to be neuroprotective in mouse models of traumatic brain injury, adding to its wide pharmacological interest [48][49][50][51][52][53][54].
A recent study (data available under Gene Expression Omnibus accession number GSE65778) utilized ribosome profiling to better define the mechanisms of ISRIB action on the ISR, modeled by 1-hour tunicamycin (Tm) treatment in HEK293T cells [50]. A key finding of this study is that a specific subset of stress-related transcription factor mRNAs exhibit increased translational efficiency (TE) compared to untreated cells during the tunicamycin-induced ISR. However, when cells were co-treated with tunicamycin and ISRIB, the TE of these stress-related mRNAs showed no significant increase compared to untreated cells, which indicates that ISRIB can counteract the translational responses associated with the ISR.
To showcase the utility of XPRESSpipe in analyzing ribosome profiling and sequencing datasets, we re-processed and analyzed this dataset using the more current in silico techniques included in the XPRESSpipe package to further query the translational mechanisms of the ISR and ISRIB. All XPRESSpipe-processed biological replicate samples exhibited a strong correlation between read counts per gene when thresholded similarly to count data available with the original publication (Spearman ρ values 0.991-0.997) (Figure 2A; Figure S2B shows the corresponding plots using the count data provided with the original publication for reference).
Compared to the raw count data made available in the original manuscript, when XPRESSpipe-processed samples were thresholded as in the original published raw count data, samples showed generally comparable read counts per gene between the two analytical regimes (Spearman ρ values 0.937-0.950) ( Figure 2B). This is in spite of the fact that the methods section of the original publication employed software that was current at the time but is now outdated, such as TopHat2 [55], which has a documented higher false positive alignment rate, generally lower recall, and lower precision at correctly aligning multi-mapping reads compared to STAR [22,23]. Many of the genes over-represented in the original count data as compared to data processed by XPRESSpipe are genes that have pseudogenes or other paralogs ( Figure S2A highlights a sampling of some extreme cases). As these genes share high sequence similarity with each other, reads mapping to these regions are difficult to attribute to a specific genomic locus and are often excluded from further analyses due to their multi-mapping nature. The benchmarking study [23] that examined these and other aligners described how TopHat2 had a disproportionally high rate of incorrectly aligned bases, or bases that were aligned uniquely when they should have been aligned ambiguously, at least partially explaining the observed overcounted effect with TopHat2. Had TopHat2 marked problematic reads as ambiguous, they would have been excluded from later quantification.
Additionally, when TopHat2 and STAR were tested using multi-mapper simulated test data of varying complexity, TopHat2 consistently suffered in precision and recall. These calls are increasingly more difficult to make with smaller reads as well, and this is evident from Figure 2B, where ribosome footprint samples consistently showed more over-counted genes than the corresponding RNA-Seq samples. When dealing with a ribosome footprint library of about 50-100 million reads, and with TopHat2's simulated likelihood of not marking an ambiguous read as such being about 0.5% higher than STAR, this would lead to around 250,000 to 500,000 spuriously aligned reads, which is in line with our observations (all benchmarking details were derived from [23]).
An additional potential contributor to this divergence is that the alignment and quantification within XPRESSpipe use a current human transcriptome reference, which no doubt contains updates and modifications to annotated canonical transcripts and so forth when compared to the version used in the original study. However, in practice, these effects are modest for this dataset ( Figure S3). While differences in processing between the outdated and current methods may not always create broad differences in output, key biological insights may be missed. The analysis that follows is exploratory and only meant to suggest putative targets identifiable by re-analyzing pre-existing, publicly available data.
Similar canonical targets of translation regulation during ISR were identified in the XPRESSpipe-processed data as were identified in the original study. These targets include ATF4, ATF5, PPP1R15A, and DDIT3 ( Figure   3A-C, highlighted in purple) [50]. Of note, the fold-change in ribosome occupancy of ATF4 (6.83) from XPRESSpipe-processed samples closely mirrored the estimate from the original publication (6.44). Other targets highlighted in the original study [50], such as ATF5, PPP1R15A, and DDIT3 also demonstrated comparable increases in their ribosome occupancy fold-changes to the original publication count data (XPRESSpipe: 5.89, 2.47, and 3.93; respectively. Original: 7.50, 2.70, and 3.89; respectively) ( Figure 3A). Similar to the originally processed Figure 2: Comparison between processed data produced by XPRESSpipe and original study. Genes were eliminated from analysis if any RNA-Seq sample for that gene had fewer than 10 counts. A) Comparison of biological replicate read counts processed by XPRESSpipe. B) Comparison of read counts per gene between count data from the original study and the same raw data processed and quantified by XPRESSpipe. RPF, ribosome-protected fragments. Tm, tunicamycin. All ρ values reported are Spearman correlation coefficients. XPRESSpipe-processed read alignments were quantified to Homo sapiens build CRCh38v96 using a protein-coding only, truncated GTF. Figure 3: Analysis of previously published ISR TE data using XPRESSpipe. A-C) Log 2 (Fold Change) for each drug condition compared to untreated for the ribosome profiling and RNA-Seq data. Purple, ISR canonical targets highlighted in the original study. Green, genes with uORFs affected by ISR as highlighted in the original study. Orange, genes fitting a strict thresholding paradigm to identify genes that display a 2-fold or greater increase in TE in Tm + ISRIB treatment compared to Tm treatment. Black, genes with statistically significant changes in TE. Grey, all genes. Changes in ribo-seq and mRNA-Seq were calculated using DESeq2. TE was calculated using DESeq2. Points falling outside of the plotted range are not included. D) Changes in log 2 (TE) for each drug condition compared to untreated control. Grey, all genes. Purple, ISR targets identified in the original study. Orange, genes fitting a strict thresholding paradigm to identify genes that display a 2-fold or greater increase in TE in Tm + ISRIB treatment compared to Tm treatment. XPRESSpipe-processed read alignments were quantified to Homo sapiens build CRCh38v96 using a protein-coding only, truncated GTF. data, all of these notable changes in ribosome occupancy return to untreated levels during Tm + ISRIB co-treatment ( Figure 3B). Additional ISR targets containing micro-ORFs described in the study (highlighted in green in Figure   3A-C) were also similar in translational and transcriptional regulation across conditions between the two analyses.
Both the original study and our XPRESSpipe-based re-analysis show that ISRIB can counteract the significant increase in TE for a set of genes during ISR. To further explore TE regulation during ISR, we asked if ISRIB has a similar muting effect on genes with significant decreases in TE induced by the ISR. In the original study, genes with significant decreases in TE were reported in a source-data table but not a focus in the study. However, re-analysis of these data with the updated XPRESSpipe methodology identifies genes whose translational down-regulation may play a role in the neurodegenerative effects of ISR and the neuroprotective properties of ISRIB [51][52][53][54]. Importantly, several of these genes were not identified as having significantly down-regulated TEs in the original analysis, which suggests why a focus on translational downregulation may have been foregone. In all, we identified eight genes with the regulatory paradigm of interest: significant decreases in TE during tunicamycin-induced ISR that are rescued in the ISR + ISRIB condition (  Figure 3D). RNA-Seq and ribosome-footprint coverage across these genes show that the significant changes in their TE are due to neither spurious, high-abundance fragments differentially present across libraries nor variance from an especially small number of mapped reads ( Figure S4). This is an important consideration as the commonly suggested use of the CircLigase enzyme in published ribosome profiling library preparation protocols, which circularizes template cDNA before sequencing, can bias certain molecules' incorporation into sequencing libraries based on read-end base content alone [56]. neurotoxic levels of glutamate [57]. Finally, down-regulation of SLC1A1 has already been implicated in diseases such as neurodegenerative diseases caused by mutations in the eukaryotic translation initiation factor 2B subunit epsilon (eIF2B5) that mimic the effects of phosphorylated eIF2α on cellular stress response and

Gene Name Relevant Description POMGNT1
Participates in O-mannosyl glycosylation. Mutations have been associated with muscle-eye-brain diseases and congenital muscular dystrophies. Expressed especially in astrocytes, as well as in immature and mature neurons. Expressed across brain stem cells. MYO5B* May be involved in plasma membrane recycling. No related neurological annotations. PABPC1 Binds the poly(A) tail of mRNA. Promotes ribosome recruitment and translation initiation. May contribute to mRNA stability. No related neurological annotations. RPL12 Ribosomal subunit. No related neurological annotations. SLC1A1 Dense expression in substantia nigra, red nucleus, hippocampus, and cerebral cortical layers. Member of high-affinity glutamate transporter. In the brain, crucial for terminating postsynaptic action of the neurotransmitter glutamate. Responsible for maintaining glutamate concentrations below neurotoxic levels. MAP3K10* Functions in JNK signaling, reportedly involved in nerve growth factor induces neuronal apoptosis. Expressed in the cerebral cortex. Activates NEUROD1, which promotes neuronal differentiation.

RPLP1
Ribosome subunit. Evidence for stem cell and embryonic expression in the cerebral cortex. TSPAN33 Plays a role in normal erythropoiesis and regulates maturation and trafficking of ADAM10, a metalloprotease. Negatively regulates Notch activity by way of its regulation of ADAM10. Notch signaling is vital to neurogenesis.
translation [49,50,58]. This suggests that TE regulation of SLC1A1 abundance by translation initiation factors might be important in the neurodegeneration observed in prolonged ISR conditions. ISRIB's neuroprotective effects may, therefore, stem from a recovery of SLC1A1 protein expression to wild-type levels, which in turn helps restore normal glutamate regulation. Though speculative, these ISRIB-responsive neuronal targets act as interesting cases for further validation and study in a model more representative of neurotoxic injury and disease than the HEK-293T model used in the original study. In all, this comparison demonstrates the utility of XPRESSpipe for rapid, user-friendly analysis and re-analysis of ribosome-profiling experiments in the pursuit of biological insights and hypothesis generation.

Performance Validation Using TCGA Data
To further validate the design, reliability, and versatility of the XPRESSpipe pipeline, we processed raw TCGA sequence data using XPRESSpipe and compared the output count values to those publicly available through TCGA Another source of dissimilarity in data processing appears to arise if an Ensembl canonical transcripts-only reference is used during quantification. TCGA-processed data used an un-modified transcriptome reference file (all transcripts); therefore, the use of this modified (Ensembl canonical transcripts only) GTF will produce varied quantification for some genes as quantifications are constrained to a single transcript version of a given gene and a read will not be quantified if mapping to an exon not used by the canonical transcript. Even using XPRESSpipe settings closest to the TCGA pipeline and using the same genome and transcriptome version resulted in some variation ( Figure S5, plot enclosed in maroon). By performing a more detailed analysis of these differences, it is clear that virtually all genes exhibiting variance between the processing methods are pseudogenes, with the TCGA pipeline accepting and quantifying more pseudogenes at the time of initial analysis of this dataset. This can be indicative of the difficulty surrounding the recognition of these reads as multi-mapping to both the original gene and pseudogene ( Figure S6, S7, S8; interactive plots accompanying Figure S8 can be accessed at https://github.com/XPRESSyourself/xpressyourself_manuscript/tree/master/supplemental_files).

Discussion
We have described a new software suite, XPRESSyourself, which includes a set of tools to aid in the processing and analysis of expression data, and other applications of RNA-Seq such as ribosome profiling. Although RNA-Seq technologies are quite advanced, standardized computational protocols are much less established for some applications. This is problematic when individuals or groups may not be using the most up-to-date methods or may not be aware of particular biases or measures of quality control required to produce a reliable, high-quality sequencing study. XPRESSpipe handles these issues through on-going curation of benchmarked software tools and by simplifying the required user input. It also outputs all necessary quality control metrics so that the user can quickly assess the quality of their data and identify any systematic problems or technical biases that may compromise their analysis.
One particular benefit of XPRESSyourself is that it consolidates, streamlines, and/or introduces many tools Figure 4: Pipeline validation using publicly available TCGA count data. Correlations were calculated between publicly available count data from TCGA samples and the count data processed by XPRESSpipe. Pseudogenes were excluded from the analysis. All reported ρ values are Spearman correlation coefficients. XPRESSpipe-processed read alignments were quantified to Homo sapiens build CRCh38v96 using an unmodified GTF.
specific to ribosome profiling processing and analysis. This includes producing GTF files with 5 -and 3 -truncated CDS annotations, rRNA probe design for subtractive hybridization of abundant rRNA contaminants, and automated quality-control analyses to report on ribosome footprint periodicity and metagene coverage. These tools will help to democratize aspects of ribosome profiling for which software have not been previously publicly available.
An additional problem XPRESSpipe addresses is the incorrect use of these software tools, which is especially important for those coming from a non-computational background, but often applies to experienced users who simply cannot keep up with the plethora of advances and benchmarking in the field. XPRESSyourself will remove this barrier-to-entry for most users so that they can process and analyze their data immediately upon receipt of the raw data and only requires simple programming knowledge that is included in video walkthroughs, example scripts, and interactive command builders within this software suite.
We demonstrated the utility of the XPRESSyourself toolkit by re-analyzing a publicly available ribosome profiling dataset. From this analysis, we identified putative translational regulatory targets of the integrated stress response (ISR) that may contribute to its neurodegenerative effects and their rescue by the small-molecule ISR inhibitor, ISRIB. This highlights the importance of re-analyzing published datasets with more current methods, as improved analysis methodologies and updated organism genome references may result in new interpretations and hypotheses.
XPRESSyourself will enable individuals and labs to process and analyze their own data, which will result in quicker turnaround times of experiments and financial savings. XPRESSyourself will also put missing or incomplete computational tools required for ribosome profiling and RNA-Seq into the hands of the user. Additionally, the inclusion of detailed log reports, summaries of software dependencies used during runtime, and containerized versions of the pipeline where dependencies are archived and self-contained will aid in reproducibility and make transparent methods easy to incorporate into the resulting publications.

Conclusions
With the adoption of this flexible pipeline, the field of high-throughput sequencing, particularly ribosome profiling, can continue to standardize the processing protocol for associated sequence data and eliminate the variability that comes from the availability of a variety of software packages for various steps during sequence read processing.
Additionally, XPRESSpipe consolidates various tools used by the ribosome profiling and RNA-Seq communities.
With these tools, genome reference formatting and curation is automated and accessible to the public. Other tools, like those for GTF CDS truncation, rRNA depletion, and intron-less gene coverage plotting, are introduced within this software suite to enhance the current RNA-Seq toolkit. Further, by using this pipeline on publicly available data, we highlight the utility of XPRESSpipe to process publicly available or personal data to uncover novel biological patterns quickly. Adoption of this tool will allow scientists to quickly process and access their data independently, guide them in understanding key considerations in processing their data, and standardize protocols for applications of RNA-Seq, thus increasing the reproducibility of sequencing analyses.

Materials and Methods
Methods described in this manuscript apply to the software packages at the time of writing. To obtain the most current methods, please refer to the documentation or source code for a given module. Perform differential expression analysis [37]

Software Dependencies
A list of dependencies required for XPRESSpipe at the time of writing is listed in Table 3. Dependencies for XPRESSplot at the time of writing are listed in Table 4.

GTF Modification
To parallelize GTF modification, a GTF file is split into approximately proportional chunks equal to the specified number of threads. To avoid an incomplete gene record being included in a chunk and being inappropriately processed, a given chunk endpoint is determined by calculating the size of the GTF, dividing by the number of threads, and advancing to that endpoint, then advancing line by line until the last line of the gene record encountered at the endpoint. This is performed for each subsequent chunk. If creating the last chunk, the end of the chunk is the last line of the GTF record.
Ensembl canonical transcripts are determined according to the Ensembl glossary definition of a canonical transcript [24]. For cases where a tie exists between equal priority transcripts, the longest is chosen. When there are multiple transcripts that tie for equal priority and longest length, the first listed record is retained. Exon or CDS lengths are calculated by taking the sum of each exon or CDS, not including intron or other space in the calculation.
Protein-coding records are retained by performing a simple string search for the "protein_coding" annotation in the attribute column of a GTF file.
Truncation of records is performed by identifying the 5 -and 3 -end of each transcript and modifying the given coordinates to reflect the given truncation amounts. Suggested truncation amounts are 45 nt from the 5 -end and 15 nt from the 3 -end, both of which are set as the default truncation amount parameters for the function and do not need to be modified unless the user desires [5]. As a given CDS portion of a given exon may be less than a truncation amount, the function will perform a strand-aware recursive search CDS by CDS per transcript until the full truncation amount has been fully removed for each end. Any record smaller than the sum of the 5 -and 3truncation amounts is removed entirely from the output file.

Flattened GTF Records
Flattened transcriptome references are created via a modified version of the annotation curation module available in riboWaltz [36]. Vectorized expressions in Pandas [59] are performed to quickly parse out pertinent meta-information for each transcript for the given analysis. Intermediate files are created for retrieval by each process when parallelizing analysis of each alignment file. This allows for fast processing of each BAM file, where the bottleneck in speed arises from the decompression and import of the binary alignment data. Flat files are automatically destroyed after sub-module completion.

Normalization
Equations 1-4 reflect the design of the normalization functions within XPRESSplot, where g is gene n, ge is cumulative exon space for gene n, r is total reads, f is total fragments, and l is length.

Gene Coverage Plotting
Gene coverage calculations are performed by determining the exon space of the gene of interest and mapping any read for a given sample to this space. Each nucleotide of a read that maps to a nucleotide within these exon regions is counted. During plotting, a rolling window of 20 nucleotides is used to smoothen the plotted coordinates' read coverage. Required inputs are a transcriptome-aligned BAM file (as output by STAR within XPRESSpipe) and a GTF reference file, which is then curated into its longest-transcript, protein-coding-only flattened form, as discussed above. If a longest-transcript, protein-coding-only modified GTF has already been curated, this can alternatively be provided as input, with which the module will flatten (file suffix must be LC.gtf).

Periodicity
Ribosome p-site periodicity is calculated using riboWaltz [36]. Required inputs are the path to a directory containing transcriptome-aligned BAM files (as output by STAR within XPRESSpipe) and the path and file name of the appropriate un-modified GTF.

rRNA Probe
rrnaProbe works on a directory containing FastQC [39] zip compressed files to detect over-represented sequences for each sample. These sequences are then collated to create consensus fragments. One caveat of FastQC is that it collates on exact matching strings, but these strings, or sequences, can be 1 nt steps from each other and a single rRNA probe could be used to effectively pull out all these sequences. To handle this situation, XPRESSpipe will combine these near matches. A rank-ordered list of over-represented fragments within the appropriate length range to target for depletion is then output. A BLAST [63] search on a consensus sequence intended for probe usage can then be performed to verify the fragment maps to an rRNA sequence and is thus suitable for rRNA depletion.

Confidence Interval Plotting
Confidence intervals within PCA scatterplots generated by XRESSplot are calculated as follows: 1. Compute the covariance of the two principal component arrays, x and y using the numpy.cov() function.
2. Compute the eigenvalues and normalized eigenvectors of the covariance matrix using the numpy.linalg.eig() function.
3. Compute the θ of the normalized eigenvectors using the numpy.arctan2() function and converting the output from radians to degrees using numpy.deg().
4. Compute the λ of the eigenvalues by taking the square root of the eigenvalues.
5. Plot the confidence intervals over the scatter plot: The center point of the confidence interval is determined from the means of the x and y arrays. The angle is set equal to θ. The width of the confidence interval is calculated by where ci is equal to the corresponding confidence level (i.e., 68% = 1, 95% = 2, 99% = 3 Only gene names in common between the original data file and XPRESSpipe output were used for the method comparisons. Correlation between methods or replicates were calculated using a Spearman rank correlation coefficient, performed using the scipy.stats.spearman() function [64]. The associated script can be accessed at https://github.com/j-berg/xpressyourself_manuscript/blob/master/isrib_analysis/isrib_analysis.py.
Differential expression analyses were performed using all genes, but with a minimum count of 10 or greater per gene across samples, as recommended by the DESeq2 documentation [37]. Differential expression for ribo-seq and RNA-Seq was performed as detailed in the associated scripts (https://github.com/j-berg/xpressyourself_manuscript/blob/master/isrib_analysis/isrib_analysis.py, https://github.com/j-berg/xpressyourself_manuscript/blob/master/isrib_analysis/isrib_de/isrib_de_analysis.py, https://github.com/j-berg/xpressyourself_manuscript/blob/master/isrib_analysis/isrib_de/run_de.sh). For these analyses, the design formula was such that comparisons were designed as "treated" factor level over "untreated" factor level. Differential expression of translation efficiencies between conditions used the additional incorporation of the "ribosome footprint" factor level over "RNA-Seq" factor level in the design formula [5,37,50]. Adjusted p-values (FDRs) in the associated figures were calculated from the differential expression of the translation efficiencies of each gene for a given condition. Those passing an adjusted p-value threshold of less than or equal to 0.1 are highlighted in black.
Intron-agnostic gene coverage profiles were generated using XPRESSpipe's geneCoverage module.

Consent for publication
Protected TCGA data were obtained through dbGaP project number 21674 and utilized according to the associated policies and guidelines.

Availability of data and materials
The source code for these packages is perpetually open source and protected under the GPL-3.0 license.    Figure S1: Comparison between IGV browser and geneCoverage output. A) Gene coverage from IGV (above) and XPRESSpipe (below) for SLC1A1. B) Gene coverage from IGV (above) and XPRESSpipe (below) for TSPAN33. Introns collapsed by XPRESSpipe. Green box, region shown in corresponding IGV window comparing outputs between the two programs. Figure S2: Original ISRIB count data plotted against XPRESSpipe-processed data reveals systematic differences between the analytical regimes. A) Selected highlighted genes show consistent differences between processing methods. B) Spearman correlation plots using the data table provided as supplementary data with the original ISRIB manuscript comparing biological replicates. RPF, ribosome-protected footprint. Tm, tunicamycin. All ρ values reported are Spearman correlation coefficients. Figure S3: Original ISRIB count data plotted against XPRESSpipe-processed data quantifying with same reference version reveals negligible improvement in comparability between the analytical regimes. Original samples were processed using Ensembl human build GRCh38 v72, as in the original manuscript, and compared with the original count data provided with the manuscript. XPRESSpipe-prepared counts were thresholded similarly as the original data (each gene needed to have at least 10 counts across all mRNA samples). RepA, biological replicate A. RepB, biological replicate B. RPF, ribosome-protected footprint. Tm, tunicamycin. All ρ values reported are Spearman correlation coefficients. Figure S4: Gene coverage plots for neurologically annotated genes passing strict thresholding. Coverage plots were generated using XPRESSpipe's geneCoverage module, which collapses introns within the representation. Figure S5: Sample RNA-Seq count data compared between TCGA count data and various conformations of the XPRESSpipe pipeline. An overview of how different conformations of the XPRESSpipe peRNAseq pipeline compared to the published TCGA sample TCGA-06-0132-01A count data. The x-axis data in the plot enclosed in maroon most closely mirrors the settings used in the published TCGA RNA-Seq pipeline. The x-axis data in the plot enclosed in green used XPRESSpipe default settings and the most current reference transcriptome at the time of writing. All ρ values reported are Spearman correlation coefficients. Figure S6: Effect of pseudogene inclusion on comparability between processing regimes. Spearman correlations between XPRESSpipe and TCGA-processed count data with pseudogene counts included. All ρ values reported are Spearman correlation coefficients. Figure S7: Removal of pseudogenes counts improve comparability between analytical regimes. An overview of how different conformations of the XPRESSpipe peRNAseq pipeline compared to the published TCGA sample TCGA-06-0132-01A count data with pseudogenes collapsed. The x-axis data in the plot enclosed in maroon most closely mirrors the settings used in the published TCGA RNA-Seq pipeline. The x-axis data in the plot enclosed in green used XPRESSpipe default settings and a current reference transcriptome. All ρ values reported are Spearman correlation coefficients. Figure S8: Pseudogenes counts are over-represented in TCGA-processed data. An overview of gene-type distributions between transcriptome reference versions. The plots above used GRCh38v79 and the bottom plot used GRCh38v96. Purple points, protein-coding genes. Orange points, pseudogenes. Green points, other gene records. All plots represent sample TCGA-06-0132-01A and were processed the same way except for transcriptome reference used during read quantification.