XPRESSyourself: Enhancing, standardizing, and automating ribosome profiling computational analyses yields improved insight into data

Ribosome profiling, an application of nucleic acid sequencing for monitoring ribosome activity, has revolutionized our understanding of protein translation dynamics. This technique has been available for a decade, yet the current state and standardization of publicly available computational tools for these data is bleak. We introduce XPRESSyourself, an analytical toolkit that eliminates barriers and bottlenecks associated with this specialized data type by filling gaps in the computational toolset for both experts and non-experts of ribosome profiling. XPRESSyourself automates and standardizes analysis procedures, decreasing time-to-discovery and increasing reproducibility. This toolkit acts as a reference implementation of current best practices in ribosome profiling analysis. We demonstrate this toolkit’s performance on publicly available ribosome profiling data by rapidly identifying hypothetical mechanisms related to neurodegenerative phenotypes and neuroprotective mechanisms of the small-molecule ISRIB during acute cellular stress. XPRESSyourself brings robust, rapid analysis of ribosome-profiling data to a broad and ever-expanding audience and will lead to more reproducible and accessible measurements of translation regulation. XPRESSyourself software is perpetually open-source under the GPL-3.0 license and is hosted at https://github.com/XPRESSyourself, where users can access additional documentation and report software issues.


CHAPTER 2
Table of contents

Wait! I've never programmed before!
If you don't have any programming experience and find this all very daunting, this is the documentation for you! We will walk through installation and usage step by step, and explain what we are doing along the way.

Installation
Installation requires the use of the command line interface (CLI). If you would like some background on how this programming environment works, you can try the Codecademy module which will familiarize you with this language To begin, you will need to install the software package, xpressplot. To do so, we will use a Package Manager which will ease the overhead involved in installing software and other software packages it relies on. 1. We will need to use the command line interface (CLI, also known as the Terminal) to install and begin using the software -Linux: Press Ctrl + Alt + T on your keyboard and the Terminal will open -Mac: Click on the Finder icon (a magnifying glass) at the top right corner of your Desktop, type in Terminal, and double-click the corresponding icon 2. We recommend using Python3 as Python2 is being deprecated (will no longer be updated, debugged, etc) -You can check the version of Python you have by typing python -v in the command line -If you only have Python2 installed and want to use Python3, you can download this here -Now we need the computer to recognize Python3 as the default Python by typing newalias() {echo "python=python3" >> $HOME/.bash_aliases; source~/.bash_aliases; } -Now we can test this by executing python -v again 3. With newer versions of Python, the package manager PyPi should already by installed. We can install xpressplot by executing the following: pip install xpressplot. This should install xpressplot and all dependencies. 4. Let's test that the installation worked:

$ python
This will open the python interactive mode. Next, type the following:

>>> import xpressplot
If the command executes without error, xpressplot and all dependencies have been successfully installed. NOTE: If any installation up to this point fails due to priviledges warnings, you should run the pip install or other command using sudo. You will append this to the beginning of the command and will likely be asked to provide your account password for your machine. sudo means "substitute user do", which essentially tells your computer you are a authorized user to install software on the system.

Use
Assuming you are a beginner user, you will likely want to run the interactive notebook. This has many example functions you can run with a toy dataset, which can be easily modified for your use. Instructions are provided with each block of code. In order to run this interactive notebook, we will need Jupyter Notebook, which is automatically installed with the Anaconda package manager 1. Let's install Anaconda -The version you install depends on the version of Python you are using -Follow this link to install the appropriate version of Anaconda 2. Let's check that Anaconda installed correctly: ... Installing xpressplot script to /Users/$USERNAME/anaconda3/bin Installed /Users/$USERNAME/anaconda3/lib/python3.6/site-packages/xpressplot-0.0.1b0-˓→py3.6.egg Processing dependencies for xpressplot==0.0.1b0 Finished processing dependencies for xpressplot==0.0.1b0 $ echo "export PATH='/Users/$USERNAME/anaconda3/bin:$PATH' >>~/.bash_profile

General Usage
xpressplot is intended as a all-in-one toolkit and interface for analysis of sequencing data

Sequence Data
Required format for all functions (unless otherwise noted in documentation).

Metadata
Required format for all functions (unless otherwise noted in documentation).

RNAseq Datasets
A module will be added in the future to automate this conversion and import from GEO Download the csv or tsv file provided in supplement and ensure formatted follows xpressplot standards Sometimes the delimiter is formatted incorrectly. If so, a simple find/replace can be used to replace the incorrect delimiter with a t Remove the gene name column header, but keep the trailing tab Create a metadata file following xpressplot standards Import data
Position is relative to delimiter in the final field (usally a ";"), so if the new name is in the third position, new_name_location=2, etc.
-This function is pulling original and new gene name information from any row where the third field is "gene". You can run cat transcripts.gtf | awk '$3 == "gene"' | less -S from the command line of your reference file to identify the positions of the required text fields Parameters: data: Dataframe to convert rows names gtf: Path and name of gtf reference file orig_name_label: Label of original name (usually a "gene_id "') orig_name_location: Position in last column of GTF where relevant data is found (i.e. 0 would be the first sub-string before the first comma, 3 would be the third sub-string after the second comma before the third comma) new_name_label: Label of original name (usually "gene_name ") new_name_location: Position in last column of GTF where relevant data is found (i.e. 0 would be the first sub-string before the first comma, 3 would be the third sub-string after the second comma before the third comma) refill: In some cases, where common gene names are unavailable, the dataframe will fill the gene name with the improper field of the GTF. In this case, specify this improper string and these values will be replaced with the original name sep: GTF delimiter (usually tab-delimited)

R/FPKM (Reads/Fragments per Kilobase Million per Million Mapped Reads)
xpressplot.r_fpkm ( data, gtf, feature_type='exon', identifier='gene_name', sep='t' ) Purpose: Perform reads/fragments per kilobase per million mapped reads sample normalization on RNAseq data Assumptions: -Dataframe contains raw count data, where samples are along columns and genes across rows -As FPKM was developed for paired-end sequencing, it accounts for two reads being able to map to one fragment. Therefore, input counts should have accounted for this is the counting step of sequence quantification. If specifying a paired-end alignment in XPRESSpipe, this will have been accounted for. Chapter 2.

Analysis
|The following commands rely heavily on the matplotlib (DOI:10.5281/zenodo.2577644) and seaborn (DOI:10.5281/zenodo.883859) libraries, but have been modified in many cases for ease of plotting given the formatting of xpressplot datasets.