A Computational Workflow for Analysis of 3′ Tag‐Seq Data

RNA‐sequencing (RNA‐seq) is a gold‐standard method to profile genome‐wide changes in gene expression. RNA‐seq uses high‐throughput sequencing technology to quantify the amount of RNA in a biological sample. With the increasing popularity of RNA‐seq, many variations on the protocol have been proposed to extract unique and relevant information from biological samples. 3′ Tag‐Seq (also called TagSeq, 3′ Tag‐RNA‐Seq, and Quant‐Seq 3′ mRNA‐Seq) is one RNA‐seq variation where the 3′ end of the transcript is selected and amplified to yield one copy of cDNA from each transcript in the biological sample. We present a simple, easy‐to‐use, and publicly available computational workflow to analyze 3′ Tag‐Seq data. The workflow begins by trimming sequence adapters from raw FASTQ files. The trimmed sequence reads are checked for quality using FastQC and aligned to the reference genome, and then read counts are obtained using STAR. Differential gene expression analysis is performed using DESeq2, based on differential analysis of gene count data. The outputs of this workflow are MA plots, tables of differentially expressed genes, and UpSet plots. This protocol is intended for users specifically interested in analyzing 3′ Tag‐Seq data, and thus normalizations based on transcript length are not performed within the workflow. Future updates to this workflow could include custom analyses based on the gene counts table as well as data visualization enhancements. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
RNA sequencing (RNA-seq) is a widely used method to detect genome-wide changes in gene expression (Wang, Gerstein, & Snyder, 2009). It was first used in Saccharomyces cerevisiae to identify the gene expression patterns of all genes, exons, and their boundaries across the yeast genome (Nagalakshmi et al., 2008). For humans as well as many model organisms including yeast, fruit flies, and mice, RNA-seq has been instrumental in providing high-resolution, functionally relevant genome annotations (Cherry et al., 2012;Gnerre et al., 2011;IHGSC, 2004;Matthews et al., 2015). Briefly, the RNA-seq protocol begins with the isolation of total RNA, which is typically enriched/selected for polyA + RNA or alternatively depleted for rRNA (Zhao, Zhang, Gamini, Zhang, & von Schack, 2018). After this step, double-stranded cDNA is synthesized via reverse transcription from the RNA, resulting in cDNA libraries. The cDNA molecules are fragmented and sequence adapters are added to the cDNA fragments. Then, the cDNA fragments are subjected to high-throughput sequencing to generate short sequence reads that are used for downstream analyses.
Whole-transcript RNA-seq workflows produce data providing information on quantifying gene expression, novel transcripts, alternatively spliced genes, and allele-specific expression (Wang et al., 2009). Although this information is valuable, the goal of many biological research projects is simply to identify changes in gene expression patterns between conditions of interest. Given this simplified goal, classical RNA-seq protocols are overly complex and more costly than is necessary to obtain genome-wide gene expression changes (Ma et al., 2019;Wang et al., 2009). To simplify classical RNA-seq protocols, recent advances in the field have provided alternative RNA-seq methods that are used to address specific biological questions (Moll, Ante, Seitz, & Reda, 2014;Morrissy et al., 2011). One of these alternative methods is 3 Tag-Seq (also called TagSeq, 3 Tag-RNA-Seq, and Quant-Seq 3 mRNA-Seq). In this method, cDNA libraries are reversetranscribed from only the 3 end of the mRNA, resulting in a single copy of cDNA from each transcript (Ma et al., 2019;Moll et al., 2014;Torres, Metta, Ottenwälder, & Schlötterer, 2008). Compared to classical RNA-seq methods, 3 Tag-Seq is simpler, quicker, and less costly while providing sufficient sequencing depth for differential gene expression analysis (Ma et al., 2019). These benefits make 3 Tag-Seq an ideal choice for researchers whose goal is to identify changes in patterns of gene expression between two or more conditions. Analysis of sequencing datasets from classical RNA-seq data was complex when this technology was first introduced (Wang et al., 2009). Even today, analysis of RNA-seq data can be challenging because it is highly dependent on the experimental design used to create the sequencing libraries (Conesa et al., 2016). Consequently, there is no onesize-fits-all workflow for analysis of RNA-seq output reads. Here, we present a simple, easy-to-use, and publicly available computational workflow to analyze 3 Tag-Seq data, which allows the user to use default analysis parameters or adjust parameters to fit their unique experimental design considerations. The workflow begins by trimming sequence adapters from raw FASTQ files (Cock, Fields, Goto, Heuer, & Rice, 2010;Conesa et al., 2016). The trimmed sequence reads are checked for quality using FastQC and aligned to the reference genome, and read counts are obtained using STAR (Dobin et al., 2013). Differential gene expression analysis is then performed using DESeq2, based on differential analysis of gene count data (Love, Huber, & Anders, 2014). The outputs of this workflow are MA plots, tables of differentially expressed genes, and UpSet plots. This workflow is well-suited for users with minimal computational expertise. (i) Adapters added to raw RNA-seq reads are trimmed using BBDuk. (ii) A quality control report is generated for trimmed reads using FastQC. (iii) Reads passing the QC check are aligned to the reference genome using STAR and a gene count table is created. (iv) The gene count table is used to run differential gene expression analysis using DESeq2, and the DESeq2 output is saved as a table to a file for downstream usage.

RUNNING THE 3 Tag-Seq WORKFLOW
In the following section, we describe in detail our 3 Tag-Seq analysis workflow. Figure 1 provides a summary of the computational workflow. Briefly the pipeline begins with the processing of raw RNA-seq FASTQ files and ends with a table output of differential gene expression (Love et al., 2014).

Necessary Resources Hardware
An internet-connected machine running Linux 64-bit Ubuntu version 20.04.1 with at least 32 GB RAM. The number of threads can be provided by the user, but 16 threads is recommended.

Software
The workflow uses the Conda command line tool environment to install all required software and tools. Conda software can be accessed at https:// docs.conda.io/ en/ latest/ miniconda.html. The workflow is saved in a bash Paropkari et al.

of 9
script file called pipeline.sh. The source code and documentation can be found on GitHub at https:// github.com/ akshayparopkari/ RNAseq/ wiki. The Conda environment file is provided in Supplemental file 1 (see Supporting Information).

Other requirements
Access to a computational cluster and login information Basic knowledge of Unix Raw FASTQ sequencing data Sample metadata

Download the RNA-seq workflow on a local machine
In Linux and MacOS, use the built-in Terminal application. In Windows, download and use Git Bash (https:// gitforwindows.org/ ).

Load the Conda virtual environment
Conda enables virtual environments that contain the required software packages/libraries to be installed and set up. In this instance, the RNA-seq Conda environment contains the BBMap suite, STAR alignment software, and FASTQC tool (Bushnell, 2014;Dobin et al., 2013). Additionally, required Python and R libraries and their dependencies are also installed.
conda env create -f Supplemental_file_1.yml -n RNAseq The environment only needs to be created once. For subsequent analysis, the environment can be activated to run the analysis using the command: conda activate RNAseq Create an input data folder 4. The main script of 3 Tag-Seq is the pipeline.sh file. This single bash script contains all the preprocessing steps: QC filtering with BBDuk, generating QC summaries with FastQC, and alignment and gene counting with STAR. The pipeline.sh script takes in a single input, which is a folder/directory containing: all raw FASTQ sequence files the sample metadata Excel file The raw FASTQ sequence files may be either compressed (using gzip) or uncompressed. The file names must start with the sample ID, followed by an underscore and the rest of the file name. For example, projectname_date_L001.fastq.gz should be named sampleid_projectname_date_L001.fastq.gz. The first part of the file name before the first underscore dictates the sample the script is processing. The sample metadata file contains all metadata associated with the input samples including sample ID, genotype, condition, treatment, time, etc. For this repository, the sample metadata file must contain at least two columns: Sam-pleID and Condition. Table 1 is an example of a sample metadata file, where the first two columns (SampleID and Condition) are required and the third and subsequent columns (e.g., FASTQ_file) are optional but highly recommended. A comprehensive metadata file also enables convenient sample submission to a sequence read archive (SRA) once your manuscript is published.

Transfer data to/from a cloud computing resource to a local machine via command line
5. Below is a common usage of the secure copy scp function, which is one of the commands used to transfer files to/from a cloud computing resource. The other command is a secure file transfer protocol sftp. Please refer to the cloud computing resource wiki for detailed instructions on sftp function.

scp FROM TO
where FROM is the source location and TO is the destination location.

Run the RNA-seq pipeline
The RNA-seq Conda environment must be activated before attempting to execute the pipeline.
INPUTFOLDER="path/to/your/input/folder" # enter your data folder with FASTQ files here bash pipeline.sh "$INPUTFOLDER" <num_og_threads_to_use> > "$INPUTFOLDER"/preprocess.log Output files: pipeline.sh creates multiple output files, which can be useful to gain insights into specific samples to address any discrepancy in the data. The three important files to check are: gene_raw_counts.txt: a tab-separated file of raw gene counts for all samples, with gene names as rows and samples as columns deseq2_lfc.txt: a tab-separated file from the DESeq2 analysis MA_plot.pdf: a .pdf file depicting volcano plots of log fold changes against mean gene expression Paropkari et al.

of 9
Additional information about other output files: All files ending "_trimmed.fastq" are trimmed sequences from BBmap and are saved in the trim_log directory All files ending ".bam" are alignment files generated by STAR and are saved in the STAR_log directory All files ending "ReadsPerGene.out.tab" are gene count files for each sample generated by STAR and are saved in the STAR_log directory All files ending "Log.out", "Log.final.out", and "Log.progress.
out" are intermediary alignment files generated by STAR and are saved in the STAR_log directory Visualize overlaps in multiple experimental conditions 7. Use the overlap_upsetR.R script to visualize overlaps in genes for multiple experimental conditions. The overlap is represented as an UpSet plot (Lex, Gehlenborg, Strobelt, Vuillemot, & Pfister, 2014). UpSet plots are an extension of Venn diagrams and are useful when there are more than three categories/sets of conditions/samples to consider. The overlap_upsetR.R takes one input (either "up" or "down") to calculate overlap between various samples/conditions. Users need to supply an input directory in the code on line 38 and run the following command to get the output UpSet plot: To visualize genes upregulated in multiple conditions/samples:

overlap_upsetR.R up
To visualize genes downregulated in multiple conditions/samples: overlap_upsetR.R down Figure 2 is an example of an UpSet plot output of Candida albicans RNA-seq data showing upregulated genes overlapping various conditions.

GENERATING GENOME INDICES
During the alignment step, STAR utilizes genome index files for mapping sequenced reads to a reference genome. This protocol describes how to generate genome indices for the C. albicans assembly 21 genome as an example. These steps can be used to generate genome indices for your reference genome of choice.

Necessary Resources
See Basic Protocol.

Background Information
We present a simple and user-friendly workflow to analyze 3 Tag-Seq experimental data. This workflow allows one to determine differentially expressed genes in the experimental conditions of interest. The workflow also includes a useful UpSet plot script to allow further analysis of differential gene expression data.
Given that the experimental methodology behind 3 Tag-Seq is focused on the 3 end of the transcript, information related to splicing events is not captured, and this is a limitation of this methodology.
We note that this workflow is distinct from other pipelines that analyze traditional RNAseq experimental data in terms of gene counting. In traditional RNA-seq analyses, gene counts are normalized by the length of the transcript or fragment to obtain comparable count values across genes and across sampling conditions. For traditional RNA-seq experiments, transcripts are fragmented and reversetranscribed into cDNA. As a result, longer transcripts generate more cDNA, necessitating length-based normalization procedures during analysis. For 3 Tag-Seq experiments, only the 3 -end of the transcript is reverse-transcribed, Paropkari et al.

Critical Parameters and Troubleshooting
The script only takes in a single input-a folder with raw FastQ files. The FastQ files can be either compressed or uncompressed.
See Table 2 for troubleshooting tips.

Understanding Results
The outputs of this pipeline are MA plots, tables of differentially expressed genes, and UpSet plots. The output files are saved in the input folder supplied with the script. The DGE output table consists of six numerical columns: baseMean, log2FoldChange, lfcSE, stat, pvalue, and padj. The details of these columns are explained in "Analyzing RNA-seq data with DESeq2" (https:// www.bioconductor.org/ packages/ devel/ bioc/ vignettes/ DESeq2/ inst/ doc/ DESeq2.html). To obtain a list of genes that are differentially regulated between two conditions, users can focus on the log2FoldChange and padj columns. The log2FoldChange indicates the magnitude of change in expression values between the two experimental conditions and the padj value provides statistical significance towards the calculated magnitude change. The availability of these output files including the MA-plot indicates a successful run of this pipeline.
improved the utility of the workflow. This work was supported by the National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) award number R35GM124594 and by the Kamangar family in the form of an endowed chair to C.J.N. The content is the sole responsibility of the authors and does not represent the views of the funders. The funders had no role in the design of the study; the collection, analysis, or interpretation of the data; the writing of the manuscript; or the decision to publish the results.