The seeker R package: simplified fetching and processing of transcriptome data

Transcriptome data have become invaluable for interrogating biological systems. Preparing a transcriptome dataset for analysis, particularly an RNA-seq dataset, entails multiple steps and software programs, each with its own command-line interface (CLI). Although these CLIs are powerful, they often require shell scripting for automation and parallelization, which can have a high learning curve, especially when the details of the CLIs vary from one tool to another. However, many individuals working with transcriptome data are already familiar with R due to the plethora and popularity of R-based tools for analyzing biological data. Thus, we developed an R package called seeker for simplified fetching and processing of RNA-seq and microarray data. Seeker is a wrapper around various existing tools, and provides a standard interface, simple parallelization, and detailed logging. Seeker’s primary output—sample metadata and gene expression values based on Entrez or Ensembl Gene IDs—can be directly plugged into a differential expression analysis. To maximize reproducibility, seeker is available as a standalone R package and in a Docker image that includes all dependencies, both of which are accessible at https://seeker.hugheylab.org.


INTRODUCTION
Measuring the transcriptome has become a standard approach for understanding biological systems. However, even after a transcriptome dataset is generated, it must go through multiple processing steps before it is ready for analysis. For RNA-seq data in particular, each step typically involves a different software program with its own command-line interface (CLI). Thus, complete processing of transcriptome data requires chaining together the output of one program with the input of another. Automating and parallelizing the entire process has often involved shell scripting, which can have a steep learning curve, limiting reproducibility and portability. To address this issue and simplify RNA-seq processing pipelines, two programs-pyrpipe (Singh et al., 2021) and nf-core/ rnaseq (Ewels et al., 2020)-have recently been developed that wrap around various RNA-seq processing programs. pyrpipe uses Python, whereas nf-core/rnaseq uses Nextflow. However, many individuals working with transcriptome data are already familiar with R, given the popularity of R-based tools for biological data analysis, especially

METHODS
The seeker package's website (https://seeker.hugheylab.org) includes instructions for installing the package, multiple vignettes, and detailed documentation for every function.
The seeker package has no hardware requirements other than those of its system dependencies and of R itself. In general, processing RNA-seq data is more memory-and compute-intensive than processing microarray data. However, the memory requirements and computation time to process a given RNA-seq dataset will depend on one's hardware, the number of files to process and their size, the extent of parallelization, and which processing steps are being run. For specific estimates, we refer readers to the papers and documentation of the underlying tools.

Microarray data
To fetch and process microarray data, the package includes a function called seekerArray (Fig. 1A), which depends on the GEOquery and ArrayExpress R packages. seekerArray can process data from NCBI GEO or Array Express, or raw Affymetrix data stored locally. The main inputs to seekerArray are a study accession and a preferred gene ID type (Entrez or Ensembl). The main outputs are a table of sample metadata and a matrix of log 2transformed gene expression measurements. seekerArray automatically detects the microarray platform and maps probes to the appropriate gene IDs. If the study includes raw Affymetrix data, seekerArray performs RMA (Irizarry, 2003) and uses custom CDFs from Brainarray (Dai, 2005), otherwise seekerArray uses the study's processed data.

RNA-seq data
To fetch and process RNA-seq data, the package includes multiple functions (described below), each of which handles single-end or paired end reads, includes sensible defaults, and allows the user to pass custom arguments to the respective APIs or CLIs. fetchMetadata: Uses the API of the European Nucleotide Archive (ENA) or the Sequence Read Archive (SRA) to fetch a study's sample metadata. The main input is a Bioproject accession. fetch: Uses srafetch, fasterq-dump, and pigz to download files from SRA and convert them to gzipped fastq files in parallel. The main input is a vector of SRA run accessions. trimgalore: Uses Trim Galore to perform standard adapter and quality trimming in parallel. The main input is a vector of paths to fastq files. fastqc: Uses FastQC to perform quality control checks in parallel. The main input is a vector of paths to fastq files. salmon: Uses Salmon to quantify transcript abundances in parallel. The main inputs are a vector of paths to fastq files, corresponding sample names, and a directory containing the salmon transcriptome index (which can be fetched by refgenie via installSysDeps). multiqc: Uses MultiQC to aggregate the results of various processing steps (including Trim Galore, FastQC, and Salmon) into a single report. The main input is a directory containing the results. getTx2gene: Uses the biomaRt package (Durinck et al., 2009) to create a mapping of transcripts to genes based on Ensembl IDs. The main input is an organism name.  tximport: Uses the tximport R package (Soneson, Love & Robinson, 2015) to summarize Salmon's transcript-level quantifications for gene-level analyses. The main inputs are a directory containing quantification directories from Salmon and a mapping of transcripts to genes.
The most computationally intensive functions (fetch, trimgalore, fastqc, and salmon) output their progress as tab-delimited text files.
In addition, the package includes a function called seeker, which can perform any of the above steps and pass the output of one step as input to the next step (Fig. 1B). The main input to the seeker function is a list of parameters, which can be derived from a yaml file, that specifies which steps to perform and how to perform them. Depending on the steps specified, the seeker function can fetch and process publicly available RNA-seq data or process locally stored data. The function also includes a "dry run" option to check the parameters' validity without fetching or processing any data.

Reproducibility
To help ensure that the output of seekerArray and seeker is reproducible, both functions save a yaml file of the user-defined parameters and a text file containing the R session information. The latter is provided by the sessioninfo R package and includes version numbers and sources of all loaded packages. In addition, the seeker function saves a text file containing paths and version numbers of its system dependencies, as well as a yaml file of the conda environment, which includes version numbers of the miniconda-based dependencies.
To provide an even higher level of reproducibility, the seeker package and its dependencies are available in a Docker image called socker (https://github.com/hugheylab/ socker), which is based on rocker/rstudio (https://hub.docker.com/r/rocker/rstudio). Using socker thus ensures that the output is not dependent on details of the local computing environment (Nüst et al., 2020).

RESULTS
To illustrate the utility of the seeker package, we used it to fetch and process multiple publicly available transcriptome datasets. We first used the seekerArray function to fetch two datasets related to circadian metabolism in mouse liver (GSE34018 and GSE67964) (Cho et al., 2012;Zhang et al., 2015). The main output for GSE34018, which is based on an Illumina beadchip, consisted of 24 samples and 17,142 Entrez genes. The main output for GSE67964, which is based on an Affymetrix array, consisted of eight samples and 27,352 Entrez genes. We used these outputs to perform principal components analysis and to calculate differential expression using limma (Ritchie et al., 2015) between the wild-type and knockout samples in each dataset (Fig. 2). The log 2 fold-changes for Entrez genes measured in the two datasets were weakly negatively correlated (Spearman's rho −0.092), consistent with the opposing roles of the genes knocked out in each dataset (Nr1d1 and Nr1d2 in GSE34018; Rora and Rorc in GSE67964). We next used the seeker function to fetch and process one sample from each of two RNA-seq datasets related to circadian rhythms and feeding in mouse liver (PRJNA600892 and PRJNA667743) (Guan et al., 2020;Manella et al., 2021). We chose two samples that had the same genotype (wild-type), feeding regimen (ad libitum), and time of day of acquisition (4 h after lights on). PRJNA600892 is based on paired-end sequencing and Illumina TruSeq Stranded library prep, whereas PRJNA667743 is based on single-end sequencing and bulk MARS-Seq (3′-tagged) library prep (Jaitin et al., 2014). The seeker function detected and handled the paired-end and single-end reads automatically. To account for the 3′-tagged reads in PRJNA667743, we specified the countsFromAbundance argument to tximport as "no" in the input parameters to seeker. For simplicity and speed, we disabled the trimgalore step and used a salmon index based on partial selective alignment. The main outputs were the sample metadata and the tximport object including counts and abundances for 35,494 Ensembl genes. As expected, the gene-level abundances between the samples from the two datasets were highly correlated ( Fig. 3; Spearman's rho 0.877).

DISCUSSION
The seeker package complements existing software packages for fetching and processing transcriptome data. Whereas pyrpipe and nf-core/rnaseq are focused on RNA-seq data, seeker can also process microarray data. Here seeker builds on the GEOquery and ArrayExpress R packages and our previous package metapredict (Hughey & Butte, 2015) by providing a simple interface to map microarray probe sets to standard gene IDs. In addition, unlike pyrpipe, seeker can run in parallel natively, without requiring a workflow manager such as Nextflow. seeker also allows the user to specify all processing parameters for a given dataset in a single yaml file. Perhaps most importantly, seeker uses the R language and computing environment, and thus could appeal to a wider group of researchers already using R for bioinformatic analyses. The seeker package does have limitations. First, the current implementation focuses on processing bulk transcriptome data using a relatively small set of tools. Although these should suffice for the majority of use cases, seeker currently supports a more limited set than either pyrpipe or nf-core/rnaseq. For example, it supports only salmon for quantifying transcript-level abundances, which is a standard in the field and is both fast and accurate. In the future, we plan to extend seeker to accommodate other types of genomic data, including from single cells, and a wider range of tools. Second, although seeker thoroughly documents its computing environment, recreating that environment on another machine (if not using the socker Docker image) would still be a manual process. We welcome contributions from the community to improve seeker in these and any other ways. By providing a straightforward, R-based approach to small-or large-scale processing of transcriptome data, we hope seeker contributes to improved reproducibility and transparency of genomic analyses.