Selective transcriptional regulation by Myc: Experimental design and computational analysis of high-throughput sequencing data

The gene expression programs regulated by the Myc transcription factor were evaluated by integrated genome-wide profiling of Myc binding sites, chromatin marks and RNA expression in several biological models. Our results indicate that Myc directly drives selective transcriptional regulation, which in certain physiological conditions may indirectly lead to RNA amplification. Here, we illustrate in detail the experimental design concerning the high-throughput sequencing data associated with our study (Sabò et al., Nature. (2014) 511:488–492) and the R scripts used for their computational analysis.


a b s t r a c t
The gene expression programs regulated by the Myc transcription factor were evaluated by integrated genome-wide profiling of Myc binding sites, chromatin marks and RNA expression in several biological models. Our results indicate that Myc directly drives selective transcriptional regulation, which in certain physiological conditions may indirectly lead to RNA amplification. Here, we illustrate in detail the experimental design concerning the highthroughput sequencing data associated with our study (Sabò et

Experimental design
The Gene Expression Omnibus (GEO) Series GSE51011 contains 94 high-throughput sequencing samples associated with the Sabò et al. study [1]. These samples cover different -omics datasets (ChIP-Seq for transcription factors and post-translational histone modifications, RNA-Seq, 4sU-Seq and DNAseI-Seq) produced in different organisms and biological systems. Some of the samples have to be used as references for other samples: for example, as inputs in the ChIP-seq peak calling procedure, or as baselines for the identification of differentially expressed genes. To help navigating through these data we collected the most relevant associated metadata in Table 1.
The different biological systems used in the study allowed the analysis of the effects of modulation of Myc levels in vitro and in vivo, both at physiological and pathological levels. In the Eμ-myc model [2], Myc overexpression was achieved in vivo specifically in the mouse B-cell compartment, where it causes lymphoma development. This model system gave access to primary wild type B-cells (Control, "C"), Em-myc transgenic B-cells not yet transformed (Pre-tumoral, "P"), and lymphoma tumoral cells (Tumor, "T"). Modulation of Myc expression in human B-cells was obtained in a timecontrolled manner in vitro in the cell line P493-6 [3], harboring a tet-regulated Myc transgene. A line of mouse 3T9 fibroblasts was also used in which endogenous c-myc was modulated from low basal levels (in conditions of serum starvation) to mitogen-induced levels (upon serum stimulation). In the same cells, we expressed a conditionally active MycER chimaera, allowing us to induce active Myc at supra-physiological levels through administration of OHT to the culture medium.

Data analysis: Source code design and installation
In addition to the methods in the original publication [1], the source code used for the computational analysis of the high-throughput sequencing data is available as supplemental material of this manuscript.

R/Bioconductor and the compEpiTools package
The source code is entirely written using R, an open-source language and environment for statistical computing and graphics. In particular, several of the scripts developed for this study take advantage of R packages developed within the Bioconductor project [4], which currently counts more than 700 packages contributed from the scientific community, mostly dedicated to the analysis of highthroughput biological data. In the Bioconductor spirit, most of the scripts developed for this study were included in an R package (compEpiTools), which was recently approved as part of that project and is available on the Bioconductor website at the following URL: http://www.bioconductor.org/packages/ release/bioc/html/compEpiTools.html. To ensure complete reproducibility, we include here the original version of the compEpiTools package (v0.1) preceding the submission to Bioconductor, which was the one actually used for the computational analysis of the published data. Importantly, compEpiTools (both v0.1 and following versions) is totally compliant with the Bioconductor computational infrastructures, and therefore the results generated here are highly compatible with the other tools offered by Bioconductor.
From here on, R commands will be indicated enclosed within quotes (e.g. 'load'), while file and folder names will be indicated in italic (e.g. file1.txt).

Description of source code files
The source code (saboEtAl2014_sourceCode.zip) is composed by 6 files and two folders:   ) and Bioconductor (http://www.bioconductor.org/) web sites for documentation and tutorials on the R language and common Bioconductor infrastructure (e.g. the GRanges object) and methods.

filemapping_GEO.R
Contains virtual links to the sequencing data indicated in Table 1 and processing steps used to transform peak lists in GRanges R objects, which were saved in the R binary file peaksRef.rda in the data folder.

analysisEnvironment.R
An R script, including a number of additional compEpiTools functions and methods not contained in compEpiTools_0.1.tar.gz. This script can be called with the 'source' R command when initializing the R session.

saboEtAl2014_Figures.R
An R script containing the code used to generate the main figures resulting from computational analyses in the published paper [1].

saboEtAl2014_ExtData.R
An R script containing the code used to generate the extended figures resulting from computational analyses in the published paper [1].

saboEtAl2014_ExtData10.R
An R script containing the code used to generate extended figure 10 in the published paper [1].
data folder A folder containing the input and output data, formatted as R objects or text files.
figures folder A folder containing the figures resulting from the computational data analysis, which were used as panels to assemble the main and extended figures in the published paper [1].
This file is also available at the following URL: http://genomics.iit.it/supplementalData/SaboNa ture2014. In case updated versions will be necessary they will be released there, while the original zip file will always be available.

Getting started
The file filemapping_GEO.R allows matching of the GEO samples listed in Table 1 with the corresponding computational objects. In particular, the code shows how genomic regions such as ChIP-Seq peaks and DNAseI-Seq hypersensitive sites were stored as GRange objects. A GRange is a basic Bioconductor infrastructure that minimally contains the chromosome assignments as well as the start and end nucleotide positions for a set of genomic regions. filemapping_GEO.R reports how the ChIP-seq peaks were processed, i.e. considering the filters on the associated p-values and the pooling of replicated experiments (see [1] methods section for a description of the actual peak calling procedure). The final lists of peaks were saved in the peaksRef.rda file available in the data folder. This is a binary file containing an R object (list of GRanges) and can be loaded into R using the 'load' R command.
The analysisEnvironment.R file contains a set of R commands needed for the setup of the working environment before starting to reproduce the analyses contained in the saboEtAl2014_* files. The 'source' R command can be used to execute the R commands included in analysisEnvironment.R.
This uploads a number of data and additional functions into the R memory and executes processing steps described at the bottom of the analysisEnvironment.R file.
The saboEtAl2014_Figures.R file contains the R commands used to generate individual figures (or panels) as indicated throughout the code. The code can be copied and pasted in the R GUI (or in the command line shell) to obtain the resulting data or figure. Please note that some steps might depend on the execution of previous steps reported in the same file. The results were already incorporated in the saboEtAl2014_Figures.R file itself (in case of numbers) or included in the figures folder (in case of figures or figure panels).
The same logic applies to saboEtAl2014_ExtData.R and saboEtAl2014_ExtData10.R for the results reported in the extended figures.
The original FASTQ sequencing data were submitted to GEO (GSE51011 series) and are available there as SRA files. SRA files can be transformed back to FASTQ files using the fastq-dump tool (http:// www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump). Finally, BAM files can be obtained by aligning the FASTQ files the reference genome indicated in Table 1, using BWA and TopHat aligners using default parameters for ChIP-seq and RNA-seq data, respectively.

RNAPII stalling index analysis
The RNAPII stalling index (SI) was determined relating the RNAPII ChIP-seq reads density in the region around the transcription start site (tss) to the density in the genebody (GB). Specifically, the SI was obtained as a ratio of the number of reads counted on RNAPII alignment files in the interval [tss À 300 bp, tssþ300] (TSS) and [tss þ300, transcription end siteþ3000] (GB): SI ¼TSS/GB [5]. In literature, this quantity was referred alternatively as travelling ratio (TR ¼GB/TSS) [6]; yet, some studies refer as travelling ratio the ratio TSS/GB [7,8]. The stalling index was computed considering all transcripts whose length was above 600 bp having an RNAPII ChIP-seq peak on their TSS. Since the stalling index reflects the balance between two different effects (the amount of RNAPII loaded on the TSS of a gene, and the amount of RNAPII travelling on the genebody), the code in the R source file saboEtAl2014_ExtData10.R clarifies this point by separately plotting the TSS and GB read distributions along with the SI.