RNA-Seq profiling in peripheral blood mononuclear cells of amyotrophic lateral sclerosis patients and controls

Coding and long non-coding RNA (lncRNA) metabolism is now revealing its crucial role in Amyotrophic Lateral Sclerosis (ALS) pathogenesis. In this work, we present a dataset obtained via Illumina RNA-seq analysis on Peripheral Blood Mononuclear Cells (PBMCs) from sporadic and mutated ALS patients (mutations in FUS, TARDBP, SOD1 and VCP genes) and healthy controls. This dataset allows the whole-transcriptome characterization of PBMCs content, both in terms of coding and non-coding RNAs, in order to compare the disease state to the healthy controls, both for sporadic patients and for mutated patients. Our dataset is a starting point for the omni-comprehensive analysis of coding and lncRNAs, from an easy to withdraw, manage and store tissue that shows to be a suitable model for RNA profiling in ALS.


Background & Summary
Deep sequencing technologies allow high-throughput massive-parallel RNA sequencing, permitting an extensive characterization of transcriptomic profile of cell populations and tissues. RNA-seq analyses are the gold standard for in depth characterization of global changes in gene expression levels between different conditions (e.g.: disease and healthy subjects or treated and untreated samples) 1 .
Such comprehensive analysis are currently bringing the light on the expression of largely unexplored RNA classes. While coding RNA involvement in a wide plethora of disorders has been subject of intense research, the more recently revealed class of lncRNAs is only starting to be characterized. LncRNAs are RNA transcripts greater than 200 nucleotides that lack an open reading frame and therefore do not encode proteins 2 . The heterogeneity of lncRNAs, both in biological type and function, is a major obstacle that makes it difficult to extract information about lncRNAs function and to enrich our knowledge 3 . While coding genes are widely annotated, high-quality catalogues of lncRNAs and information on the tissues in which they are expressed are recently being constructed [4][5][6] . Recent efforts are directed to characterize this largely unexplored functional component of the genome. Deep sequencing is a precious instrument to obtain information suitable to investigate both coding and lncRNAs simultaneously in the same sample, and this comprehensive RNA analysis can elucidate fundamental mechanisms in gene expression regulation at post-transcriptional level 7,8 .
There is mounting evidence that altered RNA metabolism, both involving coding and non-coding RNAs, plays an important role in ALS pathogenesis [9][10][11][12][13] . ALS is an adult-onset, progressive and fatal neurodegenerative disease, caused by the selective loss of both upper and lower motor neurons in the cerebral cortex, brainstem and spinal cord. The pathogenesis of the disease is still unknown. Moreover, the investigation of ALS mechanism is challenging due to the difficulty to obtain a sufficient amount of biological material from ALS patients: it is indeed impossible to access biological samples from the most affected areas, such as spinal cord. PBMCs have been shown to be a convincing and realistic model for ALS, since many pathways, typically located in neurons, are also activated in these cells 8,14,15 .
Furthermore, low ALS incidence (ranging between 1 and 2.5/100'000) 16 and low percentage of familial ALS cases (5-10%) impose the requirement of data sharing and data aggregation among different hospitals and laboratories in order to obtain a sufficient sample size. Detailed curation of these datasets is paramount for accurate interpretation, widespread dissemination, and repurposing of data.
In this paper, we present a dataset suitable for the analysis for both coding RNAs and lncRNAs in PBMCs of 15 sporadic ALS (sALS) patients, 9 ALS patients with mutation in classical ALS-genes (SOD1, FUS, TARDBP and VCP) and 7 healthy controls. More in detail, these data are suitable for antisense lncRNAs detection and characterization because of the chemistry of the protocol properly design to maintain strandness information. Strand-specific RNA-Seq permits a more complete understanding of the transcriptome, enhancing resolution for sense/antisense profiling 17 .
Part of this dataset was used to perform an analysis of differentially expressed transcripts (coding RNAs and lncRNAs) 8 aiming at investigating the importance of sense and antisense RNA regulation in central and peripheral systems in ALS disease.
To our knowledge, no dataset of RNA expression in PBMCs from ALS patients is publicly available. Our dataset is a starting point for the omni-comprehensive analysis of RNA classes, from an easy to withdraw, manage and store tissue that shows to be a suitable model for RNA profiling in ALS 8 .
It was shown that in RNA-seq studies, increasing sample size is the best choice to enhance statistical power 18 . For these reasons, the re-usability of this dataset is related to the possibility to expand studies with smaller sample sizes, especially for mutated patients, to identify new sense and antisense transcripts with altered expression in both sporadic and mutated ALS patients compared to healthy controls and to provide an open-angle point of view for a broad-spectrum characterization.

Methods
Isolation of human PBMCs from ALS patients and healthy controls Subject enlistment. 24 ALS patients and 7 age-and sex-matched healthy controls (CTRL) were recruited after obtaining written informed consent (Table 1). ALS patients underwent clinical and neurologic examination at IRCCS National Neurological Institute "C. Mondino" (Pavia, Italy). All patients were diagnosed with ALS as defined by El Escorial criteria. All ALS patients were analysed for causative mutations in SOD1, TARDBP, FUS, C9orf72, ANG and VCP genes: 15 patients resulted to be sporadic (sALS) and are indicated as ALS-s1, ALS-s2 and so on. Three patients were mutated in SOD1 gene (SOD1-m1, SOD1-m2 and SOD1-m3), three in FUS gene (FUS-m1, FUS-m2 and FUS-m3), two in TARDBP gene (TARDBP-m1 and TARDBP-m2) and one in VCP gene (VCP-m1). The seven control subjects (indicatetd as CTRL1-CTRL7) have been recruited at the Transfusional Service and Centre of Transplantation Immunology, Foundation San Matteo, IRCCS (Pavia, Italy). All details are reported in Table 1.
The study protocol to obtain PBMCs from patients and controls was approved by the Ethical Committee of the National Neurological Institute "C. Mondino", IRCCS (Pavia, Italy). Before being enrolled, the subjects participating in the study signed an informed consent form (Protocol n°375/04version 07/01/2004). All experiments were performed in accordance with relevant guidelines and regulations. These methods and following methods are expanded versions of descriptions in our related work 8 . Isolation of human PBMCs. PBMCs were prepared by centrifugation. Peripheral blood was layered (density = 1.077) and centrifuged at 950 g for 30 min. After isolation on a Ficoll-Histopaque layer (Sigma Aldrich, Italy), cell viability was assayed by a trypan blue exclusion test and cells were then used for RNA extraction.
RNA extraction. Samples were homogenized and total RNA was isolated by Trizol ® reagent (Life Science Technologies, Italy) following the manufacturer's protocol. RNAs were quantified using a Nanodrop ND-100 Spectrophotometer (Nanodrop Technologies, Wilmington, USA) and a 2100 Bioanalyzer (Agilent RNA 6000 Nano Kit, Waldbronn, Germany); RNAs with a 260:280 ratio of ≥1.5 and an RNA integrity number of ≥8 were deep sequenced (Supplementary Figure 1).
RNA-seq library preparation, sequencing and analysis. Sequencing libraries were prepared with the Illumina TruSeq Stranded RNA Library Prep, version 2, Protocol D, using 500-ng total RNA (Illumina, USA). The qualities of the libraries were assessed by 2100 Bioanalyzer with a DNA1000 assay. Libraries were quantified by qPCR using the KAPA Library Quantification kit for Illumina sequencing platforms (KAPA Biosystems); RNA processing has been carried out using Illumina NextSeq 500 Sequencing, using 8 samples for each run mixing samples and controls in each flowcell to avoid not manageable batch-effects.  Table 1. RNA-seq profiling to evaluate differential gene expression. All samples had the same source (blood sample) and were processed with the following steps: PBMCs isolation, RNA extraction and RNA-seq. All samples included in this study were Italian. All ALS patients had spinal onset and sample collection was performed after 6 months from exordium. FastQ files were generated via llumina bcl2fastq2 (Version 2.17.1.14 -http://support.illumina.com/ downloads/bcl2fastq-conversion-software-v217.html) starting from .bcl files produced by Illumina NextSeq sequencer (see Table 2).

Quality validation and read mapping
Quality of individual sequences were evaluated using FastQC software (see Code Availability 1) after adapter trimming with cutadapt software. Per base sequence quality plots, showing the Phred quality score distribution among all sample reads are shown in Figure 1. Gene and transcript intensities were computed using STAR/RSEM software 19 using the "--strandness forward" option (see Code Availability 2 and 3 and Figure 2). Human genome reference used for the alignment was GRCh38 (Gencode release 27): in a rapidly evolving field like the one of non coding RNA analysis it is indeed fundamental to use up-to-date reference versions containing all the available information about annotated coding and non coding RNAs.
Mapping results are summarized in Table 2. Percentage of uniquely mapped reads is 21.8% on average, with standard deviation 10.35%. Remaining reads belong to ribosomal RNA because of the non perfect ribosomal RNA depletion. Since 10 to 25 M reads are suggested for RNA-Seq experiments 20 , this dataset is suitable for differential expression analysis, since about 60% of samples (21/36) have at least 10 M reads. Furthermore, biological replicates are the most effective strategy to improve statistical power   in such experiments. For this reason, we decided to share our dataset with the scientific community, to have the possibility of integrating our data with similar data obtained in different laboratories. Expressed transcripts per sample were evaluated imposing a minimum threshold of 5 counts per gene to consider it as expressed. Figure 3 shows the amount of coding and non coding transcripts detected in each sample. With the exception of 4 samples which have a low number of detected transcripts (probably due to a lower number of available reads), 7000 to 14000 expressed coding genes and 500 to 3000 non coding genes were detected per sample. This is in accordance with the currently available knowledge about non coding transcripts that result to be globally less ex-pressed than coding ones within the cell 21 .

Downstream analysis
Differential expression analysis for all transcripts (coding and non coding RNAs) was performed with the R package DESeq2 22 (see Code Availability 4). This tool was selected because of its superior performance in identifying isoforms differential expression 23 .  Genes were considered differentially expressed and retained for further analysis with |log 2 (disease sample/healthy control)| ≥ 1 and a FDR ≤ 0.1. Analysis workflow is summarized in Figure 2.
For each sample, we evaluated the number of detected transcripts for coding and lncRNAs, separately. Transcripts were considered if covered by at least 5 reads 24 . Results are summarized in Figure 3.

Code availability
The following software and versions were used for quality control and data analysis as described in the main text:

RNA integrity assessment
RNAs were quantified using a Nanodrop ND-100 Spectrophotometer (Nanodrop Technologies, Wilmington, USA) and a 2100 Bioanalyzer (Agilent RNA 6000 Nano Kit, Waldbronn, Germany); RNAs with a 260:280 ratio of ≥1.5 and an RNA integrity number of ≥8 were subjected to deep sequencing.

RNA-seq data qyality assessment
We applied FastQC v0.11.5 software to verify that per base quality scores are suitable for downstream analysis (Figure 1). Mapping percentage have been computed and are reported in Table 2. PCA plot, distance matrix and dispersion estimates are also shown in Figure 2.