Biomarker potential of repetitive-element transcriptome in lung cancer

Since repetitive elements (REs) account for nearly 53% of the human genome, profiling its transcription after an oncogenic change might help in the search for new biomarkers. Lung cancer was selected as target since it is the most frequent cause of cancer death. A bioinformatic workflow based on well-established bioinformatic tools (such as RepEnrich, RepBase, SAMTools, edgeR and DESeq2) has been developed to identify differentially expressed RNAs from REs. It was trained and tested with public RNA-seq data from matched sequencing of tumour and healthy lung tissues from the same patient to reveal differential expression within the RE transcriptome. Healthy lung tissues express a specific set of REs whose expression, after an oncogenic process, is strictly and specifically changed. Discrete sets of differentially expressed REs were found for lung adenocarcinoma, for small-cell lung cancer, and for both cancers. Differential expression affects more HERV-than LINE-derived REs and seems biased towards down-regulation in cancer cells. REs behaving consistently in all patients were tested in a different patient cohort to validate the proposed biomarkers. Down-regulation of AluYg6 and LTR18B was confirmed as potential lung cancer biomarkers, while up-regulation of HERVK11D-Int is specific for lung adenocarcinoma and up-regulation of UCON88 is specific for small cell lung cancer. Hence, the study of RE transcriptome might be considered another research target in cancer, making REs a promising source of lung cancer biomarkers.

Architecture, software and databases 151 The bioinformatic workflow has been tested, implemented and executed on a SUSE Linux Enterprise  Venn diagrams were obtained at http://bioinformatics.psb.ugent.be/webtools/Venn.

159
Human genome version hg38.p1 (GCA 000001405.16) was used as reference for mapping. Following  Read pre-processing 165 The automated workflow designed in this study can be divided into three main blocks: data pre-processing, 166 quantification and differential expression (Fig. 1). Contaminating sequences, adaptors, low quality 167 regions, etc., were removed from raw reads (Raw reads in Fig. 1 Figure 1. Schematic workflow for assessing differential expression of REs from matched-sample raw reads developed in this study. Each main block is explained in detail in the main text. Grey rectangles indicate bioinformatic tools and filters; lilac boxes correspond to R functions; green and yellow boxes are relevant input or output data to be saved; and dark green cylinders refer to databases. using SAMtools, to obtain one sorted BAM file and one FastQ file containing the multi-mapping reads for 171 each SAMPLE (Fig. 1).

173
In the second block of the workflow (Fig. 1), quantification of RE expression was obtained using 174 RepEnrich with the following files obtained in the previous step: (1) FastQ files containing paired-reads 175 mapping more than once on the human genome, (2) sorted, indexed BAM files with reads mapping only 176 once on the human genome, (3) the human genome sequences in the appropriate format (Hg38 RE), and 177 (4) the RepBase human subset of REs in the appropriate format (hRB RE). As a result, three files per 178 sample were obtained: one with the expression of each RE, another with the expression level for each RE 179 class, and another for each RE family. Throughout this study, only data from RE file were used, since  Tables S3 and S6).

181
All snRNA, rRNA and tRNA elements were removed from output because they can provide spurious 182 differences due to the way each laboratory extracted the RNAs. This resulted in 1 190 analysed REs.

183
REs with low expression level (less than 10 counts per million in at least 2/3 of the samples) were also 184 discarded. Finally, filtered RE expression data for all samples were gathered in a single matrix y (Fig. 1).

185
Differential expression 186 The expression matrix y for each SAMPLE was analysed in two parallel ways: (1) to determine the  The logFC pP was based on the cmp function of edgeR package taking into account that samples were where rows in the lfc matrix contained logFC pP for each RE, and columns corresponded to patients.

199
The logFC OE was necessary to determine whether the expression change of a RE was statistically 200 significant or not. Data contained in y were conveniently processed using function DESeq from the 201 DESeq2 package taking into account the normal/tumour samples belonging to the same patient to obtain 202 the object dds that contained the logFC OE values and their significance together with, among other data.

204
The workflow finally provided graphical representations of the analysed data, stored in several PDFs 205 for convenience (bottom of Fig. 1). with tumour samples (Fig. 2). Since misclassification could be assigned to some kind of mishandling 212 during RNA preparation, this sample was discarded together with its matched tumour sample S585270. 213 Therefore, samples from the 50 LUAD patients and only 16 SCLC patients were then pre-processed.

214
The overall pre-processing results of samples grouped by cancer types is shown in Table 1, where it 215 can be observed that only 12.29% of LUAD reads and 7.29% of SCLC reads were discarded. The level 216 of multi-mapping reads (corresponding to REs and other repetitive sequences) is only 6.91% and 6.32% 217 in LUAD and SCLC, respectively. The reasonable number of biological replicates, the high number of 218 useful reads, the low rate of rejection, and a reasonable rate of multi-mapping reads ensured that results 219 derived from these data could be representative of lung cancer. 220 Table 1. Summary of read pre-processing and mapping against hg38.p1 grouped by lung cancer type. Manuscript to be reviewed Methods.

232
Overall RE expression after oncogenic reprogramming was also analysed per patient in each lung 233 cancer. It revealed that the median values of logFC pP for all REs was statistically equal to zero in both 234 LUAD and SCLC (Supplemental Fig. S1). This demonstrated that there was no broad RE deregulation 235 after an oncogenic change in lung cells, which led to the analysis of overall RE differential expression by  Table S2). The advantage of this kind of plot is that it provides an overview 241 of the logFC OE used for differential expression (triangles in Fig. 3), together with the distribution of Manuscript to be reviewed Transposon Cancer_type Repetitive element  Table S3). In conclusion, it seems that ERVs of the LTR class were the 257 most affected REs after LUAD-specific expression changes.

259
The 16 patient samples of SCLC which met our criteria provided 32 matched samples whose analysis 260 revealed 71 differentially expressed REs (Fig. 4), all of them with FDR < 10 4 (Supplemental Tables S4 261 and S5). As occurred in LUAD, logFC pP medians and logFC OE values were also consistent in SCLC 262 (Fig. 4) Manuscript to be reviewed Transposon logFC Repetitive element Differentially expressed REs for SCLC (Fig. 4), LUAD (Fig. 3) and normal lung (Supplemental Table S1) 283 were compared to obtain REs that could be disease-specific and those shared by diseases (Fig. 6). Most  It became evident that MER65-int, the RE common to the three samples in Fig. 6, presented statistically

319
The same filtering criteria described above were applied to SCLC-specific and LUAD-specific differentially-320 expressed REs in Fig. 6. The four LUAD-specific REs (ALR Alpha, HERV3-Int, HERVK11D-Int and 321 LTR4) were all up-regulated in tumour cells, but the logFC = 0 value fell within their logFC pP ± 1.5 ⇥

322
IQR range (Fig. 3). Moreover, the Venn diagram in Fig. 8 (left) confirmed that three were also expressed-  Supplemental Table S4), while 7 were up-regulated (marked with an asterisk in Supplemental Table S5).

330
Interestingly, the Venn diagram of Fig. 8 (right) shows that among the 55 SCLC-specific REs, only 331 UCON88 (marked with a bullet (•) in Supplemental Table S5) is not expressed at all in any LUAD 332 cells but is up-regulated in SCLC, suggesting that it is a good candidate for SCLC-specific biomarker.  (Table 1) provided good quality control analyses, despite the removal of one patient from the 340 SCLC dataset (Fig. 2). This is illustrated by the reasonable number of matched biological replicates and 341 the high number of reads per sample longer than 40 nt that remained useful. The number of multi-mapping 342 reads (6.91% in LUAD and 6.32% in SCLC, Table 1) is lower than described for prostate cancer (12.43%   Table S1). Unexpectedly, healthy (normal) lung samples from LUAD and not present a neat MER126 down-regulation (Fig. 7B). Hence, only AluYg6 and LTR18B retained an 420 intact potential biomarker capability in all patients that might be of use in future theranostic testings for 421 lung cancer.

422
The biomarker capability of AluYg6 and LTR18B is also supported by some reports in the literature. it is also expressed in LUAD cells from MRH patients (right Venn diagram of Fig. 8), it does not meet our 448 proposed properties for a biomarker.

449
The consistent down-regulation of AluYg6 and LTR18B in all patients analysed, and the exclusive 450 differential up-regulation of HERVK11D-Int and UCON88 in LUAD and SCLC cells, respectively, 451 strengthen the idea that studies in the search for new biomarkers based on RE profiling merit the effort.

453
We have described an automated workflow using established bioinformatic tools to study expression  Table S1: Name, class and family of differentially expressed REs when normal lung from SCLC is  Table S2: Name, class and family of differentially expressed REs in LUAD together with their 670 logFC OE and their statistical significance as an FDR. These data were used in Fig. 3.    and class are marked with an asterisk (*).