Differentially expressed microRNAs in lung adenocarcinoma invert effects of copy number aberrations of prognostic genes

In many cancers, significantly down- or upregulated genes are found within chromosomal regions with DNA copy number alteration opposite to the expression changes. Generally, this paradox has been overlooked as noise, but can potentially be a consequence of interference of epigenetic regulatory mechanisms, including microRNA-mediated control of mRNA levels. To explore potential associations between microRNAs and paradoxes in non-small-cell lung cancer (NSCLC) we curated and analyzed lung adenocarcinoma (LUAD) data, comprising gene expressions, copy number aberrations (CNAs) and microRNA expressions. We integrated data from 1,062 tumor samples and 241 normal lung samples, including newly-generated array comparative genomic hybridization (aCGH) data from 63 LUAD samples. We identified 85 “paradoxical” genes whose differential expression consistently contrasted with aberrations of their copy numbers. Paradoxical status of 70 out of 85 genes was validated on sample-wise basis using The Cancer Genome Atlas (TCGA) LUAD data. Of these, 41 genes are prognostic and form a clinically relevant signature, which we validated on three independent datasets. By meta-analysis of results from 9 LUAD microRNA expression studies we identified 24 consistently-deregulated microRNAs. Using TCGA-LUAD data we showed that deregulation of 19 of these microRNAs explains differential expression of the paradoxical genes. Our results show that deregulation of paradoxical genes is crucial in LUAD and their expression pattern is maintained epigenetically, defying gene copy number status.


INTRODUCTION
Integration of array comparative genomic hybridization (aCGH) with mRNA microarray data has revealed significant associations between occurrence of copy number aberrations (CNAs) and differential gene expression in diverse cancers [1][2][3][4][5][6][7]. However, significantly downregulated genes have been often found to reside within chromosomal regions with increased number of copies (gains) and vice versa, creating a paradoxical signal. For example, Phillips et al. reported that 14% of the genes downregulated in prostate cancer reside within regions of DNA copy number gains, and approximately 9% of upregulated ones reside in regions of DNA copy number loss [1]. Usually, this paradox is ignored as a noise, but can potentially be a consequence of interference of other regulatory mechanisms controlling mRNA transcription [8].
In recent years, the cancer research community has investigated how epigenetic regulators, known as microRNAs (miRNAs), form prognostic signatures and affect regulatory pathways that can lead to tumorigenesis. miRNAs are short non-coding RNAs that regulate the translation of mRNA by serving as guide molecules in mRNA silencing, mediated by various associated proteins [9]. Targeting most protein-coding transcripts [10], miRNAs are involved in diverse biological processes, including development and homeostasis [11,12]. Moreover, growing evidence implicates miRNAs as factors associated with major human pathologies, including cancer [13][14][15][16].
In 2012 approximately 13% of all new cancer cases worldwide were cancers of lung (and bronchus), making lung cancer one of the most frequent cancer type (surpassed only by breast cancer in women) [17]. Despite smoking cessation, and advances in detection and treatment, lung cancer remains the main cause of cancerrelated death worldwide for both men and women [18]. With nearly 160,000 deaths annually it kills more people than other common cancers combined, including colon, breast and prostate [18]. The most common type of lung cancer is lung adenocarcinoma (LUAD), comprising approximately 45% of all lung cancer cases [19,20].
In this paper, we integratively analyzed gene expression and CNA data from 12 publicly available LUAD datasets, and new CNA data obtained from 63 LUAD samples profiled at our institution ( Figure 1). By combining and analyzing data from 1,062 tumor tissue samples and 241 normal samples we identified genes whose differential expression consistently was in contrast to aberrations of their copy numbers. Paradoxical status of these genes was then validated on the sample level, using TCGA LUAD gene expression and CNA data. Furthermore, to assess whether the paradoxical expression patterns were caused by epigenetic disruptions in lung tumors, we compiled miRNA expression data from 406 LUAD samples and 321 normal lung samples. Using miRNA:gene associations from mirDIP [21] and the measure of co-expression, we showed that this paradox can be explained by 19 miRNAs consistently deregulated in LUAD.

Identification of the paradoxical genes
First, we examined the frequency and statistical significance of autosomal CNAs across 3 LUAD aCGH datasets, including our new data and two publicly available datasets (see Materials and Methods). We identified multiple chromosomal regions with significantly (p < 0.05, randomization test) high frequency of gains (more than two copies) or losses (less than two copies); frequencies and corresponding p-values are listed in Supplementary Data 1. The most extensive positive aberrations identified occur on the q-arm of chromosomes 1, 7 and 8 as well the p-arm of chromosomes 5 and 7 ( Figure 2A). The most significant copy number losses occurred on the q-arm of chromosomes 6, 9, 13, 15, and 18, along with the p-arm of chromosomes 8 and 9.
To identify genes whose differential expression remained consistent across patient cohorts, we performed integrative analysis of 10 publicly available gene expression datasets, comprising 740 LUAD samples and 241 normal tissue samples. Among 15,323 genes that were subjected to the robust rank analysis, we identified 1,309 genes that were significantly deregulated across the datasets (p < 0.01, robust rank aggregation), where 701 of these were downregulated genes, and 608 are upregulated. Excluding 9 non-protein-coding genes, reduced the numbers to 600 upregulated and 700 downregulated genes (see Supplementary Data 2). Non-protein coding genes involve downregulated C17orf91, and eight upregulated miRNA sequences: MIR7112, MIR6847, MIR7113, MIR671, MIR4647, MIR93, MIR25, MIR4721, all of which are intragenic miRNAs residing within upregulated genes. In subsequent meta-analysis of miRNA expression in LUAD, we found none of these miRNAs to be significantly deregulated.
We identified 132 (14.6%) downregulated genes residing in regions with decreased number of copies and 102 (22%) upregulated ones residing in regions with increased number of copies (p < 2.2E-16, Chi-squared test). Importantly, 63 consistently downregulated and 22 consistently upregulated genes reside on chromosomal regions with opposite direction of aberration -gains and losses, respectively ( Figure 2). Hereafter, we refer to these 85 genes as paradoxical genes (Supplementary Table 1).

Validation of the CNAs and differential expression of the paradoxical genes
While we identified paradoxical genes using data from diverse patient cohorts, we sought to validate our findings in an independent, homogeneous datasets. We used data from TCGA comprising: CNA, mRNA-seq and miRNA-seq LUAD data from 514 LUAD and 57 normal samples. This dataset was selected for validation solely on the basis of the CNA, mRNA and miRNA expression data availability, without considering clinicopathological characteristics of the data. Five of the 85 paradoxical genes could not be evaluated due to the missing CNA or expression data. From the remaining 80, we successfully validated 70 genes (p < 1E-4, randomization test), whose copy aberration status and differential expression confirmed results from the integrative analysis (Supplementary Figure 1, Supplementary Table 1). Further analysis only considers these 70 validated paradoxical genes.
To test whether paradoxical deregulation occurs in the individual samples, we measured frequencies of paradoxical co-occurrence of up-/downregulation (expression z-score >/< +/-1.647) and losses/gains (log2 CNA >/< +/-0.2) of the paradoxical genes across individual TCGA LUAD samples ( Figure 3). We found that frequency of paradoxical deregulation ranges from 9% (NFKBIA) to 74% (NPR1) of samples, median frequency equal to 46% and mean 44%. For all the 70 genes frequency of paradoxical deregulation exceeds the frequency of regular (non-paradoxical) deregulation.

Association between deregulated miRNAs and paradoxical genes
According to mirDIP [21], 46 of the 70 paradoxical genes (65.7%) are targeted by subsets of the 19 deregulated miRNAs (p = 4.8E-3, hypergeometric test; Figure 5). Moreover, for 35 of these 46 targeted paradoxical genes, we found at least one deregulated miRNA that targets the given gene with its expression status contrasting the expression status of the given gene, implying that paradoxical expression of this gene could be explained by the miRNA deregulation.
To further asses the association between expression of individual miRNAs and paradoxical genes we calculated partial correlation between them [22], using copy number status of the paradoxical genes as a controlling variable. We found 369 significantly coexpressed miRNA:gene pairs (27% of all miRNA:gene combinations), 362 (98.1%) of which are explanatory, i.e., there is a positive correlation between miRNAs and genes deregulated in the same direction, or negative correlation between inversely deregulated ones ( Figure 6A). We found 64 paradoxical genes (91.4%) whose correlation with at least one of the 19 validated miRNAs is among the top 5% of the correlations measured between 10E+4 random pairs of genes and miRNAs, and whose paradoxical expression can be explained by deregulation of the upstream miRNA.
As downstream effects of the deregulation of individual miRNAs may be combined, we aimed to evaluate potential associations between expression of individual paradoxical genes and en bloc expression of the deregulated miRNAs. We calculated coefficients of multiple (multivariate) correlation (CMC) between the paradoxical genes and deregulated miRNAs. The value of CMC can be interpreted as the correlation between dependent variable (gene expression) and its best prediction that can be computed linearly from the set of independent variables (expression of miRNAs). We found 23 paradoxical genes (32.9%) with CMCs in the top 5% of the values of the same measure as calculated across 17,745 genes covered by TCGA-LUAD RNA-seq data (p = 1.7E-14, hypergeometric test; Figure 6B).

Clinical significance of the paradoxical genes
Using KMplot [23], we assessed the association of the 70 paradoxical genes with patient diseasefree survival. We found 41 (58.6%) of these genes as significantly associated with survival (FDR < 0.05). We assume that down-regulation of significantly positive genes (HR < 1, FDR < 0.05) as well as the up-regulation of significantly negative ones (HR > 1, FDR < 0.05) worsens the survival prognosis. Under this assumption, with the exception of three genes (ELMO1, DENND3,  Table 3). This implies that the gene expression paradoxes we identified here mostly worsen the prognosis of LUAD patients.
Using TCGA-LUAD RNA-seq and matching clinical data, we constructed a multivariate Cox prognostic model, where expressions of the paradoxical genes served as prognostic variables. The model was validated using three independent publicly available gene expression datasets and associated clinical data: Botling et al. [24], Okayama et al. [25] and Der et al. [26] (Figures 7B-7D). The resulting concordance index, area under ROC curve, hazard ratio between risk groups and associated P-value, demonstrated robust prognostic potential of paradoxical genes signature.

Pathway enrichment analysis of the paradoxical genes
To elucidate biological functions of the paradoxical genes we performed a comprehensive pathway enrichment analysis. Using Pathway Data Integration Portal (pathDIP) [27] we identified 22 pathways significantly enriched by the 70 paradoxical genes (FDR < 0.05, hypergeometric test). A list of all pathways and respective gene memberships is provided in Supplementary Data 4. Interestingly, several of the enriched pathways are related to lipid metabolism and signaling (adipogenesis, LPA receptor mediated events, regulation of lypolysis) are key players in carcinogenesis [28]. Moreover, several enriched guidance molecule pathways (ephrin signaling, semaphorin interactions, integrin, DCC-mediated attractive signaling) are noted as cancer-drug targets [29]. PTEN-dependent cell cycle arrest, apoptosis pathway as well as hemostasis are known to play a role in cancer [30][31][32].

DISCUSSION
Integrating three LUAD aCGH datasets, we identified several chromosomal regions with extensive copy number aberrations. Weir and colleagues [33] reported similar profile of copy number aberrations in 371 LUAD samples, confirming gains on 1q, 5p, 7p, 7q and 8q as well as deletions on 6q, 8p, 9p, 9q, 13q, 18q (Table 1). Lee et al. [34] obtained similar results using Molecular Inversion Probe assays on 12 LUAD samples, confirming gains on 1q, 5p, 7p, 7q and 8q and losses on 6q, 8p, 18q (but not losses on 9p, 9q, 13q and 15q). While the individual aberrations vary greatly among individuals, as even the most frequent aberrations appear only in less than 50% of samples, the overall CNA profile of LUAD is conserved across the patient cohorts.
By integrative analysis of multiple gene expression and copy number datasets, we found significant association between CNA status and differential expression of genes. Similar associations were previously reported in other cancer types [2][3][4][5]7]. However, we also discovered 85 paradoxical genes whose expression was in opposite direction to their CNA. Seventy of these genes were subsequently validated across a homogeneous  LUAD data cohort from TCGA. Paradoxical expression of these genes was validated even on the individual samples, proving that such paradoxical gene expression is a well preserved feature of the molecular profile of LUAD.
Expression of paradoxical genes is associated with miRNAs consistently deregulated across multiple LUAD patient cohorts. Deregulation of these miRNAs inverts the effects of the genomic aberrations occurring in tumors. This is demonstrated by significant overlap between paradoxical genes and targets of these miRNAs, as well as by the two methods we applied here to measure correlation between the paradoxical genes and given miRNAs. Although correlation does not imply causality, the results of our analysis strongly suggest that differential expression of paradoxical genes is caused by deregulation of miRNAs.
We tested prognostic relevance of the paradoxical genes and found 41 (58.6%) paradoxical genes significantly associated with patient survival. Paradoxical expression of 39 genes has negative impact on prognosis. We also developed paradoxical gene signature that was validated on a three independent validation datasets [24][25][26].
Genomic instability is one of the cancer hallmarks [70] that results in CNAs and differential expression of various genes. However, paradoxical genes with expression patterns opposite to gene dosage status often are dismissed as noise and overlooked in genome-wide cancer gene discovery efforts [8]. Prognostic significance of the paradoxical genes suggests that their deregulation in LUAD is crucial for cancer progression and is maintained by the cancer cells despite the CNAs affecting expression of these genes in an inverse manner. We found that deregulation of the paradoxical genes is maintained at the epigenetic level by a group of deregulated miRNAs. These

Copy number aberrations: sample collection, preparation and data processing
The samples used in this study were from the banked resected tumors collected in the BR.10 adjuvant chemotherapy trial [71]. The study has received approval by the Institutional Research Ethics Board. A total of 142 formalin-fixed paraffin embedded (FFPE) and 16 snapfrozen samples were included. Haematoxylin and eosin stained slides from FFPE blocks first were reviewed by a lung pathologist to locate tumor rich areas (tumor cellularity > 60%), and then the block was cored at this area. Cored specimens were de-paraffinized by incubation in xylene overnight, and then washed with ethanol and air dried. Qiagen ATL buffer (QIAamp © DNA extraction kit cat. 51306, Germantown, MD) were added. Specimens were digested by proteinase K at 55°C overnight at 450rpm (Eppendorf © Thermomixer R, Fisher Scientific). DNA isolation followed the manufacturer's protocol (Qiagen, Cat. 51306, Germantown, MD). Samples of isolated genomic DNA were quantified by Nanodrop 1000 (Thermo Scientific, Wilmington, DE) and electrophoresed in 0.8% agarose gel to visualize DNA size distribution. Severely degraded samples (80% of DNA fragments with size < 20bp) were excluded. Eight out of 142 FFPE and none of the snap-frozen samples were excluded. The final cohort contains 134 FFPE and 16 snap-frozen samples.
Test and reference DNA were labeled using Cy3 and Cy5 dCTPs respectively; 200 ng of genomic DNA was labeled using the BioPrime DNA labeling system (Invitrogen). Prior to hybridization, test and reference labeled DNA were combined and purified using a ProbeQuant Sephadex G-50 Column (Amersham, GE Healthcare Life Sciences, Chicago, IL) to remove unincorporated nucleotides. Then 100 μg of Human Cot-1 DNA (Invitrogen) was added to the labeled sample prior to precipitation with 0.1 volume 3M sodium acetate and 2.5 volumes of ethanol. The DNA pellet was resuspended in 20 μl DIG Easy hybridization solution (Roche, Indianapolis, IN), 2.5 μl (20 μg/μl) sheared herring sperm DNA and 2.5 μl (100μg/μl) yeast tRNA (Calbiochem, San Diego, CA). DNA was denatured at 85°C for 10 minutes and repetitive sequences were blocked at 37°C for one hour prior to hybridization.
Each array was then rinsed 5 times in a clean slide box containing 0.1 x SSC with agitation. Slides were then dried with (oil free) nitrogen air stream and stored in the dark until imaging.
Array image capture and data normalization were performed as previously described [72]. Briefly, posthybridization arrays were scanned using a CCD-based imaging system (Virtek ChipReader), and quantitated using Soft-Worx Tracker spot analysis software (Applied Precision, Issaquah, WA).
Data were log2 transformed, and replicate clones having standard deviations > 0.075 or signal-to-noise ratios in each dye channel of < 3 were filtered out. A multi-step normalization was then performed to control for biases caused by the array (e.g., spatial biases or differences in background signal), the dyes used for labeling, or the DNA sample quality [73,74]. The amount of "copycat'" correction required for each sample was plotted in a histogram of all samples; those that required too much correction and did not lie within a normal distribution were deemed to be poor quality DNA, and were eliminated from analysis. By these criteria, 35 samples were eliminated, leaving 115 samples (including 63 LUAD samples used here) from 113 patients for further analysis. Data from all 115 samples are publicly accessible through: http://ophid.utoronto.ca/aCGH/.

Analysis of the copy number aberrations data
In addition to our newly-produced CNA data, we analyzed two publicly available aCGH datasets acquired from LUAD tumor samples (see Table 3). Each of the public datasets was first normalized, segmented and additionally underwent post-segmentation normalization using methods provided by Bioconductor package CGHcall (v2.22.0) [75]. All three datasets then underwent a "calling" process using the CGHcall method from the same package, converting the continuous log-ratios on each probe, to one of the three discrete values (calls) corresponding to: (i) decreased number of copies (loss), (ii) normal copy and (iii) increased number of copies (gain) [75].
As the individual datasets come with different probesets, obtained copy numbers calls correspond to different chromosomal segments and cannot be compared directly. We then integrated results acquired from individual datasets by assigning a set of chromosomal positioning "anchors" that comprised the starts and ends of chromosomal locations of all the probes in the four datasets, as described by Guo et al. [76]. Then for each anchor, if the anchor was within the chromosomal location of the probe from any of the datasets, acquired vector of states corresponding to the probe were assigned to this anchor. Conversely, if an anchor was outside of any of the probes of the given dataset, a vector of missing values was created and assigned to the anchor. Anchors with missing values from more than one datasets were removed. Number of losses and gains were calculated across the anchors and their statistical significance was evaluated by p-values, calculated by comparing actual gains/losses counts to those obtained from 10 6 permutations.

Analysis of gene expression datasets
We analyzed 10 publicly available gene expression datasets (Table 1) which satisfied the following criteria: (i) were originally from studies on tissue samples from surgically resected human LUAD tumors, (ii) contained at least one sample of noncancerous normal tissue for comparison, and (iii) were produced by using Affymetrix platforms to enable uniform processing and analysis of all the datasets. We first normalized and summarized each dataset by Gene Chip Robust Multiarray Averaging (gcrma, v2.38.0) [77]. For each individual dataset, we then evaluated differential expression of the genes using Bioconductor package limma (v3.32.7) [78]. Based on the expression fold change, genes were classified as either up-or downregulated, and then ranked according to statistical significance, which was evaluated by FDRadjusted p-value. Analyzing 10 datasets resulted in 10 rankings for upregulated genes and 10 for downregulated ones. To identify consistently deregulated genes, obtained rankings were subjected to robust rank aggregation analysis implemented in R package RobustRankAggreg (v1.1) [79]. This analysis detects genes that are ranked consistently better than expected under the null hypothesis of uncorrelated inputs, and assigns a p-value as a significance score for each gene. The stability of the resulting significance scores was assessed by the leave-one-out correction, in which the same analysis was repeated 10 times, each time excluding one of the rankings. Acquired p-values from each round were averaged into a corrected p-value. Genes whose significance score was smaller than chosen threshold (corrected p < 0.01) were further considered as consistently significantly deregulated genes.

Meta-analysis of the miRNA expression
Compared to gene expression studies, fewer miRNA expression profiles from LUAD are available, and various platforms are used, often including custom arrays. Therefore, to provide analysis of miRNA expression in LUAD, instead of acquiring and processing expression data, we summarized reported results of 9 published miRNA expression studies (Table 1). Full text and (if applicable) supplementary data of each of the studies were carefully examined, and miRNAs with significantly altered expression were selected for further analysis. miRNA names were standardized according to miRBase (release 21) [80]. All miRNAs were classified as either up-or downregulated and ranked according to their reported statistical significance. If this was not reported, expression fold change was used instead. Examining 9 studies we obtained 9 rankings for upregulated miRNAs and 9 for downregulated ones. Analogously to gene expression analysis described in the previous section, obtained miRNA rankings subsequently were subjected to the robust rank aggregation analysis and leave-oneout correction of the obtained p-values. miRNAs whose significance score was smaller than a chosen threshold (corrected p < 0.05), comprised the resulting list of consistently significantly deregulated miRNAs.

Acquiring chromosomal locations and copy number status of the deregulated genes
Using Bioconductor package biomaRt (v2.22) [81] we determined the chromosomal locations of the deregulated genes from the Ensembl (v75, Feb. 2014) database and compared these locations with chromosomal coordinates of the aberrant regions. Deregulated genes whose chromosomal locations overlapped with aberrant regions were counted and statistical significance of the association between the aberrations and differential gene expression was then evaluated using Chi-square test.

Identification of the miRNA-target pairs
We used microRNA Data Integration Portal, v2.3.2.0 (mirDIP; http://ophid.utoronto.ca/mirDIP) [21] to acquire data on human miRNAs and their respective targets. mirDIP integrates data from 14 miRNA resources and supports a search for miRNA-target pairs under user-defined filters, including a number of independent confirmations of given pairs, confidence criteria, etc. We restricted our search to only miRNA-target pairs that fell among the top third of the most confident predictions from at least two different sources. miRNA names were standardized as described above, and symbols of their gene targets were standardized by HGNC symbol checker (http://www.genenames.org, version from September 2015). To assess the significance of overlap between targets of deregulated miRNAs and paradoxical genes, we performed hypergeometric testing, using 15,323 genes that were subjected to robust rank analysis as a population and 7,836 of these genes that, according to mirDIP are targeted by at least one of the deregulated miRNAs as a number of "successes" in the population.

Calculation of the miRNA:target partial correlation and multiple correlation coefficients
Partial correlation coefficients between gene and miRNA expressions were calculated using R package ppcor v1.1, using copy number of the given gene as a third -controlling variable. Statistical significance of the obtained values was calculated by two-sided comparison www.impactjournals.com/oncotarget with distribution of the same measure obtained across 10 4 random miRNA:gene pairs. To distinguish whether correlations between miRNA and given genes may also imply causal association, we compared the sign of the correlation and copy number status of the miRNA and gene. If the miRNA and gene were deregulated in the same direction, mutual correlation must be positive to indicate a causal association. If the miRNA and genes were inversely deregulated, mutual correlation must be negative to indicate causal association.
Multiple correlation coefficients between gene and en bloc miRNA expression C was calculated as follows: where c denotes a vector of Pearson coefficients of correlation between a given genes and miRNA expressions, R denotes a matrix of Pearson coefficients of correlation between miRNA expressions.

Evaluation of the prognostic significance of the paradoxical genes
Prognostic properties of the individual genes were evaluated by KMplot (http://kmplot.com/analysis/) [23], version 2015, using only LUAD patient data and corresponding disease-free survival censored at 10 years. If multiple probe sets were mapped to the same gene, we used only JetSet probes mapping to a given gene. Obtained hazard ratios (HR) and associated p-values were then summarized and multiple testing adjustment of the p-values was subsequently computed using false discovery rate (FDR) method.
To evaluate the multivariate prognostic potential of the paradoxical genes we developed a Cox proportional hazards model, where expressions of 70 validated paradoxical genes served as covariates. The model was derived using R package glmnet [82] (v2.0.2), applying LASSO (L1) regularization to prevent over-fitting. TCGA-LUAD RNA-seq data were standardized by converting to z-scores and along with the corresponding clinical data were used as "training data". The resulting model was then validated on three independent datasets, and its predictive performance was first evaluated by a concordance index (function survConcordance from R package survival [83], v2.38.3), and an area under receiver operating characteristics curve (AUC), measured at the fifth year after initial time point (function AUC.cd from the R survAUC package, v 1.0.5). Patients were then separated into two groups based on the predicted risk score, using its 40 th percentile as a threshold. This threshold was selected based on ROC analysis of the model using training data. Validated HR between these two groups, as well as associated statistical significance (log-rank test) were calculated (function survdiff from the survival package) and Kaplan-Meier survival curves of both groups were plotted (for more details see [84]).

Pathway enrichment analysis
Using Pathway Data Integration Portal v2.5.1.2 (http://ophid.utoronto.ca/pathDIP), we performed comprehensive pathway enrichment analysis across 20 major pathway databases [27]. We considered literature curated gene:pathway memberships as well as those predicted according to experimentally detected proteinprotein interactions (including interactions experimentally detected between orthologues plus FpClass interactions with minimum confidence level for predicted associations equal 0.95; for more details see pathDIP documentation).

Author contributions
TT -conception and design of the work, data collection, data analysis and interpretation, data visualization, drafting and critical revision of the article, final approval of the version to be published; CPconception and design of the work, critical revision of the article; VR -conception and design of the work; CZ -data collection, data analysis and interpretation, critical revision of the article; KC -data collection, critical revision of the article; LP -data collection, critical revision of the article; EV -data collection, critical revision of the article; SV -data analysis and interpretation, critical revision of the article; FS -project management; critical revision of the article; MT -project management; critical revision of the article; WL -project management; critical revision of the article; IJ -conception and design of the work, data visualization, drafting and critical revision of the article, project management, final approval of the version to be published.

ACKNOWLEDGMENTS
SV would like to acknowledge support from travel bursary covering visit to Jurisica Lab from Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava. The authors would like to thank Richard Lu, Wing Xie, Adrian M. Teisanu and Dan Strumpf for the original development of mirDIP and CDIP portals.

CONFLICTS OF INTEREST
The authors declare no conflicts of interest.