Putative effectors for prognosis in lung adenocarcinoma are ethnic and gender specific

Lung adenocarcinoma possesses distinct patterns of EGFR/KRAS mutations between East Asian and Western, male and female patients. However, beyond the well-known EGFR/KRAS distinction, gender and ethnic specific molecular aberrations and their effects on prognosis remain largely unexplored. Association modules capture the dependency of an effector molecular aberration and target gene expressions. We established association modules from the copy number variation (CNV), DNA methylation and mRNA expression data of a Taiwanese female cohort. The inferred modules were validated in four external datasets of East Asian and Caucasian patients by examining the coherence of the target gene expressions and their associations with prognostic outcomes. Modules 1 (cis-acting effects with chromosome 7 CNV) and 3 (DNA methylations of UBIAD1 and VAV1) possessed significantly negative associations with survival times among two East Asian patient cohorts. Module 2 (cis-acting effects with chromosome 18 CNV) possessed significantly negative associations with survival times among the East Asian female subpopulation alone. By examining the genomic locations and functions of the target genes, we identified several putative effectors of the two cis-acting CNV modules: RAC1, EGFR, CDK5 and RALBP1. Furthermore, module 3 targets were enriched with genes involved in cell proliferation and division and hence were consistent with the negative associations with survival times. We demonstrated that association modules in lung adenocarcinoma with significant links of prognostic outcomes were ethnic and/or gender specific. This discovery has profound implications in diagnosis and treatment of lung adenocarcinoma and echoes the fundamental principles of the personalized medicine paradigm.


Data normalization
The Taiwanese training set consists of 3 types of data: copy number variations, DNA methylations and mRNA expressions. To incorporate the heterogeneous sources of data in the same integrated model, we had to convert them into the same format with compatible scales. We treated each feature as a discrete random variable. For such features with numerical values, simple quantization with hard thresholds results in substantial information loss. To avoid this problem we adopted a probabilistic quantization procedure to convert a measurement outcome into a probabilistic vector over the discrete states. This procedure hypothesizes that the underlying molecular states are discrete with uncertainty and/or different mixture coefficients in a population of cells, and the uncertainties or mixture coefficients are reflected in the magnitudes of measurement outcomes. Therefore, information pertaining to magnitudes of measurement outcomes are preserved in the probability vectors of the discrete states.
The data of mRNA and DNA methylations constitute continuous measurements from microarrays. We converted these continuous values into probability vectors of trinary states -up, down regulation or no change. For each dataset, denote z ij the observed value of probe i on sample j, and x ij its discrete hidden state. The following procedures convert each z ij into a probability vector (P(x ij =−1), P(x ij =0), P(x ij =1)).
2. Rank-transform z ij into the cumulative distribution function (CDF) value y ij ∈ 30, 14 . For the datasets reporting relative values (Agilent arrays), rank transformation is applied to the entire matrix. For the datasets reporting absolute values (Affymetrix arrays), each feature is rank-transformed separately. This is because we want to capture the relative variation of features across different samples instead of comparing the values of distinct features. DNA methylation data are scaled in 30, 14 thus need not be rank-transformed.
3. Convert y ij into a probability vector (P(x ij =−1), P(x ij =0), P(x ij =1)) with a specific quantization function. Intuitively, a data point with a low CDF value is more likely to be down-regulated (P(x ij =−1) is high), and a data point with a high CDF value is more likely to be up-regulated (P(x ij = 1) is high). This intuition is translated into the requirements that a quantization function is monotonic and maps y ij =0 into P(x ij =−1) =1 and y ij =1 into P(x ij =1) =1. We chose polynomial functions f γ and f γ as the quantization curves.
4. Quantization results are sensitive to γ values. To reduce the bias induced by a specific quantization function we assigned weights (prior) on fγ functions and integrate the transformed values over a family of quantization functions. In this work we chose an exponential prior e −(γ−1) and restricted γ ∈ 31, ∞ ) . The averaged quantization outputs are:

Oncotarget, Supplementary Materials 2015
The exponential prior e −(γ−1) was chosen for the following reasons. First, large γ values are penalized because they assign the probability mass to x ij =0 for most y ij values. An exponential prior naturally penalizes large γ values. Second, it ensures the existence of the integrals in equation 2. Third, the requirements that P(x ij =1|y ij =1) =1 and P(x ij =−1|y=0) = 1 are satisfied. Fourth, the most justified single value of γ is γ = log 3/ log 2 because it assigns an equal probability (1/3) for each state when the input CDF y ij =0.5. The marginal quantization curves indeed resemble the quantization curves generated by γ.
These procedures converted the measurement value of one probe in one sample into a probability vector. Genes are the elementary units in our analysis. The expression and DNA methylation of a gene are often measured by multiple probes. The probe-level data were first ranktransformed into CDF values. We generated gene-level data by merging the probe data corresponding to the same genes. The entry of each gene in each sample was the average over all probe values of the corresponding gene and sample.
To examine gender specific effects, we required the expression data of the 4 validation sets to be split by gender. The data were first rank-transformed into CDF values, and subsequently split into separate male and female subdata.

CNV data processing
The aforementioned procedures of data normalization need to be modified on CNV data as the underlying assumption mismatches the empirical characteristics of CNV measurements. Equation 2 gives the probability of each trinary state x given the CDF value of its measurement outcome y. For each specific probe (gene), y is uniformly distributed in 30, 14 . Thus the total probability of encountering the up-regulation state (x=1) is Similarly, P(x=−1) ≈ 0.3663 and P(x= 0) ≈ 0.2674 . This distribution stipulates that the fractions of amplification/deletion/no change entries in the CNV data are approximately comparable. In reality, the vast majority of the CNV data entries do not deviate from normal values (0s for two-channel microarrays). Only about 1% of the data points have log2 ratios ≥ 1 or ≤ −1. Therefore, the quantization function in equation 2 severely distorts the global characteristic of the CGH array data.
To reduce this distortion we introduced an extra parameter β to the quantization function: The new P(x ij =1|y ij ) and P(x ij =−1|y ij ) shrink with increasing β values. We adjusted β to make the global distribution (P(x=1), P(x=0), P(x=−1)) obtained from equation 4 close to the empirical distribution. We counted the fractions of entries exceeding log 2 1 3/22 =0.585 (f 1 ) and below log 2 1 1/22 =−1 (f 2 ). For simplicity we set the global empirical probability of amplification and deletion to be equal: We then found the parameter value β^ that fit the following equality: The estimated parameter β^ was substituted in equation 4 in probabilistic quantization.
Unlike mRNA expressions or DNA methylations, the elementary subunits of CNV data are segments bounded by amplification and deletion events instead of genes. In an earlier version of the module-finding algorithm [1], we partitioned each chromosome into smaller segments according to their CNV data. In this particular dataset, however, we observed that CNV data were mostly coherent within each chromosomal arm. To simplify computation, a decision was therefore made to use chromosomal arms as natural segmentation boundaries. We aggregated the measurements of probes on each chromosomal arm and used the median over the corresponding probe values as the proxy CNV value of a chromosome segment.

Assessment of purity and ploidy of the tumors
We employed the ABSOLUTE algorithm to estimate both ploidy and purity of samples in the training set. The inferred results are reported in S5 Table. The inferred purity indicates most samples retain a significant fraction of cancer DNA. Therefore, the association modules reflect information in cancer genomes. Inferred ploidy indicates many cancer samples undergo copy number amplifications in most spots in the genomes. The fluctuations of CNVs provide a necessary condition to build associations between CNV and mRNA data.

Clustering DNA methylation, and mRNA expression data
Correlated effector aberrations tend to fit the same set of target genes. It is therefore more efficient to cluster the effector aberrations and build association modules with the proxy data derived from clustered effector aberrations. We propose a graph-based method to cluster DNA methylation and mRNA expression data. In brief, by setting a threshold on correlation coefficients we could build a graph with genes as nodes and edges connecting correlated genes. By gradually lowering the threshold we extracted cliques -maximally and completely connected components -with increasing sizes. Highly connected cliques were then merged to form clusters.

Pairwise associations
For each pair of molecular aberration (segment CNV, DNA methylation) and mRNA gene expression, we evaluated their Pearson correlation coefficient, loglikelihood ratio of the logistic regression model, and p-value obtained from both χ 2 approximation of log-likelihood ratio and permutation tests. Denote x a candidate effector and y a target gene expression. We express the conditional probability P(y|x) with a logistic regression model: where either f(x) =x (the effector activates the target) or f(x) =−x (the effector represses the target). λ is a nonnegative parameter. Z 1 x2 =1 + eλ + e −λ is the partition function that normalizes the conditional probabilities.
Denote D ≡ (x 1 , y 1 ), · · · , (x m , y m ) the observed instantiations of x and y over m samples. The loglikelihood function of the observed data D is where C x and C y stand for configurations of x and y, and N(C x ), N(C x , C y ) the fractional counts for configurations C x and (C x , C y ) over all samples. Using probabilistic quantization each entry x ij was converted into (P(x ij =−1), P(x ij =0), P(x ij =1)) , where i and j are gene and sample indices. The fractional count for a state configuration (C x , C y ) is The maximum likelihood parameter λ^ was numerically estimated using the Newton-Raphson method. Denote F(λ) ≡ L ' (D;λ) and G(λ) ≡ F ' (λ). Set λ 0 =1 as the initial value of λ. Iteratively execute the following updates until either λ t converges or λ t ≤ 0: Given the observed data D and two nested models M 0 , M 1 ⊇ M 0 , we incurred a standard hypothesis testing procedure to calculate the log-likelihood ratio and χ 2 p-value: where χ 2 1 is the χ 2 CDF function with one degree of freedom.
The χ 2 p-values tend to over-estimate the significance of the testing results. Thus we also evaluated the p-values of permutation tests and reported the supremum of χ 2 and permutation p-values. Permutation p-values were calculated by the following procedures: 1. Quantize the aberration and expression CDF values into binary or trinary states. For trinary variables we choose 0.4 and 0.6 as thresholds. The reported p-value for each pairwise association is the maximum of the p-values derived from χ 2 approximation and permutations.

Building association models for individual genes
Not all types of molecular aberrations are equally likely to drive gene expressions. Some candidate effectors provide direct explanations for gene expressions without requiring many mechanistic assumptions underlying gene regulation (e.g., cis-acting effects with CNVs). Others have a greater number of features thus are likely to introduce spurious associations. We proposed a layered modeling framework to prioritize molecular aberrations differently and incrementally incorporating candidate effectors to the model according to their priorities. Molecular aberrations are categorized with the following priorities.
1. Level 1: segment CNVs on the same chromosomal arm as the target (CNV cis-acting effects) and DNA methylations on the target. 2. Level 2: positive associations with segment CNVs on distinct chromosomes from the target (CNV transacting effects) and negative associations with DNA methylations of non-target genes. 3. Level 3: negative associations with CNV trans-acting effects when the target is not explained by level 1-2 models.
In addition, we placed an extra requirement for CNV trans-acting effects. To build an association between a segment CNV and a target gene on another chromosome, the segment must accommodate at least one intermediate regulator that likely modulates the expression of the target. The regulator gene expressions are positively associated with the segment CNVs, and the functional direction of the association between the regulator and the target gene expressions coincides with that of the association between the segment CNV and the target gene expression. 727 candidate regulators were pulled from three sources: human transcription factors from TRANSFAC [2], human transcription factors and signaling proteins from FanTom [3], and genes pertaining to cancer from the OMIM database [4].
Pairwise associations were filtered with different thresholds of log-likelihood ratios and correlation coefficients according to the levels and types of effectors. The thresholds applied in the analysis are reported in S6 Table. We again applied logistic regressions to build association models with multiple effectors. The procedures of calculating the maximum log-likelihood of a model are analogous to those for pairwise associations (equations 7-10).
The log-likelihood ratio, χ 2 p-value and permutation p-value between two nested models M 0 , M 1 ⊇ M 0 were calculated analogous to equation 11. Here the degree of freedom d of the χ 2 function χ 2 d is the number of additional features in M 1 compared to M 0 .
The hypothesis testing procedures were applied in a model selection algorithm as follows. Suppose M 0 is a null model containing no effectors, M 1 is an association model at the current level. Consider adding a candidate effector x to the current association model M 1 . We define the following model selection procedures as selection (M 1 , x, θ, θ θ and θ' stand for the p-value thresholds for testing a single-effector model (M 2 ) against an independent model (M 0 ) and testing a multi-effector model (M 12 ) against a null model by removing one effector (M 1 or M 2 ). The test is significant if the p-value is not greater than the threshold.
To build association models to explain a gene expression data y, we employed the layered modeling framework to incrementally add effector molecular aberrations to the existing models with the following procedures. Initially set the association model M as empty.
1. At level 1, implement the following steps. To prevent including all these correlated DNA methylation effectors in the model selection process, we solicited one representative from each cluster of DNA methylation data. The representative DNA methylation profile possesses the highest log-likelihood ratio among the members in the same cluster. When incorporating higher-layered associations, only the representative DNA methylations will be included in the association model of the current level. Non-representative DNA methylations that pass the model selection procedures will be reported as effectors but not used in evaluating log-likelihood ratios. Notice model selection still operates at the DNA methylation profiles of single genes, and nonrepresentative DNA methylations are still included as effectors. Clustering results are used only to reduce model complexity in model selection. 3. At level 3, implement the following steps.
(a) Filter out candidate effectors according to the thresholds of log-likelihood ratios and correlation coefficients.

Assembling association models of individual genes to modules
Molecular aberrations on effector genes or genomic components often mis-regulate many downstream targets. We define an association module as a tuple consisting of three components: (1) observed effector molecular aberrations, (2) target genes, (3) regulators that mediate the effects between effectors and targets. From the association models of individual genes we incurred the following procedures to construct modules.
1. Group gene expressions by each candidate effector molecular aberration. Assign a gene to the targets of a effector molecular aberration if the latter appears in the association model(s) of the former.

VALIDATION OF ASSOCIATION MODULES FDR evaluation of pairwise associations
False discovery rates (FDRs) quantify the expected fraction of false positives among the positive calls from a multiple hypothesis testing problem [5]. To simplify the testing procedures, we considered a positive call as a significant pairwise association between a effector and a target. The thresholds of log-likelihood ratios and correlation coefficients for determining positive calls are reported in S6 Table. Significant pairwise associations arising from the permuted data are false positives according to the null model. Lacking specific information about the distribution of noise, we created a simple null model by randomly permuting the effector aberration and target expression 100 times. Such permutation tests preserve the marginal distributions of individual components thus are widely used in evaluating the significance of detected signals. The empirical distribution of the number of significant pairwise associations arising from permuted data provides a reasonable measure for false positive numbers. A standard formula of FDR is E {# false positives according to the null model}/# positive calls from the data [6] The expected number of false positives can be directly calculated using its distribution from the permuted data. FDR rates are reported in S7 Table.

Reproducibility verification of associations in external datasets
We verified the reproducibility of the association modules by checking whether the associations extracted from each module are robust in East Asian and Caucasian external datasets. For each module, we incurred the following tests to verify the reproducibility of associations.
We computed the distributions of correlation coefficients between the target gene expressions in all external datasets. To quantify the strength of coherence, we compared the distribution of correlation coefficients with the background distribution derived from a reference set of genes. A reference set was generated by calculating correlation coefficients of 1,000 randomly selected pairs of genes in the dataset. We evaluated the p-value of the onesided, two-sample Kolmogorov-Smirnov test between the two distributions of correlation coefficients.
Modules containing a large number of target genes are more likely to be found significant using the Kolmogorov-Smirnov test. To account for the effects of module size we generated 1,000 modules of randomly selected genes of the same size as the inferred modules and compared them to the previously generated background distributions. The p-value of the inferred module is ranked amongst the p-values of the randomly generated modules. The rank over the number of randomly generated modules provides an adjusted p-value.

Assessment of the prognostic power of association modules
We selected datasets for survival analysis according to the following criteria: (1) patient survival information was available, and (2) exact survival durations (in terms of days or weeks) instead of broad ranges were labeled. The Taiwan training data were excluded from the module selection due to a shortage of death events. The four external datasets all met these criteria. The prognostic power of biomarkers is typically measured by Cox regression coefficients. In survival analysis, Cox regression coefficients quantify the association of a set of independent variables (e.g., biomarker gene expression levels) with the hazard function of the population.
In the first stage, we treated the mRNA expression of each gene as an independent variable in the Cox regression model and evaluated its regression coefficient. To assess the prognostic power of an association module, we evaluated the distribution of Cox regression coefficients of its targets and compared this distribution with the background distribution of all genes. Both one-sided and two-sided, two-sample Kolmogorov-Smirnov tests were applied to calculate the p-values. Deviations from the background distribution in both directions provide useful prognostic information. Strong positive coefficients indicate negative associations with survival durations (higher expression levels are associated with shorter survival durations), and strong negative coefficients indicate positive associations with survival durations (higher expression levels are associated with longer survival durations).

Oncotarget, Supplementary Materials 2015
Adjusted p-values were calculated once again to account for the effects of module size on the significance of the test statistic. We generated 1,000 modules of randomly selected genes of the same size as the inferred modules. The Cox regression coefficients were calculated for each of the genes in the generated modules and compared to the background distribution previously generated. The p-values of the Kolmogorov-Smirnov statistic for the inferred modules are ranked amongst the p-values from the randomly generated modules. The rank over the number of randomly generated modules provides an adjusted p-value accounting for module size.
In the second stage, we intended to demonstrate that an aggregate index derived from the data of an association module could quantify its prognostic power. We used the median over the expression profiles of target genes as the aggregate biomarker. Patients were divided into low and high expression groups according to whether their aggregate biomarker exceeded the average expression over all the genes in the data. The log-rank p-value was used to measure prognostic power.

Validation of segmentation boundaries
To validate the decision to chromosomal arms as segment partitions we employed the Circular Binary Segmentation (CBS) algorithm [7] to partition the patient data of each chromosome into segments with coherent CNV profiles. We visually inspected the segment boundaries on individual samples to identify candidate boundary partitions across all samples. Any full CNV segments generated from the CBS algorithm that differed from the chromosomal arm segmentation were entered into the module-building algorithm separately and verified to ensure that the results were consistent regardless of segmentation approach.

FDR evaluation of validation tests
As we are testing a large number of modules in the passenger coherence and prognostic power validation tests, we must consider the chance generation of false positive results. For a module to be deemed significant it must pass the passenger coherence and the two stages of the prognostic power tests. We can consider these tests individually and as a collective. To calculate an FDR statistic we randomly assign genes to modules the same sizes as the inferred modules. This process is repeated 200 times and each of the validation tests applied at each repetition. The number of significant modules are counted for each individual test and also for any randomly generated modules that pass both the passenger coherence and prognostic power tests.

Partial least squares to assess the dependency between association modules
It is possible that association modules identify different elements of a larger biological mechanism and therefore they may be interrelated. To investigate possible dependencies between association modules we employ a partial least squares (PLS) analysis [8]. Similar to principal components analysis (PCA), PLS is a dimension reduction method that allows the user to project high dimensional data onto a lower dimensional space, often capturing a high proportion of variance with only a few orthogonal latent variables. Unlike PCA, PLS incorporates information of the response, building latent variables that maximize the covariance between the predictors X and the response Y . A bilinear decomposition of X can be formed such that where T M =t 1 , t 2 , …, t M are the m latent variables (or scores), P M =p 1 , p 2 , …, p M are the weights (or loadings) on the components and E is a residual error term. PLS is particularly beneficial for the current study as we can plot the target genes in each module as points in an object space representation. This allows the current analysis to be extended by considering the modules as a complete set of genes, rather than being limited to just pairwise associations between module members. By considering two modules as X and Y blocks of genes, PLS can build latent variables that maximize covariance (i.e. X T Y) between the two sets of target genes.
We can visualize the associations of the module members by projecting them onto a two dimensional space, labeled a 'correlation circle'. The cosine of the angle between the gene locations in the plot indicates the association between the gene expressions and the distance from the center represents the proportion of variance explained by the first two PLS components. A cumulative R 2 can also be calculated to provide a quantitative measure of the dependency between a pair of association modules.

Co-citation analysis on PubMed and OMIM database
We incurred a search on the PubMed database to find targets from all modules that were co-cited in publications with variations of the key term "lung cancer". The citation counts for each target gene were then ranked, with the top-ranking 5% of genes selected as significant.
Co-citation count is susceptible to false negative driver genes that have been more recently discovered. To relieve this problem we consider the overlap of target genes in significant modules with the NCBI OMIM database [4] of cancer related genes. Whilst this database is not designed specifically for lung adenocarcinoma, overlapped genes can be flagged as potential module effectors.

QIAGEN's Ingenuity pathway analysis
A pathway enrichment can be performed on the target genes of inferred modules in the study. This step is intended primarily to assess trans-acting CNV and methylation modules, as targets are not restricted to a single chromosome. We employed QIAGEN's Ingenuity pathway analysis (IPA) [9] to search for pathways amongst target genes. The software uses a right tailed Fisher's exact test to assess whether any identified overlap of genes between module targets and pathways is statistically significant (not generated by chance). As a large number of pathways are present and tested for in the IPA software, an adjusted p-value is calculated using the Benjamini-Hochberg method.