Investigating Multi-cancer Biomarkers and Their Cross-predictability in the Expression Profiles of Multiple Cancer Types.

Microarray technology has been widely applied to the analysis of many malignancies, however, integrative analyses across multiple studies are rarely investigated. In this study we performed a meta-analysis on the expression profiles of four published studies analyzing organ donor, benign tissues adjacent to tumor and tumor tissues from liver, prostate, lung and bladder samples. We identified 99 distinct multi-cancer biomarkers in the comparison of all three tissues in liver and prostate and 44 in the comparison of normal versus tumor in liver, prostate and lung. The bladder samples appeared to have a different list of biomarkers from the other three cancer types. The identified multi-cancer biomarkers achieved high accuracy similar to using whole genome in the within-cancer-type prediction. They also performed superior than the one using whole genome in inter-cancer-type prediction. To test the validity of the multi-cancer biomarkers, 23 independent prostate cancer samples were evaluated and 96% accuracy was achieved in inter-study prediction from the original prostate, liver and lung cancer data sets respectively. The result suggests that the compact lists of multi-cancer biomarkers are important in cancer development and represent the common signatures of malignancies of multiple cancer types. Pathway analysis revealed important tumorogenesis functional categories.


Introduction
Human malignancies remain one of the leading causes of mortality in the United States. Uncontrolled growth, reduced ability to undergo apoptosis and the ability to metastasize are some of the important features of malignancies, regardless of origins of tissues. There are multiple mechanisms underlying the phenotype of cancer. The alterations of cell growth and cell death signaling pathway due to mutation and inactivation of tumor suppressor genes and/or amplifi cation and activation of proto-oncogenes have been thought to be the primary causes of carcinogenesis. 1 Abnormalities of the same signaling pathways can be found in multiple types of human cancers, while a tumor may contain multiple abnormalities in signaling. Overlapping these abnormalities among multiple types of tumors may shed light on some key alterations of carcinogenesis.
Prostate cancer is second only to skin cancer as the most commonly diagnosed malignancy in American men: at current rates of diagnosis, one man in six will be diagnosed with the disease during his lifetime. 2 Even though nutritional and environmental etiology has been implicated for prostate cancer development, such link has yet to be fi rmly established in general population. Some studies suggested that up to 80% of men age older than 80 were found to contain pathologically recognizable prostate cancer, while rarely any man younger than 40 developed the same disease. This argues against any singular specifi c etiology responsible for prostate cancer besides aging. Histologically, prostate cancer cells closely interact with their neighbor stromal cells to form some distinctive architectural patterns that make up the basis of Gleason's grading. 3 The clinical courses of most prostate cancers are long, and some are life-threatening. Hepatocellular carcinoma, on the other hand, is quite the opposite. It is not age related, and is tightly linked to cancer etiologies such as alcohol, hepatitis B or C virus or certain toxins. Hepatocellular carcinoma is distinctive in its well confi ned nodular architecture. The clinical courses of most of the hepatocellular carcinomas are short and the fatality is high. Most of the lung cancers, with the exception of small cell carcinoma, are also associated with distinctive etiologies, such as smoking or chronic exposure to certain type of carcinogens. The urothelial carcinoma of the urinary bladder, however, is primarily idiopathic or viral related. Since these four types of cancer are so far apart in etiology, morphology and clinical courses, any common ground between these tumors could be interpreted as a likely common pathway of carcinogenesis.
In the literature, microarray technology has been widely applied to the analysis of many malignancies, including the four cancer types mentioned above. However, meta-analysis to integrate multiple studies has rarely been investigated. Segel et al. 4 proposed a systematic approach to incorporate 1,975 arrays in 22 tumor types and constructed a large gene module map. The resulting module map was, however, too complex to follow up and the modules were based on 2,849 known biologically meaningful gene sets instead of learning new sets of multi-cancer biomarkers. The gene matching of heterogeneous array types also potentially deteriorate the analysis accuracy. In this report, we performed a meta-analysis on 455 arrays collected from four microarray studies in Affymetrix U95Av2 platform: 94 samples of liver tissue 5 (43 liver cancer, 30 hepatic tissues adjacent to liver cancer, 21 normal liver from organ donors), 148 samples of prostate tissues 6 (66 prostate cancer, 59 prostate tissues adjacent to prostate cancer and 23 organ donors), 151 samples of lung tissues 7 (134 tumors and 17 normal lung tissues) and 62 urinary bladder tissues 8 (5 normal and 57 tumors). The use of common array platform has avoided the problem of incorrect gene matching and gene annotation, a common cause to deteriorate the performance of meta-analysis in microarray. 9 We performed two batches of analyses. In batch I, all three tissue types in liver and prostate were analyzed using analysis of variance (ANOVA) model. In batch II, normal and tumor tissues in all four cancer types were included and t-test was used to identify multi-cancer biomarkers (see Table 1 for data description). The identifi ed biomarkers were found to have high predictability in both within-cancer-type (i.e. cross-validation within a single cancer type) and inter-cancer-type (i.e. prediction model trained in one cancer type and used to predict another cancer type) prediction via leave-one-out cross validation. Further pathway enrichment analysis identifi ed statistically signifi cant function categories of the biomarkers. Validation of the 47 batch II multicancer biomarkers on an independent 23 prostate tissues yielded 96% accuracy in inter-study prediction from the original prostate, liver and lung cancer data sets respectively, showing the robustness of the multi-cancer biomarkers and their implications to common carcinogenesis of multiple cancer types.

Data and preprocessing
We collected four published microarray data sets [5][6][7][8] to perform meta-analysis on prostate, liver, lung and bladder samples. A total of 455 U95Av2 arrays were analyzed (94 liver, 148 prostate, 151 lung and 62 bladder tissues) with each covering 12,625 genes and EST sequences. The common array platform eliminated technical diffi culties including gene matching and inter-platform discrepancies. In liver and prostate data sets, three types of samples were collected: organ donor (N), normal tissues adjacent to tumor (A) and tumor tissues (T). In lung and bladder tissues, only organ donor and tumor tissues were available. We analyzed the data through two batches of analyses. In the fi rst batch, both liver and prostate data sets with all three tissues were included. The expression patterns across the three types of samples were the major targets for investigation. In the second batch, data of all four organ types were included and only normal and tumor samples were compared. For details see Table 1.
The raw data (CEL fi les) were preprocessed in each cancer type separately using dChip software for array quality assessment, normalization, expression intensity extraction and log-transformation (base 2). Genes of low information content in each data set were fi ltered respectively and the union gene set of the four data sets was retrieved for further analysis. Specifi cally, in each data set, the top 50% genes with the largest average intensities were fi rst selected. Among them the top 50% genes with the largest standard deviations were further identifi ed, resulting in 25% genes (3,156 genes) selected in each data set. The union list of these most informative 25% genes in four data sets was used for subsequent downstream analysis (a total of 5,917 genes). The expression intensities in each sample column of each data set are standardized to have zero mean and unit variance so that data sets of different cancer types are comparable.

Biomarker selection by ANOVA and t-test
In batch I analysis, ANOVA model was fi tted for the organ donor (N), adjacent to tumor (A) and tumor (T) samples with a β parameter for fi eld effect and a γ parameter for tumor effect. Stepwise algorithm was used to select the best regression model. The ANOVA model is described in the following: where i = 1, … 5917 for all the genes, n = 1, … 94 for liver samples and n = 1, … 148 for prostate samples. The fi eld effect binary covariate F in = 1 for A or T group; F in = 0 for N group. The tumor effect covariate T in = 1 for T group; T in = 0 for N or T group. Field effect is defi ned as the expression difference between normal tissues (N) compared to tissues adjacent to tumor (A) and tumor tissue (T). Tumor effect is defi ned as a further difference between A and T. Genes satisfying the following criteria were selected: (a) statistical signifi cance: adjusted q-value for the final stepwise-selected ANOVA model after Benjamini-Hochberg correction is less than 0.05 (i.e. to control false discovery rate smaller than 0.05); (b) biological signifi cance: fi eld effect or tumor effect is larger than 0.4 (correspond to ∼32% fold change). The fi eld effect and tumor effect parameter β and γ both have three possibilitiespositive, negative and no change -, resulting in eight patterns as described in Figure 1A. Figures 1B and  1C show the number of genes selected in liver and prostate samples respectively and their distribution in the eight pattern categories. The intersection of selected ANOVA genes in liver and prostate with concordant pattern categories were used to construct prediction model for within-cancer-type (Liv→Liv and Pro→Pro) and inter-cancer-type (Liv→Pro and Pro→Liv) analysis. To summarize a list of gene markers in batch I for further analysis, genes selected in more than 70% of the times in leave-one-out cross validation (see section below for more detail) in the above procedure were identifi ed as the "batch I multi-cancer biomarkers" (batchI-MBs).
In the batch II analysis, similar gene selection procedure was performed. Instead of ANOVA, simple t-test was performed to distinguish normal and tumor. Given the comparison of a pair of cancer types (e.g. liver vs. lung), genes satisfying the two criteria used in batch I were fi rst selected and the intersection of the gene lists obtained from the two compared cancer types were identifi ed. Among them, genes with concordant differential expression direction (up-or down-regulation) were used to construct prediction model for within-cancer-type (Liv→Liv and Lun→Lun) and inter-cancer-type (Liv→Lun and Lun→Liv) analysis. Leave-one-out cross validation was similarly performed. For each pair of cancer type comparison, gene lists of more than 70% appearance in the leave-one-out cross validation signatures were identifi ed and were denoted as "liv-pro-MBs" (i.e. multicancer biomarkers in liver-prostate comparison), "liv-lun-MBs" etc. The intersection genes of "liv-pro-MBs", "liv-lun-MBs" and "pro-lun-MBs" are denoted as "batchII-MBs" (See Fig. 4; bladder  cancer data appear to generate a very different biomarker list than that from liver, prostate and lung data, as will be describe later).
Gene-specifi c scaling in inter-cancer-type classifi cation Figure 2 demonstrates expression patterns of one selected gene for each of the eight pattern categories (the category (N = T) Ͼ A had no gene and is omitted). We observed that gene-specifi c scaling was needed for many of the biomarkers so the prediction information could be carried across organs. For example in "APBA2BP", the expression of group A is consistently greater than N and group T is further greater than A in both liver and prostate samples. However, the levels of expression intensities in liver and prostate are in different scale even though all the liver and prostate samples are preprocessed and properly normalized across data sets. This phenomenon may be due to differential sample preparation, tissue physiology and/or hybridization conditions in different studies. As a result, we conducted gene-specifi c scaling in all inter-cancer-type classifi cation. Conceptually the scaling parameters are estimated so that the gene vectors in each study are standardized to mean 0 and standard deviation 1. However, since each study has a different ratio of normal versus tumor samples, we performed a bootstrap sampling before scaling so that the gene vectors were standardized under a synthetic condition that groups (N, A and T) are of equal sample size in each study (see Appendix for more details).
Classifi cation method and leave-one-out cross validation PAM (Prediction Analysis of Microarray) was used to construct the prediction models in this paper. 10 The method has been found effective in many microarray prediction analyses and has the merit that gene selection is embedded in the method. When "all genes" are used, the predictive genes are automatically chosen from the total of 5,917 genes to construct the prediction model. When "common signatures" are used, the common biomarkers are selected according to the description in the section "Biomarker selection by ANOVA and t-test" and no gene selection is further performed in PAM. Results of both gene selection procedures are reported and compared.
To avoid over-fi tting in the evaluation of crosspredictability of the multi-cancer biomarkers, we conducted rigorous leave-one-out cross validation (see the prediction scheme in Figs. 3A and 3B), i.e. the left out sample does not participate in the selection of marker genes. Global sample normalization has been performed across prostate and liver data sets. It is clearly seen that although all these biomarkers demonstrate concordant patterns across prostate and liver, many of them (APBA2BP, SLC39A14, AGT, TOP2A and B2M) are at different expression level and direct application of a prediction model developed in one data set will likely perform poor in the other data set.

Confusion matrix and prediction index
In the literature, the overall accuracies from different methods are usually reported to compare performance. It is, however, often a misleading index in practice. Supplementary Table 1   The test sample is fi rst left aside. The remaining samples are used for selecting multi-cancer biomarkers and constructing the prediction model to be used to evaluate the set-aside test sample. This scheme is used to evaluate procedures of selecting both batchI-MBs and batchII-MBs to generate Table 2 and Table 3. (A) an example to evaluate liv→liv in Table 2 (B) an example to evaluate pro→liv in Table 2 groups. The two off-diagonal numbers represent the false positives and false negatives in the prediction and their sum represent to total errors made (see Supplement Table 1). We then further summarize the prediction results by a prediction performance index (PPI) that is defi ned as the average of sensitivity and specifi city, to be used throughout this paper for performance evaluation.

Pathway analysis
For each gene list of multi-cancer biomarkers, the gene ontology (GO) database was used for pathway enrichment analysis. For each GO term, a Fisher's exact test was performed to determine the enrichment of the gene list and a p-value was generated. 11 We performed this analysis in batchI-MBs, batchII-MBs and all pairwise comparison multi-cancer biomarkers in batch II (liv-pro-MBs, liv-lun-pro-MBs etc). The p-value results were summarized in a heatmap (Fig. 5).

External evaluation of batchII-MBs by independent prostate data
A data set of 23 prostate cancer samples performed in an independent lab 12 was used for external validation of the batchII-MBs. A toltal of 47 batchII-MBs were identifi ed from the normal and tumor samples in liver, prostate and lung data sets. To evaluate the robustness and inter-cancer-type crosspredictability, a prediction model based on the 47 batchII-MBs in the normal and tumor samples of liver data set was constructed and was used to evaluate the 23 external prostate cancer samples (see "EV_liv→pro in Fig. 3C). The evaluation of prediction model generated by the old prostate data is denoted by "EV_pro→pro" in Figure 3D. Similarly we also perform "EV_lun→pro" evaluation. The data preprocessing of the 23 new samples was conducted similarly to the four analyzed data sets and simple constant normalization was adopted against the original prostate data set. Additional gene-wise normalization against the original prostate is also applied so the liver and lung data sets can be used to predict the 23 new prostate samples.

Results
To identify common signature genes among four types of malignancies, we started with the prostate and liver data sets in batch I analysis because of more balanced numbers of tumor and normal samples and availability of benign tissues adjacent to tumor. In this analysis, 1,854 genes from liver data set and 1,139 genes from the prostate data set were found to fi t the ANOVA model and meet the gene selection criteria. Among these genes, 520 genes were common in both organs (Venn diagram in Fig. 1B). The histogram of correlations of N vs A vs T patterns (average intensities of each group) across two organs in each gene is shown in Figure 1D. Majority of the genes were highly correlated across prostate and liver but surprisingly 113 genes presented strong negative correlation (Ͻ−0.7), which may refl ect the differences in tissue types. The 520 selected genes were categorized into eight patterns as demonstrated in Figure 1A. These patterns represent either tumor specifi c alteration, field effect, or reactive changes. Among these 520 genes, 111 genes were in the same pattern categories in liver and prostate (Fig. 1C) based on our defi nition in Figure 1A. Further analysis of expression of the 111 genes in both organs indicated that even though the expression patterns for these genes across N, A and T were identical in both organs, the levels of expressions may vary greatly (for example, APBA2BP and SLC39A14 in Fig. 2). This suggests that direct application of classifi cation model constructed in one cancer type may not predict the histology of tissues in the other cancer type.
To resolve this problem, an adequate gene-specifi c scaling across organs was carried out for the intercancer-type prediction. The gene-specifi c scaling procedure described in the Method section and Appendix is applied for all analyses hereafter.
We performed leave-one-out cross validation throughout the prediction analyses. There are 242 samples in liver and prostate data sets. Among the 242 leave-one-out cross validation analysis, a total of 109 common biomarkers were identifi ed in more than 70% leave-one-out cross validation and all of them belong to the 111 gene list using all liver and prostate samples described above. These 109 frequently identifi ed biomarkers are named "batchI-MBs". 99 (out of 109) were identifi ed as distinct multi-cancer biomarkers (Supplement Table 4). Subsequently we assessed the cross-predictability of the identifi ed biomarkers. When using all genes, we observed high PPI between normal and tumor comparison (N vs. T) with 96.5% in liver dataset and 93.9% prostate dataset while lower accuracy was observed between adjacent and tumor (79.9% in liver and 71.4% in prostate) ( Table 2). When only common signature biomarkers were used, the prediction accuracy remained comparable to using all genes (N vs. T: 96.5% in liver and 98.8% in prostate; A vs. T: 75.6% in liver and 66.7% in prostate). The result suggests that the common signature biomarkers carry as good predictive information as the entire 5,917 genes. We then further conducted intercancer-type classifi cation analysis. We used either "all genes" (the entire 5,917 genes) or the common signatures to construct a prediction model in one cancer type and predict in another cancer type. The prediction evaluation was performed in a manner of leave-one-out cross validation. We denoted "pros-tate→liver" as constructing prediction models using prostate samples and predicting liver samples. We found that prediction with "all genes" did not perform well with only 47.4% in liver→prostate and 66.3% in prostate→liver among N vs T comparison and 55.7% in liver→prostate and 51.9% in prostate→liver among A vs T comparison. On the other hand, the model using common signature genes achieved much superior performance, nearly as good as the within-cancer-type classifi cation (96.3% in liver→prostate and 93% in prostate→ liver among N vs. T comparison and 65.1% in liver→prostate and 74.7% in prostate→liver among A vs. T comparison). The results clearly demonstrate the cross-predictability of the common signatures.
Subsequently, we expanded our analysis to prostate, lung, liver and bladder data sets (batch II analysis) with only normal and tumor tissues to test whether common signature genes can be found across these four types of cancers. Similar analyses were performed except that ANOVA was replaced by t-test for two class normal and tumor comparison. Each pair of the cancer types was analyzed. Similar to batch I analysis, only common signature genes with consistent regulation direction (up-regulation or down-regulation) in both cancer types were selected. Table 3 (see also  Supplement Table 3 for the entire confusion matrix results) summarizes the prediction results of batch II analysis. Similar to the result of batch I analysis, we observed high prediction accuracy for within-cancer-type prediction when using all genes in PAM (96.5% for liver, 93.9% for prostate, 90.7% for lung and 88.6% for bladder). The prediction models using common signature biomarkers generated similar high accuracy compared to using all genes (91.7%-97.7% in liver, 79.6%−95.6% in prostate, 89.4%-96.0% in lung and 97.4%−98.3% in bladder). The result confi rms that the common signature biomarkers carry as good predictive information as the entire 5,917 genes. For the inter-cancer-type classifi cation analysis, we repeatedly found that prediction with all genes did not perform well. In contrast, using common signature genes achieved much superior performance (Table 4). Liver particularly seemed to be the most robust either used as training or test data. Bladder, however, showed slightly lower cross-predictability with the other three cancer types. The numbers of common signature genes of bladder with other cancer types are also much smaller. Following the same criterion of selecting 70% frequency of being selected as common signatures in the cross-validations, we identified multi-cancer biomarkers of the comparison in each pair of cancer types in Table 4 (255 liv-pro-MBs, 119 liv-lun-MBs, 288 lunpro-MBs, 53 liv-bla-MBs, 10 pro-bla-MBs and 19 lun-bla-MBs). When all possible pairs of comparisons among liver, prostate and lung are overlapped (liv-pro-lun-MBs), a number of 47 genes was identifi ed. After deleting replicates, 44 (out of 47) distinct multi-cancer biomarkers in liver, prostate and lung cancers were identifi ed as batchII-MBs (Table 5). However, these common signature genes do not overlap with those from bladder data set, indicating a lack of common signature between these cancers and bladder cancer. There are 12 overlapping genes ( Fig. 4; p Ͻ 1E-10 with signifi cantly high overlapping) between batchI-MBs and batchII-MBs (marked with asterisk in Table 5 and Supplement Table 4). Pathway analysis was performed on these multicancer biomarkers indicating that fewer numbers of multi-cancer biomarkers and GO terms were identifi ed when bladder samples were analyzed in the inter-cancer-type prediction.
To validate the robustness and cross-predictability of batchII-MBs, a data set of 23 independent prostate cancer samples obtained from another institute 12 was evaluated. The prediction model based on the 47 batchII-MBs in the 64 normal and tumor liver samples achieved 96% (22/23) accuracy in predicting the 23 independent prostate samples (the "EV_liv→pro" scheme in Fig. 3C). Evaluation of "EV_pro→pro" and "EV_lun→pro" also gave the same results (96% accuracy). Since we only have tumor samples in the external prostate data, there is a potential pitfall that the high accuracy may be an accidental result of study discrepancies between the new 23 prostate samples and the normal and tumor samples in analyzed data sets. We performed multi-dimension scaling (MDS) plots to visualize the new and old samples and excluded this possibility (Fig. 6). The new prostate tumor samples are scattered and mixed with the old tumors but separated from old normal samples. As a result, the high accuracy of the prediction on this new data set is not caused by pure "accident."

Discussion
Meta-analyses have been performed for several types of human malignancies. [13][14][15][16][17][18] However, to our knowledge, this is the fi rst report showing that a microarray gene expression model demonstrates inter-cancer predictability between different types of cancers using the identified multi-cancer biomarkers. These results not only were evaluated in cross-validation analysis of existing working data sets but also were validated by independent    Table 5 and 109 batchII-MBs are listed in Supplement Table 4. Table 5. The 44 batchII-MBs overlapped by pair-wise comparisons of liver, prostate and lung data sets (liv-pro-MB, liv-lun-MB, pro-lun-MB). The fi rst 12 genes with asterisk overlapped batchI-MBs. The signed mean fold change shows mean fold change of tumor versus normal when positive (up-regulation) and normal versus tumor when negative (down-regulation). prostate tissues collected and preprocessed separately. This argues strongly in favor of the reproducibility of the multi-cancer biomarkers and the models. The 44 batchII-MBs appear to represent the common gene expression alteration among hepatocellular carcinoma, lung and prostate cancer. They follow similar patterns of differential expression in normal and tumor tissues for prostate, lung and liver cancer. Surprisingly, these gene signatures predict prostate, lung and hepatocellular carcinoma with similarly high accuracy as using the entire genome information of 5917 genes in each within-cancer-type prediction in prostate, lung or liver cancer. This suggests that the 44 genes are the major determinant of gene expression alteration in these three types of cancers.

Probe set ID
Comparing the 44 genes to published potential biomarker list yielded high overlapping (28 overlapped to the 3,312 gene list generated in Bhattacharjee et al. 7 22 overlapped to the 2,413 gene list generated in Luo et al. 5 16 overlapped to the 726 gene list generated in Yu et al. 6 ). The high level of inter-organ cancer predictability using just 44 genes implies that the core of can-cer gene alterations may actually be quite small. The alterations of the expression of these genes could represent the common features of the three types of malignancies. None of these genes was, however, identifi ed as the most signifi cantly altered in bladder cancer suggest the dis-resemblance of bladder cancer to these three types of cancers. Among these genes includes a interferon inducible protein, 1-8D (IFITM2, 411_i_at). This gene was a known important mediator of interferon induced in cell growth inhibition and induction of cell death. 19,20 1-8D was down-regulated in hepatocellular carcinoma, lung cancer and prostate cancer, while pro-growth genes such as cyclin B1 (CCNB1, 34736_at) was signifi cantly up-regulated in three types of tumor samples. Other genes involving in growth controls including growth arrest specifi c 6 (GAS6, 1597_at), G0/G1swtich 2 (GOS2, 38326_at) are also abnormally expressed in these tumors. The 44 gene list also includes six metallothioneins including 1A, 1B, 1E, 1F, 1H and 2A (MT1A, 31623_f_at; MT1B, 609_f_at; MT1E, 36130_f_at; MT1F, 31622_f_at; MT1H, 39594_f_at; MT2A, 39081_  at). Metallothioneins are some low molecular weight zinc binding proteins that play important role in regulating transcriptional activity for variety of genes, and play crucial role in zinc signaling. 21,22 . Abnormal up-regulation of these genes may result in global pattern of gene expression alteration. Up-regulation of metallothioneins were thought to contain prognostic value in invasive ductal breast cancer. 23 CCNB1 and most of the metallothioneins were also identifi ed in batchI-MBs where adjacent tissues were included in the analysis. In the pathway analysis, we also observe many cancer related functional categories, including "mitotic checkpoint", "apoptotic program", "copper ion binding" and "cadmium binding". Investigation into the abnor-malities of these pathways may yield important insight into the common carcinogenesis mechanism of the tumors. A possible future work is to study sequential biopsies in the progression of different tumors in a mouse model and analyze the expression changes of the biomarkers identifi ed in this paper. Such rigorous validation of signature genes can help create a carcinogenic model and reduce the inter-individual genetic differences.
The clinical implication of our finding is two-fold: If the prediction of hepatocellular carcinoma, lung cancer and prostate cancer using our 44 batchI-MBs is interchangeable, we like to hypothesize that the abnormalities in the expression of the 44 genes represent a common features of these malignancies. Therapeutic targeting toward some of these genes will be of signifi cant value in treating these malignancies. Second, the 99 batchII-MBs predicts tissues adjacent to malignancies versus completely normal organ tissues with high accuracy. This model may be able to serve as predictor of malignancies nearby even if a biopsy misses its tumor target. This may serve as an indicator for a quick follow-up re-biopsy until the tumor(s) is identifi ed. Alternatively, the detection of a strong cancer fi eld effect change may argue for some prophylactic treatments before morphological cancer appears.

Supplementary Material
Bootstrap procedure for gene-wise normalization Conceptually we standardize each gene vector to mean 0 and standard deviation 1 to accommodate different expression range of a predictive biomarker across different studies (e.g. APBA2BP, SLC39A14, AGT, TOP2A and B2M in Fig. 2). Since the ratios of normal and tumor groups can vary in different studies, simple standardization can cause bias and deteriorate the prediction performance. Instead we perform bootstrap to sample a gene vector of B = 1,000 samples in each group and standardize the vector of 2,000 (3,000 if N, A and T groups are all compared) bootstrapped samples to mean 0 and standard deviation 1 to estimate the standardization factors. Essentially we perform standardization under the simulated condition that normal and tumor groups have the same sample sizes.

True normal tissues True tumor tissues
Predicted as normal tissues 1 1 Predicted as tumor tissues 5 41 Supplement Table 2. Batch I leave-one-out cross validation analysis result (confusion matrix). Supplement Table 3. Batch II leave-one-out cross validation analysis result (confusion matrix).  Table 3 and the corresponding Table 4. Table 4. A total of 109 biomarkers are identifi ed in more than 70% of leave-one-out cross validation in batch I (batchI-MBs