Molecular subtyping of bladder cancer using Kohonen self-organizing maps

Kohonen self-organizing maps (SOMs) are unsupervised Artificial Neural Networks (ANNs) that are good for low-density data visualization. They easily deal with complex and nonlinear relationships between variables. We evaluated molecular events that characterize high- and low-grade BC pathways in the tumors from 104 patients. We compared the ability of statistical clustering with a SOM to stratify tumors according to the risk of progression to more advanced disease. In univariable analysis, tumor stage (log rank P = 0.006) and grade (P < 0.001), HPV DNA (P < 0.004), Chromosome 9 loss (P = 0.04) and the A148T polymorphism (rs 3731249) in CDKN2A (P = 0.02) were associated with progression. Multivariable analysis of these parameters identified that tumor grade (Cox regression, P = 0.001, OR.2.9 (95% CI 1.6–5.2)) and the presence of HPV DNA (P = 0.017, OR 3.8 (95% CI 1.3–11.4)) were the only independent predictors of progression. Unsupervised hierarchical clustering grouped the tumors into discreet branches but did not stratify according to progression free survival (log rank P = 0.39). These genetic variables were presented to SOM input neurons. SOMs are suitable for complex data integration, allow easy visualization of outcomes, and may stratify BC progression more robustly than hierarchical clustering.


Introduction
Bladder cancer (BC) is a common disease for which the outcomes have not improved in the last three decades [1]. This probably reflects a lack of community-based screening for the disease, that advanced BC responds poorly to chemotherapy and that it can be hard to judge the need for radical treatment in patients with non-muscle invasive (NMI) disease. The latter arises primarily from a lack of knowledge regarding the biology of this disease. Clinicopathological and molecular data suggest two distinct pathways of urothelial carcinogenesis [2,3]. Low-grade NMI cancers arise through regional deletion of chromosome 9, mutation of FGFR3 (Fibroblast growth factor receptor), and H-RAS [4]. High-grade tumors may present with or before the onset of muscle invasion and are best characterized by loss of (Tumor Protein) function through direct (e.g., mutation or deletion of TP53) or indirect (e.g., loss of RB1-Retinoblastoma or upregulation of MDM2-Murine Double Minute) means [5]. High-grade tumors also have widespread chromosomal instability (polysomy, aneuploidy) and numerous changes to their epigenome [6,7].
While the two-pathway biology of BC is generally accepted, many tumors have aspects of low-and high-grade biology. For example, FGFR3 mutations are not found in CIS (carcinoma in situ) but they coexist with TP53 mutations in 10-20% of invasive BCs as do deletions of both chromosome 9 (typical of low-grade disease) and 17p (locus of TP53) in 15-74% BC [4,8]. Clinical phenotypes, therefore, reflect either the timing or impact of genetic events combined with patient factors (such as type and continued exposure to carcinogens) and treatment effectiveness (such as timing, appropriateness and quality of treatment). A current challenge for translational researchers is to integrate distinct and, potentially, competing molecular events into single-phenotype predictions. In BC, this represents the ability to discriminate future tumor behavior using molecular alterations typical for low-and high-grade tumor development. Nonstatistical methods are appealing in this role as they do not rely upon data distribution, can handle large datasets automatically without supervision or prior assumptions, and do not assume that statistical proximity equates to molecular association [9]. Various structures of artificial intelligence have been developed, of which Artificial Neural Networks (ANNs) are perhaps the best evaluated (reviewed in Ref. [10]). Here, we report the use of a self-organizing map (SOM) to integrate molecular parameters in BC. SOMs are a type of unsupervised ANNs that are good for low-density data visualisation [11]. We selected molecular events that characterize high-and lowgrade BC pathways and used progression to more advanced disease as our primary outcome.

Patients, tumors, and samples
A total of 104 patients with BC were studied in this report (data in Table 1). The tumors were chosen at random to represent the disease spectrum from three Departments of Urology located in Lodz Macroregion, where the textile industry was very popular in the previous century. Tumors were graded according to 1973 WHO classification and staged using the TNM criteria [12]. This study was approved by the ethics committee of the Medical University of Lodz (No: RNN/99/11/KE) and all patients gave written informed consent before entry.

RNA and DNA extraction
RNA and DNA were extracted from bladder tumors, peripheral blood, and urinary sediments. For tumors, frozen tissues were homogenized in TRI REAGENT (guanidine thicyante/phenol, Molecular Research Center, Inc. cat. No TR-118) using ceramic beads (Roche MagNA Quantitative polymerase chain reaction TP53 expression was measured using quantitative polymerase chain reaction (qPCR) performed using an iCycler iQ System (Bio-Rad cat. No 170-8701, 1709750) [14]. Expression was determined SYBR Green I fluorescence and normalized with respect to GAPDH (Glyceraldehyde-3-Phosphate Dehydrogenase) and HPRT (Hypoxanthineguanine Phosphoribosyltransferase) genes.

UroVysion test
The UroVysion (Vysis) test consists of a four-color, fourprobe mixture of DNA probe sequences homologous to specific regions on chromosomes 3, 7, 9, and 17, and was carried out according to the manufacturer's protocol.

Generation of a self-organizing map
The dataset (10 genetic variables 9 104 patients) was presented to 10 input neurons seven times in the rough-training phase and 27 times in the fine-tuning phase. The number of the input neurons was equal to the number of variables in the dataset. On a basis of the established link between the input and output neurons, a virtual patient (in terms of values of the genetic variables presented to the SOM) was created in each output neuron. The output neurons were arranged on a two-dimensional grid (4 9 4). To cluster the virtual patients (and respective output neurons), the hierarchical cluster analysis with the Ward linkage method and Euclidean distance measure was used [20][21][22]. Finally, each real patient was assigned to the best matching virtual patient and the respective output neuron.  [25].

Statistical data analysis
The primary aim of our study was to evaluate the ability of the SOM at integrating molecular data from BC samples. To this end, we analyzed its ability to stratify tumor progression using log-rank analysis and by plotting survival using the Kaplan-Meier method (SPSS Vsn. 19.0, IBM Inc., New York, NY) (Fig. 1) or metastases from an invasive cancer. Both these events mark a significant deterioration in prognosis for the patient and a need to alter treatment intent. For comparison with the SOM, we used an unsupervised hierarchical approach to cluster tumors using city block distance and average linkage in Cluster 3.0 (Eisen Lab, University of California, Berkeley, CA) and Tree view.

Patients and tumors
The population studied was typical for bladder cancer. Most patients were male; the average age was 66 years (66 AE 11), and most had a history of cigarette smoking. Around 2/3 of tumors were NMI (Table 1) and most were of low or moderate grade. Following treatment, recurrence was observed in 24 patients (23%) and progression to invasion or metastases in 15 (14%).

Progression from genetic markers
The primary outcome for our study was disease progression to a more advanced stage. In univariable analysis, tumor stage (log rank P = 0.006) and grade (P < 0.001), HPV DNA (P < 0.004), Chromosome 9 (P = 0.04) and the A148T polymorphism (rs 3731249) in CDKN2A (P = 0.02) were associated with progression following treatment. Multivariable analysis of these parameters identified that tumor grade (Cox regression, P = 0.001, OR 2.9 (95% CI 1.6-5.2)) and the presence of HPV DNA (P = 0.017, OR 3.8 (95% CI 1.3-11.4) were the only independent predictors of progression. Unsupervised hierarchical clustering grouped the tumors into several branches ( Figs. 1 and 3). This approach did not significantly stratify progression free survival (log rank P = 0.39).

Clusters
The two main clusters of SOM output neurons were distinguished: X and Y, each with a pair of sub-clusters: X 1 and X 2 , and Y 1 and Y 2 (Fig. 4). Patients with the worst prognosis were assigned to X 1 and X 2 (UroVysion test positive in 100% and 93%, respectively, and high frequency of TP53 mutations, data in Table 2 and Fig. 2). The highest frequency of: (1) abnormal TP53 expression (57%) and (2) heterozygocity loss for 9, 13 and 17 chromosome loci (71%) was recorded for patients in subcluster X 2 . In Y 1 the UroVysion test was negative for all patients, and the FGFR3 mutation ratio was quite high (38%). In Y 2 the UroVysion test was positive in 86% patients and all of them had FGFR3 gene mutation. These differences were also reflected in clinical variables ( Table 2). Tumors with high grade and higher diameter were grouped mostly in subcluster X 1 and X 2 . The highest ratio of recurrences (29%) was observed in subcluster Y 1 , where were only negative results of UroVysion test and none TP53 mutations. Significant difference in frequency

Discussion
Our knowledge of the molecular changes in BC has considerably grown over recent years [25]. Currently, a number of conventional clinicopathological factors are useful in predicting survival of bladder cancer patients. These include tumor grade, stage, type, size, the presence of concomitant carcinoma in situ, patient age, tumor location, and presence of multiple tumors [26]. As yet, there are no criteria that robustly predict the clinical outcome for individual patients with BC. Improvements in prediction may be made by the gain of information (e.g., through molecular biology) or by alternate methods of analysis. With this in mind, we have undertaken this study to evaluate the ability of SOM to integrate clinicalmolecular information for stratifying outcomes in BC. Traditionally, statistical techniques such as Cox's proportional hazards and logistic regression are usually employed when analyzing prognostic information. Classic statistical modeling requires the explicit assumption of certain relationships within the data that are often unproven. ANNs offer a number of theoretical advantages, including ability to detect complex nonlinear relationships between variables, ability to detect all possible interactions between predictor variables, and the availability of multiple training algorithms [27]. The ANN techniques depicted in the literature can be mainly categorized under two headings: supervised and unsupervised. Kohonen SOM consists in a feed forward neural network that uses an unsupervised training (partitional clustering). It means that, the data are directly divided into a set of clusters without any regard to the relationships between the clusters. These methods try to maximize some measure of similarity within the units (patients) of each cluster,   while minimizing the similarity between clusters [28]. SOM is combination of partitional clustering and projection methods. It can be used at the same time both to reduce the amount of data by clustering and to construct nonlinear projection of the data onto a low-dimensional display. In contrast to other clustering methods, the units in SOM become organized in such a way that nearby units on the gird are similar to another. The Figure 5. The associations (stronger if brighter red) of virtual patients' features with SOM regions. The intensity of colours is scaled independently for each variable. Variables with the same pattern over SOM are positively correlated. If the frequency of real patients with a given feature is significantly highest in any subcluster as compared to others, the symbol of the subcluster and the respective significance level (*P < 0.05; **P < 0.01; ***P < 0.001) are shown along with the variable name. topology of the gird can be anything but in practice rectangular two-dimensional girds are preferred as they are easy to display [28,29]. In managing patients with BC, one of the principal problems for the clinician is prediction tumor recurrence and progression. It is likely that a combination of clinical, pathological, and molecular data are needed to optimize these outcome predictions. The future of molecular biomarkers in BC undoubtedly lies in of panels of markers that represent high-and low-grade disease. Examples of these include FGFR3 and TP53 mutations that are associated with a better or worse prognosis, respectively [30]. Additional genetic changes that reflect underlying malignant traits, such as numerical chromosomal alternations from genetic instability, are useful as they identify global patterns within a disease rather than focusing upon specific events [31]. In this work, we included many of these changes in an attempt to genotype tumors. In 2010, Catto et al. and Kim et al. identified six and eight progression-related genes in BC from microarray and either neurofuzzy modeling or hierarchical clustering, respectively [32,33]. Of interest, the genes in these panels do not overlap, as found in other cancers [34,35]. Here, we used SOM to explore a similar clinical scenario. We found that the SOM was easily understood by the clinician and could cluster tumors according to future clinical outcomes. SOMs appear to do this better than more traditional statistical analyses. In clinical care, this stratification could identify aggressive tumors needing early radical treatment and indolent ones suitable for less intense surveillance. The potential for SOMs is in real time help to guide patient choices. For example, in breast cancer detection, an unsupervised ANN model improved diagnosing performance when compared to classical feed-forward neural networks like multilayer perceptron (MLP), radial basis function (RBF), and probabilistic neural networks (PNN) [36,37]. Thus, it is possible that SOMs could be integrated into patient pathways and used to guide their surveillance frequency or even treatment intent. There are a number of limitations to our work. For example, the analysis was based on a relatively low number of patients with a low event rate (number of cases with progression). However, we analysed a large number of genetic events that are known to characterize distinct BC molecular pathways, and as such, this work represents the first in BC to integrate clinical, molecular, and environmental prognostic biomarkers.

Conclusions
We have shown that Kohonen SOM could cluster homogenous tumors according to genotype and that this stratified clinical outcomes when analyzed. SOMs are easy to understand and potentially outperform traditional statistical analyses. As such, their use needs more evaluation but they could potentially offer a real-time solution to integrating molecular data into patient pathways.