Hormone Receptor and ERBB2 Status in Gene Expression Profiles of Human Breast Tumor Samples

The occurrence of large publically available repositories of human breast tumor gene expression profiles provides an important resource to discover new breast cancer biomarkers and therapeutic targets. For example, knowledge of the expression of the estrogen and progesterone hormone receptors (ER and PR), and that of the ERBB2 in breast tumor samples enables choice of therapies for the breast cancer patients that express these proteins. Identifying new biomarkers and therapeutic agents affecting the activity of signaling pathways regulated by the hormone receptors or ERBB2 might be accelerated by knowledge of their expression levels in large gene expression profiling data sets. Unfortunately, the status of these receptors is not invariably reported in public databases of breast tumor gene expression profiles. Attempts have been made to employ a single probe set to identify ER, PR and ERBB2 status, but the specificity or sensitivity of their prediction is low. We enquired whether estimation of ER, PR and ERBB2 status of profiled tumor samples could be improved by using multiple probe sets representing these three genes and others with related expression. We used 8 independent datasets of human breast tumor samples to define gene expression signatures comprising 24, 51 and 14 genes predictive of ER, PR and ERBB2 status respectively. These signatures, as demonstrated by sensitivity and specificity measures, reliably identified hormone receptor and ERBB2 expression in breast tumors that had been previously determined using protein and DNA based assays. Our findings demonstrate that gene signatures can be identified which reliably predict the expression status of the estrogen and progesterone hormone receptors and that of ERBB2 in publically available gene expression profiles of breast tumor samples. Using these signatures to query transcript profiles of breast tumor specimens may enable discovery of new biomarkers and therapeutic targets for particular subtypes of breast cancer.


Introduction
The accurate assessment of the expression of the estrogen and progesterone hormone receptors (ER and PR) and that of ERBB2 is essential to select the appropriate therapy for breast cancer patients [1,2,3,4,5]. Knowledge of the expression of the latter biomarkers is also advantageous to develop new therapies that may target specific subtypes of breast cancer [6,7]. ER and PR status is routinely defined by immunohistochemistry (IHC), whereas that of ERBB2 is determined by either IHC or by fluorescence in-situ hybridization (FISH) [8,9]. However, despite standardization of the methods used to define the status of the hormone receptors and ERBB2 in clinical laboratories, there is a level of subjectivity in these measurements, leading to variability among results obtained by different pathologists and laboratories [10,11,12,13]. It has been suggested that more accurate and less subjective methods would improve the classification of human breast tumors [14].
Global gene expression profiling is widely used to examine the expression of thousands of genes in biological samples [15]. Indeed, this technology has been used extensively in numerous breast cancer studies to: examine the effects of various therapies on gene transcripts [16,17]; identify differences in gene expression among different tumor tissues [18,19,20,21]; molecularly classify tumors [22,23,24]; and to predict prognosis [25,26,27] and treatment outcomes [28,29,30]. Attempts to use gene expression profiles to identify the ER, PR and ERBB2 status of human breast tumors have also been reported [14,31,32]. A single probe set representative of each gene was informative to establish ER, PR and ERBB2 expression in breast tumor samples. However, we wondered whether the specificity and/or sensitivity of this method could be improved by using probe sets representative of multiple genes (gene signatures) whose expression correlated with that of the hormone receptors and ERBB2.
Many peer-reviewed journals require authors to deposit microarray data in public depositories, such as the Gene Expression Omnibus [33] or ArrayExpress [34], thereby making them publicly available for various applications [35]. However, clinical information such as hormone receptor or ERBB2 status of breast tumor samples is not invariably provided with their global gene expression profiles. Knowledge of hormone receptor and ERBB2 status as well as the global gene expression profiles of breast tumor samples may permit more accurate prognostic tests to be developed and would strengthen the value of the many breast tumor gene expression profiles in public depositories.
Here we used 8 independent datasets containing human breast tumor samples profiled on Affymetrix GeneChips to define gene expression signatures predictive of their ER and PR status as well as that of ERBB2. These gene signatures reliably predicted the status of the hormone receptors and that of ERBB2 as assessed by protein (IHC) or DNA (FISH) based tests. Because the largest predictive signature defined in our study comprises only 51 genes, a qRT-PCR based format may be developed that could provide an objective and relatively high-throughput alternative for the IHCbased definitions of hormone receptor and ERBB2 status in patient samples. Figure 1 shows the specificity and sensitivity values for sets of genes predictive of ER status selected by using Spearman rank correlation cutoffs between 0.42 and 0.48. To find the most predictive set of genes, we selected those that yielded the highest combination (here the sum) of specificity and sensitivity values. The identified gene signature consisted of 35 probe sets, representing 24 annotated genes ( Table 1). Of these 24 genes, one is the ESR1 itself, whereas 11 are related to the expression of the ER: the latter include genes (GATA3, GFRA1, IL6ST, and STC2) whose expression correlates positively with that of the ER [36,37,38]; genes (CA12, CYP2B6, GREB1, LIV1, TFF1, and KDM4B) whose expression is positively regulated by the ER [37,39,40,41,42,43]; and a gene located in close proximity to ESR1 (C6orf97) [44], and whose expression is therefore positively correlated with that of the ER. Importantly, several of these genes are represented by multiple probe sets indicating that they robustly detect their cognate transcripts in breast tumor RNA samples (Table 1). Twelve remaining genes (ADCY9, ANXA9, AMFR,  CELSR1, CYP2B7P1, FAM176B, GAMT, KCNK15, SCCPDH,  SCUBE2, SSH3, and TBC1D9) have not been previously associated with ER status. Interestingly, SCUBE2 is reported to positively correlate with PR status [45]. Because our ER signature comprises 24 genes and one probe set for an unknown gene, we refer to the signature as the ''24-gene ER signature''. The 24-gene ER signature separated ER-positive tumors from ER-negative tumors with an accuracy of 88.66%, sensitivity of 91.18%, specificity of 88.26%, PPV (Positive Predictive Value) of 98.43% and NPV (Negative Predictive Value) of 55.36% in the 247 training samples ( Table 2; p,2.2?10 216 , Fisher's exact test). To determine whether the predictive performance of a single probe set is sufficient to determine ER status of a sample we used ''205225_at'', the probe set with the highest Spearman rank correlation in the 24-gene ER signature (Spearman rank correlation is 0.50; see Table S1), which we termed ''best probe set'' for the ER predictive signature. It is of interest, that the ''best probe set'' was the same probe set conventionally used to determine ER status (205225_at; see Table  S1). The prediction accuracy of the ''best probe set'' was 89.07%, sensitivity 89.67%, specificity 85.29%, PPV 97.45% and NPV 56.86% (Table 2; p,2.2?10 216 , Fisher's exact test). Both the sensitivity and specificity of prediction by using the ''best probe set'' were lower than were the sensitivity and the specificity of the prediction using the 24-gene ER signature, indicating that the predictive performance of the single ''best probe set'' is not as high as the performance of our signature.

ER status
We subsequently tested the predictive performance of the 24gene signature in 5 independent validation datasets ( Table 2). The first validation set (GSE2034) comprised 286 samples; the Figure 1. Selecting gene signature predictive of ER status based on sensitivity and specificity. The cutoff is based on Spearman rank correlation coefficients. The number of probe sets in each signature is marked by the number under the lowest curve. Black filled circles -specificity; gray circles -sensitivity; black line -sum of specificity and sensitivity. The optimal number of probe sets was 35, with Spearman rank correlation coefficient cutoff set at 0.43. Gray line and ''*'' indicate the sum of specificity and sensitivity of the prediction obtained by using a single ''best probe set'' (''205225_at''  Figure 2 and Table S4 depict the sensitivity and specificity levels obtained for the training and the validation sets using the 24-gene ER signature, compared to those derived by using the conventional method of employing a single probe set (205225_at). The sensitivity levels obtained by using a single probe set were relatively Each row in the coefficient column represents a probe set. Genes, whose levels of expression were previously reported to correlate with ER status are marked in bold. The rows were sorted alphabetically according to the Gene Symbol. For detailed information on the probe sets see Table S1. doi:10.1371/journal.pone.0026023.t001 high, ranging between 85.71% (GSE20271) and 98.21% (GSE2603); however, the specificity levels were significantly lower than these obtained using the 24-gene ER signature, ranging between 68.29% (GSE2603) and 85.96% (GSE20194; p,0.05, t-test). Hence the 24-gene ER signature significantly improved the specificity levels of ER status prediction (p,0.05, t-test) to range between 80.6% and 100% without adversely affecting sensitivity levels. Figure 3 shows the specificity and sensitivity values for gene sets predictive of ERBB2 status selected by using Spearman rank correlation cutoffs between 0.34 and 0.39. For the first training set (GSE2603; Figure 3, left panel), the sum of specificity and sensitivity was constant for the examined range of Spearman rank correlation cutoffs. Therefore, we used an additional set of samples for training (GSE20271; Figure 3, right panel), which led to the highest combination of specificity and sensitivity values at a cutoff of 0.35, yielding a gene signature consisting of 14 annotated genes (represented by 18 probe sets) and 1 probe set representing an unknown sequence ( Table 3). The ERBB2 gene and 5 other genes (CRK7, GRB7, PERLD1, PPARBP, and STARD3) are part of the 17q12-q21 amplicon and are reported to be co-amplified with the ERBB2 locus [46]. Several of these genes are represented by a number of probe sets indicating that they readily detect their cognate transcripts in breast tumor RNA samples ( Table 3). The remaining 8 genes comprising the candidate ERBB2 gene signature have not previously been reported to correlate with ERBB2 expression. Because our signature comprises 14 genes and one probe set representing an unannotated gene we henceforth To determine whether the predictive performance of a single probe set is sufficient to determine ERBB2 status, we used the ''203497_at'', the probe set with the highest Spearman rank correlation in the 14-gene ERBB2 signature (Spearman rank correlation is 0.45; see Table S2), which we termed the ''best probe set'' for the ERBB2 predictive signature. For the first training set (GSE2603) the predictive accuracy of the ''best probe set'' was 96.59%, sensitivity 87.5%, specificity 97.5%, PPV 77.78% and NPV 98.73% (Table 4; p,4.4?10 28 , Fisher's exact test). For the second training set (GSE20271) the predictive accuracy of the ''best probe set'' was 86.11%, sensitivity 40.91%, specificity 94.26%, PPV 56.25% and NPV 89.84% (Table 4; p,5.2?10 25 , Fisher's exact test). Although predictions by using ''best probe set'' in both training sets provided similar results, the sensitivity of prediction by using the ''best probe set'' in the second training set (GSE20271) was very low, reaching 40.91%. Therefore, we suggest that the predictive performance of the 14-gene ERBB2 signature is better than that of the single ''best probe set''.

ERBB2 status
We tested the predictive performance of the 14-gene signature in 2 validation sets ( Table 4). The first validation set (GSE20194) is composed of 278 breast tumor profiles; the prediction accuracy was 94.60%, sensitivity 76.27%, specificity 99.54%, PPV 97.83% and NPV 93.97% (Table 4; p,2.2?10 216 , Fisher's exact test). For the second validation set (GSE16446; 93 breast tumor profiles), the prediction accuracy was 93.55%, sensitivity 83.07%, specificity 98.39%, PPV 96.30% and NPV 92.42% (Table 4; p,2.2?10 216 , Fisher's exact test). Importantly, the second validation set was obtained from transcript profiles performed on a different type of GeneChip -HG-U133 Plus 2.0. We performed this last validation on data collected from HG-U133 Plus 2.0 GeneChips to determine whether the candidate 14-gene ERBB2 signature was capable of separating ERBB2-positive tumors from their ERBB2negative counterparts independent of the nature of the Affymetrix arrays to which the transcripts were hybridized. Figure 4 and Table S4 depict sensitivity and specificity levels obtained for the training and the validation sets using the 14-gene ERBB2 signature or using the method employing a single probe set (216836_s_at). The specificity levels obtained by using one probe set were relatively high, ranging between 94.94% (GSE2603) and 99.54% (GSE20194); however, the sensitivity levels were significantly lower, ranging between 54.55% (GSE20271) and 77.78% (GSE2603). Whereas the specificity levels were approximately within the same range using the 14gene ERBB2 signature, the sensitivity levels changed to range between 59.09% (GSE20271) and 77.78% (GSE2603). Impor-   Each row in the correlation coefficient column represents a probe set. Genes, within the borders of the ERBB2 amplicon are marked in bold. The list of genes is divided into genes with positive and negative (-) correlation coefficients. For detailed information on the probe sets see Table S2.  Table 5). The PR gene, PGR, and 3 other genes (GATA3, STC2 and GLI3) [47,48,49] are increased in their expression, whereas the expression of 6 genes (AURKA, BUB1, CDC20, MKI67, HJURP, and CENPA) [50,51,52,53,54] is     Tables 1 and 5). Because our signature comprised 51 genes and one probe set for an unannotated gene we refer to this signature as a ''51-gene PR signature''. The candidate 51-gene PR signature contained 2 genes (HPN and MAPT) whose expression was reported to correlate positively with ER expression [37,56]; however, these genes did not appear in the ER-predictive signature. The expression of 41 genes out of the 51 annotated genes constituting the PR-predictive signature has not been previously associated with PR status. The candidate 51-gene PR signature separated PR-positive tumors from PR-negative tumors with an accuracy of 81.27%, sensitivity of 83.68%, specificity of 73.77%, PPV of 90.86% and NPV of 59.21% in the 251 training samples (Table 6; p = 2.3?10 216 , Fisher's exact test). To determine whether the predictive performance of a single probe set is sufficient to determine PR status we used ''219197_s_at'', the probe set with the highest Spearman rank correlation in the 51-gene PR signature (Spearman rank correlation is 0.44; see Table S3), which we termed ''best probe set'' for PR predictive signature. The prediction accuracy of the ''best probe set'' was 80.48%, sensitivity 91.05%, specificity 47.54%, PPV 84.39% and NPV 63.04% (Table 6; p,3.3?10 210 , Fisher's exact test). Although the sensitivity of the prediction by using the ''best probe set'' was higher than the sensitivity of the prediction by using the 51-gene PR signature, the specificity was very low, reaching only 47.54%.
Also the prediction accuracy and PPV were lower when using only the ''best probe set''. These findings indicated that the predictive performance of the single ''best probe set'' is not as high as the performance of the signature.
We tested the predictive performance of the 51-gene PR signature in 3 validation datasets ( Table 6). The prediction accuracy was 78.47%, sensitivity 76.92%, specificity 79.75%, PPV 75.76% and NPV 80.77% in 144 samples of the first validation set (GSE20271; Table 6; p = 6.1?10 212 , Fisher's exact test). The prediction accuracy was 74.1%, sensitivity 81.82%, specificity 68.15%, PPV 66.44% and NPV 82.94% in 278 profiles of the second validation set (GSE20194; Table 6; p,2.2?10 216 , Fisher's exact test); however, in the third validation set (HG-U133 Plus 2.0 GeneChip array) the prediction accuracy was 62.03%, sensitivity 62.5%, specificity 60.0%, PPV 86.96%, and NPV 27.27% in 79 samples (GSE9195; Table 6; p = 0.1484, Fisher's exact test). Figure 6 and Table S4 depict sensitivity and specificity levels obtained for the training and the validation sets by using the candidate 51-gene PR gene signature or by using a single probe set (208305_at) to assess PR status in breast tumor specimens. The estimation was performed in the same way as was reported previously to establish PR status based on gene expression profiles. The specificity levels obtained by using a single probe set were relatively high, ranging between 77.05% (GSE3494) and 98.73% (GSE20271); however, the sensitivity levels were lower, ranging between 32.31% (GSE20271) and 65.79% (GSE3494). Whereas using the 51-gene PR signature the specificity levels did not change significantly (p = 0.134, t-test) compared to those using the single probe set, the sensitivity levels were significantly improved (p,0.05, t-test), to range between 76.92% (GSE20271) and 83.68% (GSE3494). The sensitivity (62.5%) obtained with HG-133 Plus 2.0 GeneChip (GSE9195; third validation set) lies within the 95% confidence interval for sensitivity obtained for HG-U133A GeneChip (CI95%: 61.44 -87.19%). However, the specificity (60.0%) obtained with HG-133 Plus 2.0 GeneChip was lower than the lower limit of the 95% confidence interval for specificity established with the HG-U133A GeneChip (95%CI: 72.08-88.98%). This indicates, that whereas the candidate PR Each row in the correlation coefficient column represents a probe set. Genes, whose levels of expression are reported to correlate with PR are marked in bold. Genes that occur in the signature predictive of ER status are marked in italics. Those genes whose levels of expression have been reported in literature to correlate with ER status are marked by an asterisk. The list of genes is divided into those with positive and negative correlation (-) coefficients. For detailed information on the probe sets see Table S3. doi:10.1371/journal.pone.0026023.t005

Discussion
Global gene expression profiling is widely used in cancer research and the results of these analyses are generally accessible to the scientific community in public repositories. However, these profiles rarely have accessory information concerning the clinically established status of PR, ER or that of ERBB2. Knowledge of the expression of the aforementioned markers could be used to mine publically available gene expression profiles for candidate molecular targets thus aiding efforts to expand the armamentarium of anticancer therapies targeted to these breast tumor subtypes.
Previous studies have demonstrated a correlation between mRNA levels and clinical receptor status as established by IHC, FISH and ligand-binding assays using breast tumor samples [57,58,59]. Means have also been established for statistical thresholds for ESR1, PR and ERBB2 transcript levels to assign their expression status in profiled breast tumor samples [14,31,32]. These methods use a single probe set to predict ER, PR or ERBB2 status of breast tumor samples. Whereas the latter assays provide good sensitivity for determining ER status and good specificity for those of PR and ERBB2, improvements of these parameters would be desirable to more accurately predict the status of the expression of these genes in breast tumor gene expression profiles.
Our study sought to establish a more accurate specificity for predicting ER status and increased sensitivity for predicting those of PR and ERBB2 while maintaining or improving the sensitivity to predict ER status and to similarly maintain or improve the specificity to predict PR and ERBB2 status. Predictive signatures were developed based on data collected from HG-U133A GeneChips. However, additional GeneChip arrays, HG-U133 Plus 2.0, have been developed (http://media.affymetrix.com/ support/technical/datasheets/hgu133arrays_datasheet.pdf), and  are increasingly used for global gene expression profiling. Therefore, another goal of our study was to examine the predictive capacity of our signatures using transcript profiles performed on both HG-U133A and HG-133 Plus 2.0 GeneChips to learn whether our predictive signatures perform independently of the nature of the GeneChips used to identify them.

Gene signature predictive of ER status
The gene signature reported here comprises 24 annotated genes. One of these genes is ESR1 (estrogen receptor alpha) whereas 11 others have been reported to correlate with the expression of ESR1 or to be directly regulated by ER [36,37,38,39,40,41,42,43,44]. Several of the identified genes are represented by a number of probe sets in the gene signature indicating that these genes have a stable correlation with ER status. Interestingly, one additional gene was found to be reported to positively correlate with PR status [45]. This finding is supported by reports that ER and PR status often correspond with each other [60]. However, this gene was not identified in our PR-predictive gene signature. A plausible explanation for the latter is that we used a high correlation coefficient cutoff to identify the genes belonging to the ER-predictive signature, and hence this gene might have been eliminated during the gene selection process.
Because previously reported methods used a single probe set to determine the hormone and ERBB2 status of tumors, we wished to learn whether a single probe set from the 24-gene ER signature performed as well as the whole signature. To this end we selected the probe set with the highest Spearman rank correlation to the ER status of the sample as the ''best probe set''. The best probe set thus identified is identical to that identified in previous studies to determine ER status [14,31]. The levels of sensitivity and specificity of ER status prediction by using the ''best probe set'' were lower than the sensitivity of the prediction by using the 24gene ER signature, indicating that the signature outperformed the ''best probe set''.
Previous methods [14,31] yielded high sensitivity, but a relatively low specificity for predicting ER status ( Figure 2). Therefore, we wondered whether we could improve the specificity of ER status prediction by identifying a gene signature to predict ER status. Indeed, our ER-predictive gene signature provides a significantly higher specificity, while maintaining the level of sensitivity. The ER-predictive gene signature we identified was derived by analyzing gene expression data from breast tumor RNA samples profiled on the HG-U133A GeneChip arrays. However, we were unable to find an HG-U133 Plus 2.0 dataset with accompanying clinical information concerning ER status. Future studies will examine the predictive potential of the ER gene signature on HG-U133 Plus 2.0 arrays.

Gene signature predictive of PR status
The signature predictive of PR status consists of 51 annotated genes, which include the PGR (progesterone receptor), and 9 genes (AURKA, BUB1, GATA3, GLI3, STC2, CDC20, CENPA, HJURP and MKI67) that have previously been demonstrated to correlate with PGR expression (Table 5; [47,48,49,50,51,52,61]). Interestingly, 11 genes (STC2, GATA3, CA12, FAM176B, GAMT, GFRA1, IL6ST, KDM4B, SCUBE2, LIV1, and an 'unknown' gene; Tables 1 and 5) out of the 51 genes constituting the PR-predictive signature also appear in our 24-gene ER-predictive signature. These findings are in agreement with other studies reporting that ER and PR status often correlate with each other [60]. Notably, the probe set for the only gene lacking annotation appears in both signatures predictive of PR and ER status indicating a strong connection of the gene reflected by this probe set to ER and PR status. The PR-status predictive signature comprised 2 other genes (HPN and MAPT) whose expression is positively correlated with ER expression [37,56]. However, these genes were not identified in our ER-predictive gene signature, probably due to the fact that they had a lower correlation coefficient with ER status than the cutoff established to identify the ER-predictive signature.
The ''best probe set'' selected from the PR predictive signature was ''219197_s_at'' (SCUBE2). Expression of this gene has not been reported to correlate with PR status of human, however, this gene appears also in our 24-gene ER-predictive signature, and, as has been mentioned earlier, there are studies showing that ER and PR status often show correlation with each other. Specificity of prediction using the ''best probe set'' was very low, reaching only 47.54% and prediction accuracy and PPV of the were lower than the ones obtained with the 51-gene PR-predictive signature. Therefore, we concluded, that the PR-predictive signature outperformed the single ''best probe set''.
Previous method [32] yielded high specificity, but a relatively low sensitivity for predicting PR status ( Figure 6). Therefore, we wondered whether we could improve the sensitivity of PR status prediction by identifying a gene signature to predict PR status. By using our gene signature predictive of PR status, we significantly improved the level of sensitivity, while not reducing the level of specificity, as compared to the same measures obtained with 1 probe set ( Figure 6).
When tested on data obtained from HG-U133 Plus 2.0 GeneChip arrays, the results differed from the ones obtained from datasets profiled on HG-U133A arrays ( Figure 6 and Table 6), indicating, that our candidate PR gene signature needs to be modified to predict PR status of tumor samples profiled on other array types. A plausible explanation for the lower level of performance of the predictive signature on data obtained from HG-U133 Plus 2.0 arrays could be the technical differences in the design of the arrays belonging to HG-U133A and HG-U133 Plus 2.0 types: HG-U133 Plus 2.0 arrays belong to a newer generation of GeneChip arrays, which contain improvements, that result in higher resolution, sharpness, definition and signal uniformity (http: //media.affymetrix.com/support/technical/technotes/expression_ comparison_technote.pdf). Such technical differences could affect information obtained for the probe sets that were included in our PR signature, among other probe sets.

Gene signature predictive of ERBB2 status
The ERBB2 predictive gene signature consists of 14 annotated genes, including ERBB2 and 5 genes (CRK7, GRB7, PERLD1, PPARBP, and STARD3) located within the ERBB2 17q12-q21 amplicon [46]. Several of these genes are represented by multiple probe sets in the ERBB2-predictive gene signature indicating their stable correlation with ERBB2 status.
The ''best probe set'' selected from the ERBB2 predictive signature was ''203497_at'', representing PPARBP, a gene, located within the ERBB2 17q12-q21 amplicon [46]. The performance of this ''best probe set'' was tested on two training sets (GSE2603 and GSE20271), that were used to derive the 14-gene ERBB2 signature. The first training set (GSE2603) could not provide us with a clear cutoff for the Spearman rank correlation used to determine the optimal number of genes for the signature (Figure 3). Therefore we needed to test the second training set (GSE20271) as well. The sensitivity of prediction by using the ''best probe set'' for the second training set (GSE20271) was very low, reaching 40.91%. Therefore, we concluded, that the ERBB2-predictive signature outperformed the single ''best probe set''.
A previously described method [14] yielded high specificity levels for predicting ERBB2 status from gene expression profiles using a single probe set (216836_s_at); however, the sensitivity of this method was relatively low. By contrast the specificity levels of our 14-gene signature was unchanged from that reported previously but the sensitivity levels were improved. Additionally, the ERBB2-predictive gene signature also successfully predicted ERBB2 status of gene expression profiles obtained by employing the HG-U133 Plus 2.0 GeneChip (Figure 4 and Table 4).
In summary our findings demonstrate that small gene signatures can be identified in patient breast tumor gene expression profiles that accurately predict ER, PR and ERBB2 status.
Clinical definition of hormonal receptors status and ERBB2 status Table 7 shows the sources of the samples and the methods used to obtain the clinical status of the ER and PR and that of ERBB2.

Filtering repeated samples across datasets
Samples for 2 datasets (GSE20271 and GSE20194) were contributed by the University of Texas M. D. Anderson Cancer Center (MDACC, Houston, TX, USA), and as a result, there were 34 samples that were present in both datasets. For all analyses performed in the present study, these repeated samples were removed from GSE20271, reducing the number of usable samples from 178 to 144.

Single-probe set estimations
Comparing predictive capacity of our signatures to predictive capacity of single probe sets reported to be used in the literature. For all datasets obtained from HG-U133A GeneChips, the one probe set estimation was performed by using ''205225_at'' for determining ER status [14,31] , ''216836_s_at'' for determining ERBB2 status [14,31], and ''208305_at'' for determining PR status [32]. Hormone and ERBB2 status was determined by fitting Gaussian distributions into the distribution of expression values of the examined probe set using Expectation-Maximalization (EM) algorithm [63], similar to the method described by Rody et al [31] and by Lehmann et al [32]. For GSE16446 dataset, which was obtained from HG-U133 Plus 2.0 GeneChips, we used the data on bimodal ERBB2 status supplied with the samples.
Comparing predictive capacity of our signatures to predictive capacity of single probe sets with the highest Spearman rank correlation to the hormone and ERBB2 status (''best probe set''). These comparisons were performed for the training sets used to establish the predictive signatures. The probe set with the highest Spearman rank correlation with ER status was ''205225_at'' (Spearman rank correlation = 0.50), the same probe set as the one used in literature [14,31]. We used ''203497_at'' (Spearman rank correlation = 0.45) for determining ERBB2 status and ''219197_s_at'' (Spearman rank correlation = 0.44) for determining PR status. Hormone and ERBB2 status was determined in the same way as in the previous single-probe set estimation, by fitting Gaussian distributions using Expectation-Maximalization algorithm.
Finding gene signatures predictive of ER, PR or ERBB2 clinical status of the tumor samples Figure 7 describes the algorithm used to find gene signatures predictive of ER, PR or ERBB2 clinical status of the samples. First, global gene expression profiles for the whole training dataset were examined and for each probe set Spearman rank correlation coefficient between its expression levels and clinical status of interest was calculated. The probe sets were sorted by the correlation coefficient, and several groups of genes were selected based on varying correlation cutoff. This way a group of genes selected by using a lower cutoff would contain all the genes belonging to a group selected by using a higher cutoff and additional genes that were filtered out by the higher cutoff. Each group of genes was used for k-means clustering of the samples, in order to define samples with positive and negative status. Then specificity and sensitivity were calculated. Group of genes that led to the highest combination of specificity and sensitivity was defined as a gene signature with optimal predictive ability for the clinical status of interest for the samples in the training set. The same group of genes was used on validation sets, and specificity, sensitivity, accuracy, PPV and NPV were calculated. To derive gene signature predictive of ERBB2 status 2 training sets were needed, since the first set provided constant specificity and sensitivity values for multiple correlation cutoffs.