An international reproducibility study validating quantitative determination of ERBB2, ESR1, PGR, and MKI67 mRNA in breast cancer using MammaTyper®

Accurate determination of the predictive markers human epidermal growth factor receptor 2 (HER2/ERBB2), estrogen receptor (ER/ESR1), progesterone receptor (PgR/PGR), and marker of proliferation Ki67 (MKI67) is indispensable for therapeutic decision making in early breast cancer. In this multicenter prospective study, we addressed the issue of inter- and intrasite reproducibility using the recently developed reverse transcription-quantitative real-time polymerase chain reaction-based MammaTyper® test. Ten international pathology institutions participated in this study and determined messenger RNA expression levels of ERBB2, ESR1, PGR, and MKI67 in both centrally and locally extracted RNA from formalin-fixed, paraffin-embedded breast cancer specimens with the MammaTyper® test. Samples were measured repeatedly on different days within the local laboratories, and reproducibility was assessed by means of variance component analysis, Fleiss’ kappa statistics, and interclass correlation coefficients (ICCs). Total variations in measurements of centrally and locally prepared RNA extracts were comparable; therefore, statistical analyses were performed on the complete dataset. Intersite reproducibility showed total SDs between 0.21 and 0.44 for the quantitative single-marker assessments, resulting in ICC values of 0.980–0.998, demonstrating excellent agreement of quantitative measurements. Also, the reproducibility of binary single-marker results (positive/negative), as well as the molecular subtype agreement, was almost perfect with kappa values ranging from 0.90 to 1.00. On the basis of these data, the MammaTyper® has the potential to substantially improve the current standards of breast cancer diagnostics by providing a highly precise and reproducible quantitative assessment of the established breast cancer biomarkers and molecular subtypes in a decentralized workup.


Background
In contemporary clinical management of patients with breast cancer, prognostications and therapeutic decisions are based on the assessment of clinicopathological factors as well as on the expression status of biomarkers with established clinical validity (i.e., human epidermal growth factor receptor 2 [HER2]; estrogen receptor [ER]; progesterone receptor [PgR]; and Ki67, a marker of cell proliferation) [1,2]. Currently, the most commonly applied method for the determination of these four markers is immunohistochemistry (IHC), which allows for the semiquantitative assessment of the protein expression levels on histological slides [3,4]. For HER2, an additional analysis of the amplification status of the corresponding gene ERBB2 by fluorescence in situ hybridization (FISH), chromogenic in situ hybridization (CISH), or silver in situ hybridization (SISH) can also be applied in selected cases. The quality of the determination of these markers in terms of accuracy and reproducibility is essential for effective therapeutic interventions. However, the inter-and intraobserver variability of IHC is of concern [3][4][5][6][7][8][9]. For HER2, ER, and PgR, several studies have reported discrepancies of up to 20% [5][6][7], but most prominent and challenging is the inconsistency regarding Ki67 [8,9]. Ki67 is a marker of the proliferative activity of the tumor cells and thereby carries valuable prognostic information [10][11][12]. In addition, Ki67 may have a direct impact on therapeutic decisions by assisting in the distinction between luminal A and luminal B breast cancer and therefore may aid in the selection of cytotoxic chemotherapy in addition to endocrine treatment [2,13]. The variability in Ki67 is due mainly to the subjectivity of the visual estimation method and the choice of areas of evaluation on the histological slides and, to a lesser extent, the technical variations in the IHC staining process [9,14]. Efforts to standardize Ki67 scoring resulted in considerable improvements, but interobserver agreement is still unsatisfactory [15,16]. In addition, implementation of these methodological advances in clinical routine laboratories is challenging, and clinical validity of the new methods remains to be shown. For these reasons, the Ki67 determined by IHC is not currently included in the American Society of Clinical Oncology/College of American Pathologists guidelines for routine clinical use [1,17]. There remains an urgent need for alternative, more robust, standardized, and precise assays with proven analytical and clinical validity for Ki67, HER2, ER, and PgR in routine breast cancer diagnostics [17,18].
The MammaTyper® (BioNTech Diagnostics, Mainz, Germany) is a novel CE-marked in vitro diagnostic test that quantifies the messenger RNA (mRNA) expression of the four key marker genes ERBB2, ESR1, PGR, and MKI67 on the basis of reverse transcription-quantitative real-time polymerase chain reaction (RT-qPCR), which differs from the currently applied standard of protein-based semiquantitative assessment by IHC. The main goal in using this technology is to provide a precise and reproducible assessment of the four biomarkers. Similarly to IHC, the MammaTyper® test can be integrated into the local laboratory setup because it supports analysis on widely accessible qPCR platforms using total RNA extracted from clinical routine formalin-fixed, paraffin-embedded (FFPE) breast cancer samples from resections or core needle biopsies.
In this study, we assessed the precision of the Mam-maTyper® test with a focus on reproducibility [19]. We adopted a multicenter design to fully evaluate the interand intrasite components of precision as well as other sources of imprecision, including preanalytical factors. Ten international pathology institutions, all with expertlevel background in the field of breast cancer diagnostics, participated in the study. Each site carried out the same technical procedures according to a predefined study plan based on the EP05-A3 guideline for precision evaluation of quantitative measurement methods issued by the Clinical & Laboratory Standards Institute [20]. To our knowledge, a similar study has not been conducted to date.

Study objectives
The precision (reproducibility) of the MammaTyper® test was evaluated on multiple levels according to the following parameter definitions: 1. Intermediate precision, here also referred as interrun precision, as the variability of quantitative results across repeated measurements over several days by the same operator, in the same laboratory, and using the same instrument; this parameter also included repeatability, the variance component due to simple replicates (intrarun) 2. Intersite reproducibility, as the most comprehensive demonstration of precision, including the variability introduced by different laboratories, operators, and instruments 3. Preanalytical and lot-to-lot variability 4. Agreement of binary single-marker results and subtypes 5. Interclass correlation coefficient (ICC) as the agreement of quantitative results

Study design
A prospective, two-stage study was designed with the participation of ten international pathology institutions (see authors' affiliations 1-9 and 14). Prior to the study start, one operator per site was trained on the correct use of the preanalytical RNA extraction kit RNXtract® (BioNTech Diagnostics) and the MammaTyper® test within a 2-day standard training phase carried out by the manufacturer. This training also included qualification of the local qPCR instrument for use with the MammaTyper®, which in this study was the LightCycler® 480 instrument II (Roche Molecular Diagnostics, Pleasanton, CA, USA). The training was followed by a familiarization period consisting of at least four MammaTyper® runs on 3-4 days using BioN-Tech Diagnostics' reference material, carried out by the operator without supervision. During the study, each site performed repeated MammaTyper® measurements on different days according to a predefined study plan using RNA extracts from clinical FFPE breast cancer tissues. The same MammaTyper® lot was used at all sites, and only one site repeated study arm 1 using a second lot of MammaTyper®. The study comprised 8 days in total (consecutive or nonconsecutive days), as illustrated in the study design ( Fig. 1).

Study arm 1
RNA was extracted at a central laboratory (BioNTech Diagnostics), and eight different RNA pools, each containing RNA from a single tumor sample, were provided as single-use aliquots to the study sites (samples 1-8). Samples were measured repeatedly on 4 different days using MammaTyper®.

Study arm 2
Ten-micrometer sections of 16 FFPE tissue samples from different breast tumors (samples 9-24) were provided to each study site for local RNA extraction using RNXtract®. After extraction, each RNA eluate was split into three single-use aliquots for repeated MammaTyper® measurements on 3 different days.

Samples
The samples used in the study were prepared from clinical FFPE breast cancer tissue blocks by BioNTech Diagnostics and were distributed to study sites as RNA aliquots (samples 1-8) or 10-μm FFPE whole-tissue sections (samples 9-24). The 24 FFPE tissue samples were selected from a series of clinical routine breast cancer cases (n = 43) kindly provided by PSi. A summary of the clinicopathological characteristics of these patient samples is given in Additional file 1:  Figure S1).

RNA extraction
Total RNA was purified from 10-μm FFPE tissue sections using the paramagnetic particle-based RNXtract® RNA Extraction Kit (reference 90040; BioNTech Diagnostics GmbH) according to the manufacturer's instructions. The RNXtract® kit has been validated as a preanalytical RNA extraction method for the MammaTyper® by the manufacturer.

MammaTyper® test
The MammaTyper® (reference 90020; BioNTech Diagnostics GmbH) is a molecular in vitro diagnostic RT-qPCR test for the quantitative detection of the mRNA expression status of the genes ERBB2, ESR1, PGR, and MKI67 in human FFPE breast cancer tissue from resection or core needle biopsies with at least 20% tumor cell content using whole-tissue sections without macrodissection. Primary analysis outputs are the normalized, quantitative single-  (quantification cycle) values on a continuous scale [21]. The test also provides the status of each marker as a binary category (positive or negative) based on clinically validated marker-and devicespecific cutoff values.

Statistical analysis
The results were analyzed according to a predefined statistical analysis plan using SAS version 9.4 software (SAS Institute, Cary, NC, USA). The number of measurements (sample size) of the study was determined using simulations to achieve predefined levels of uncertainty using results of a previous method validation [21]. On each study day, exported raw C q values were directly transferred by the operator to the statistician. To reflect a realistic estimate of the test precision, statistical outliers were not excluded from the analyses. The precision of the quantitative single-marker assessments (40 −ΔΔCq values) was estimated by a random effects model II analysis of variance (ANOVA) with site as a random factor [20]. Because the variability does not depend on 40 −ΔΔCq values, the sample was also included as a (trivial) random factor in the model, which allows averaging of the variance components over the samples:

The intermediate precision referring to interrun/day
SD is obtained as the residual SD in the ANOVA. 2. The reproducibility was calculated as the intersite SD summarizing the condition of different sites, operators, and instruments. The total SD is also presented, calculated as the square root of the sum of residual and intersite variance components. Because the total SD is the precision as experienced in clinical practice, we decided to report this parameter as the main result as a conservative approach.

The variance introduced by a different
MammaTyper® lot obtained in a separate experiment was given as the interlot SD. 4. Agreement of the categorical marker results and the breast cancer biological subtypes across all sites was evaluated using Fleiss' kappa statistics [22]. According to the method of Landis and Koch [23], the strength of the agreement was defined as follows: kappa < 0.00 = poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 substantial, and 0.81-1.00 = almost perfect. 5. The ICC was estimated for the continuous scaled quantitative marker results and was used to evaluate the reproducibility and intermediate precision in relationship with the intersample variance using the approach proposed by Eliasziw et al. [24]. Thus, and different from the kappa statistic, the ICC determines the agreement of measured quantitative values over the whole measurement range, independent of any cutoff point [25]. The agreement is generally interpreted as follows: ICC < 0.40 = poor, 0.40-0.74 = fair to good, and 0.75-1.00 = excellent [26].
More stringent thresholds were recommended by Kirkegaard et al. [25] for IHC assessments, with an ICC level of 0.7 regarded as the minimum acceptable standard, 0.8 as good, and ≥0.9 as excellent. The latter thresholds were applied in this study.
In a final analysis, kappa and ICC values were simulated in a larger sample cohort using quantitative data from 769 breast cancer cases of the FinHer trial that had been measured previously by MammaTyper® [27]. ICC values were calculated using the intersample variance of the larger cohort along with the intersite and residual variance of the present study. To estimate the kappa values for this cohort, 1000 simulated pairs of datasets were created by adding random noise to the 40 −ΔΔCq values according to the marker-specific total variance observed in this study. For each pair (2 × 769 values), kappa values for binary marker results and subtypes were calculated, resulting in 1000 kappa values, of which the median kappa as well as the 2.5% and 97.5% percentiles are reported.

Intermediate precision
On the basis of MammaTyper® measurements of study arm 1 (Fig. 1), quantitative single-marker results were obtained as 40 −ΔΔCq values for ERBB2, ESR1, PGR, and MKI67 and are presented in Fig. 2 as box plots for each marker, sample, and study site. The intermediate precision for each marker at the individual site was computed over all samples, presented as interrun SD (Fig. 2, graphs at the bottom).

Intersite reproducibility
As indicated by the side-by-side box plots in Fig. 2, the 40 −ΔΔCq quantitative single-marker results of each individual sample were highly consistent across all ten study sites. The total SD of the measurements of the eight centrally extracted RNA samples (samples 1-8) was as low as 0.18 C q for PGR, 0.29 C q for ERBB2 and MKI67,  (Table 1, upper panel). As demonstrated by the variance component analysis, the factor site (intersite SD) had less impact on the total imprecision (total SD) than the interrun/day variability within one laboratory (residual SD) ( Table 1, upper panel).

Intersite reproducibility including preanalytical variances
The total variance of marker results (40 −ΔΔCq values) in the self-extracted samples (study arm 2, samples 9-24) was almost identical to the variance seen for the RNA pool aliquots (samples 1-8) ( Table 1, middle panel). There was no additional variance or bias introduced by RNA extraction at local sites. Therefore, the intersite reproducibility was again computed on the whole sample set (samples 1-24), leading to a similar approximation of the total variability of single-marker 40 −ΔΔCq assessments with SDs between 0.21 and 0.44 C q ( Table 1, lower panel). Performing the analysis of the eight RNA pool samples with a different MammaTyper® lot resulted in comparable quantitative values (Additional file 3: Figure S2). The interlot SD was almost completely covered by the existing interrun/day variability (residual SD), and its impact on the total variance was negligible (Additional file 4: Table S2). Individual laboratory deviations for all samples are also shown with Bland-Altman plots (Fig. 3). The average deviation at the respective site was in all cases close to zero, with values ranging from C q −0.13 to 0.16 for ERBB2, −0.11 to 0.20 for ESR1, −0.15 to 0.19 for PGR, and −0.22 to 0.31 for MKI67.

Binary single-marker and subtype agreement
The binary single-marker results (positive/negative) for all measurements at the ten sites are displayed as counts for each sample in Table 2, revealing a very high concordance. The 24 samples showed 100% concordance for ERBB2 and for PGR and ESR1 an equivocal assignment in only one and two samples, respectively. These cases exhibited a marker expression level near the cutoff, as indicated by the distance to cutoff value ( Table 2). This also explained the divergent measurements seen for MKI67, which biologically exhibits more samples near the cutoff because of its continuous distribution [28]. Nevertheless, MKI67 showed a high agreement because for most discrepant cases only 1 of 30 determinations was classified differently ( Table 2). Calculating the overall agreement of the categorical marker assessments resulted in kappa values of 1.00, 0.91, 0.94, and 0.94 for ERBB2, ESR1, PGR, and MKI67, respectively. Corresponding subtype assessments resulted in an almost perfect agreement with a kappa value of 0.90 (Table 3). Discrepancies were observed for example for the luminal A-like and luminal B-like (HER2-negative) subtype, where discrimination using St. Gallen guidelines relies on MKI67 marker expression [2], which for the discrepant cases was very close to the cutoff (as described above).

Intra-and interclass correlation
ICC estimates of all markers were between 0.976 and 0.996 for the intralaboratory assessment (ICC_intra), and between 0.980 and 0.998 for the intersite reproducibility (ICC_inter) ( Table 4, upper panel), reflecting excellent agreement of the quantitative data. To exclude any effect of the sample selection on ICC results, the ICCs were again computed using the intersample variance observed in the 769 breast cancer cases of the FinHer trial [27]

Discussion
This study addressed the question whether the recently developed molecular in vitro diagnostic MammaTyper® test could improve the reproducibility of the assessment of the four key routine breast cancer biomarkers ERBB2 (HER2), ESR1 (ER), PGR (PgR), and MKI67 (Ki67). The routine diagnostic assessment of these markers, as well as the corresponding subtyping, is currently performed by semiquantitative IHC and FISH, CISH, and SISH assays [1][2][3][4]. IHC assays suffer from considerable inter-and intralaboratory variability, which particularly applies to the assessment of the valuable biomarker Ki67 [8,9,15,16]. Therefore, it is of importance that new technologies carrying the potential for more accurate, reliable, and precise analysis of Ki67 expression are brought under consideration to overcome the persisting inconsistencies [17,18].
In this multicenter study, we demonstrated that Mamma-Typer® shows excellent inter-and intralaboratory precision, both for the continuous quantitative single-marker measurements (40 −ΔΔCq values) and for the categorical positive/ negative status and the breast cancer molecular subtype classification. These data therefore confirm the high analytical performance of the MammaTyper® that was previously reported in the original technical validation of the test [21] but was shown in this study in a more comprehensive and challenging methodological setting. In our study, ten different laboratories were able to generate consistent and highly concordant test results after an initial training and a relatively short familiarization period. Overall, the test results were found to be independent of the preanalytical process and not influenced by the MammaTyper® lot. The source of imprecision of the quantitative measurements was related mainly to the general run-to-run variability rather than to the      variance introduced by different laboratories. These observations were in line with the original technical validation report [21].
The ICC values for all markers were above 0.976, which signifies excellent agreement of the quantitative data and suggests improved inter-and intrasite reproducibility achieved with qPCR compared with what has been documented previously for IHC [9,16]. In studies on IHC reproducibility, the intersite agreement for Ki67 IHC displays an ICC of 0.59, which is below the minimum acceptable standard proposed by Kirkegaard et al. [9,25]. Even when standardized Ki67 scoring methods on centrally stained histological slides were tested, the interobserver reliability on resection specimens reached ICCs of only 0.40 to 0.74, with kappa values ranging from 0.29 to 0.58 [16]. Only training and precise calibration resulted in better Ki67 assessment on centrally stained tissue microarray slides (ICC 0.94) or centrally stained core-cut biopsies using a standardized scoring method (ICC 0.87), but this process is difficult to implement in routine clinical practice, and clinically important discrepancies persisted in the critical range of 10% to 20% Ki67-positive nuclei staining [15,29]. The challenges in the standardization of Ki67 assessment include the variability in the selection of the tumor areas to be assessed, the technique used for nuclei counting, and the dilemma of the numerical cutoff for positivity, especially for large tissue sections [8,9,15,16]. The highly promising reproducibility of the MammaTyper® was confirmed by a simulated analysis using MammaTyper® data obtained from 769 samples from the FinHer trial cohort [27], verifying that the study samples were representative of the whole spectrum of routine clinical samples.
The high values of the various reproducibility metrics in the present study are a result of both the underlying high degree of standardization of the MammaTyper® test, which minimizes the main sources of variability, and the adaptation of a fully objective assessment method (i.e., qPCR). Thus, the MammaTyper® assay has the potential to overcome the substantial and varying rates of interand intraobserver variability that may occur with IHC. This applies especially in samples where high-quality IHC is not readily available for diagnostic purposes. Analytical validity, such as reproducibility, is a prerequisite for accurate diagnostics, and its formal evaluation is required along with a test's clinical performance to allow conclusions on its potential use in clinical practice [17,18]. In a previous clinical performance evaluation study, good concordance was shown between MammaTyper® single-marker assessments and IHC (or IHC/CISH for HER2), using 769 archived breast cancer cases available from the FinHer trial [27]. Only for MKI67 mRNA expression was the correlation moderate, most likely because of the analytical restriction of Ki67 IHC. The multivariable analysis revealed that MKI67 expression assessed by MammaTyper®, but not Ki67 IHC, was an independent predictor of distant disease-free survival (DDFS), indicating the superiority of MammaTyper® compared with IHC with respect to MKI67/Ki67 determination [27]. Furthermore, this clinical performance evaluation study demonstrated that the mRNA-based subtyping by MammaTyper® resulted in clinically meaningful prognostic and predictive information with regard to DDFS, overall survival, and response to taxane-based chemotherapy in the luminal B-like (HER2-negative) subtype [27]. These published data provide evidence for the clinical validity of the test, and additional clinical performance evaluation studies would help further strengthen its clinical value for routine applications. As is true of all molecular assays that use tissue homogenates, one may also consider low tumor cell content or high lymphocytic infiltration as potential sources of error. For MammaTyper®, a minimum tumor cell content of 20% was required to generate stable test results when compared with the paired macrodissected sample [21]. Adjacent nontumor tissue had no major influence on test results, likely because of the reduced metabolic activity and low RNA content in the surrounding tissue compared with the invasive tumor [30,31]. As reported previously, nontumor components may have a stronger impact on multigene tests that analyze a recurrence score based on genes with partially notable expression in normal tissue [32,33]. Nevertheless, further validations of MammaTyper® on samples with problematic characteristics (i.e., varying amounts of ductal carcinoma in situ and lymphocytic infiltrates) should be envisaged.
The reliable quantification of single-marker expression by MammaTyper® has the potential to become part of a predictive marker panel in breast cancer diagnostics with further refined implications for clinical management. The expression of the single markers ERBB2, ESR1, PGR, and MKI67 is obtained on a continuous scale covering a much broader dynamic range (up to 5 orders of magnitude) than can be achieved by IHC (up to 2 orders of magnitude; 0-100% positive cells or H-score 0-300) [21,34]. An ongoing challenge in clinical management is the decision whether to treat patients with luminal breast cancer with systemic chemotherapy when other clinicopathological factors, such as nodal status, are not decisive [17]. Various multigene assays address exactly this diagnostic dilemma. Oncotype DX® (Genomic Health, Redwood City, CA, USA), MammaPrint® (Agendia, Irvine, CA, USA), EndoPredict® (Myriad Genetics, Salt Lake City, UT, USA), and Prosigna® (NanoString Technologies, Seattle, WA, USA) provide risk scores with prognostic information for distant recurrence in patients with luminal A-like and luminal B-like (HER2-negative) breast cancers. High and low risk scores, even though they are mainly prognostic, are frequently used to decide for or against chemotherapy in the ER/PgR-positive, HER2negative subgroup [17]. Limitations exist, however, because the appropriate course of action remains unclear for patients with intermediate risk cancers or for nodepositive patients. Prospective studies on multigene tests in breast cancer showing predictive values are limited [35][36][37]. The randomized phase III Microarray in Node-Negative and 1 to 3 Positive Lymph Node Disease May Avoid Chemotherapy (MINDACT) trial recently demonstrated that chemotherapy can be omitted in around 46% of clinically high-risk cases with low Mamma-Print® scores [37]. Similarly, the recurrence score measured with Oncoytpe DX® in the prospective randomized phase III Plan B study demonstrated excellent 3-year survival by omitting chemotherapy in clinically high-risk but recurrence score low-risk cases [35]. Further prospective studies on Oncotype DX® testing revealed excellent 5-year survival (98%) in low-risk score cases treated with hormone therapy alone [36,38]. However, these tests come with high costs. In addition, some of these tests require sending the samples to a central laboratory, such as Oncotype DX® and MammaPrint®. In this respect, it is interesting that a similar score, namely the immunohistochemical 4 (IHC4) score, with comparable prognostic value was generated by using just the four IHC markers HER2, ER, PgR, and Ki67 [39,40]. However, insufficient standardization and considerable interlaboratory variability of IHC suggest that the IHC4 algorithm cannot easily be transferred to other laboratories, although it was successfully validated in an independent study cohort [41]. On the basis of data obtained in this study, MammaTyper® could provide a highly reproducible and reliable assessment of these four markers. It is tempting to suggest that in the future a similar approach might be applicable for the MammaTyper® to generate additional prognostic information to guide personalized treatment options at much lower cost than multigene expression tests.

Conclusions
The MammaTyper® test has the potential to improve the quality of primary breast cancer molecular diagnostics. The test showed reliable reproducibility in the quantitative assessment of the single markers ERBB2, ESR1, PGR, and MKI67, as well as in the subtype determination, and thereby overcomes the variability known on the basis of diagnostic experience with IHC. The low intersite and intrasite variance of the MammaTyper® test enables pathology institutions to perform this assay in-house and integrate the technology into routine diagnostic services. However, additional clinical performance evaluation studies in larger cohorts are necessary to confirm the clinical utility of the test and to extract further predictive information for more personalized clinical management of patients with breast cancer.

Additional files
Additional file 1: Table S1. Clinicopathological characteristics of the breast cancer patient samples used in this study, including HER2, ER, PgR, and Ki67 marker status. (DOC 61 kb) Additional file 2: Figure S1