A novel panel of short mononucleotide repeats linked to informative polymorphisms enabling effective high volume low cost discrimination between mismatch repair deficient and proficient tumours

Somatic mutations in mononucleotide repeats are commonly used to assess the mismatch repair status of tumours. Current tests focus on repeats with a length above 15bp, which tend to be somatically more unstable than shorter ones. These longer repeats also have a substantially higher PCR error rate, and tests that use capillary electrophoresis for fragment size analysis often require expert interpretation. In this communication, we present a panel of 17 short repeats (length 7–12bp) for sequence-based microsatellite instability (MSI) testing. Using a simple scoring procedure that incorporates the allelic distribution of the mutant repeats, and analysis of two cohort of tumours totalling 209 samples, we show that this panel is able to discriminate between MMR proficient and deficient tumours, even when constitutional DNA is not available. In the training cohort, the method achieved 100% concordance with fragment analysis, while in the testing cohort, 4 discordant samples were observed (corresponding to 97% concordance). Of these, 2 showed discrepancies between fragment analysis and immunohistochemistry and one was reclassified after re-testing using fragment analysis. These results indicate that our approach offers the option of a reliable, scalable routine test for MSI.

Introduction Two decades ago, Ionov et al. [1] reported that length variation in polyA mononucleotide repeats (MNR) was present in up to 12% of colorectal cancers (CRC), and was associated with a distinct pathological and molecular phenotype. Increased instability was observed for other microsatellites and this led to the designation of such tumours as microsatellite unstable. Microsatellite instability is a hallmark of tumours where the function of the mismatch repair (MMR) system is compromised [2]. MMR defects of clinical significance affect the MSH2, MLH1, MSH6 and PMS2 genes [3,4]. Germline defects lead to an inherited autosomal dominant cancer predisposition syndrome, Lynch Syndrome, that is characterised by a high risk for colon and endometrial cancer and an increased susceptibility to a variety of other malignancies including upper GI, ovarian, breast, genito-urinal and kidney cancers [5]. The mismatch repair status of a cancer can be of clinical interest. Compared to other colorectal tumours, microsatellite unstable CRCs have been found to have a better prognosis [6]. In 2015, patients with MMR deficient CRCs were shown to benefit from immunotherapy with pembrolizumab [7], and a more recent trial in 12 different tumour types with MMR deficient cancers showed benefit from immunotherapies [8]. Identification of patients with Lynch Syndrome also allows at-risk relatives to be identified [9]. This enables the implementation of monitoring regimes designed to detect cancer early, and the use of prophylactic measures such as the regular intake of aspirin which has been shown to reduce cancer rates in patients with Lynch Syndrome by over 50% [10]. The implications for cancer treatment and management have led to recommendations to increase the proportion of CRC and endometrial tumours that are tested for MMR defects [11][12][13][14], with the UK National Institute of Health and Care Excellence (NICE) recommending MMR testing of all CRCs [15].
Testing for microsatellite instability (MSI) is one of the main methods used to identify MMR deficiency. However, somatic microsatellite mutations can also be observed in MMR proficient tumours. Thus, detection of low levels of microsatellites instability may not be indicative of mismatch repair defects [16,17], a view which is also reflected in the NICE guidance where 3 population based studies indicated improved assay specificity when cases with low levels of MSI (MSI-low) and microsatellite stable (MSS) cases were grouped together compared to MSS cases alone [15]. MSI is commonly tested by amplification of a panel of microsatellites followed by analysis of the amplified fragments by capillary electrophoresis. A variety of panels have been recommended and current tests rely on long MNRs [18]. Long homopolymers tend to be more unstable both in vivo and in vitro, and PCR-induced errors lead to stutter peaks in electropherograms [19]. This can complicate downstream phenotype interpretation and visual inspection of the fragment size profiles can be required. Samples can be classified according to the frequency of microsatellite mutations. For example, the Revised Bethesda Guidelines for Hereditary Nonpolyposis Colorectal Cancer (Lynch Syndrome) and Microsatellite Instability described a classification using a panel of 5 quasi monomorphic MNR [20]. Samples showing mutations in two or more MMR designated as microsatellite instability high (MSI-H) samples, samples with only one altered MNR as microsatellite instability low (MSI-L) and where all microsatellites appear to be stable as microsatellite stable (MSS). MSI-H status is indicative of an MMR defect.
Microsatellite instability assesses the function of the MMR system. An alternative is to ascertain the presence of its components by immunohistochemistry (IHC). Lack of protein can result from mutations causing premature truncation of the encoded polypeptides and nonsense-mediated decay, or from the destabilisation of protein complexes leading to accelerated degradation of their components [21]. IHC requires highly skilled personnel. Since IHC assesses the levels of MMR proteins as opposed to a consequence of MMR dysfunction, there is some discordance between the results of microsatellite instability and IHC analyses [21,22]. The reported concordance varies, but a sensitivity of 92% for IHC in predicting MSI has been reported [21].
Several groups have developed sequencing based approaches to identify microsatellite instability. These include methods utilising genome [23] or transcriptome [24] wide data, as well as sequences from target enriched libraries [25,26] and more recently, melt-curve analysis based testing [27]. In vitro amplification errors, which lead to the presence of variant read lengths in the PCR product, can complicate sequence-based approaches. The frequency of such artefacts will differ between MNRs, but some mutant reads are expected even in the absence of mutations in the starting material. One approach to address the problem of amplification errors is to use a threshold value of the proportion of mutant molecules to discriminate between PCRartefacts and the genuine presence of MNR mutations in the starting material [26].
Short MNRs tend to be less polymorphic than longer ones [28]. Thus, the likelihood of encountering germline variants in short MNRs is reduced, suggesting that they would be suitable for assessing MSI status in tumours without requiring matched germline DNA. The lower mutation rate also means that mutant reads from shorter repeats are more likely to reflect a single mutational event and affect only one allele while recurrent artefacts will affect both alleles. As a result, assessing whether length variants are concentrated in one allele offers an additional criterion to differentiate between PCR artefacts and mutations that occur in vivo.
The aim of this study was to develop a method suitable for high throughput and automated microsatellite analysis that allows separation of samples into two classes: those with clinically significant instability designated MSI-H and those with little or no evidence of instability and designated stable or MSS. The separation of unstable samples into "high" or "low" would thus be made redundant.
This involved selecting a panel of short MNR, and developing a method to score instability based on both MNR specific variant read frequency thresholds and allelic bias. The parameters required for classification were determined in a training cohort of 139 CRC tumours where the MSI status had been previously characterised, and a testing cohort of 70 CRC tumours was used for blinded validation of the method.

Study samples
The present study utilised 3 sample cohorts for discovery, training and testing purposes. The discovery cohort consisted of 132 CRC tumour and tissues samples that were obtained, either as formalin fixed paraffin embedded (FFPE) tissues or as DNA extracted from FFPE CRC tissues, from the Northern Genetics Service, Newcastle Hospitals NHS Foundation Trust between 2014 and 2015. The MSI status of all tumours had previously been established using the MSI Analysis System Version 1.2 (Promega, Southampton, UK) and this, together with relevant clinical and pathological data were obtained from the UK's National health Service (NHS) database. These samples were used to identify a potentially informative set of mononucleotide repeat markers which could be taken forward to training and validation cohorts.
The training cohort consisted of 139 samples which were obtained as extracted DNA from FFPE treated CRC tumours from the Genetics Service of the Complejo Hospitalario de Navarra and the Oncogenetics and Hereditary Cancer Group, IDISNA (Biomedical Research Institute of Navarra, Spain). These samples were used to identify classification parameters for the training algorithm. They had previously been tested for MSI using the MSI Analysis System, Version 1.2 (Promega, Southampton, UK) at IDISNA, and the MSI status calls were available for each sample. IHC expression analysis was performed at IDISNA on all samples using (BD biomedical Tech, New Jersey, USA) antibodies for MLH1 protein at 1:10; MSH6 protein at 1:120; and PMS2 protein at 1:100, and (Oncogene Ltd Middlesex, UK) antibody for MSH2 protein at 1:100 ratios. However, protein expression data were available for 124 out of 139 of the samples only.
The testing cohort consisted of 70 anonymised DNAs from FFPE treated CRC tumour samples that were obtained from the Department of Molecular Pathology, University of Edinburgh. MSI status had been tested at the University of Edinburgh using the MSI Analysis System, Version 1.2 (Promega, Southampton, UK). The study team was kept blinded to the MSI status information on these samples until the end of the study.
The study was conducted following ethical approval from Medical Research and Ethics Committee (CEIC Navarra Government) and Newcastle Hospitals NHS Foundation Trust (REC reference 13/LO/1514).  [32] and duplicates were removed using PICARD (version 1.75) [33]. GATK (version 2.2.9) [34] was used to produce a combined BAM file for all samples, and to realign around indels. The GATK UnifiedGenotyper was used to produce a raw variant call file which was annotated using the TandemRepeatAnnotator for indel identification in mononucleotide repeats. Mononucleotide repeats of lengths 7bp-12bp were selected, and repeats encompassing common sequence variants (dbSNP version 173, hg19) [35] were removed. SNPs listed in dbSNP within 30bp of the repeats were annotated using Perl scripts. Because of the low pass nature of the sequence data, all reads from MSI tumours were combined in one group, while reads from MSS and MSI-L tumours and from normal samples were combined in a second group as controls.

MNR amplification
Primers were designed using Primer3 [36] or manually if Primer3 returned no suitable oligonucleotides. Primers designed manually had a T m of 57-60˚C. All primers were checked for common SNPs using SNP Check (https://ngrl.manchester.ac.uk/SNPCheckV2/snpcheck. htm), off target binding using BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) or BLAT [37], and appropriate melting temperatures and absence of secondary structures using OligoCalc (http://www.basic.northwestern.edu/biotools/oligocalc.html) or Primer3. Primers were manufactured either by Metabion (Metabion International AG, Steinkirchen, Germany) or by Biobasic (Bio Basic Inc., Markham, Canada). Primers for all MNRs were initially designed to create amplicon of approximately 300-350bp (see S2 Table). For the final MNR panel, a second set of primers were designed to generate 100-150bp amplicons with 5' adapters (see S3 Table). Amplicons were generated using the high fidelity Pfu-based Herculase II Fusion DNA polymerase (Agilent, Santa Clara, CA, USA) at 35 PCR cycles following the manufacturer's master mix and PCR cycling conditions recommended by the manufacturer.

Targeted sequencing
Amplicons were quantified using a Qiagen QIAxcel (Qiagen, Manchester UK.), then pooled at roughly equimolar concentrations. Agencourt AMPure XP beads (Beckman-Coulter Life Sciences, Indianapolis, USA) were used for PCR clean up before Library Preparation. For the 300-350bp amplicons, barcoding and library preparation were performed using the Nextera XT DNA Library Prep kit (Illumina, San Diego, CA, United States of America), after pooling of the amplification products for each sample, while for the 100-150 bp amplicons the 16S metagenomic sample preparation protocol was followed (http://support.illumina.com/ documents/documentation/chemistry_documentation/16s/16s-metagenomic-library-prepguide-15044223-b.pdf). Sequencing was performed on the Illumina MiSeq plattform to a target depth of at least 10,000 reads per amplicon per sample. Fastq files for all the samples are available from the European Nucleotide Archive (Study accession number: PRJEB27681; URL to the study: http://www.ebi.ac.uk/ena/data/view/PRJEB27681).

Variant and MNR calling
Sequences were aligned using BWA (version 0.6.2) and the hg19 assembly as the reference genome. Samtools was used to sort and index the BAM files, and realignment was done using GATK (3.1.1). Alignment files were converted to SAM format and processed using custom R scripts. Only features observed on both reads of a pair, i.e. concordant in both orientations, were used in subsequent calculations, and only amplicons where the MNR was covered by at least 20 read pairs were analysed. Flanking SNPs were considered to be heterozygous if the least common allele, i.e. the allele supported by the smallest number of reads, was present in at least 20% all the read pairs covering the SNP position.

Construction of MNR specific ROC curves
For each marker, the proportion of reads representing MNR deletion alleles in MSI-H and MSS samples was analysed separately. A threshold approach to instability classification was used: Samples with a proportion of variant reads above the threshold were classified as MSI-H and below as MSS. This enabled the relative frequency of true positives (i.e. known MSI-H samples with a value above the threshold), and of false positives (i.e. known MSS samples with a value above the threshold) to be determined. For each MNR, these two values were then plotted against each other for thresholds between 0 and 1. The resulting curve represents the receiver operating characteristic (ROC) curve and the area under the curve (AUC) was used as a quantitative measure of the ability of the MNR to discriminate between MSI-H and MSS samples.

MNR based classification using deletion frequency and allelic bias
The classifier was designed to include information both on changes in MNR length, and on the distribution of the variant reads across both alleles. Since discrimination between alleles is only possible for samples heterozygous for a flanking SNP, not all samples can be assessed for biased distribution of variant reads across both alleles. However, lack of data should not favour either classification. A naïve Bayes approach for the classification procedure was used [38]. The underlying idea is to compare the probabilities of belonging to one of two classes, i.e. MSI-H or MSS, given the observations at each of the MNR markers used.
If we consider a set of MNRs and, for a particular sample, we represent the observed frequency of reads showing deletion for each of them with O, the probability that the sample is microsatellite unstable with p(MSI|O), and the probability that the sample is microsatellite stable with p(MSS|O), then the ratio: if the number of reads representing a deletion is above a pre-specified threshold and 0 otherwise, and B i = 1 if significant bias was observed and 0 otherwise. Therefore: , samples heterozygous at a flanking SNP marker, and for which the frequency of reads with deletions exceeded the MNR specific thresholds, were used. Bias was considered to be present when the association between the presence of a deletion and the genotype at the flanking SNP was significant at p-value of 0.05 using Fishers' exact test. If there were multiple heterozygous SNPs neighbouring a repeat then the SNP with the lowest p-value was used. When the deletion frequency was below the threshold, p(B i |D i ,MSI) and p(B i |D i , MSS) were set to 1. This is equivalent to assuming that in such cases there is insufficient evidence for an MNR mutation and therefore bias is not meaningful.
The results are presented as a score S ¼ log 10
Here we used a set of samples to determine, for each MNR, the following parameters used in the classification: (1)  These parameters were then used to calculate the score for each tumour in a second, independent set of samples. Samples with a score below 0 were classified as MSS and those above as MSI-H. Scripts for the MSI classification tool is available in S1 File.

Whole exome sequencing
Whole exome enrichment and library preparation was carried out on tumour DNA from sample 91 using Illumina TruSeq DNA Exome capture kit (Illumina, San Diego, CA, United States of America), according to the manufacturer's protocol. The sequencing library was pooled at an equimolar concentration and sequenced on an S2 flowcell on the NovaSeq 6000 sequencing platform, according to the manufacturer's protocol (Illumina, San Diego, CA, United States of America), with an average raw read depth of 150x. Sequences were aligned using BWA (version 0.7.17) and the hg19 assembly as the reference genome. Samtools was used to sort and index the BAM files, and realignment and variant calling were performed using GATK (version 4.0). Variant call files were annotated using variant effect predictor [40]. Annotated single nucleotide variants and indels were filtered manually for the 4 MMR genes and were assessed for their pathogenicity using CADD_Phred [41] and FATHMM [42] prediction scores and OMIM annotation. Fastq files are available from the European Nucleotide Archive (Sample accession number: ERS2623574).

Identification of an MNR panel
A total of 218,181 variable 7-12bp MNRs were identified from the TGCA CRC genome sequence data. From these, we excluded MNRs with a read depth less than 20 in either the MSI-H or the MSS group, and MNRs that did not have a SNP (dbSNP137) with a minor allele frequency larger than 20% within 30bp of the repeat. MNRs with multiple lower frequency SNPs in the flanking regions were not excluded if the probability of observing a minor allele in at least one SNP was above 20%, assuming linkage equilibrium.
For MNRs with a length of 7-9bp, only those, which had no length variation in the control group but where at least 10% of reads in the MSI-H group showed length variants, were selected. For MNRs with a length of 10-12bp, only MNRs where the frequency of reads showing length variation was at most 5% among controls and at least 15% among MSI-H samples, were selected. In total, 529 poly A-MNRs fulfilled these criteria. For poly C-MNRs no microsatellite fulfilled these criteria. To ensure that some polyC MNRs were included in subsequent analyses, the minimum depth and flanking SNPs requirements were dropped, leading to the selection of 33 polyC MNRs. From these 562 markers, MNRs within repetitive elements and regions of low complexity (likely to be refractory to amplicon design) were also excluded, producing a final list of 120 MNRs (S2 Table).
To eliminate potentially uninformative repeats, amplicons were designed for all 120 MNRs. These were initially tested in FFPE samples from the discovery cohort: 6 tumours from patients with Lynch syndrome, 5 normal mucosa samples and 6 samples from sporadic MSS tumours. Amplicons were pooled, indexed, and sequenced to a target depth of 10,000 reads per amplicon per sample. Only results for amplicons represented by at least 100 paired end reads were analysed and representative results are shown in Fig 1.  Fig 1 shows the relative frequencies of reads for two MNRs in an MMR proficient (MSS) and an MMR deficient (MSI-H) sample. A small fraction of insertion reads (+1 value in the abscissa) are observed in both MSI-H and MSS samples, but the frequency of deletions (-1, -2 and -3 values) differs between the two. However, for the longer repeat shown, reads representing deletions of more than one base pair are also observed in the MSS sample, while a second peak can be observed corresponding to a 2 bp deletion in the MSI-H sample. In all analyses, the sum of the frequencies of reads representing all deletions were used.
To illustrate levels of allelic variation observed, results from a single MNR marker (LR46) are shown in Fig 2. The read distribution for each allele is plotted separately for an MSI-H and an MSS sample that are heterozygous for the flanking SNP. While the distributions for both the G and A alleles in the MSS sample are similar, reads representing a one base pair deletion are predominantly found in the G allele of the MSI-H sample.
From this initial assessment, MNRs were retained for further analysis only if they exhibited a deletion frequency >5% in 1 or more MSI-H sample, and these frequencies were also >1.5 fold higher than frequencies observed in all normal mucosa samples. 41 MNRs satisfied these criteria. Two previously described MNRs adjacent to SNPs (one in DEPDC2 [43] and one in the intergenic repeat AP0035322 [44]) were also added to the analysis at this stage. These 43 MNRs were each typed in a minimum of 28 MSI-H and 30 MSS tumours in the discovery cohort, and ROC curves were generated to assess the ability of each to discriminate between MSI-H and MSS samples. This was performed by estimating the area under the curve (AUC) using the frequency of reads representing MNR deletion as the classification criterion, and classifying samples with a frequency above each threshold as MSI-H and below each threshold as MSS (S2 Table).
Representative examples of this analysis are shown in Fig 3 which shows the ROC curves for the two poly-A MNRs-LR46 (8bp) and LR44 (12bp) used in Fig 1. The AUC for LR46 was 0.83 (95% confidence interval 0.71-0.84) and 0.99 (0.98-0.99) for LR44. Using the AUC and amplicon length as a criterion, 15 poly-A MNR repeats were selected and together with the two poly-C MNR with the largest AUC formed our final panel (S3 Table). As described in the Methods section, the primers for this panel were redesigned to produce shorter amplicons (S3 Table).

Tumour classification using the selected panel of short MNRs
To establish the parameters required by the classification procedure, the seventeen MNRs included in the final panel were typed in the training cohort consisting of 139 samples, of  Fig  4A, the right hand panel those that are below. Overall, 21 MSI-H and 4 MSS samples had values above the threshold, i.e. had a bias significant at the 5% level. This corresponds to our expectation that allelic bias will be more common among MSI-H samples.
It is noteworthy that only 2 of the 4 MSS samples above the frequency threshold in Fig 4A  were heterozygous, and neither showed significant bias. In contrast, 27 out of the 32 MSI-H samples which were heterozygous showed a bias above the threshold (Fig 4B). This difference is significant (two sided Fishers' exact p = 0.03), while the corresponding test for samples that do not reach the frequency threshold does not suggest any difference between MSS and MSI-H samples (two sided Fishers' exact p = 0.39). This is consistent with our assumption that allelic bias can help to discriminate between MSI-H and MSS samples. For allelic bias and deletion frequencies, thresholds and relative numbers of samples above the respective threshold were determined for each of the 17 MNRs.
The parameters determined in the training cohort were then used to test the procedure in an independent testing cohort consisting of 70 CRC samples, 36 of which had previously been classified as MSI-H and 34 as MSS. Fig 5 presents the contribution made to tumour classification by MNR length variation (Fig 5A) and MNR allelic bias (Fig 5B). This illustrates that while both contribute to the separation of the groups; changes in MNR length provide the main contribution. The final combined classification (Fig 5C) is concordant with fragment analysis, achieving 100% sensitivity and specificity (95% confidence intervals 87-100% and 90-100%, respectively) when fragment analysis is used as the reference technique.
Finally, we used the data from the testing cohort to estimate the parameters and classify the samples in the training cohort. The results are represented in Fig 6. Four samples gave discordant results relative to fragment analysis (samples 63, 72, 91 and 135). Immunohistochemistry for sample 63 was checked and found to be consistent with reported MSS status. However, DNA from sample 72 was reanalysed by fragment analysis and MSI-H phenotype was detected, while IHC analysis of samples 91 and 135 revealed no alteration in expression for MSH2, MLH1, MSH6 and PMS2 genes. Since tumour DNA was available for sample 91, we carried out whole exome sequencing to screen for potential pathogenic mutations in the 4 MMR genes. The analysis revealed no pathogenic mutations, suggesting that the tumour was indeed MMR proficient, in agreement with the IHC results. This raises the possibility that IHC and fragment analysis are inconsistent for these 3 samples, with evidence available for one of the samples from direct sequencing of the MMR genes. Overall, there was a 92% concordance between fragment analysis and IHC, as assessed by staining for MSH2, MLH1, MSH6 and PMS2 proteins. For this analysis, the concordance between our results and fragment analysis is 97% and the estimates for sensitivity and specificity are both 97% (95% CI: 89-99% and 90-99%, respectively) when results from fragment analysis are used as reference. Interestingly, reclassification using the training cohort for both parameter estimation and for testing the classification resulted in misclassification of the same four samples. Combining both sets of results led to a sensitivity of 98% (95% CI: 92-99%) and specificity of 98% (95% CI: 93-99%).

In-silico assessment of limit of detection
A tumour sample can contain both MSI-H and MSS components. To assess the performance of the assay for mixtures, we investigated all combinations of one MSS and one MSI-H sample from each test set. Reads from each of the two samples were mixed at predetermined proportions. Each mixture was treated as if it represented the data from a new sample and classified. There were 1224 (34x36) pairs and for each, 41 different proportions from 0 to 1 in 0.025 incremental step mixtures were generated. Fig 7 shows the fraction of mixtures classified as MSI-H for each mixing proportion. The results indicate that there is a 72% chance of classifying a mixture containing 5% reads from a MSI-H tumour as MSI-H. It should be noted, however, that the starting material is not homogeneous since the original samples themselves may contain contributions from MSI-H and MSS clones, as well as some normal tissue.

Discussion
The method presented here allows sequence-based discrimination between MSI-H and MSS tumours using a limited number of loci, without the requirement for paired germline DNA as a reference. A multi-step process was used to select a panel of MNRs involving analysis of genomic sequence data to identify the most promising markers, and two rounds of amplicon assessment. These novel MNRs are present in the intergenic regions and to our knowledge are devoid of any functional property. Although, this approach does not ensure that the optimal set of MNRs was identified, the performance of the panel is comparable to that of fragment analysis.  We chose relatively short MNRs for our test to diminish the probability of PCR artefacts and to reduce the likelihood of encountering germline variation affecting MNR length, a potential confounding factor in cases where no normal material is available. However, somatic instability is also lower meaning that genuine mutations will tend to affect only one allele. Therefore, even allowing for PCR errors, mutant reads should concentrate on one allele. We showed that this can be assessed using flanking heterozygous SNPs and can be used to improve classification. It is worth noting that even in situations where mutations have occurred in both alleles, each allele is likely to be affected in a different proportion of cells in a sample since, during clonal evolution, there will be a time interval between the occurrence of the two mutations, and this time interval is expected to be larger for shorter microsatellites.
To our knowledge, this is the first method for assessing MSI that uses allelic information. Although we only use allelic data to assess bias in the distribution of mutant reads, it can also help to distinguish between somatic and germline variation, in particular in situations where no normal material is available, but the tumour is expected to contain normal tissue contamination. In silico analysis of the limit of detection (LOD) indicated that there is a 72% chance of detecting MSI signal when the MSI-H DNA content is only 5%. This is an improvement over fragment length analysis based assays, which have a reported LOD of approximately 10% [45]. MMR deficiency is observed in over 75% of colorectal adenomas in Lynch syndrome patients, however reporting of MMR deficiency in adenomas is reported to be higher by IHC compared to fragment length analysis based assays [46]. This contrasts with the higher sensitivity of fragment length analysis based assays compared to IHC in MMR deficient CRCs [46]. This discordance suggests that whilst the loss of protein expression is synchronous with the onset of MMR deficiency, MSI increases with the progression of MMR deficient cells, thus being less easy to detect in adenomas compared to cancers [46,47]. Application of our assay to detect MMR deficiency in extra-colonic tumours would, however, require further validation studies. Furthermore, MNRs showing germline variants in our assay can be excluded from the analysis, although it would also be possible to treat each allele separately. Allelic analysis is only possible for MNRs heterozygous for flanking SNPs in a particular sample. In principle, it would be feasible to restrict the score calculation to such MNRs. However, such a procedure would disregard information from many of the amplicons used, and require larger marker panels, increasing assay costs.
Here we used thresholds on the frequency of reads representing mutated MNRs because we wanted to dichotomise the data. Other approaches would be possible, however, using a threshold that is above the frequency observed in the majority of the MSS samples is consistent with the approaches followed by other authors who aim to set their thresholds so that variation reflecting PCR artefacts is excluded [26]. The formalism presented here could be used without defining thresholds, but this would require specifying the whole deletion frequency distribution. Similarly, we used a threshold, the p-value of 0.05 in Fisher's exact test, to dichotomise allelic bias. Using the statistical significance of the bias seems natural although the precise choice of the threshold is arbitrary. Since our test aims to detect MSI-H tumours, it seems reasonable to use fragment analysis as the reference technique. However, MSI detection is usually a means for assessing MMR proficiency. It is noteworthy that in 3 out of 4 cases where there were discrepancies between our results and the results from fragment analysis, there were also discrepancies between fragment analysis and IHC results.
Establishing whether tumours have resulted from a breakdown in mismatch repair is important in clinical management of the individual, and can help prevent future cancers in those families where there is a germline molecular defect. Expansion of testing to all colorectal cancers has been shown to be cost effective in the UK [48] and is soon to become standard of care on the basis of NICE guidance in the UK's National Health Service [15]. Similar decisions are being taken in other developed nations. Furthermore, with the evidence from a recent clinical trial suggesting benefit of immunotherapies in MMR deficient solid cancers regardless of the tumour location [8], the application of the current assay to non-CRC MSI-H tumours could be envisaged. Recent studies by Bonneville et al. and Middha et al. have found MSI-H signatures in over 20 tumour types between the range of 1.8% to 3.8% of all cancers that were analysed, with the highest rates of MSI-H signatures found in endometrial, small bowel, gastric, colon and rectal cancers [49][50][51]. Application of MSI testing across these tumour types would help in detection of patients who are likely to benefit from immunotherapies, as well as improve the diagnostic rate of Lynch syndrome cancers [52].
In summary, we propose a novel labour and cost-efficient approach to the detection of MSI-H tumours whose main advantage is its simplicity, making it suitable for high throughput analysis without the need for control normal DNA. A scalable, modular, and reliable MSI test will have clinical utility while modest costs and the ability to link this analysis to routine pathology assessment with help to ensure rapid adoption and facilitate further molecular approaches to tumour profiling and precision medical care.