MSIsensor-pro: fast, accurate and matched-normal-sample-free detection of microsatellite instability

We developed MSIsensor-pro (https://github.com/xjtu-omics/msisensor-pro), an open-source single sample microsatellite instability (MSI) scoring method for research and clinical applications. MSIsensor-pro introduces a multinomial distribution model to quantify polymerase slippages for each tumor sample and a discriminative sites selection method to enable MSI detection without matched normal samples. For samples of various sequencing depths and tumor purities, MSIsensor-pro significantly outperformed the current leading methods in terms of both accuracy and computational cost.

. Recently, several next-generation-sequencing (NGS)-based methods have been developed, which show much better time and cost efficiency and are highly consistent with the gold standards [3][4][5][11][12][13][14][15] . For instance, MSIsensor 11 , an FDA-approved MSI detection solution of MSK-IMPACT 16 , achieved 99.4% concordance and high sensitivity 17 . However, these NGS methods do have limitations, such as requiring matched normal samples as control (sometimes inaccessible), being computational expensive, and being affected by low sequencing depths and low tumor purities 9 .
A hallmark of MSI is the enrichment of insertions or deletions in microsatellite regions initiated by polymerase slippages 2,18 (Supplementary Fig. 1), an enzymatic process that we argue is described using a multinomial distribution (MND) model ( Supplementary Fig. 2), providing promising improvements of MSI detection efficacy on NGS data. Here, we report a novel MSI calling method, MSIsensor-pro, which addresses the limitations of current NGSbased MSI detection tools by applying MND model to capture the intrinsic properties of polymerase slippages in a single sample. We demonstrate that MSIsensor-pro is an ultrafast, accurate and normal sample-free MSI calling method. Moreover, it outperforms all current MSI detection methods and is robust for samples with various sequencing depths, tumor purities and target sequencing regions.
To quantitatively describe the polymerase slippages present in a single sample, we first examined the allele length distributions of 27,200 microsatellites in 1,532 normal samples from the Cancer Genome Atlas (TCGA) 19 (Supplementary Table 1,2, Methods). The distributions flattened (the variances became larger and the modes deviated from expectation) with the increase in the repeat length of microsatellites on the reference genome (Fig. 1a), suggesting that the polymerase slippage was an iterative process. We argue that polymerase slippages are independently accumulative in the DNA replication process and could be modeled by a MND model. Here, we used p and q to denote the probabilities of hysteresis synthesis (causing deletions) and presynthesis (causing insertions), respectively, for each replication unit ( Supplementary Fig. 2). We next estimated the parameters p and q of each microsatellite to quantify the polymerase slippage in a given allele length distribution.
To explore the characteristics of parameter p and q in MND model, we applied MND model to 1,532 TCGA normal samples. We totally obtained 11,666 microsatellites with sufficient read coverage (>20) in more than half of the samples for subsequent study (Supplementary   Table 1-2, Methods). We found that the average probability of hysteresis synthesis, p, is significantly larger (P-value <0.05, Wilcoxon rank-sum test) than the average probability of presynthesis, q (Supplementary Fig. 3) in these sites, indicating that polymerase slippages tend to cause more deletions than insertions at microsatellites, confirming previous reports 4,18 .
To evaluate the power of our MND model on describing the polymerase slippages in DNA replication, we simulated the allele length distributions at each microsatellite site with their corresponding computed p and q values and compared them with the observed ones from sequencing data. We found that the simulated allele length distributions were consistent with the observed ones at 91.97% microsatellites and the similarities of two distributions decreased with increasing repeat length (Fig. 1b [6][7][8][9]. Thus, it is conceivable that the higher incidence of polymerase slippages, and therefore, the greater instability of microsatellites in MSI instead of MSS, is attributed to more deletions, rather than insertions. Therefore, the parameter p evaluates the stability of each microsatellite site. MSIsensor-pro classifies the i-th microsatellite as unstable when its p is larger than μ i +3σ i , in which μ i and σ i are the mean and standard deviation, respectively, of p in 1,532 normal samples at the i-th microsatellite. The fraction of unstable sites in a given microsatellite set is used to score MSI in a tumor sample (Supplementary Fig. 10 and Methods).
To assess the performance of MSIsensor-pro in terms of the accuracy and computational cost, we compared this method against MSIsensor 11 , mantis 13  respectively, performances which were significantly faster than mSINGS (94 minutes) and mantis (119 minutes). In addition, MSIsensor-pro consumed much less memory than MSIsensor, mSINGS and mantis ( Fig. 2a and Supplementary Fig 11, 12).
While MSIsensor-pro exhibited satisfactory all-around performance in detecting MSI using the 11,666 preselected microsatellites, these sites seemed to have an unequal contribution to the MSI classifications. We therefore evaluated the contribution of each microsatellite based on MND parameter p and identified 7,698 sites (Supplementary Table 8 Therefore, by testing the MSIsensor-pro performance on various numbers of randomly selected DMS sites, we sought to identify small panels of DMS sites that are potentially effective enough for robust MSI calling. Indeed, we found that the AUC values of MSI detection steadily increased with growing numbers of randomly selected DMS sites. When as few as 50 random sites were used, the AUC was approximately 0.98 and remained stable. Taken together, MSIsensor-pro could be applied to various target sequencing panels with as few as 50 sites, (Fig. 2d, Supplementary Fig. 14  MSIsensor-pro's potential to score MSI with circle tumor DNA or stool DNA. We examined the properties of DMS sites and found that they were closer to splicing sites and located in genes with higher expression than the rest of the sites (Supplementary Fig. 15-17), indicating their potential roles in tumorigenesis.   Fig. 1, 2), we argued that the polymerase slippage during DNA replication is an iterative process and that each step is independently accumulative. Therefore, we use multinomial distribution to model the slippage process in microsatellite sites. We use variable x to denote hysteresis synthesis ( 0 x = ), presynthesis ( 2 x = ) and normal synthesis ( =1 x ) of each step of repeat unit synthesis, and the corresponding probabilities are denoted by p , q and 1 p q − − , respectively. Then, x is subject to a multinoulli distribution, and its probability distribution function is as follows: Thus, for a microsatellite site with n repeats on the reference genome, we assume that y is the repeat length observed from the data. Therefore, we have: Here, ND pro and NI pro denote the probability of acquiring the observed repeat length with minimum steps, while Δ is the probability of using more steps. Since Δ is much smaller and difficult to calculate, we ignore it in practice to save computational resources. For a microsatellite region spanned by m reads, we denote the observed repeat length as The values of p and q are positively correlated with the magnitude of polymerase slippages.
Validation of the MND model. To evaluate how well parameters p and q in MND mimic polymerases slippage for microsatellites with various repeat lengths, we randomly select 27,200 microsatellites from normal control samples of three cancer types in TCGA and estimate the parameters p and q of each site. Then, the calculated parameters p and q, also known as the probabilities of deletion and insertion, are used to simulate allele length distribution. The sites with no significant difference (P-value > 0.05, Kolmogorov-Smirnov test) between real and simulated distribution are defined as fitted sites. Then, the percentage of fitted sites to all test sites is used to evaluate the fitness of the MND model. To investigate the polymerase slippages in tumor samples, we estimate p and q for 1,532 TCGA tumor samples and compare the difference between MSI and MSS samples 5 . We found that p is discriminative between the MSI and MSS samples, while q is not, indicating that parameter p is an effective metric for MSI classification.
MSI calling of MSIsensor-pro. We use the parameter p (probability of deletion) in the MND model to evaluate the stability of microsatellites. To distinguish unstable sites from stable sites, we determine the mean (μ i ) and standard deviation (σ i ) of parameter p in the i-th microsatellite site in normal samples. Specifically, a microsatellite is classified as unstable when its p value is larger than μ i +3σ i . Here, 1,532 normal control samples from three cancer types are used to build the baseline. The MSI score, defined as the percentage of unstable sites within all detected sites in a sample, is used for MSI calling. MSIsensor-pro performance evaluation. To assess the performance of MSIsensor-pro, we benchmark MSIsensor-pro against MSIsensor 11 , mantis 15 and mSINGS 12 using 1,532 TCGA tumor samples. The MSI score is used for MSI classification, and AUC is used to evaluate the performance of each method (Supplementary note 1). The CPU usage, memory and runtime of these methods are tested on a TCGA sample, TCGA-AD-A5EJ, by a Linux machine running Ubuntu18.04 OS with Intel(R) Core (TM) i5-7500 CPU@3.40 GHz and 32 GB memory.
To compare the performances of the four methods on samples with low sequencing depths and low tumor purities, we use 178 CRC tumor-normal paired samples in TCGA to simulate test data. We downsample the raw sequencing data to 5x, 10x, 20x, 40x, 60x and 80x sequencing depths and mix different proportions of tumor and normal sequencing data to generate various tumor purities ranging from 5% to 80%. We call MSI for all simulated data and calculate the AUC for each method. To assess the performance of MSIsensor-pro with fewer sites, we select microsatellite sets containing the top 1, 2, 5, 10, 20, 50, 100, 200, 500, and 1,000 DMS sites for MSI calling. In addition, we randomly select various numbers of microsatellites from DMS sites for MSI calling to examine how many sites are sufficient for MSI calling by MSIsensor-pro.
Code availability. MSIsensor-pro source code is freely available at https://github.com/xjtuomics/msisensor-pro with help documentation and demo data for testing.
Data availability. Primary sequencing data, gold standard MSI status and RNA expression data are downloaded from TCGA Research Network (http://cancergenome.nih.gov/). All results generated by this study are available within the article and in the Supplementary Data, or are available from the authors upon request.