An automated approach for determining the number of components in non-negative matrix factorization with application to mutational signature learning

Non-negative matrix factorization (NMF) is a popular method for finding a low rank approximation of a matrix, thereby revealing the latent components behind it. In genomics, NMF is widely used to interpret mutation data and derive the underlying mutational processes and their activities. A key challenge in the use of NMF is determining the number of components, or rank of the factorization. Here we propose a novel method, CV2K, to choose this number automatically from data that is based on a detailed cross validation procedure combined with a parsimony consideration. We apply our method for mutational signature analysis and demonstrate its utility on both simulated and real data sets. In comparison to previous approaches, some of which involve human assessment, CV2K leads to improved predictions across a wide range of data sets.


Introduction
Non-negative matrix factorization (NMF) provides a low rank approximation of a matrix. When non-negativity constraints are applicable, NMF often provides a more meaningful structure than other low-rank approximation methods [1].
One important question concerning the use of NMF is how to determine the number of components, or rank, of the factorization (K). For example, Brunet et al [2] propose the cophenetic correlation coefficient as a measure of stability for each K, and choose the first value for which the cophenetic coefficient begins to fall. Hutchins et al [3] select the value of K in which the reconstruction error curve shows an inflection point. Another approach is a Bayesian variant of NMF [4], which chooses K that balances between model's likelihood and complexity. It uses an automatic relevance determination technique to remove irrelevant components that do not help to explain the input data. Other approaches are based on enumerating K and testing the performance of the different solutions in a cross validation setting (cf [5]). Owen and Perry's Bi-Cross-Validation [6] generalizes Gabriel's two-dimensional holdout cross validation scheme [7], leaving out a sub-matrix on each fold and using the rest of the matrix to reconstruct it. A different cross-validation scheme by Wold [8] that is also adopted in [9,10] leaves out random entries from the initial matrix.
In the field of mutational signature learning, the two common methods are SigProfiler (SP) [11] and SignatureAnalyzer (SA) [12]. SigProfiler uses a method that is similar to Brunet et al It clusters signatures to K clusters and uses the tightness of the clusters as a measure of stability. The number of signatures is chosen after human assessment of the stability and accuracy of the solutions for a range of K values. SignatureAnalyzer uses the above Bayesian NMF to infer the number of signatures K, but may output different values across different runs and also depends on a regularization hyper-parameter.
Here we propose an automated method, CV2K, to choose K that is applicable for the most general definition of NMF. CV2K is based on a detailed comparison of the performance of different solutions in reconstructing hidden values of the original matrix in a cross validation setting, combined with a parsimony Frobenius norm consideration. We demonstrate its utility using both simulated and real data sets, showing that it outperforms previous approaches over a wide range of cases.

Non-negative matrix factorization
NMF receives as input a matrix V ∈ R N×M ≥0 and a number of factors K, and outputs In its most general formulation, NMF solves the following minimization problem: arg min for some cost function d : R 2 → R ≥0 . A useful class of distance functions can be obtained by using Bregman divergence, i.e for a univariate convex smooth function ϕ is a cost function. Some useful ϕ functions are given in is given to us, e.g. reflecting entries with available data or the confidence in each cell, leading to the following extension: In order to optimize NMF, which is known to be NP hard [13], we start from a random guess for W, H and apply some iterative update rules. For the initialization we use the random scheme used in scikit-learn; for the updates we use a scalar block coordinate descent (sBCD) [14], which we found to be the simplest and most efficient algorithm for a general cost function d ϕ . We adapt both the initialization (lines 3-5) and sBCD (lines 6,[9][10][11][12][13][14][15][16] to work with a weight matrix as in (2). To check for convergence, we use a minimal relative improvement threshold ε and a maximum iteration number T (lines [7][8]17) parameters. This is summarized in algorithm 1, which was adapted from [14]. In practice, to avoid bad local minimum, we initialize the model many times, train each for a few hundred iterations and continue training the best performing model.

Data imputation
In order to score different values of K, we hide some of the original matrix entries and test how well we can recover them. To this end, we set the weight matrix O to be binary, acting as a mask on the matrix values. To score the imputation of W, H we first normalize W, H, V, then average the L 1 norm over all missing values. The entire scheme is summarized in algorithm 2. In lines 2-4 we move the weights from the rows of H to the columns of W without changing the product WH. To reduce the variance caused by high/low values to impute, we normalize the rows of V and W to sum to one (lines 5-6, 7-8).

Choosing the number of components
Our CV2K scheme is described in algorithm 3 and consists of two parts: (a) (lines 3-8) Evaluate all applicable values for K using imputation with a fraction p of missing values by repeatedly running Algorithm 2 on random O matrices (line 5). Note we require O to have at least one non-zero entry in every row and column. We let K * be the value of K attaining the best median error over all runs. An example application of this criterion is shown in figure 1.  Table 2 summarizes the effect of this part on the performance of our method.  . Dots represent runs, with their median denoted using a dashed line. The minimum is obtained for K = 12, and the final chosen K is marked in gray (11). The true number of components for this simulation is also 11.
As we show below, the results of CV2K are quite robust to p across a wide range, yet we hypothesize that fractions between 1/N and 1/M will work best. This requires further investigation in more size-varying data sets. In order to efficiently identify K * , we use a binary search algorithm, starting from an arbitrary guess of 20. Once we find K * , we scan a small window (±3) around it to test for a better value. If there is a close K with a better median, we adopt it and repeat, otherwise we use Wilcoxon rank sum test to check the possibility of decreasing K and repeat.

CV2K-x: Eliminating the fraction hyperparameter
To eliminate the fraction hyperparameter, we propose a variant (CV2K-x) which is described in Algorithm 4 (showing only the difference from Algorithm 3). The idea is to combine cross validation on the rows of V to learn the H matrix (signatures) and cross validation on the columns to learn the W matrix (exposures; for the latter we use SciPy's non-negative least squares [15]). Finally, reconstruction errors are summed up from each column and row folds to produce a score for a given rank. When applying this method with number of folds ranging from 5 to 30, we noticed that the results were the same or deviated by at most 1 across this range. Henceforth, we used 10-fold cross validation (F = 10).

Comparison to related methods
Two previous methods, Bi-CV [6] and scattered holdout [9], have explored the use of cross validation for learning the number of components in NMF. To demonstrate the utility of our new method, we also tested it against these two previous methods. We found that the Bi-CV method is not stable, producing results with no single local minimum. For example, on a sample simulated data set ((SA) ii]) in which the true number of components is 39, the Bi-CV method gives multiple local minima that are far from the true value (for K = 15, 64, 76, 52, 29, in this order). To test the scattered holdout approach we used Alex Williams' implementation (http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval/). We adjusted the implementation (abbreviated scatCV below) to use median values (rather than mean) to score each rank, thus decreasing the problem of outliers. Lin et al [10] also suggests using imputation to choose K. Their approach, implemented in the NNLM package, is based on masking 30% of the input matrix entries in cross validation and picking K to minimize the reconstruction error. See https://rdrr.io/cran/NNLM/f/inst/doc/Fast-And-Versatile-NMF.pdf for additional details.

Data description 2.6.1. Simulated data
The simulated data was generated and described in detail in [16] to evaluate SigProfiler and SignatureAnalyzer. Here, for each of the 12 data sets we evaluate our method on two sets of 'realistic' synthetic data: SP-realistic, based on SP's reference signatures and attributions, and SA-realistic, based on SA's reference signatures and attributions. For each of the (i) to (x) tests, the synthetic data sets were generated based on observed statistics for each signature in each cancer type. Different data sets can differ by the number of signatures, number of active signatures per samples (sparsity), number of mutations per sample (whole exome/genome sequencing), whether they reflect single cancer type or multiple types, and similarity between signatures. All these factors affect the difficulty of determining the number of components.

Real data
We analyzed breast cancer (BRCA), chronic lymphocytic leukemia (CLLE), and malignant lymphoma (MALY) mutation datasets from whole-genome sequences (WGS) from the International Cancer Genome Consortium (ICGC). We downloaded mutations from the ICGC data portal (https://docs.icgc.org/). For BRCA, we used release 22 to match Nik-Zainal et al [17], while for the other datasets we used the most recent release (release 27). These data sets had 560, 100 and 100 samples, respectively, and more than 1000 mutations for most samples.
In addition, we downloaded mutations from whole-exome-sequencing (WXS) data of 411 ovarian cancer (OV) patients from The Cancer Genome Atlas [18]. There are about 100 mutations per sample in this collection. We extracted pancreas (PANC) and prostate (PROST) mutations from WGS PCAWG data set  linked from COSMIC. These data sets have 326 and 286 samples, respectively, with more then 1000 mutations for most samples.

Evaluation
For evaluation, we use three figures of merit: (i) Average percent deviation is the average over |Truth−Predicted| Truth * 100, emphasizing errors in data sets with a small number of signatures. (ii) Average distance is the average over |Truth − Predicted|, emphasizing large deviations from the true value especially in data sets with a high number of signatures. (iii) Overall wins is the number of datasets where the predicted rank is closest to the true one. In addition, we use a signed version of the deviation, which is the average over Predicted−Truth Truth * 100, to examine whether we over-or under-estimate the true rank.

Results
In the following we describe the application of CV2K to simulated and real mutations data sets. For brevity, we use CV2K(p) to denote the application of CV2K with a fraction p of hidden values in its cross validation procedure, where CV2K-x corresponds to the variant without the fraction hyperparameter. For CV2K, we used the Kullback-Leibler divergence (KL) variant of NMF, to produce the results mentioned below.

Simulated data
Our first set of results is focused on simulated data from [16] of different characteristics, accounting for different numbers of signatures and different similarity relationships between the signatures. We examine our performance on simulated data over a range of fractions p of hidden entries, as summarized in table 4, observing that the results are relatively robust across a wide range of values for this parameter and that-as a general rule-our method tends to under-estimate K, with smaller fraction values leading to closer estimations compared to larger values. We further observe that the use of the Wilcoxon rank sum step improves the performance of the algorithm, as summarized in table 2. We compare our performance against two widely used tools for mutational signatures extraction, SigProfiler (SP) and SignatureAnalyzer (SA). We report the results of the authors of these tools on the simulated data, also published in [16]. As SigProfiler and SignatureAnalyzer miss results on few data sets (4 and 2, respectively), to aid the comparison we fill all missing values with the optimal value (marked with an asterisk in table 3). We further compare to a general model selection approach, the Akaike information criterion (AIC), as well as to the scatCV and NNLM methods reviewed above. The results are summarized in table 3 and depicted in figure 4 (left panel). They show the superiority of our approach (regardless of the choice of the hidden fraction p to use) over the existing ones.

Real data
Next, we evaluated the different methods on six real mutation data sets described above. As the true number of components or signatures in these applications is not known, we used the number of signatures suggested in COSMIC [19] as a surrogate. For SA we report in these tests the most frequent choice across 10 runs. SP was not used in these tests due to its high running time (using the original matlab code, we could not execute the new python code on our servers) and lack of automation. The results are summarized in table 5 and depicted in figure 4 (right panel), showing that our method yields numbers of signatures that are in higher agreement with the current knowledge on the corresponding data sets as represented by COSMIC.

Conclusion
In summary, we proposed a fully automated method, CV2K, to select the number of components in NMF. In the context of mutational signatures, we showed that CV2K outperforms state-of-the-art methods on simulated and real data.
The success of our method, in particular in comparison to previous methods that are based on similar cross validation principles, can be attributed to several factors: (i) a comprehensive framework for NMF optimization based on block coordinated descent coupled with multiple initializations to avoid local minima; (ii) the use of a parsimony consideration to resolve cases in which the imputation error reaches a plateau, making CV2K less prone to overestimation of K; and (iii) an exploration of a range of values for the fraction of hidden entries and a novel approach to eliminate this hyper-parameter.
Although the number of components is the only obvious hyper-parameter in NMF, there is in fact another regularization parameter that is often used. This regularization parameter can also affect the number of signatures. For future work we suggest expanding the model search to include this parameter as well.