MEMO: multi-experiment mixture model analysis of censored data

Motivation: The statistical analysis of single-cell data is a challenge in cell biological studies. Tailored statistical models and computational methods are required to resolve the subpopulation structure, i.e. to correctly identify and characterize subpopulations. These approaches also support the unraveling of sources of cell-to-cell variability. Finite mixture models have shown promise, but the available approaches are ill suited to the simultaneous consideration of data from multiple experimental conditions and to censored data. The prevalence and relevance of single-cell data and the lack of suitable computational analytics make automated methods, that are able to deal with the requirements posed by these data, necessary. Results: We present MEMO, a flexible mixture modeling framework that enables the simultaneous, automated analysis of censored and uncensored data acquired under multiple experimental conditions. MEMO is based on maximum-likelihood inference and allows for testing competing hypotheses. MEMO can be applied to a variety of different single-cell data types. We demonstrate the advantages of MEMO by analyzing right and interval censored single-cell microscopy data. Our results show that an examination of censoring and the simultaneous consideration of different experimental conditions are necessary to reveal biologically meaningful subpopulation structures. MEMO allows for a stringent analysis of single-cell data and enables researchers to avoid misinterpretation of censored data. Therefore, MEMO is a valuable asset for all fields that infer the characteristics of populations by looking at single individuals such as cell biology and medicine. Availability and Implementation: MEMO is implemented in MATLAB and freely available via github (https://github.com/MEMO-toolbox/MEMO). Contacts: eva-maria.geissen@ist.uni-stuttgart.de or nicole.radde@ist.uni-stuttgart.de Supplementary information: Supplementary data are available at Bioinformatics online.


D.1 Introduction
MEMO is an easy to use software toolbox for the statistical analysis of uncensored and censored single-cell data. It can deal with a variety of data types. The subpopulation structure and statistical properties of heterogeneous cell populations are assessed by the application of maximum likelihood inference and hypothesis testing to mixture models. MEMO 's key features are 1. support of uncensored, interval, left and right censored data, 2. symbolic definition and compilation of the mixture models, 3. support of normal, log-normal, gamma and Johnson-SU distributions, 4. integration and simultaneous analysis of multiple experimental conditions, 5. provision of gradient information for multi-start local optimization, 6. automated backward model selection, 7. integration of identifiability and uncertainty analysis (profile likelihoods and Markov chain Monte Carlo methods) using PESTO.

D.2.1 Statistical modeling of uncensored and censored data
We consider the inference of multi-experiment mixture models M, from uncensored and censored data D. The mixture model describes the distribution of the quantity of interest X in a heterogeneous population consisting of subpopulations s = 1, . . . , S with relative subpopulation sizes w s (u) and subpopulation mixture parameters ϕ s (u) for experimental condition u. The properties of the individual subpopulations are described by the probability densities φ(x|ϕ s (u)). The sum of the relative subpopulation sizes is one, S s=1 w s (u) = 1. Subpopulation sizes and mixture parameters can depend on the experimental conditions u. To infer these functional dependencies multiple experimental conditions u i , i = 1, . . . , I, are considered.
For the inference of the mixture model, uncensored data x, right or left censored data x or x, and interval censored data x are considered. The overall dataset D is a collection of the datasets D i for individual experimental conditions u i , We denote the j-th uncensored, the k-th right censored, the l-th interval censored, and the m-th leftcensored measurement under experimental condition u i by x j i , x k i , x l i , and x m i , respectively. The number of data points in experimental condition u i are n x,i , n x,i , n x,i and n x,i . In practice, not all four data types are available at the same time but there are four common combinations: (i) uncensored data; (ii) uncensored and right or left censored data; (iii) interval censored data; and (iv) interval and right or left censored data. In the following, we will introduce statistical models for the process of data collection in the presence of censoring. In this process data is generated from a data generating distribution, the distribution we are actually interested in. Due to censoring this distribution differs from distribution of observed data. By modeling these changes we will derive the likelihood function for a single experimental condition u i .

D.2.1.1 Distribution of data in the absence of censoring
For independent identically distributed (i.i.d.) measurements, uncensored data x j i provide samples from p(x i |θ, u i ). Accordingly, for experimental condition u i the dataset consists of samples Uncensored measurements provide information about the full distribution, enabling reliable reconstruction of the distribution p(x i |θ, u i ) for sufficiently large sample numbers n x,i . The reconstruction of the distribution does not ensure that the parameters θ are identifiable. For mixture models, for instance, the problem of symmetry is well-known.
In the absence of censoring, the likelihood function for data D i is given by

D.2.1.2 Distribution of data in the presence of interval censoring
Interval censored data x l i provide the information that the corresponding uncensored data point x i lies in the interval (x l i − ∆x, x l i ]. The interval length is denoted by ∆x. Accordingly, for experimental condition u i the dataset consists of samples Interval censored data provide information about the probability mass between two observation points. The precise shape of the probability density between observation points cannot be reconstructed but is merely restricted by the distribution assumption.
In the presence of interval censoring, the likelihood function for data D i is Note that we assume that the length of all intervals is identical. This can easily be generalized. Furthermore, instead of the endpoint x j i of an interval also a description in terms of an interval index, e.g. k for an interval ((k − 1)∆x, k∆x], might be used. By rewriting the likelihood in terms of observations per interval, the link to the multinomial distribution becomes apparent.

D.2.1.3 Distribution of data in the presence of right censoring
In the presence of right censoring, uncensored data x j i and right censored data x k i are observed. In order to model the data collection process, we define the indices j, k, and z. These indices enumerate the uncensored and right censored datapoints and the sum of both, respectively, and are initially set to 0. For each data point z, sample pairs x i and x i,c are drawn from their respective distributions, where p c (x i,c |θ, u i ) denotes the right censoring distribution. Then we set The minimum of these two random variables is recorded and added to the dataset, i.e.
The distributions of observed uncensored and right censored data for experimental condition u i are the conditional distributions and marginal probabilities for observing uncensored or censored data, The cumulative distributions of X i and X i,c are denoted by P (x i |θ, u i ) and P c (X i,c |θ, u i ), respectively. As analytical solutions of p(X i ≤ X i,c |θ, u i ) and p(x i,c ≤ X i |θ, u i ) are often not available, numerical integration might be necessary.
The right censoring distribution can have different shapes. In the case of distributed censoring, meaning that p c (x i,c |θ, u i ) is a smooth distribution, the likelihood function for data D i is proportional to In the case of fixed censoring at a single valuex i,c , translating into a probability density which is a Dirac delta, p c (x i,c |θ, u i ) = δ(x i,c −x i,c ), the likelihood function simplifies and we obtain This formulation exploits the tail probabilities 1 − P (x k i |θ, u i ), capturing the censoring. Sometimes this likelihood function is also used to avoid modeling of the censoring. While this still allows for inference, a visual comparison of model and data requires an estimate of the censoring distribution (see Section S.4.2.1, Model-data comparison in the Supplementary Material), since the conditional distributions above have to be evaluated for this purpose. Furthermore, this estimate is required for goodness-of-fit analysis based on bootstrapping the likelihood distribution of the objective function (see Section S.4.2.1, Goodness of fit in the Supplementary Material). Without an estimate of the censoring distribution, the expected event and censoring distribution -p(x i , x i ≤ X i,c |θ, u i ) and p(x i,c , x i,c ≤ X i |θ, u i ) -cannot be predicted, and therefore it is not possible to resample data for a bootstrapping.

D.2.1.4 Distribution of data in the presence of left censoring
In the presence of left censoring, uncensored data x j i and left censored data {x m i } are observed. In order to model the data collection process in the presence of left censoring, we proceed similarly to the case of right censoring, i. e. sample pairs x i and x i,c are drawn from their respective distributions, where p l (x i,c |θ, u i ) denotes the left censoring distribution. Counters are updated acccordingly. The maximum of these two random variables is recorded and added to the dataset, The distributions of observed uncensored and left censored data are and marginal probabilities for observing uncensored or censored data, The cumulative distributions of X i and X i,c are denoted by P (x i |θ, u i ) and P l (x i,c |θ, u i ), respectively.
The likelihood function for data D i is proportional to and can be simplified acccordingly in case of a delta distribution for the left censoring times.

D.2.1.5 Distribution of data in the presence of interval and right censoring
In the presence of interval and right censoring, interval censored data x l i and right censored data x k i are observed. In order to model the data collection process in the presence of interval and right censoring, all counters are initially set to 0, i. e. k, l, z = 0, and sample pairs x i and x i,c are drawn from their respective distributions, Then we set . The expressions floor ∆x (x) and ceil ∆x (x) indicate round up and round down of x to the next multiple of ∆x. Accordingly, ceil ∆x (x) yields the smallest multiple of ∆x, x + , which is larger than x, and floor ∆x (x c ) yields the largest multiple of ∆x, x − c , which is smaller than x c . Without loss of generality we assume that measured time points are multiples of ∆x, such that ∀i, j∃k , k such that x j i = k ∆x and x j i = k ∆x. The dataset is filled as follows The distributions of interval and right censored data for experimental condition u i are and marginal probabilities for observing uncensored or censored data, The cumulative distributions of X i and X i,c are denoted by P (x i |θ, u i ) and P c (x i,c |θ, u i ), respectively.
This statistical model of the data in the presence of interval and right censoring models (time-lapse) microscopy experiments as well as flow cytometry experiments (with finite acquisition accuracy). The aspect of censoring is however neglected in standard analysis tools (O'Neill et al., 2013, Pyne et al., 2009).
In the case of distributed censoring the likelihood function for data D i is The joint distribution is used, as we know the value of the respective measurement as well as the order of the occurence of event and censoring. In the case of fixed censoring at a pointx c , the likelihood function simplifies and we obtain The case of combining interval and left censoring can be derived accordingly.
In summary, statistical models and likelihood functions for the different types of experimental data have been introduced in Section D.2.1.1 -D.2.1.5. Based on these preliminaries, Multi-Experiment Mixture Modelling using the MATLAB toolbox MEMO will be introduced. MEMO provides a framework for the simultaneous analysis of multiple experiments for which censored and uncensored data are available.

D.2.2 Step 1: Formulation of hypotheses
Standard mixture modeling is commonly used for the quantification of experimental data as well as for the analysis of the subpopulation structures and properties (Pyne et al., 2009). These are inference and hypothesis testing problems. As MEMO allows for the simultaneous study of multiple experimental datasets, more general hypotheses can be assessed.
Hypotheses regarding the condition-dependent population structures can be encoded in the functional dependency of the subpopulation sizes w s (u) and distribution parameters ϕ s (u) on the experimental condition u and the parameters θ. These dependencies are generally unknown and can be inferred by MEMO. Parametrization of w s (u) and ϕ s (u) with meta-parameters θ is possible. MEMO allows for a flexible symbolic definition of these parameter dependent functions, w s (u) = fun(u, θ) and ϕ s (u) = fun(u, θ), and the inference of the corresponding meta-parameters θ.
The interpretation of the distribution parameters ϕ s (u) depends on the distribution assumption. In the current version of MEMO four distributions for the subpopulations are supported: Normal distribution: With distribution parameters ϕ = (µ, σ) the probability density function (pdf) and cumulative distribution function (cdf) are With Λ(y) we denote the cdf of the standard normal distribution N (0, 1) throughout this documentation.
Mechanistic models. Appropriate functions w s (u) and ϕ s (u) can often be derived from the properties of the biological system, e.g., the underlying signalling pathway. This has been demonstrated for uncensored data (Hasenauer et al., 2014) and can be generalised to the case of censored data. In this case, the mixture parameters w s (u) and ϕ s (u) relate to solutions of the model describing the signalling pathway. Possible models include reaction rate equations (Hasenauer et al., 2014), the linear noise approximations (Elf & Ehrenbarg, 2003, van Kampen, 2007, effective mesoscopic rate equations (Grima, 2010, Ramaswamy et al., 2012 or moment equations (Engblom, 2006, Lee et al., 2009). These ODE models allow for a mechanistic description of the single-cell and population dynamics, in particular the explicit consideration of intrinsic and/or extrinsic noise (Swain et al., 2002). MEMO can use functions w s (u) and ϕ s (u) derived from such mechanistic models for hypothesis testing using censored and uncensored data. An example for the use of mechanistic models is provided in Section 4.3 in the main manuscript and Section S.5 in the Supplementary Material.
Semi-mechanistic models. For many processes no mechanistic models are available. In this case, flexible parametric expressions can be used to model w s (u) and ϕ s (u). These expressions depend on the mixture component and the expected condition dependence. In case of a normal distribution with mixture parameters ϕ(u) = (µ(u), σ(u)), exemplary hypotheses regarding the condition-dependence of the mean are for example: The mean changes linearly with u: The mean increases with u in a Hill-type manner: µ(u, θ) = µmaxu n K n +u n with θ = (µ max , K, n) T .
The mean decreases with u in a Hill-type manner: µ(u, θ) = µmaxK n K n +u n with θ = (µ max , K, n) T .
Similar or more general dependencies might be assumed for any function w s (u) and ϕ s (u). In the presence of multiple inputs also products of univariate functions and/or multi-variate functions might be used. An example for the use of semi-mechanistic models is provided below in Section S.4.4 in the Supplementary Material.
We note that in addition to the subpopulation structure and properties, also the right censoring distributions can be parametrized. MEMO even allows for condition-dependent censoring.

D.2.3
Step 2: Parameterization of multi-experiment mixture models The formulated multi-experiment mixture model in general posseses several unknown parameters. These unknown parameters are collected in the parameter vector θ ∈ R n θ . MEMO implements state-of-the-art Frequentist and Bayesian methods to estimate the unknown parameters and to evaluate their uncertainties. The concepts and methods are introduced in the following.

D.2.3.1 Frequentist parameter estimation
Frequentist parameter estimation assesses the quality of the model using the likelihood of the data given the model parameters, P(D|θ). Assuming that measurements are i.i.d., the likelihood function for multiple experimental datasets is given as the product of the likelihood functions for the individual datasets, This likelihood function encodes the information about the optimal parameter values and parameter uncertainties present in the experimental data.
Maximum likelihood (ML) estimate. A maximum likelihood estimate (MLE) θ ML is a parameter vector for which the likelihood takes its maximal value, hence ∀θ ∈ Ω : P(D|θ) ≤ P(D|θ ML ). Accordingly, θ ML is a solution to the optimization problem The numerics of the optimization problem (2) are improved by using the negative logarithm of the objective function, This transformation conserves the optimum and the shape of the level sets. The reformulation yields the minimization problem (3) For more information concerning the optimization problem and the implementation in MEMO please visit MEMO's github page (https://github.com/MEMO-toolbox/MEMO).
Confidence intervals. As the measurement data are limited, the parameters can often not be determined uniquely and the corresponding estimation problem is ill-posed (Hadamard, 1902). Confidence intervals can be computed via local sensitivity-based methods, e.g., the Wald approximation (Meeker & Escobar, 1995) or the Fisher information matrix (FIM) (Murphy & van der Vaart, 2000). Alternatively, bootstrapping (Joshi et al., 2006), profile likelihoods (Murphy & van der Vaart, 2000, Raue et al., 2009) and Markov chain Monte-Carlo methods (Girolami & Calderhead, 2011, Hug et al., 2013, Wilkinson, 2007 can be used. Nowadays, Bayesian methods and profile likelihoods (Murphy & van der Vaart, 2000 become increasingly common as they yield very reliable results . The profile likelihood PL(θ i ) is the maximal likelihood value for a given value θ i . A particular value of θ i can be rejected, if the profile likelihood PL(θ i ) is low compared the likelihood P(D|θ ML ) at the globally optimal parameter point θ ML . Cut-off values for the likelihood ratio for a particular significance level can be derived from the χ 2 distribution (Meeker & Escobar, 1995). For more information on the method and the implementation in MEMO please visit MEMO's github page.

D.2.3.2 Bayesian parameter estimation
Bayesian parameter estimation employes Bayes' theorem, The posterior distribution π(θ|D) of the parameters θ given the data D is determined by the likelihood P(D|θ), the prior distribution π(θ) and the marginal probability P(D). The marginal probability is independent of the parameters. Therefore, the posterior probability is proportional to the product of likelihood and prior.
MEMO implements optimization-and sampling-based approaches for the analysis of the posterior distribution.
Maximum a posteriori (MAP) estimate. A maximum a posteriori estimate θ MAP is a parameter vector for which the posterior probability takes its maximal value, hence ∀θ ∈ Ω : π(θ|D) ≤ π(θ MAP |D). Thus, MAP estimates provide the best agreement of model and data taking the prior knowledge into account. As the posterior probability is proportional to the product of likelihood and prior probability, θ MAP is a solution to the optimization problem θ MAP = arg max θ∈Ω P(D|θ)π(θ) s.t. This optimization problem can be reformulated similar to the corresponding ML problem (2). In MEMO, multi-start local optimization is used to solve the reformulated optimization problems. The same optimization settings are used as for the maximum likelihood problem (see Section D.2.3.1), and similar visualizations are available.
Credibility intervals. In Bayesian statistics, the a posteriori uncertainty of model parameters depends on the information content of the data -encoded in the likelihood -and the prior information. Bayesian credibility intervals for the parameters (Chen & Shao, 1999) can, for instance, be computed using Laplace approximations at the MAP estimate, profile posteriors (Hug et al., 2013) and Markov chain Monte-Carlo (MCMC) methods (Girolami & Calderhead, 2011, Hug et al., 2013, Wilkinson, 2007. The profile posterior calculation is similar to the profile likelihood calculation. The same ideas and methods are exploited but the function P(D|θ)π(θ) is optimized repeatedly to determine the uncertainty of each individual parameter.
MCMC methods generate a chain of parameters, θ 1 , θ 2 , . . . , θ n S , by exploring the posterior distribution π(θ|D). After convergence of the chain, the set S = {θ j } n S j=1 provides a representative sample from the posterior distribution. This sample S reveals parameter uncertainties as well as correlations of parameters. The most common choice for the 100(1 − α)% credibility interval for a parameter θ j is the 100(1 − α)-th percentiles of the sample S (DiCiccio & Efron, 1996).

D.2.4 Step 3: Testing of hypotheses via model selection
To compare competing model hypotheses, MEMO allows for the comparison of user-defined models as well as for an automated analysis of the population structure.

D.2.4.1 Model selection methods
For model selection, three different types of criteria are implemented in MEMO: Likelihood ratio test; Akaike information criterion (AIC); and Bayesian information criterion (BIC). The likelihood ratio test requires models to be nested, which is not required by AIC and BIC. In the following, the different criteria are introduced, and the index k is used to denote a particular model: Likelihood ratio test. The likelihood ratio test is a statistical test used to compare a null model M k and an alternative model M k (Wilks, 1938). The null model has to represent a special case of the alternative model, thus the models have to be nested. The likelihood ratio P(D|θ ML k ))/P(D|θ ML k )) indicates how much more likely the data are under the alternative model compared to the null model. The logarithm of this test statistic Ξ = − log(P(D|θ ML k )) + log(P(D|θ ML k )) is approximately χ 2 distributed with degrees of freedom equal to n θ,k − n θ,k . Using the χ 2 distribution, the significance level for a rejection of the null model can be computed. In this manuscript we considered a p-value of 0.05 as substantial.
Akaike information criterion. The AIC is an information theoretical measure of the quality of a statistical model (Akaike, 1973). It accounts for fit and complexity of a model M k , The number of parameters of model M k is denoted by n θ,k . Models with lower AICs are preferable and the best model M k * (according to AIC) possesses the minimal AIC, AIC min = AIC k * with k * = arg min k AIC k . In practice, given the best model M k * , the level of empirical support for model M k is considered to be substantial for 0 < AIC k − AIC min < 2, considerably less for 4 < AIC k − AIC min < 7, and essentially none for 10 < AIC k − AIC min .
For details we refer to (Burnham & Anderson, 2002). In this manuscript we considered a difference in the AIC values of 4 to be substantial.
The application of AIC is well justified if the sample-size n D is large. In the case of small sample-sizes, the corrected AIC provides a bias adjustment (Hurvich & Tsia, 1989), In accordance with this formula, AIC and corrected AIC should only be applied for n D > n θ,k + 1.
Bayesian information criterion. The BIC is a probabilistic model selection criterion, derived using Bayesian arguments. It accounts for fit and model complexity, The weight of the model complexity depends on the sample-size n D . Similar to the AIC, models with lower BICs are preferable. The best model M k * (according to BIC) possesses the minimal BIC, BIC min = BIC k * with k * = arg min k BIC k . In practice, given the best model M k * , the evidence against model M k is considered to be not worth more than a bare mention for 0 < BIC k − BIC min < 2, positive for 2 < BIC k − BIC min < 6, strong for 6 < BIC k − BIC min < 10, and very strong for 10 < BIC k − BIC min .
For details we refer to (Kass & Raftery, 1995). In this manuscript we considered a difference in the BIC values of 6 to be substantial.
The three model selection criteria implemented in MEMO enable the selection of appropriate models based on optimization results. Parameter uncertainties as well as prior knowledge is not taken into account. To consider this, Bayes factors can be used for model selection (Kass & Raftery, 1995). As the robust and efficient computation of Bayes factors can be challenging and might require problem-specific implementations, MEMO does not provide this functionality. Any methods for Bayes factor calculation, including bridge sampling (Meng & Wong, 1996), importance sampling (Neal, 2001), nested sampling (Skilling, 2006), Laplace Metropolis (Lewis & Raftery, 1997), density estimation-based approaches (Kramer et al., 2010) and thermodynamic integration (Vyshemirsky & Girolami, 2008), can however be easily coupled with MEMO. For this purpose, MEMO provides an easy-to-use objective function interface.

D.2.4.2 Automated unraveling of subpopulation structure
The comparison of many alternative hypotheses can become cumbersome as it requires the definition of many models. Therefore, in addition to the comparison of user-defined models, MEMO provides backward model selection capabilities. This backward model selection supports the automated unraveling of the subpopulation structure. For single-and multi-experiment single-cell data, the implemented backward model selection iteratively simplifies the model by eliminating subpopulation which are not supported by the experimental data. Therefor all possible models which possess in the experimental conditions one subpopulation less than the current model are generated, compiled and fitted. Among the fitted model the best one is selected using the specified selection criteria (likelihood ratio, AIC or BIC) along with a cut-off threshold. This process is repeated until no further improvement of the model is achieved.
The backward model selection method implemented in MEMO is computationally relatively efficient. In comparison to a brute-force approach not all model alternatives are analyzed, which decreases the computational complexity significantly. Furthermore, the automatic generation reduces the work load for the user and enables the analysis of thousands of model alternatives.

D.2.5 Step 4: Interpretation and further analysis
Parameter estimation and model selection provide a set of mixture models capturing the data. MEMO provides different routines to visualize these models along with the measurement data and to compare them. Beyond the sole analysis of available data, multi-experiment mixture models enable predictions. For example, mixture models with relative subpopulation sizes w s (u) and subpopulation mixture parameters ϕ s (u) that functionally depend on the experimental condition can be used to predict the outcomes of future experiments with different experimental conditions. Therefore, the mixture properties just have to be evaluated for the respective experimental setting u. MEMO also allows for the assessment of the prediction uncertainty, e.g. by using the parameter samples obtained via MCMC sampling (Section D.2.3.2). This enables a thorough analysis of the predictive power of models and model validation.
The mixture model and its parameters can furthermore be used for subsequent analysis of the data generating process. The calculated subpopulation sizes w s (u) and subpopulation mixture parameters ϕ s (u) can, for instance, be used to unravel key properties of the underlying signaling pathways such as cooperative interactions between proteins. The processed data can even be used to inform mechanistic models. This was demonstrated in several recent publications (Duffy et al., 2012, Heinrich et al., 2013, Song et al., 2010. For the use in subsequent analysis, MEMO provides estimates and confidence or credibility intervals for all essential model quantities. Also model averaging (Ando, 2008, Burnham & Anderson, 2004 can easily be realized to account for multiple model alternatives.

D.2.6 Gradients of objective function
To facilitate the convergence and reduce the computational cost of the optimization, MEMO provides the optimization routine with the gradient of the objective function with respect to the parameter vector θ.
In the following subsections we provide all gradients needed for the case of uncensored data (x j i ), interval censored data (x j i ) and right censored data (x j i ), assuming a model for the censoring. For maximum generality the distribution parameters ϕ are assumed to be functions of θ.

D.2.6.1 Gradient of the log-likelihood function
Since we optimize the logarithm of the objective function, the gradient of this log-likelihood function is provided.
For the case of uncensored data the gradient reads with A = p(x j i |θ, u i ). For the case of interval censored data the gradient reads . For the case of right censored data the gradient reads For the case of left censored data the gradient reads . For the case of interval and right censored data the gradient reads The case of left and interval censored data can be derived accordingly.
The gradients of the mixture distributions and probability densities that are part of this gradient can be found in the subsequent sections.

D.2.6.2 Gradient of the mixture distribution
The gradient of the mixture distribution with respect to the parameter vector θ is The gradient of the cumulative mixture distribution with respect to the parameter vector θ is

D.2.6.3 Gradient of the probability densities
In the following the gradients of the probability density functions φ (see D.2.2) and cumulative probability densities (cdf) Φ with respect to the parameter vector θ are provided for different distribution types.

D.3 Implementation of MEMO
MEMO is implemented in Matlab and has been tested under Version R2014a under Mac OS, Linux and Windows 7.

D.3.1 Requirements
In addition to the provided Matlab functions, the Parameter EStimation Toolbox (PESTO) developed by Jan Hasenauer (Institute of Computational Biology, Helmholtz Zentrum München, Germany) is required. PESTO is included in the MEMO download. Furthermore the Symbolic Math Toolbox and the Statistics and Machine Learning Toolbox of Matlab are required. For MCMC sampling the MATLAB toolbox DRAM (Haario et al., 2006) is required.

D.3.2 Model formulation and data format
MEMO uses a single definition file, an m-file, for each finite mixture model. This definition file is provided by the user and contains the specification of the model M and the data D. An example is shown in Figure D1. The model setup starts by defining the struct parameters that specifies the parameters and their range for optimization, sampling and profile likelihood estimation. Since MEMO uses symbolic computation, the parameters have to be defined symbolically and then stacked in a parameter vector. Furthermore, for each parameter a name has to be specified that is used for plots and other visualizing routines and should therefore comply to a L A T E X interpretable format ( Figure D1, lines 3 -21). In addition, the definition file provides the structure of the finite mixture model for the quantity of interest. The type of distribution can be chosen and the unit of the modeled data can be specified. The latter will be used as label in visualization routines. Furthermore, M contains the specification of the different experimental conditions that are subject to analysis. For every experimental condition the subpopulation structure, namely the number of subpopulations and their distribution type, can be specified. The symbolic parameters are assigned to their respective experimental condition and subpopulations. If modeling of the distribution of the censoring quantity is chosen, the same is performed for the censoring quantity in a second struct Mc ( Figure D1, lines 23 -52). The definition file also links the data to the model. All information on the data is stored in a cell array D, whose length is the number of different experimental conditions. For every dataset a title specifying the experimental condition and the vectors containing the data, grouped by the type of data (uncensored / interval censored (Tm), right / left censored (Tc)), has to be provided. Furthermore, the type of censoring in Tc has to be specified. Possible options are 'left' or 'right'. Data for the different experimental conditions can be loaded from separate data files ( Figure D1, lines 34, 54-60) or the data and the associated information can be specified in the definition file itself. An exemplary data file is shown in Figure D2. Subsequent to the definition, model and data structure are compiled by the MEMO function getMixtureModel.m that creates functional expressions of parameters and derivatives ( Figure D1, lines 70 + 71).

D.3.3 Likelihood calculation
To calculate the likelihood for a certain model given the data, MEMO provides the functions logLikelihoodMM.m and logLikelihoodMMc.m for model setups without and with model of the censoring distribution, respectively. These functions take the parameters struct, the model struct(s) and the data struct as an input. If two outputs are specified also the gradient of the likelihood is evaluated. Examples for the function call are shown in Figure D3.

D.3.4 Optimization
To find the optimal parameters of the finite mixture model, given the data, the function optimizeMultiStart.m is called ( Figure D4). This function performs a multi-start local optimization. The first input argument is the parameters struct. The second argument is a function handle to the respective likelihood function (logLikelihoodMM.m or logLikelihoodMMc.m) and the last argument is a struct with options, e.g. for fmincon (see m-file of optimizeMultiStart.m for further information). The default options are shown   Figure D3: Function calls to obtain the likelihood and its gradient.
in Figure D5. The optimization time scales linearly with the number of data points. The number of experimental conditions, which contain these data points, has almost no effect. However with increasing number of experimental conditions, the number of parameters potential increases, which may require a higher number of starts in the multi-start optimization and result in an increased computational complexity.

D.3.5 Automated backward model selection
The automated model backward selection is performed by calling the function performModelSelection.m which takes the model with the maximum number of components per experimental condition M, the parameters, the data struct D and options and returns the most likely model consisting of M red and Mc red, and the reduced parameter struct parameters red. For details on the method please see Section D.2.4.2.

D.3.7 Uncertainty analysis
MEMO provides two different approaches to assess the uncertainty of the estimated parameters. There are inbuilt functions for profile likelihood estimation and MCMC sampling. For details we refer to Section D.2.3. These function can be called by the functions computeProfiles.m and computeMCMCsample DRAM.m included in PESTO.

D.3.8 Visualization
MEMO provides the user with a variety of visualization functions. The function printModel.m prints the specified model in the Matlab command window either with symbolic names of the parameters or, if the parameters struct with optimization results is provided as input, the parameter values of the best fit (see Figure D6 for example output). Parameter values shown by printModel.m are the parameters θ (not ξ = log(θ), see Section D.2.3). The function plotMixtureModel.m visualizes the model data fit graphically by creating a Matlab figure with subplots for every experimental condition. There are several possible options, including visualization of parameter uncertainties obtained by MCMC sampling.