Implementation of Bayesian Mixture Models in identifying subpopulation of breast cancer patients based on blood test measurements

A complete blood test is one of a series of initial examinations of cancer patients that is relatively easy. The use of blood measurement components in analysing patient conditions is commonly used. However, it is not the case for the ratio and inter-ratio components of blood measurements, and this is what is proposed in this study. The built hypothesis is that the ratio and inter-ratio components of blood tests that can explain the condition of cancer patients are better than the blood test’s own components. An analysis will also be conducted to develop a patient profile based on these measurements, and those that can clearly distinguish between patient groups will be identified. The Finite Mixture Model is a method for modelling heterogeneous data that may originate from different subpopulations, where subpopulations represent groups of patients based on a particular latent condition. This model takes the form of a superposition of several distributions, which in this study, a Gaussian distribution is used. The parameter estimation used in the Bayesian method, which determines the prior distribution of the model parameters, and it is combined with the likelihood which will produce a posterior distribution. Then, the Markov Chain Monte Carlo-Gibbs Sampler is used to draw samples on the parameters of the posterior distribution. By using the breast cancer patient blood test data from the Oncology Department of a hospital in Jakarta, with 100,000 iterations as burn-in, and 200,000 iterations for sampling, based on Deviance Information Criterion values, the optimal grouping is two subpopulations using blood ratio and inter-ratio measurements. Two subpopulations were identified, with the first population is characterized by low distribution value and the second subpopulation with the opposite characteristics. The explanatory factors of ratio data are ratio neutrophils to lymphocytes, ratio platelets to lymphocytes, and ratio lymphocytes to monocytes.


Introduction
Some tests that are useful for cancer diagnosing are the laboratory tests, image processing, biopsy, and genomic test. However, to get a good result, many kinds of costly check-up are required. The medical industry needs an alternative to facilitate the diagnosis process that easy, cheap, but without loss of accuracy.
A complete blood test is one of the laboratory checks that requires lower cost. In last few years, the medical industry relentlessly did a research about the effect of blood test ratio between platelet, lymphocyte, neutrophil, and monocyte to the cancer diagnosis. Ratio neutrophil to lymphocyte (NLR) [1] and ratio platelet to lymphocyte (PLR) [2] have been a prognostic marker from hematologic measurements that are interesting, easy, and cost-effective. NLR and PLR reflect breast cancer inflammation, in which a higher value of NLR means that the prognostic of breast cancer worsens [2].
Platelet count, white blood cell count (WBC), platelet volume (MPV), NLR, and PLR as a surrogate marker are compared between a breast cancer patient and healthy people, but it is rarely the research about cancer diagnosis with complete blood test [3]. NLR takes effect significantly to breast cancer stadium, unlike the PLR [4]. A high cut off value of PLR indicates bad overall survival, and disease-free-survival of the breast cancer patient, so a standard value for PLR cut off is needed [5]. It means that the characteristic of the blood test ratio still gives various results.
The significant factors of the blood test will be described by knowing the subpopulation's characteristics of cancer patients based on the blood test. Clustering is one way to know the characteristics of the data which divides it into several clusters that have the same character each cluster.
Furthermore, this research will define another measurement that is expected to be better me, which contains the ratios between each blood test ratio called the inter-ratio.

Data and Methods
and, for any = ( 1 , … , ) represent the proportion of data that is explained by subgroup , known as the component weights. Each ranges between 0 and 1 and ∑ = 1
The membership of each observation to one of these subgroups is the main interest. To indicate this for , we define , a discrete single-valued latent variable introduced. Taking values from = 1, … , , combining the latent variable into the model does not alter the likelihood of

MCMC-Gibbs sampler
The Markov chain Monte Carlo (MCMC) is a technique that employed to correct and generate the estimates of unknown parameters θ in order to have a better estimate of the desired posterior distribution. When MCMC is processed, we need to check whether the algorithm converges to desired the posterior distribution [9]. Markov chain Monte Carlo (MCMC) offers a way to produce samples from the posterior when analytical normalization constants cannot be implemented, as is often the case in very simple models. The general idea of the MCMC method is to provide a large number of samples or iterations of the posterior estimates, the order of the sample values of each parameter provides estimates for the exact posterior distribution, with the expectation that each unknown value is given from the Monte Carlo estimate. The most commonly implemented MCMC technique is the Gibbs sampler, mainly useful for model interface under conjugate priors.
The Gibbs decomposes the posterior, which is considered to be multivariate and complex, into a number of simpler and easier samples of the distribution for each unknown parameter. This distribution is known as a full posterior. Let parameters of interest = 1 , … , be the full conditional of each depending on parameter − 1 unknown parameters take the general form where ≠ and each full condition is familiar parametric form. For each , an initial estimate is provided, then a cyclical sampling scheme is provided by these full conditionals where each is processed in order from full conditions given the latest estimates for other parameters. For a single iteration , a generalization of this scheme is given by the equation

Deviance information criterion (DIC)
DIC strongly agrees with Bayesian analysis because it formally takes into account uncertainties in the model parameters. Similar with other information criteria, DIC is the sum of the model fit term and penalty conditions. Given the parameter , the general form of the DIC is given by, Here ( * ) represents the deviance of the model given an estimate of , The inclusion of ( ) acts as a standardising term and is a function of y only. For example, on a general linear model, this may correspond to the likelihood of the saturated model, resulting in ( * ) representing the scaled residual deviance. For model comparison, it is generally accepted that ( ) is set to one. To motivate the derivation of the penalty term, in this case, given by , Within Bayesian paradigm, since the true value of is unknown, this discrepancy may be approximated by its posterior expectation, with respect to the posterior of the unknown parameters, ( | ), resulting in the definition, In other words, measures the difference between the average deviance and the given estimated parameters of the model. Thus, it can be interpreted as reducing the uncertainty of the model resulting from parameter estimation. In practice, ( ) ̅̅̅̅̅̅̅ is easy to calculate, because it is solely the average deviance in all MCMC iterations. Therefore, it can be seen that DIC takes into account uncertainty in estimating parameters through ( ) ̅̅̅̅̅̅̅ , Referring to Equation (7), given expressions for ( * ) and , the DIC may be rewritten as follows

Results
By using the ratio and inter-ratio data, FMM can create the subpopulation of the data. We apply the algorithm with 200000 iterations to sample the parameters posterior distribution. For all tables, P[] informs the weight of subpopulations. Deviance is a value that will be used to calculate DIC in determining the best number of subpopulations. Lambda[] represents the mean value in the subpopulation.  Table 1 shows the ratio data with four subpopulations. From this table, we can get information that has an overlap for each subpopulation. It appears that the third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation, the third quantile mean value of the second subpopulation is higher than the first quantile mean value of the third subpopulation and the third quantile mean value of the third subpopulation is higher than the fourth quantile mean value of the second subpopulation.
Likewise shown in table 2, in the ratio data with three subpopulations. The third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation, and the third quantile mean value of the second subpopulation is higher than the first quantile mean value of the third subpopulation. Furthermore, in Table 3, it appears that in ratio data with two subpopulations seems that the third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation.  Table 4 shows the inter-ratio data with four subpopulations. From the table, we discover that has an overlap for each subpopulation formed. It seems from the third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation, the third quantile mean value of the second subpopulation is higher than the first quantile mean value of the third subpopulation, and the third quantile mean value of the third subpopulation is higher than the fourth quantile mean value of the second subpopulation. So also shown in table 5, inter-ratio data with three subpopulations. The third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation, the third quantile mean value of the second subpopulation is higher than the first quantile mean value of the third subpopulation. Furthermore, table 6, inter-ratio data with two subpopulations seems that the third quantile mean value of the first subpopulation is higher than the first quantile mean value of the second subpopulation Each subpopulation deviance will be used to calculate the DIC, in which the lowest DIC value is better. The DIC value is displayed in table 7. Both ratio and inter-ratio data, the best number of subpopulations formed, are two subpopulations with DIC value 7525.8 for ratio data and 11822.8 for inter-ratio data. After that, we will describe all the factors of ratio and inter-ratio in the following figure.  The ratio data has six factors, and three of them, namely PLR, NLR, and LMR do not overlap.
Although the first subpopulation has low distribution values, it is centred because the interval of the value is narrow. The second subpopulation has a large distribution value, yet it has a wider interval of the values, meaning that this subpopulation contains various values. A medical analysis is required to analyse the deferent overlap between PWR, HPR, and LWR.
On the other hand, the inter-ratio data with ten factors has eight factors that does not overlap, they are PLRNLR, PLRLMR, PLRPWR, NLRLMR, NLRPWR, LMRPWR, LMRLWR, and HPRLWR. Similar to the ratio, Although the first subpopulation that is formed has small values, it is centred as well due to the narrow interval of values. The second subpopulation has a large value but wider interval of values, meaning that this subpopulation contains various values. Therefore, a medical analysis is required to analyse the deferent overlap between LMRPWR, and HPRPWR.