Bayesian Bernoulli Mixture Regression Model for Bidikmisi Scholarship Classification

Bidikmisi scholarship grantees are determined based on criteria related to the socioeconomic conditions of the parent of the scholarship grantee. Decision process of Bidikmisi acceptance is not easy to do, since there are sufficient big data of prospective applicants and variables of varied criteria. Based on these problems, a new approach is proposed to determine Bidikmisi grantees by using the Bayesian Bernoulli mixture regression model. The modeling procedure is performed by compiling the accepted and unaccepted cluster of applicants which are estimated for each cluster by the Bernoulli mixture regression model. The model parameter estimation process is done by building an algorithm based on Bayesian Markov Chain Monte Carlo (MCMC) method. The accuracy of acceptance process through Bayesian Bernoulli mixture regression model is measured by determining acceptance classification percentage of model which is compared with acceptance classification percentage of the dummy regression model and the polytomous regression model. The comparative results show that Bayesian Bernoulli mixture regression model approach gives higher percentage of acceptance classification accuracy than dummy regression model and polytomous regression model


Introduction
In the era of Asean Economy Community (AEC) in 2015, education occupies the front guard in the development of Human Resources. AEC enforcement becomes a momentum to make improvements in Indonesia's education sector to be able to produce human resources that have high competitiveness. The Government in order to improve people's productivity and competitive-ness in the international market launched Bidik-misi Education Cost Assistance 68 Jurnal Ilmu Komputer dan Informasi (Journal of a Science and Information), volume 11, issue 2, February 2018 Program [1]. Bidikmisi implementation is given especially for marginal people. However, there are indications of problems in the implementation of the program on the acceptance of Bidikmisi scholarships for Higher Education, i.e. the existence of unaccept-able acceptance conditions. Data response status acceptance Bidikmisi is in binary type (0 and 1), so by involving the founder covariate Bidikmisi scholarship recipient that is the main criteria factor Parent Revenue and Total Household Counts House produces data distributed Bernoulli mixture two components. The characteristics of each component of the Bernoulli mixture can be identified through the Bernoulli Mixture modeling by involving the Bidikmisi scholarship finder covariates. The mixture

Bernoulli Mixture Model
Bernoulli Mixture Model (BMM) is frequently used in text mining [2]. BMM was firstly performed by Duda and Hart [3]. In their development, they applied BMM to various aspects of life i.e. on the study recognition of image, clustering of text and word [4,5,6,7,8,9], on cancer and schizophrenia [10,11,12,13] and in the machine learning research [14,15].  (    under the study. The finite mixture model has the density functions as follows [16]: in which 1, 2, ..., i n = , L is the number of mixture components and for each , ( ) i p y is the density and  is the non-negative quantity which amounts to one, that is: The amount of 1 is a random sample of the D -th dimension of a binary vector. The goal is to divide Y into L (might be unknown but limited) partition. The L is assumed as the finite mixture density and the BMM can be written as [17]: which p is called with the mixture density component, is a random variable of binary data and its components are assumed to be independent, where is independently taken depend on  , and the model can be written as follows [17]:

Bernoulli Mixture Regression Model
Bernoulli mixture regression model (BMRM) is developed based on Mixture of Generalized Linear Model which is called Mixture of Generalized Linear Regression Model [18]. In the generalized linear model framework, a random variable Y i that is named as dependent variable, has a linear relationship with covariates 1 2 , ,... , p X X X as follows where  is linear predictor, (.) g is the link function, i  is expected value of random variable Y i and  is regression parameter.
The linear relationship on equation (5) allows the dependent variable distribution is assumed to be the form of the exponential family distributions (i.e. Gaussian, Poisson, Gamma, or Bernoulli). Distinct link functions can be used to perform that relationship. Canonical link function ( ΘΘ save it as the generated set of values at t + 1 iteration. is one of natural link function which is determined by the exponential of the response's density function. For Bernoulli distribution, the canonical link function is the logit function which can be defined as Therefore equation (5) can be represented as and equation (3) can be redefined as where () p Θ is the prior distribution of Θ and In the Bayesian inference approach, parameter estimation processes are performed by integrating the posterior distribution. The integration can be conducted numerically by simulation procedure which is generally recognized as Markov Chain Monte Carlo (MCMC) method.
Generally the Markov Chain Monte Carlo method works with the following steps [19,20]

Source of Data
The data used in this research was gathered from Database of Ministry of Research and Technology and Higher Education through Bidikmisi channel, that was Bidikmisi data of all provinces in Indonesia on 2015.

Research Flowchart
The classification analysis procedures done in this paper, i.e. BBMRM, dummy regression model, and polytomous regression model, are given in the following research flows as in Figure 1.

Research Variables
Research variables used in this study consisted of the response variable (Y) and the predictor variable (X) as follows Y = the acceptance status of Bidikmisi scholarship (1 = accepted, 0 = not accepted). Three variables in the Bidikmisi enrollment, i.e. "father's income", "mother's income", and "family dependent" are used for forming a Bernoulli mixture distribution as a response of BBMRM. These three variables, therefore, are not used in modeling either on BBMRM, dummy regression model, or polytomous regression model. There are still many variables in the registration form of Bidikmisi, but these four variables selected above are more fundamental variables in considering the acceptance of these grantees, in accordance with one of the rules of acceptance of Bidikmisi is that the income percapita in the family is no more than certain values.

Research Design: Pre-Processing Stage
The explanations of the techniques used in the pre-processing stages of identification with Bernoulli mixture distribution are as following steps: Step 1.Taking response variable (Y).
Step 2.Selecting covariate "father's income", "mother's income" and "family dependent". Step 3.Creating a new covariate by counting the amount of "father's income" and "mother's income" divided by "the number of family dependents", then name it with family income per capita. Step 4.Coding the covariate family income per capita with the following criteria: If family income per capita > Rp. 750,000, then the family is categorized as wealthy family which has code of family category, CFC = 0. If family income per capita < Rp. 750,000 then the family fall into poor family which has code of family category, CFC = 1.
Step 5.Matching the response variable (Y) to the CFC in Step 4 and to the AC (Acceptance Condition) with the Bidikmisi acceptance classification table of "wrong" and "right" which are given on Table 1.
The pre-processing stage result describes response data of the Bernoulli mixture distribution with two mixture components, namely component of wrong acceptance condition and component of right acceptance condition.

Nur Iriawan et.al, Bayesian Bernoulli Mixture Regression Model For Bidikmisi Scholarship
Classification 73

Proposed Model
Referring to equation (8), the two components of BMRM which has to be estimated is defined by x β x β y π x β (11)   Gibbs sampler is one of algorithms that is frequently used as generator of random variables in MCMC [20]. One advantage of the Gibbs sampler is that, in each step, random values only consider to be generated from univariate conditional distributions. Based on [20], the general Gibbs sampler algorithm for BMRM can be summarized by the following steps on Figure  2.

Results and Analysis
In model (11), there are two parameters  and p  which should be estimated. In order to    | , , f y π x β . In this research, we define BMRM which is estimated by Bayesian approach as BBMRM.

BBMRM Algorithm.
The Gibbs sampler algorithm for BBMRM can be constructed by the following steps on Figure 3. This algorithm could be performed on OpenBUGS software [21] to estimate BBMRM for each province.
For example, BBMRM for the province of East Java has a significant estimated model (12). Be y π x β (12) ( ) gx − = + − + − − In order to have the valid posterior inference for parameters, the Markov chains of estimated parameters should be convergent which implies that the chains reaches the posterior distribution. The MCMC convergence of estimated parameter can be monitored through Brooks-Gelman-Rubin method [21].
In relation with estimation processes of model (12), Figure 4 presents the convergence of 1  , whereas Figure 5 shows the convergence of 2  . These graphics describe the evolution of the pooled posterior variance which has green color line, average within-sample variance which is plotted as a blue color line, and their ratio which is marked in red line. The ratio which converge to one means that the estimated parameter is  13  in first mixture component as shown on Figure 6 and convergence of 13  in second mixture component as shown on Figure 7. The calculation results of accepted qualification percentage for all province are presented on Table 2. Those three models are built by using 70 percent of the data as the in-sample or training data. While, the other 30 percent of data, which is set to have a randomly representative member, as an out-sample for model validation. By regarding to Table 2, it can be shown that the accuracy of classification on dummy regression model, polytomous regression model, and BBMRM are about between 1% -67%, 7% -65%, and 60% -98% respectively. There are evidence that BBMRM is more accurate to clasify in 31 provinces than polytomous regression model and on 27 provinces than dummy regression model.

Conclusion
BBMRM couple with MCMC approach gives higher percentage of acceptance classification accuracy than dummy regression model and polytomous regression model. This BBMRM is more representative for Bidikmisi acceptance modeling. As a future research, three methods for Bidikmisi classification discussed in this paper can be compared with other existing classification method, i.e. the Classification and Regression Trees (CART) approach, Neural Network, and Support Vector Machine (SVM). The validation of the three methods obove is still done on 30 percent of the out-sample data in the same year of Bidikmisi selection. In the next study, validation of BBMRM can be done for Bidikmisi data in the following year. Further research can also be done on the existence of the influence of covariate on the component weight, , j  of BBMRM.