Modeling and inference for mixtures of simple symmetric exponential families of p -dimensional distributions for vectors with binary coordinates ∗

We propose tractable symmetric exponential families of distributions for multivariate vectors of 0’s and 1’s in p dimensions, or what are referred to in this paper as binary vectors, that allow for nontrivial amounts of variation around some central value 𝝁 ∈ { 0 , 1 } p . We note that more or less standard asymptotics provides likelihood-based inference in the one-sample problem. We then consider mixture models where component distributions are of this form. Bayes analysis based on Dirichlet processes and Jeffreys priors for the exponential family parameters prove tractable and informative in problems where relevant distributions for a vector of binary variables are clearly not symmetric. We also extend our proposed Bayesian mixture model analysis to datasets with missing entries. Performance is illustrated through simulation studies and application to real datasets.


INTRODUCTION
This paper concerns practical modeling and inference for multivariate binary data. The term "binary" refers to the two possible values that each of the single-variate responses can assume, whether naturally occurring in two categories (e.g., presence/absence, success/failure, on/off) or formed by dichotomizing a continuous response (e.g., whether or not systolic blood pressure is <140). Multivariate binary responses are common in many fields, including biological and social sciences. For example, diseases are often diagnosed on the basis of binary data *Not a special issue article on multiple symptoms. In psychological and educational testing, subjects are often required to give "yes" or "no" answers to multiple questions. Even in engineering fields, conclusions concerning the reliability of complex systems are often based on observation of whether various components are functioning or not. In image processing, objects are often identified based on variables such as area, shape, and so on where data arise from the use of threshold values.
We undertake the task of modeling multivariate vectors of 0's and 1's in p dimensions by first identifying a tractable symmetric family that allows for nontrivial amounts of variation around some central value. We use the fact that for any two p-dimensional binary vectors the squared Euclidean distance between them is given by the number of "pixel flips" between the two vectors. The exponential families of distributions we consider are symmetric in terms of the number of "pixel flips" away from a central binary vector and are characterized by a parameter that controls the amount of variability around the central value allowed by the distribution. We then consider mixture models with such components. Bayesian inference for data modeled by such mixture distributions is then performed via MCMC sampling.
The rest of the paper is arranged as follows. Section 2 describes an exponential family of symmetric distributions on multivariate binary vectors and likelihood-based inference for it in a one-sample model. In Section 3 we present the mixture model and our Bayes analysis for it. Results from simulation studies made to evaluate the performance of Bayes inference for the mixture model are presented in Section 4. We apply the Bayesian mixture model to two real datasets in Section 5. An extension of the Bayes modeling approach to handle incomplete datasets is described in Section 6 and applied to another real problem followed by conclusions in Section 7.

A SYMMETRIC EXPONENTIAL FAMILY OF DISTRIBUTIONS ON BINARY VECTORS
Suppose X ∈ {0, 1} p is a random vector where p is a positive integer. In what follows, realizations of X will be denoted as x. The number of possible realizations of X is 2 p . We first consider a tractable symmetric distribution for the p-dimensional binary vector X.
Note that for any two p-dimensional binary vectors x and z the squared Euclidean distance between them is the number of "coordinate or pixel flips" between the two vectors given by where 1(x j ≠ z j ) is the indicator variable that takes the value 1 if the jth coordinates of x and z do not agree and is 0 otherwise. In information theory, the squared Euclidean distance defined above is known as the Hamming distance. Hereafter, we use the term "pixel flips" and the variable m to denote the squared Euclidean (Hamming) distance between two p-dimensional binary vectors.
Let q(m) be a probability mass function (pmf) defined on the set of all possible pixel flip counts between two p-dimensional binary vectors, that is, the set {0, 1, … , p}. Then, a sensible symmetric distribution on {0, 1} p is The parameter ∈ {0, 1} p is the central data pattern in form (1). We consider cases where the pmf q(m) is parameterized with a parameter that can be used to control the amount of variability around represented by f (x| ). A particularly tractable version of the basic model (1) is where ∈ {0, 1} p , ∈ (0, 1), and c( ) is the normalizing constant (required for f (⋅) to be a pmf). Here, the total probability for all vectors x ∈ {0, 1} p that are m pixel flips away from decreases geometrically in the number of pixel flips. That is The parameter in form (2) is directly interpretable as a distribution spread parameter. The constant c( ) in form (2) (the constant of proportionality in display (3)), is (1 − )∕(1 − p+1 ). Smaller values of produce realizations x typically more similar to than do bigger values.
A tractable possible alternative to model (2) is for ∈ {0, 1} p , and ∈ (0, 1). Here, the probabilities for vectors x ∈ {0, 1} p decrease geometrically in the number of pixel flips that x is away from , that is, in terms of the pixel flip count where the constant of proportionality is (1 + ) −p . Note that the total probability for those outcomes m pixel flips away from is We have found form (2) to be more useful than this second exponential family form (4), so in the following sections we will work with form (2), and correspondingly display (3). Thus, henceforth c( ) = (1 − )∕(1 − p+1 ).

Properties of the model
We begin by considering a single p-dimensional binary random vector X whose pmf is given in display (2). For a fixed , the random pixel flip count, M, with possible values in the set {0, 1, … , p}, has mean Furthermore, the Fisher information about the parameter contained in a single observation X, or equivalently M (for a fixed ), is where log(⋅) denotes the natural logarithm and q(M| ) is as in display (3).

Likelihood-based inference in a one-sample problem
Now consider an independent and identically distributed (iid) training dataset {X 1 , X 2 , … , X N } where each X i is a p-dimensional binary random vector. Following from display (2), the joint distribution of the observed vectors is specified by the joint pmf An expression equivalent to form (7) in terms of the counts of possible observation vectors {z 1 , z 2 , … , z 2 p } (where z j denotes the jth such binary vector in some ordering of them) is where n j denotes the number of observations in the training dataset that are identical to z j (that is, for j = 1, … , 2 p , n j = |x i ∶ x i = z j ; i = 1, 2, … , N|). From form (7), the one-sample model log-likelihood is We are interested in finding estimators for ∈ {0, 1} p and ∈ (0, 1). First, a sensible estimator for is an observation vector in the training set with the highest frequency, that is, a (arbitrarily chosen in the case of ties) mode of the relative frequency distribution. That is, we consider = z j * where j * = arg max{n 1 , n 2 , … , n 2 p }. (10) To develop an estimator for the variability parameter , temporarily fix the central pattern in form (9). Let m i be the number of pixel flips that x i is away from . The joint distribution of {M 1 , M 2 , … , M N } characterized by the parameter has pmf Both the log-likelihood and the method of moments equations (for with fixed) are Solution of this polynomial (in ) equation for p of even moderate size is not obvious. But, a plausible estimator for can be based on the ratio of the sum of relative frequencies of observations one pixel flip away from to those which are identical to , that iŝ Note that in the first fraction in display (13) the denominator includes a single term and the numerator is a sum over at most p nonzero terms for a choice of . Then in light of forms (10) and (13), one (crude) estimator for iŝ(̂). From the Law of Large Numbers applied to the relative frequencies of various x ∈ {0, 1} p , it follows that the estimator for defined in form (10) is consistent in probability. That is, if {X 1 , X 2 , … , X N } are iid observations from the distribution specified by form (2) and̂N is as in display (10) (based on N observations), then and in fact It then follows that wherêN(̂N) and̂N( ) are based on N observations and constructed according to form (13). It follows from the Law of Large Numbers applied to the observed relative frequencies of observations of outcomes at and one pixel flip away from it, and the continuous mapping theorem that the (unrealizable) estimator̂N( ) is consistent for in probability. So, combining forms (15) and (16), we ultimately have the consistency of̂N(̂N) for .
Possible improvement tôN(̂N) can be obtained by adopting a one-step Newton correction. Using the notation , and this is̃N From display (15), it follows that Thus, the limiting behavior of̃N(̂N) is the same as that of̃N( ). Standard arguments for the asymptotic normality of one-step Newton corrections of consistent estimators applied here show that sincêN( ) is consistent for , where ( ) is given in display (6). Hence, Furthermore, and and these can be applied to produce large N confidence limits and tests of H 0 ∶ = 0 in the one-sample model.

MIXTURE MODELS WITH SYMMETRIC EXPONENTIAL FAMILY COMPONENTS
The symmetric single component one-sample model is typically too simple for applications and mixture distribution models are a way to obtain more realistic models. Mixture distributions are commonly used tools for modeling data which is thought to be generated by multiple underlying mechanisms, or, is sampled from multiple populations [5,19]. In this section we consider a mixture model for component distributions of the form (2). The finite mixture model with K components for a single observation X has pmf where j is the p-dimensional central binary vector and j is the spread parameter for component j, the j are the mixing proportions ( j > 0 and , and f (x| j , j ) is the pmf in display (2) evaluated at x with central pattern j and variability parameter j . Now consider N independent observations X 1 , X 2 , … , X N that comprise a training set of size N and dimension p. The corresponding log-likelihood function is We introduce a latent N-dimensional component assignment vector k with entries in {1, 2, … , K} to encode the mixture components from which each observation arises. Each element k(i) in k is a stochastic indicator variable corresponding to observation x i and k(i) can take values {1, 2, … , K} for i = 1, 2, … , N. The numbering of components and thus the exact mapping provided by k is arbitrary as long as it faithfully represents which observations belong to the same classes.

Prior distributions for Bayes analysis
We give the mixing proportions j (which must be positive and sum to 1) a symmetric Dirichlet prior distribution with concentration parameter ∕K, that is Let N j be the number of observations assigned to component j (the so-called occupation number). Given the mixing proportions = ( 1 , … , K ), the indicator variables follow a distribution given by the pmf Then the occupation numbers have joint pmf The mixing proportions can be integrated out of form (27) [15] and the prior distribution for the indicator variables is The conditional prior distribution for a single indicator variable given all others is obtained from form (29) by holding all but the particular variable fixed and is specified by where the subscript −i refers to all indices except i, and N −i,j denotes the number of observations other than x i that are associated with component j. In deriving form (30) we use the fact that the observations are a priori exchangeable. If we consider the limit of this structure as the number of components K → ∞, the conditional prior for k(i) has the limits [7,15] specified by These probabilities in displays (31) and (32) are the probabilities for seating a new customer in a Chinese Restaurant Process (CRP) [1,14] with infinitely many tables. The CRP is a discrete-time stochastic process and is used to specify a distribution over partitions of N observations. Hence, it can be used as a prior distribution for the latent variables k(i) which determine the component assignments of the x i . The limit of the model described above is equivalent to a Dirichlet process mixture model [2,6] with a countably infinite number of components. It has been argued in [18] that with the particular parameterization of the Dirichlet prior over { 1 , … , K } used here, that as N → ∞ the number of components typically required to model N observations becomes independent of K and is approximately ( log(N)). Thus, the mixture model stays well defined as K → ∞, leading to what is known as an infinite mixture model [13,15]. Infinite mixtures are known to work well even when there are only a small finite number of components in the true mixture. As a result, the infinite mixture models are commonly used models whenever the true number of components is unknown. For our purposes we truncate the Dirichlet process mixture model to have K < ∞ components. K is chosen to be large and the parameter controls the number of nonempty components in a direct manner, larger implying larger numbers of nonempty components a priori. The parameter > 0 can be thought of as a prior sample size (in terms of the Pólya urn scheme [4]) or a total prior mass for the component assignment vector k.
Plausible prior distributions for the component parameters j and j of the mixture model are discrete uniform priors on central patterns j ; j = 1, 2, … , K and independently the Jeffreys priors [8] for variability parameters and for ( ) given in display (6). One might also consider beta priors for the variability parameters j , however for our simulation studies and applications to real datasets discussed later we consider the Jeffreys priors given in display (34). The following section details two MCMC sampling procedures we employ for generating samples from the posterior distributions of the mixture component parameters and the indicator variables.

MCMC samplers
For inference in the mixture model described in the previous section we employ two MCMC algorithms where the Markov chains use Gibbs and Metropolis updates. Gibbs sampling and the Metropolis algorithm are two well-known MCMC sampling techniques for complicated multivariate probability distributions when direct sampling is difficult. Based on whether the mixing proportions are included in the model (and hence, the sampling scheme) or not, we propose two variants of the MCMC sampler.
The mixing proportions are integrated out in the first version. In that case, it follows from Equation (29) that the prior distribution for the N-dimensional component assignment vector k is So, under the present model assumptions, for a chosen K, the joint distribution of the X i , the component parameters j and j , and the indicator vector k in the first version of the sampler has density proportional to where f (⋅) is as given in display (2), p (⋅) and p (⋅) are given in Equations (33) and (34) respectively. At each iteration of the MCMC algorithm, one updates the component parameters and the indicator variables based on sampling schemes determined by values of all other variables. For the central patterns j , we use Metropolis updates. A Gibbs step for j is not obvious and so a Metropolis step is used. To make such a Metropolis step it is convenient to employ normal proposals, but that requires a parameterization with parameter space R. Thus, we consider the logit transformation of the variability ∈ R, which allows proposal of a value from a normal distribution centered at the value of the current iterate with fixed standard deviation . The assignment indicators are updated one at a time following relationship (30) using conventional Gibbs steps. The first version of the MCMC sampler, a Metropolis-within-Gibbs sampler, is then summarized below.

1.
Choose appropriate values of K, , , and set the number of iterations T. 2. Initialize j and j for j = 1, … , K, and k.
via a Metropolis step. A proposal * j is obtained from a uniform distribution over vectors one pixel flip away from the current j . The acceptance ratio then employed is ) .
(b) Conditioned on the data {x 1 , x 2 , … , x N }, k, and corresponding j , update the logit of the variability parameter, j for component j (j = 1, … , K), using a density proportional to via a Metropolis step by proposing a value * j from a normal distribution centered at the current j with fixed standard deviation . (c) Conditioned on the data {x 1 , x 2 , … , x N }, j , and j (or, equivalently j ), for each i = 1, 2, … , N, update the k(i)'s one at a time by sampling from discrete distributions on {1, 2, … , K} with probabilities proportional to A more time-efficient Metropolis-within-Gibbs algorithm can be formulated by including the mixing proportions = ( 1 , … , K ) in the model rather than integrating them out. In that case, considering a symmetric Dirichlet prior for given in display (26), the joint density for the X i , the parameters j and j , the mixing proportions , and the indicator vector k is proportional to where f (⋅) is as given in form (2), p (⋅) and p (⋅) are given in Equations (33) and (34) respectively. At each iteration of this algorithm we sample the component parameters, the indicator variables, and (unlike the previous algorithm) the mixing proportions jointly from their corresponding posterior distribution using a standard Gibbs step. The second Metropolis-within-Gibbs sampler is summarized below.

1.
Choose appropriate values of K, , , and set the number of iterations T. 2. Initialize j and j (for j = 1, … , K), k, and = .

Repeat T times.
(a) The central pattern ( 1 , … , K ), j , and j (or equivalently j ), the k(i)'s are sampled independently, each from {1, 2, … , K}, with probabilities proportional to Note that the k(i)'s can be sampled all at once, unlike the one-at-a-time update in the first algorithm, avoiding the recomputation of N −i,j 's for each i. Moreover, since for a given p the number of possible observation vectors is at most 2 p , the probabilities in (d) above need to be calculated for at most 2 p vectors instead of N observations. This is helpful for datasets where N >> 2 p .
We use this second sampler for the Newspaper Reading Survey dataset discussed later where N = 10,858 and p = 7.

Method performance
We evaluated the performance of Bayes analysis of the mixture model under the following simulation settings. We considered different dataset sizes N, different dimensions of the problem p, and variants of the "true" number of components K true to generate observations from a 1 = 2 = · · · = K true mixture with: 1. N = 500, p = 4, and K true = 2; 2. N = 1000, p = 6, and K true = 7; and 3. N = 2000, p = 8, and K true = 10.
For each of the above specifications we considered five distinct sets of central vectors and spread parameters . Details regarding the sets of central vectors and variability parameters used to generate the datasets under each setting can be found in the Supporting information. We simulated 10 datasets from each set of model parameters and . For each dataset, we compared the probabilities from the "true" mixture model with the probabilities estimated after T = 10,000 complete iterations of the MCMC sampler (the probabilities estimated considering all 10,000 iterations and those estimated after a burn-in of the first 1000 iterations are almost identical). Both versions of the MCMC sampler were implemented for each of the above specifications. The results were found to be similar. We report the results from the first version for specification (i) and from the second sampler for specifications (ii) and (iii).
The primary focus of this simulation study was to evaluate the proposed mixture model in terms of its performance in estimating the probabilities of observations x from the simulated datasets. Note that the number of distinct observations in a dataset of dimension p can be at most 2 p . The probability of an observation x from the "true" mixture model with equal mixture weights is (37) The estimated probability for the corresponding observation is computed by averaging across iterations the probabilities obtained from the mixture model identified at each iterate aŝ where the superscript t denotes the tth iteration of the sampler.
We used the Kullback-Leibler divergence [11,12] (written as KL divergence) to quantify the difference between the two probability distributions given by h(x) andĥ(x). KL divergence originated in probability theory and information theory. It is a nonsymmetric, nonnegative measure of how well the "true" distribution of observations, or a precisely calculated theoretical distribution is approximated by the model in question. A KL divergence of 0 indicates that the two distributions in question are identical. The KL divergence between h(x) andĥ(x) is given by ) .
Note that the sum in Equation (39) is over all 2 p possible choices of observation vectors. For the Bayes analysis of the mixture model, the value of K was chosen to be at least twice the value of K true for each specification (K is chosen to be 5, 15, and 20 for specifications (i), (ii), and (iii) respectively). The model parameter and the MCMC algorithm parameter were chosen to be 20 and 0.5 respectively. The starting values for the logits of the variability parameters were chosen randomly from a normal distribution with mean zero and standard deviation . The p-dimensional central binary vectors were randomly initialized from a uniform distribution. The initial indicator variables were chosen as iid from a discrete uniform distribution on {1, 2, … , K}.
For each dataset, we compared the KL divergence computed between the "true" model probabilities and the probabilities estimated from the proposed mixture model with the KL divergence computed between the "true" model probabilities and the relative frequency distribution of the observed dataset. This was then repeated for every combination of model parameters and under each simulation setting. The comparisons show clearly the advantage of the Bayes analysis of the mixture model over completely nonparametric inference for the distributions on {0, 1} p . It can also be observed that the method's performance gets better with increasing values of K true .
Although details are not reported in the results here, the proposed Bayes analysis of the mixture model also accurately identified, under each setting, the sets of central vectors used to generate the respective datasets in terms of the frequencies of vectors observed across all the MCMC iterations.

Computational complexity
The

APPLICATION TO REAL DATASETS
In this section we demonstrate the performance of the proposed Bayesian mixture model analysis on two real datasets. These datasets were shared with us by Dr. A.C. Tamhane and his coauthors from their research study in [17]. We extend our sincere gratitude to the authors for their contribution. The first dataset is from a teaching style study conducted by [3]. The second dataset comes from a research project at the Media Management Center at Northwestern University.

Teaching style study
A survey of 468 fourth grade teachers from the counties of Lancashire and Cumbria was conducted by [3] in which each teacher was asked 38 questions, each of which were binary in nature (yes-no questions), about the way classes are handled. The dataset reports the answers to the following 6 questions from the 38. We ran five MCMC chains initialized at different sets of and chosen randomly. The initial set of indicator variables for each MCMC run was chosen randomly from a discrete uniform distribution. For each chain we set K = 30, = 20, = 0.5, and T = 10,000 iterations. The results reported for this analysis are considered over all 10,000 complete iterations. A comparison of the relative frequencies for the 10 most frequent binary vectors observed in the dataset with the respective probabilities estimated from the mixture model for each MCMC run suggests that the Bayesian mixture model approach performs well in approximating the observed relative frequencies with the model probabilities. The estimated probability for a vector is computed according to formula (38). Similar performance results could also be inferred from the KL divergence measure. For the five MCMC runs, the KL divergence values were found to be 0.072, 0.071, 0.074, 0.074, and 0.072 respectively. For reference, the KL divergence measure for estimation with uniform probability is 0.664. When computing the KL divergence according to Equation (39), h(x) is considered to be the relative frequency of an outcome x observed in the dataset andĥ(x) is the probability estimated from the mixture model for the corresponding x. (When h(x) is zero the contribution of the corresponding term is interpreted as zero because lim y→0 + y log(y) = 0.) The initialization of the parameters of the mixture model does not seem to have any effect on the method's performance. The model tends to favor a high number of components as can be observed from the distribution of number of nonempty components chosen over T = 10,000 iterations at each run of the MCMC algorithm. This is also an effect of the parameter which is set at 20 for this study. The high value of is equivalent to setting a uniform prior on the component assignment vector k and thus results in a large number of nonempty components. K = 30 seems to be a good choice since the relative frequencies start dropping off after 27 or 28 components. The nine most common central vectors identified by the Bayesian mixture model  Table 1. These nine vectors identified by the analysis are among the 11 most frequent outcomes observed in the dataset. It seems that there exist multiple groups of teachers whose answers to Q.1., Q.2., and Q.3. are "yes," suggesting a strict approach to teaching. There also exist two groups of teachers who are lenient in terms of their "no" answers to these three questions. As suggested by the central vectors mentioned above, teachers are almost equally split with their responses to Q.4., Q.5., and Q.6. irrespective of their strict or lenient approach. Note that the answers to Q.5. and Q.6 are supposed to be different as indicated by the central vectors. The model also identifies a group of teachers who have responded "yes" to all six questions indicating possible inattention to the survey.
The plots for iterates of the variability parameters that go with the above central vectors and their corresponding histograms suggest that the MCMC chains seem to have mixed well. The most frequent vector observed in the dataset, that is (1,1,1,0,1,0), is associated with the smallest estimate (average of the iterates) of the spread parameter in each of the five MCMC runs. This indicates that the probability of occurrence of this particular outcome is mostly explained by the component with = (1,1,1,0,1,0), contributions to probabilities of other outcomes which are nonzero pixel flip counts away from (1,1,1,0,1,0) being small. For the rest of the central vectors identified and reported here, the estimates of the corresponding spread parameters are larger thereby accounting for considerable contributions towards probabilities of other observed outcomes. The component with = (1,1,0,1,1,0) seems to give appreciable probability to the outcome (1,1,0,0,1,0) which is one pixel flip away, and vice-versa. The components with centers (1,1,1,1,1,0) and (1,1,1,1,0,1) seem to assign substantial probability to outcomes at the other center (being two pixel flips away) as well as to (1,1,1,1,1,1) which is a single pixel flip away from both the central vectors.

Newspaper Reading Survey
This dataset was generated from a mail survey conducted by the Newspaper Association of America on N = 10,858 readers. The subjects responded to seven questions (Q.i., i = 1, … , 7) on whether they read a particular newspaper on each day (i) of the week starting from Monday. There are 118 distinct multivariate binary vectors observed out of the possible 128 (2 p , where p is 7). For this dataset, we again implemented both versions of the MCMC sampler.
Since N ≫ 2 p , the second Metropolis-within-Gibbs sampler (with the mixing proportions) was considerably faster per iteration than the first (where the proportions are integrated out). The results for the two versions were similar in terms of the most frequent central vectors identified, the mixing behavior of variability parameter iterates, the estimated probabilities of vectors, and the KL divergences. Reported below are the results for the second version of the MCMC sampler.
As for the previous analysis, we ran five MCMC chains with different starting values for the mixture component parameters, the indicator variables, and the mixing proportions. We ran each chain with K = 50 components, = 20, = 0.5, and for T = 10,000 iterations. The results reported are considered over all 10,000 complete iterations. The effective performance of the proposed modeling approach is evident from the KL divergence measure and the comparison of relative frequencies for the 10 most frequent vectors observed in the dataset with their respective probabilities estimated from the Bayesian mixture model for each MCMC run. The KL divergence values for the five MCMC runs were 0.005 each. For reference, the KL divergence for estimation with uniform probability is 2.407. The KL divergence for each run is computed according to Equation (39), where h(x) is the relative frequency of an outcome x observed in the dataset andĥ(x) is the probability estimated from the mixture model for the corresponding x. (Again, when h(x) is zero the contribution of the corresponding term is interpreted as zero because lim y→0 + y log(y) = 0.) The initialization of the parameters does not seem to have any effect on the method's performance. As for the analysis with the Teaching Style Study dataset, similar observations can be inferred from the distribution of number of components with nonempty sets of related observations identified over T = 10,000 iterations for each run of the algorithm. For clarity we only mention those numbers of nonempty components which have occurred for more than 100 iterations out of the 10,000. The value of is set to be 20 in this study as well. Again, for such a high value of the mixture model identifies a large number of nonempty components. K = 50 seems to be an optimal choice for the number of components since the relative frequencies start dropping off after 46 or 47 components. Across all five MCMC runs combined, the nine most frequent central vectors identified by the analysis for the Newspaper Reading Survey dataset are reported in Table 2. Seven of these nine vectors comprise the seven most frequent outcomes observed in the dataset.
The identified central vectors indicate that there are groups of readers who usually read the particular newspaper more on weekends, especially on Sundays, as compared to the middle days of the week. There are readers who read the newspaper on all seven days of the week. This group of readers can be thought of as "subscribers" to the newspaper. The analysis also points out that there is a sizable group of readers who do not read the newspaper at all.

HANDLING MULTIVARIATE BINARY VECTORS WITH MISSING ENTRIES
The occurrence of missing values, and hence, incomplete datasets is not uncommon in applications. In this section we extend our proposed Bayesian mixture model analysis to datasets with missing entries under an implicit assumption that missingness is not informative, that is, occurs at random.
Two hundred and three of the 435 observations have missing entries. There are 160 distinct 16-dimensional binary vectors in the full dataset without any missing entries. We implemented five runs of the MCMC sampler for T = 10,000 iterations each with the full dataset, as well as with the dataset split up according to the "Class" variable. The five MCMC chains had different starting values for the mixture component parameters and the indicator variables. The values of K and were specified to be 30 and 20 respectively and was chosen to be 0.5 in each case. The results presented consider all the 10,000 iterations. The KL divergence measures computed (for the full dataset and the datasets separated by the political affiliations) between the relative frequencies of the complete observations and the respective probabilities estimated from the Bayes mixture model analysis suggest important differences.
The six most common central vectors identified by the model for the full dataset in terms of their frequencies across all iterations (all five MCMC runs combined) are reported in Table 3. Table 4 reports the six most frequent central vectors identified by the model for the datasets separated by the "Class" variable.
The central vectors identified by the model in the two datasets separated by the "Class" variable reveal the difference in opinions between the "democrat" representatives and their "republican" counterparts. The "republican" representatives tend to respond "n" to the third key vote as opposed to the "democrat" congressmen who tend to respond "y." On the other hand the former seem to be in favor of the fourth, fifth, and sixth issues unlike the latter. The same could be said for the 12th, 13th, and 14th key vote. Most democrats are in favor of the seventh, eighth, and ninth issues. Some republicans seem to agree with the democrats on the 16th key vote.
For most of the component central vectors reported, the corresponding variability parameters have chains that seem to be mixing well. It is possible that a single central vector is associated with more than one mixture components with different spread measures. The values of the variability parameter estimates corresponding to the central vectors reported in this analysis indicate that the components formed from these vectors have appreciable contribution towards probabilities of outcomes which are more than one pixel flip count away. For the full dataset, the number of nonempty components identified by the mixture model across the T = 10,000 iterations tends to range from 17 to 29. The number of nonempty components identified is between 15 and 25 and between 9 and 20 for the datasets with Class = "democrat" and Class = "republican" respectively. For the full dataset, we obtain the matrix of relative frequencies in the MCMC iterates (posterior probabilities) under which two congressmen are put into the same component. The heatmap from the resulting matrix is shown in Figure 1. With high posterior probability the 168 "republican" congressmen tend to belong to the same component as indicated from the lower left block. Similar observations could be made for the remaining 267 "democrat" congressmen as seen from the upper right block. The heatmap shown here corresponds to the fourth run of the MCMC sampler. Heatmaps from other four runs display similar block structure.

CONCLUSIONS
We have proposed a symmetric distribution on multivariate vectors of 0's and 1's. It is based on the Euclidean distance between two p-dimensional binary vectors. This distribution has a simple form and its parameters are easily interpretable. Subsequently, we have described the construction of a Dirichlet process mixture model within the Bayesian nonparametrics framework where the components of the mixture take the above distributional form. Using the Dirichlet prior on the component weights (or equivalently, the mixture proportions) has the attractive property that enables us to integrate out the weights and sample the latent indicators marginally using the CRP representation of the Dirichlet process. An alternative sampler is also TA B L E 3 Six most frequent central vectors identified by the mixture model in the full HouseVotes84 dataset reported in descending order of frequencies   1  2  3  4  5  6  7  8  9  1 0  1 1  1 2  1 3  1 4  1 5  1 6   0  0  1  0  0  1  1  1  1  1  0  0  0  1  0  1   0  0  1  0  0  0  1  1  1  1  1  0  0  0  1  1   0  0  0  1  1  1  1  1  1  1  0  1  1  1  The Dirichlet process mixture model forms a suitable modeling strategy when the "true" number of components is unknown and does not require the user to specify a "correct" number of components. Since the number of distinct binary vectors for any dimension p is limited to 2 p we effectively use the mixture model with a finite number of components K. For practical purposes K is chosen to be large and the number of nonempty components of the mixture is controlled using the parameter . Inference is made from the posterior samples generated using the proposed Metropolis-within-Gibbs sampling techniques. The implementation of the model is illustrated through simulation studies and applications to two real datasets. The method shows good performance in estimating the relative frequencies (or, "true" probabilities from the generating mixture) for the vectors observed in the datasets. Although the model does not produce a single "best" set of mixture component parameters, the central vectors generating the simulated datasets are accurately identified in terms of the frequencies of vectors observed across the MCMC iterations. We have also extended the mixture model to allow for missing entries in the data vectors. As seen from the application on a real dataset, the method seems to identify potentially interpretable central vectors and allow assessment of similarities between incomplete observations based on posterior probabilities of arising from the same mixture component. The proposed modeling approach draws a clear parallel to (the absolutely standard) mixture modeling with multivariate normal components in the case of continuous distributions. The fact that multivariate normal mixture modeling itself is closely related to multivariate kernel density estimation makes this approach a sort of "binary vector 'kernel' pmf estimation" methodology. Besides being as flexible (in terms of modeling of an arbitrary distribution for binary vectors) as other modeling approaches such as restricted Boltzmann machines (RBMs) for example, our proposed methodology remains far more interpretable, relatively parsimonious (in comparison to the RBMs), and does not suffer from serious limitations [9,10]. It also provides rational bases for clustering training cases on the basis of posterior probabilities of coming from the same component, a real strength of the conceptualization (especially in where data vectors can be incomplete).

ACKNOWLEDGMENT
Open access funding enabled and organized by Projekt DEAL.

CONFLICT OF INTEREST
The authors declare no potential conflict of interests.

DATA AVAILABILITY STATEMENT
The datasets and codes that support the findings of this study are publicly available in the GitHub repository 'Bayes-Mixture-Analysis-for-Binary-Vectors' at https:// github.com/abhicc/Bayes-Mixture-Analysis-for-Binary-Vectors.git.