A Note on Ising Network Analysis with Missing Data

The Ising model has become a popular psychometric model for analyzing item response data. The statistical inference of the Ising model is typically carried out via a pseudo-likelihood, as the standard likelihood approach suffers from a high computational cost when there are many variables (i.e., items). Unfortunately, the presence of missing values can hinder the use of pseudo-likelihood, and a listwise deletion approach for missing data treatment may introduce a substantial bias into the estimation and sometimes yield misleading interpretations. This paper proposes a conditional Bayesian framework for Ising network analysis with missing data, which integrates a pseudo-likelihood approach with iterative data imputation. An asymptotic theory is established for the method. Furthermore, a computationally efficient {P{\'o}lya}-Gamma data augmentation procedure is proposed to streamline the sampling of model parameters. The method's performance is shown through simulations and a real-world application to data on major depressive and generalized anxiety disorders from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC).

Analyzing cross-sectional binary item response data with the Ising model (Ising, 1925) is common in network psychometric analysis.This analysis is typically performed based on a conditional likelihood (Besag, 1974) because the standard likelihood function is computationally infeasible when involving many variables.In this direction, Bayesian and frequentist methods have been developed, where sparsity-inducing priors or penalties are combined with the conditional likelihood for learning a sparse network structure (Yuan & Lin, 2007;Mazumder & Hastie, 2012;Van Borkulo et al., 2014;Epskamp & Fried, 2018;Li et al., 2019;Marsman et al., 2022).Besides, the Ising model is shown to be closely related to Item Response Theory (IRT) models (Holland, 1990;Anderson & Yu, 2007).The log-multiplicative association models (Anderson & Yu, 2007), which are special cases of the Ising model, can be used as item response theory models and yield very similar results as IRT models.Furthermore, the Ising model and the conditional likelihood have been used for modeling the local dependence structure in locally dependent IRT models (Ip, 2002;Chen et al., 2018).
Due to its construction, the conditional likelihood does not naturally handle data with missing values, despite the omnipresence of missing data in psychometric data.To deal with missing values in an Ising network analysis, listwise deletion (Haslbeck & Fried, 2017;Fried et al., 2020) and single imputation (e.g., Huisman, 2009;Armour et al., 2017;Lin et al., 2020) are typically performed, which arguably may not be the best practice.In particular, it is wellknown that listwise deletion is statistically inefficient and requires the Missing Completely At Random (MCAR) assumption (Little & Rubin, 2019) to ensure consistent estimation.
Moreover, a naïve imputation procedure, such as mode imputation, likely introduces bias into parameter estimation.A sophisticated imputation procedure must be developed to ensure statistical validity and computational efficiency.
In this note, we propose an iterative procedure for learning an Ising network.The proposed procedure combines iterative imputation via Full Conditional Specification (FCS; Liu et al., 2014;van Buuren, 2018) and Bayesian estimation of the Ising network.We show that the FCS leads to estimation consistency when the conditional models are chosen to take logistic forms.In terms of computation, we propose a joint Pólya-Gamma augmentation procedure by extending the Pólya-Gamma augmentation procedure for logistic regression (Polson et al., 2013).It allows us to efficiently sample parameters of the Ising model.
Simulations are conducted to compare the proposed procedure with estimations based on the listwise deletion and single imputation.Finally, the proposed procedure and a completecase analysis are applied to study the network of Major Depressive Disorder (MDD) and Generalised Anxiety Disorders (GAD) based on data from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC; Grant et al., 2003).In this analysis, data missingness is mainly due to two screening items for GAD.That is, a respondent's responses to the rest of the MDD items are missing if they answered "no" to both screening items.This missing mechanism is Missing at Random (MAR; Little & Rubin, 2019).The complete-case analysis of missing data caused by screening items is known to be problematic in the literature of network psychometrics (Borsboom et al., 2017;McBride et al., 2023).
Our Bayesian estimate of the edge coefficient between the two screening items is negative based on the complete cases, which can be seen as a result of Berkson's paradox (De Ron et al., 2021).In contrast, the proposed method makes use of all the observed data entries and obtains a positive estimate of this edge coefficient.An identifiability result about the Ising model under this special missing data setting in the Appendix, the item content, and a simulation study mimicking this setting suggest that the result given by the proposed method is sensible.

Ising Model
Consider a respondent answering J binary items.Let Y " pY 1 , ..., Y J q J P t0, 1u J be a binary random vector representing the respondent's responses.We say Y follows an Ising model if its probability mass function satisfies where S " ps ij q JˆJ is a J by J symmetric matrix that contains parameters of the Ising model and s jk y j y k ff is a normalizing constant.The parameter matrix S encodes a network with the J items being the nodes.More specifically, an edge is present between nodes i and j if and only if the corresponding entry s ij is nonzero.If an edge exists between nodes i and j, then Y i and Y j are conditionally dependent given the rest of the variables.Otherwise, the two variables are conditionally independent.
In Ising network analysis, the goal is to estimate the parameter matrix S. The standard likelihood function is computationally intensive when J is large, as it requires computing a normalizing constant cpSq which involves a summation of all the 2 J response patterns.To address this computational issue, Besag (1975) proposed a conditional likelihood which is obtained by aggregating the conditional distributions of Y j given Y ´j " pY 1 , ..., Y j´1 , Y j`1 , ..., Y J q J , for j " 1, ..., J, where the conditional distribution of Y j given Y ´j takes a logistic regression form.More precisely, the conditional likelihood with one observation y is defined as A disadvantage of the conditional likelihood is that it requires a fully observed dataset because missing values cannot be straightforwardly marginalized out from (2).In what follows, we discuss how missing data can be treated in the conditional likelihood.

Proposed Method
Consider a dataset with N observations.Let Ω j Ă t1, ..., N u be the subset of observations whose data on item j are missing.For each observation i and item j, y ij denotes the observed response if i R Ω j , and otherwise, y ij is missing.Thus, the observed data include Ω j and y ij , for i P t1, ..., N uzΩ j and j " 1, ..., J.
The proposed procedure iterates between two steps -(1) imputing the missing values of y ij for i P Ω j , j " 1, ..., J, achieved via a full conditional specification, and (2) sampling the posterior distribution of S given the most recently imputed data.Let t be the current iteration number.Further, let y pt´1q i " py pt´1q i1 , ..., y pt´1q iJ q J , i " 1, ..., N , be imputed data from the previous iteration, where y is imputed in the pt ´1qth iteration for i P Ω j .For the tth iteration, the imputation and sampling steps are described as follows.
Imputation.We initialize the imputation in the tth iteration with the previously imputed data set py pt´1q i1 , . . ., y pt´1q iJ q, i " 1, ..., N .Then, we run a loop over all the items, j " 1, ..., J.In step j of the loop, we impute y ij for all i P Ω j , given the most recently imputed q by incorporating the newly imputed values for y ij .
The imputation of each variable j is based on the conditional distribution of Y j given Y ´j .Under the Ising model, this conditional distribution takes a logistic regression form.
For computational reasons to be discussed in the sequel, we introduce an auxiliary parameter vector β j " pβ j1 , ..., β jJ q J as coefficients in the logistic regression, instead of directly using S from the previous iteration to sample the missing y ij s.Unlike the constraint of s ij " s ji in the symmetric matrix S, no constraints are imposed on β j , j " 1, ..., J, which makes the sampling computationally efficient; see discussions in Section 2.4.The imputation of variable j consists of the following two steps: 1. Sample auxiliary parameter vector β ptq j from the posterior distribution p pt,jq pβ j q 9 π j pβ j q where π j pβ j q is the prior distribution for the auxiliary parameters β j .q by incorporating the newly imputed values for y ij , i P Ω j .We emphasize that only the missing values are updated.For i R Ω j , y ptq ij is always the observed value of y ij .After the loop over all the items, we have the imputed data set py ptq i1 , . . ., y ptq iJ q as the output from this imputation step.

Sample y
Sampling S. Given the most recently imputed data y ptq i , i " 1, ..., N , update S ptq by sampling from the pseudo-posterior distribution where πpSq is the prior distribution for the Ising parameter matrix S and recall that ś N i"1 p ˚py ptq i | Sq is the conditional likelihood.
Figure 1 visualizes the steps performed in the proposed method.Note that it is unnecessary to sample the parameter matrix S during the burn-in period and in every iteration after the burn-in period; thus, we employ a thinning step after the burn-in period.This is ř tPtpM 0 `1qt 0 ,...,M t 0 u S ptq , where M " tT {t 0 u and M 0 " tT 0 {t 0 u.
done to both decrease computational cost and reduce the auto-correlation in the imputed data.Moreover, we outline the proposed algorithm in Algorithm 1.The final estimate of S is obtained by averaging all the S ptq obtained after the burn-in period.The computational details, including the sampling of auxiliary parameters and Ising parameter matrix and discussions of the computational complexity, are given in Section 2.4.
We remark that our method imputes the missing variables one by one for each observation.This method is chosen because simultaneously imputing all the missing variables is typically computationally infeasible, especially when some observation units have many missing values.Simultaneous imputation requires evaluating the joint distribution of the missing variables given the observed ones, whose computational complexity grows exponentially with the number of missing values.In contrast, the proposed method is based on unidimensional conditional distributions, which is computationally more feasible.We also note that the proposed method has several variants that should also work well.These variants are discussed in Section 4.

Statistical Consistency
As our method is not a standard Bayesian inference procedure, we provide an asymptotic theory under the frequentist setting to justify its validity.In particular, we show that the S parameter sampled from the pseudo-posterior distribution converges to the true parameter S 0 , under the assumptions that the Ising model is correctly specified and the data are MAR.
Consider one observation with a complete data vector Y " pY 1 , ..., Y J q J .Further, let Z " pZ 1 , ..., Z J q J be a vector of missing indicators, where Z ij " 1 if Y ij is observed and Z ij " 0 otherwise.We further let Y obs " tY j : Z j " 1, j " 1, ..., Ju and Y mis " tY j : Z j " 0, j " 1, ..., Ju be the observed and missing entries of Y, respectively.Consider the joint distribution of observable data pY obs , Zq, taking the form where exp `yJ Sy{2 ˘{cpSq is the distribution of Y " y under the Ising model, qpz | y, ϕq denotes the conditional probability of Z " z given Y " y, and ϕ denotes the unknown parameters of this distribution.The MAR assumption, also known as the ignorable missingness assumption, means that the conditional distribution qpz | y, ϕq depends on y only through the observed entries, i.e., qpz | y, ϕq " qpz | y obs , ϕq.In that case, (6) can be factorized as P pY obs " y obs , Z " z | S, ϕq " qpz | y obs , ϕq ˆ˜ÿ Consequently, the inference of S does not depend on the unknown distribution qpz | y, ϕq.
As shown in Liu et al. (2014), the MAR assumption, together with additional regularity conditions, ensures that the iterative imputation of the missing responses converges to the imputation distribution under a standard Bayesian procedure as the number of iterations and the sample size N go to infinity.A key to this convergence result is the compatibility of the conditional models in the imputation step -the logistic regression models are compatible with the Ising model as a joint distribution.The validity of the imputed samples further ensures the consistency of the estimated Ising parameter matrix.We summarize this result in Theorem 1.
Theorem 1. Assume the following assumptions hold: 1) The Markov chain for missing data, generated by the iterative imputation algorithm Algorithm 1, is positive Harris recurrent and thus admits a unique stationary distribution; 2) The missing data process is ignorable; 3) A regularity condition holds for prior distributions of Ising model parameters and auxiliary parameters, as detailed in the supplementary material.Let π N pSq be the posterior density of S implied by the stationary distribution of the proposed method.Given the true parameters S 0 for the Ising model and any ε ą 0, we have π N pSq concentrates at S 0 , in probability as N Ñ 8. B ε pS 0 q " tS : }S ´S0 } ă εu is the open ball of radius ε at S 0 .
We provide intuitions about this consistency result.Suppose that the data are generated by an Ising model.The iterative imputation method ensures that the parameters of the

Computational Details
In what follows, we discuss the specification of the prior distributions and the sampling of auxiliary parameters β j and Ising model parameters S.
Sampling β j .We set independent mean-zero normal priors for entries of β j .For the intercept parameter β jj , we use a weakly informative prior by setting the variance to 100.
For the slope parameters β jk , k ‰ j, we set a more informative prior by setting the variance to be 1, given that these parameters correspond to the off-diagonal entries of S, which are sparse and typically do not take extreme values.The sampling of the auxiliary parameters β j , following (3), is essentially a standard Bayesian logistic regression problem.We achieve it by a Markov chain Monte Carlo (MCMC) sampler called the Pólya-Gamma sampler (Polson et al., 2013).
2. Given the N augmentation variables, sample β ptq from a J-variate normal distribution.
The details of these two steps are given in the supplementary material, including the forms of the Pólya-Gamma distributions and the mean and covariance matrix of the J-variate normal distribution.We choose the Pólya-Gamma sampler because it is very easy to construct and computationally efficient.It is much easier to implement than Metropolis-Hastings samplers which often need tuning to achieve good performance.
We comment on the computational complexity of the sampling of β j .The sampling from the Pólya-Gamma distribution has a complexity OpN Jq, and the sampling from the J-variate normal distribution has a complexity of OpN J 2 q `OpJ 3 q.Consequently, a loop of all the β j , j " 1, ..., J, has a complexity of OppN `JqJ 3 q.
It is easy to see that vechp¨q is a one-to-one mapping between R JpJ`1q{2 and J ˆJ symmetric matrices.Therefore, we impose a prior distribution on α and sample α ptq in the tth iteration when S is sampled.Then we let S ptq " vech ´1pα ptq q.
Recall that a thinning step is performed, and t 0 is the gap between two samples of S.
Let t be a multiple of t 0 and α pt´t 0 q " vechpS pt´t 0 q q be previous value of α.The sampling of α ptq is also achieved by a Pólya-Gamma sampler, which involves the following two steps similar to the sampling of β j .
1. Given α pt´t 0 q , independently sample N J augmentation variables, each from a Pólya-Gamma distribution.
The details of these two steps are given in the supplementary material.We note that the computational complexity of sampling the N J augmentation variables is OpN J 2 q, and that of sampling α ptq is OpN J 5 q `OpJ 6 q, resulting in an overall complexity OppN `JqJ 5 q.
Comparing the complexities of the imputation and sampling S steps, we notice that the latter is computationally much more intensive.This is the reason why we choose to impute data by introducing auxiliary parameters β j s rather than using Ising network parameters S so that the iterative imputation mixes much faster in terms of the computation time.
In addition, we only sample S every t 0 iterations for a reasonably large t 0 to avoid a high computational cost and also reduce the auto-correlation between the imputed data.
We remark that Marsman et al. (2022) considered a similar Ising network analysis problem based on fully observed data, in which they proposed a Bayesian inference approach based on a spike-and-slab prior to learning S. Their Bayesian inference is also based on a Pólya-Gamma sampler.However, they combined Gibbs sampling with a Pólya-Gamma sampler, updating one parameter in S at a time.This Gibbs scheme often mixes slower than the joint update of S as in the proposed method and, thus, is computationally less efficient.
The proposed Pólya-Gamma sampler may be integrated into the framework of Marsman et al. (2022) to improve their computational efficiency.

Numerical Experiments
We illustrate the proposed method and show its power via simulation studies and a real-world data application.In Section 3.1, we conduct two simulation studies, evaluating the proposed method under two MAR scenarios, one of which involves missingness due to screening items.
A further simulation study is undertaken, applying our method to a 15-node Ising model governed by the MCAR mechanism.Detailed exposition of this study can be found in the supplementary materials.

Simulation
Study I We generate data from an Ising model with J " 6 variables.Missing values are generated under an MAR setting that is not MCAR.The proposed method is then compared with Bayesian inference based on (1) listwise deletion and (2) a single imputation, where the single imputation is based on the imputed data from the T th iteration of Algorithm 1, recalling that T 0 is the burn-in size.
We configure the true parameter matrix S 0 as follows.Since S 0 is a symmetric matrix, we only need to specify its upper triangular matrix and then the diagonal entries.For the upper triangular entries (i.e., s jl , j ă l), we randomly assign 50% of them to zero to introduce a moderately sparse setting.In addition, the nonzero parameters are then generated by sampling from a uniform distribution over the set r´1, ´0.4s Y r0.4,1s.The intercept parameters s jj , j " 1, . . ., J are set to zero.The true parameter values are given in the supplementary material.Missing data are simulated by masking particular elements under an MAR mechanism.In particular, we have z i6 " 1, so that the sixth variable is always observed.We further allow the missingness probabilities of the first five variables (i.e., z ij " 0, j " 1, . . ., 5) to depend on the values of y i6 .The specific settings on ppz ij " 0 | y i6 q, j " 1, . . ., 5 are detailed in the supplementary material.Data are generated following the aforementioned Ising model and MAR mechanism for four different sample sizes, N " 1, 000, 2, 000, 4, 000, and 8, 000, respectively.For each sample size, 50 independent replications are created.
Three methods are compared -the proposed method, Bayesian inference with a single imputation, and Bayesian inference based on complete cases from listwise deletion.The Bayesian inference for complete data is performed by sampling parameters from the posterior implied by the pseudo-likelihood and a normal prior, which is a special case of the proposed method without iterative imputation steps.All these methods shared the same initial values s p0q jl " U p´0.1, 0.1q, 1 ď j ď l ď J.For our proposed method, we set the length of the Markov Chain Monte Carlo (MCMC) iterations to T " 5, 000 and a burn-in size of T 0 " 1, 000, with a thinning parameter k 0 " 10.This setup leads to an effective total of 400 MCMC samples for the Ising parameter matrix S. Notably, identical MCMC length and burn-in configuration are applied during parameters inference in the single imputation and complete-case analyses.
Figure 2 gives the plots for the mean squared errors (MSE) of the estimated edge parameters and intercept parameters under different sample sizes and for different methods.The MSE for each parameter s jl is defined as 1 50 Here, ŝk,jl denotes the estimated value from the kth replication while s 0,jl refers to the true value.Each box in panel (a) corresponds to the 15 edge parameters, and each box in panel (b) corresponds to the 6 intercept parameters.We notice that the listwise deletion procedure introduces biases into the edge and intercept estimation, resulting in the MSEs for certain parameters not decaying toward zero as the sample size grows.Additionally, both the proposed method and the single imputation method offer accurate parameter estimation, with MSEs decaying toward zero as the sample size increases.Notably, the proposed method is substantially more accurate than the single imputation method, suggesting that aggregating over multiple imputed datasets improves the estimation accuracy.Furthermore, for smaller sample sizes, the complete-case analysis seems to yield slightly more accurate estimates of the edge parameters than the single imputation method.Across four sample sizes, the median computational times for obtaining the results of the proposed method were 33, 50, 88, and 185 seconds, respectively1 .Study II: Missing due to screening items Missingness due to screening items is commonly encountered in practice, posing challenges to the network analysis (Borsboom et al., 2017;McBride et al., 2023).This occurs, for example, in surveys where initial screening questions determine respondents' eligibility or relevance for subsequent questions.Suppose respondents indicate a lack of relevant experience (i.e., their answers to the screening items are all negative).In that case, they are not prompted to answer certain follow-up questions, making the missingness of these responses depend on their answers to the screening questions and, thus, MAR.Our real data example in Section 3.2 involves two screening items, which results in a large proportion of missing data.
We consider a simulation setting involving two screening items to evaluate the proposed method's performance under this setting.Similar to Study I, we consider a setting with six items, the first two of which are the screening items.The full data are generated under an Ising model, whose parameters are given in the supplementary material, where the corresponding network has six positive edges including one between the two screening items.The responses to the screening items are always set as observed for any observation.When an observation's responses to the screening items are both zero, their responses to the rest of the four items are regarded as missing.
We consider a single sample size N " 8, 000 and generate 50 independent datasets.We apply the proposed method, the single imputation method, and the complete-case analysis.
For each estimation procedure, we set the MCMC iterations T " 5, 000, the burn-in size T 0 " 1, 000, and the thinning parameter k 0 " 10.These methods are compared in terms of MSEs and biases for parameter estimation.
Table 1 presents the result.For all the edge parameters except for s 12 , the three estimation methods work well, though the single imputation method is slightly less accurate, as indicated by its slightly larger MSEs.However, the complete-case estimate is substantially negatively biased for s 12 , the edge between two screening items.At the same time, the imputation-based methods are still accurate, with the proposed method having a smaller MSE than that of the single imputation method.This result confirms that running a complete-case analysis on data involving screening items is problematic while performing the imputation-based methods, especially the proposed method, yields valid results.
We provide discussions on this result.The negative bias for s 12 in the complete-case analysis is due to a selection bias, typically referred to as Berkson's paradox (De Ron et al., 2021).The complete-case analysis excludes all the response vectors with negative responses to both screening items.Consequently, a positive response on one screening item strongly suggests a negative response on the other, regardless of the responses to the rest of the items.
This results in a falsely negative conditional association between the two screening items.In fact, one can theoretically show that the frequentist estimate of s 12 based on the maximum pseudo-likelihood is negative infinity.The finite parameter estimate in Table 1 for s 12 is due to the shrinkage effect of the prior distribution that we impose.On the other hand, the proposed method makes use of the observed frequency of the (0,0) response pattern for the two screening items, in addition to the frequencies of the fully observed response vectors.
As shown by the identifiability result in the Appendix, these frequencies are sufficient for identifying all the parameters of the Ising model.

A Real Data Application
We analyze the dataset for the 2001-2002 National Epidemiological Survey of Alcohol and Related Conditions (NESARC), which offers valuable insights into alcohol consumption and associated issues in the U.S. population (Grant et al., 2003).The dataset consists of 43,093 civilian non-institutionalized individuals aged 18 and older.In this analysis, we focus on two specific sections of the survey that concern two highly prevalent mental health disorders -Major Depressive Disorder (MDD) and Generalized Anxiety Disorder (GAD).Because MDD and GAD have high symptom overlap (Hettema, 2008) and often co-occur (Hasin et al., 2005), it is important to perform a joint analysis of the symptoms of the two mental health disorders and study their separation.In particular, Blanco et al. (2014) performed factor analysis based on the same data and found that the two mental health disorders have distinct latent structures.We reanalyze the data, hoping to gain some insights from the network perspective of the two mental health disorders.
Following Blanco et al. (2014) 2000).These items ask the participants if they have recently experienced certain symptoms; see Table 2 for their short descriptions.After eliminating samples with entirely absent values across the 15 items, a total of 42,230 cases remain in the dataset.Note that any "Unknown" responses in the original data are converted into missing values.The dataset exhibits a significant degree of missingness, with only 2,412 complete cases for the 15 items, representing approximately 6% of the total cases.Specifically, the missing rate for each item is given in Table 2. Importantly, items D1 and D2 function as screening items and, thus, have a very low missing rate.The respondents did not need to answer items D3-D9 if the responses to D1 and D2 were "No" or "Unknown", resulting in high missing rates for these items.This pattern suggests that the missing data in this study is not MCAR.The GAD items A1-A6 also have a screening item, which results in the high missing rates in A1 through A6.Following the treatment in Blanco et al. (2014), these screening items are not included in the current analysis.
We apply the proposed method and the complete-case analysis to the data.For each method, 10 MCMC chains with random starting values are used, each having 10,000 MCMC iterations and a burn-in size 5,000.The Gelman-Rubin statistics are always below 1.018, confirming the satisfactory convergence of all 120 parameters for both methods.The estimated network structures for MDD and GAD items are presented in Figure 3, where an edge is shown between two variables when the absolute value of the estimated parameter is greater than 0.5.We emphasize that this threshold is applied only for visualization purposes, rather than for edge selection.Consequently, the edges in Figure 3 should only be interpreted as edges with large estimated parameters, rather than truly nonzero edges.The nine MDD items are shown as blue nodes at the bottom, and the six GAD items are shown as orange nodes at the top.The edges are colored blue and orange, which represent positive and negative parameter estimates, respectively.In addition, the line thickness of the edges indicates their magnitude.A clear difference between the two methods is the edge between D1 "depressed mood most of the day, nearly every day," and D2 "markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day", which are two screening questions in the survey that all the participants responded to.The estimated parameter for this edge has a large absolute value under each of the two methods, but the estimated parameter is negative in the complete-case analysis, while it is positive according to the proposed method.As revealed by the result of Study II in Section 3.1, the negative edge estimate of the edge between the screening items given by the complete case analysis is spurious.Considering the content of these items, we believe that the estimate from the proposed method is more sensible.Other than this edge, the remaining structure of the two networks tends to be similar, but with some differences.In particular, we see that the complete-case analysis yields more edges than the proposed method; for example, the edges of A4-A5, A1-D5, D1-D6, D1-D7, D1-D8, and D8-D9 appear in the estimated network from the complete-case analysis but not in that of the proposed method, while only two edges, A3-A5 and D3-D4, are present in the network estimated by the proposed method but absent in the network from the complete-case analysis.We believe this is due to the higher estimation variance of the complete-case analysis caused by its relatively small sample size.
Finally, our analysis shows that the symptoms of each mental health disorder tend to densely connect with each other in the Ising network, while the symptoms are only loosely but positively connected between the two mental health disorders.The edges between the two mental health disorders identify the overlapping symptoms, including "D4: Insomnia or hypersomnia" and "A6: Sleep disturbance", "A2: Easily fatigued" and "D6: Fatigue/loss of energy", and "A3: Difficulty concentrating" and "D8: Diminished concentration".These results suggest that MDD and GAD are two well-separated mental health disorders, despite their high symptom overlap and frequent co-occurrence.This result confirms the conclusion of Blanco et al. (2014) that GAD and MDD are closely related but different nosological entities.

Concluding Remarks
In this paper, we propose a new method for Ising network analysis in the presence of missing data.The proposed method integrates iterative imputation into a Bayesian inference procedure based on conditional likelihood.An asymptotic theory is established that guarantees the consistency of the proposed estimator.Furthermore, a Pólya-Gamma machinery is proposed for the sampling of Ising model parameters, which yields efficient computation.The power of the proposed method is further shown via simulations and a real-data application.
An R package has been developed that will be made publicly available upon the acceptance of the paper.
The current work has several limitations that require future theoretical and methodological developments.First, this manuscript concentrates on parameter estimation for the Ising model in the presence of missing data.However, the problem of edge selection (Ročková, 2018;Noghrehchi et al., 2021;Borsboom, 2022;Marsman et al., 2022) requires future investigation.There are several possible directions.One direction is to view it as a multiple-testing problem and develop procedures that control certain familywise error rates or the false discovery rate for the selection of edges.To do so, one needs to develop a way to quantify the uncertainty for the proposed estimator.It is nontrivial, as the proposed method is not a standard Bayesian procedure, and we still lack a theoretical understanding of the asymptotic distribution of the proposed procedure.In particular, it is unclear whether the Bernstein-von-Mises theorem that connects Bayesian and frequentist estimation holds under the current setting.Another direction is to view it as a model selection problem.
In this direction, we can use sparsity-inducing priors to better explore the Ising network structure when it is sparse.We believe that the proposed method, including the iterative imputation and the Pólya-Gamma machinery, can be adapted when we replace the normal prior with the spike-and-slab prior considered in Marsman et al. (2022).This adaptation can be done by adding some Gibbs sampling steps.In addition, it is of interest to develop an information criterion that is computationally efficient while statistically consistent.This may be achieved by computing an information criterion, such as the Bayesian information criterion, for each imputed dataset and then aggregating them across multiple imputations.
Finally, the proposed method has several variants that may be useful for problems of different scales.For problems of a relatively small scale (i.e., when J is small), we may perform data imputation using the sampled S instead of using auxiliary parameters β j s.This choice will make the algorithm computationally more intensive, as the sampling of S has a high computational complexity.On the other hand, it may make the estimator statistically more efficient as it avoids estimating the auxiliary parameters β j s, whose dimension is higher than S. For very large-scale problems, one may estimate the Ising model parameters based only on the auxiliary parameters β j s.For example, we may estimate s ij by averaging the value of pβ ij `βji q{2 over the iterations.This estimator is computationally more efficient than the proposed one, as it avoids sampling S given the imputed datasets.This estimator should still be consistent but may be statistically slightly less efficient than the proposed one.

Appendix
A Identifiability of Ising Model with Two Screening

Items
We investigate the identifiability of the Ising model parameters when there are two screening items.
Proposition 1.Consider an Ising model with J ě 3, true parameters S 0 " ps 0 ij q JˆJ , and the first two items being the screening items.We define p 0 pyq :" P pY " y|S 0 q, for any y P A " tpx 1 , ..., x J q J P t0, 1u J : x 1 " 1 or x 2 " 1u, and p 0 p0, 0q :" P pY 1 " 0, Y 2 " 0|S 0 q under the Ising model.Then there does not exist an Ising parameter matrix S ‰ S 0 such that p 0 pyq " P pY " y|Sq, for any y P A and p 0 p0, 0q " P pY 1 " 0, Y 2 " 0|Sq under the Ising model.
Proof.We first prove the statement for J ě 4. Suppose that p 0 pyq " P pY " y|Sq, for any y P A and p 0 p0, 0q " P pY 1 " 0, Y 2 " 0|Sq.We will prove that S " S 0 .
We start by considering items 1, 2, 3, 4. We define the set A 3,4 " tpx 1 , ..., x J q J P A : x 5 " ... " x J " 0u.We note that A 3,4 has 12 elements.Using y d " p1, 0, 0, 0, 0, ..., 0q J P A 3,4 as the baseline pattern, for any y P A 3,4 such that y ‰ y d we have log That gives us 11 linear equations for 10 parameters, s ij , i, j ď 4. By simplifying these equations, we have 1) two linear equations for ps 11 , s 12 , s 22 q and 2) s 1i " s 0 1i , s 2i " s 0 2i , s ij " s 0 ij for all i, j " 3, 4. We repeat the above calculation for any four-item set involving items 1 and 2. By choosing any item pair i, j ą 2, i ‰ j and repeating the above calculation with patterns in the set A i,j " tpx 1 , ..., x J q J P A : x l " 0, l ‰ 1, 2, i, ju, we have s 1i " s 0 1i , s 2i " s 0 2i , s ij " s 0 ij for all i, j ą 2, i ‰ j.With the above argument and (S1), we only need to show s 11 " s 0 11 to prove S " S 0 .To prove s 11 " s 0 11 , we use p 0 p0, 0q{p 0 py d q " P pY 1 " 0, Y 2 " 0|Sq{P pY " y d |Sq and that s ij " s 0 ij for all i, j ą 2, i ‰ j which we have proved.We have where A 0 " tpx 1 , ..., x J q J P t0, 1u J : x 1 " x 2 " 0u.As the right-hand side of the above equation is a strictly monotone decreasing function of s 11 , we know that s 11 " s 0 11 is the only solution to the above equation.This proves the J ě 4 case.
We now move on to the case when J " 3. We consider A " tpx 1 , x 2 , x 3 q J P t0, 1u 3 : x 1 " 1 or x 2 " 1u and y d " p1, 0, 0q J .Using y d as the baseline, for any y P A, y ‰ y d , we construct five linear equations given by logpp 0 pyq{p 0 py d qq " logpP pY " y|Sq{P pY " y d |Sqq.
As the right-hand side is a strictly monotone decreasing function of s 11 , we know that s 11 " s 0 11 is the only solution to the above equation.This proves the J " 3 case, which completes the proof.
Following the same proof strategy as above, it can be further shown that s 11 , s 12 , and s 22 are not identified in the complete-case analysis, while the rest of the parameters are.This is consistent with the result of Simulation Study II, where the other model parameters can be accurately estimated while the estimates of s 11 , s 12 , and s 22 are substantially different from the corresponding true parameters.

B Technical proofs B.1 A lemma for imputation consistency
Following the derivation of Section 2.3 in the main text, under ignorable missingness assumption, the posterior distribution for S satisfies π N pSq9ppy obs | SqπpSq.Under the same Bayesian model, one can impute the missing values from the posterior predictive distribution.
That is, the posterior predictive distribution for y i,mis , i " 1, ..., N , takes the form p N py 1,mis , ..., y N,mis q " where p i py i,obs | Sq " ř y ij :z ij "0 exp `yJ i Sy i {2 ˘{cpSq.Further, suppose that the Algorithm 1 converges to a stationary distribution, and let p N py 1,mis , ..., y N,mis q be the implied posterior predictive distribution given the observed data.Then we show in Lemma S1, which is an adaptation of Theorem 1 of Liu et al. (2014), suggests that p N py 1,mis , ..., y N,mis q and p N py 1,mis , ..., y N,mis q converge to each other in the total variation sense.Lemma S1.Assume the following assumptions hold: 1) The Markov chain for missing data, generated by the iterative imputation algorithm Algorithm 1, is positive Harris recurrent and thus admits a unique stationary distribution denoted by p N ; 2) The missing data process is ignorable; 3) A regularity condition holds for prior distributions of Ising model parameters and auxiliary parameters, as detailed in Assumption 1. Then the implied posterior predictive distribution p N is consistent with the true posterior predictive distribution, p N , i.e., d T V pp N , p N q " max y 1,mis ,...,y N,mis |p N py 1,mis , ..., y N,mis q ´pN py 1,mis , ..., y N,mis q| Ñ 0, (S3) in probability as N Ñ 8.
To prove Lemma S1, we start by define a Gibbs sampling process for the joint Ising model, as outlined in Algorithm 2. This algorithm is constructed for the theoretical purposes since the step of sampling S is intractable.The aim of our proof is to validate the posterior predictive distribution of missing data given observed data p N , implied by the Algorithm 1, converges in total variation to the true posterior predictive distribution p N .We first establish that p N in fact converges to the stationary distribution of the Gibbs chain, denoted as p 1 N ,

B.2 Proof of Theorem 1
We will use α, the half-vectorization of S in the proof.Denote π N pαq the posterior density of α implied by the stationary distribution of the proposed method.Let Y be the data matrix with Y mis and Y obs being the missing and observed parts, respectively.We have Let Θ P R JpJ`1q{2 , E Ă Θ be open and bounded.It can be veried that: 1) f N have continuous third derivatives; 2) f N Ñ f pointwise for some f ; 3) f 2 pα 0 q is positive definite; 4) f 3 pα 0 q is uniformly bounded on E; 5) each f N is convex and f 1 pα 0 q " 0.Then, according to the generalized posterior concentration theorem (see Theorem 5, Miller, 2021), we have for any ε ą 0, ż in probability as N Ñ 8.This conclude the proof, given that where the first term converges to 1 (i.e., (S16)) and the second term converges to 0 (i.e., (S17)) in probability as N Ñ 8.
C Computation details for sampling β j and S Upon observation of the logistic form presented in the conditional distribution when sampling the auxiliary parameters β j , we employ the Pólya-Gamma for effective sampling.Denote a random variable ω follows the Pólya-Gamma distribution P Gpb, cq, b ą 0, c P R with parameters b ą 0 and c P R if it is a weighted sum of independent Gamma random variables ) where g k " Γpb, 1q, which is the Gamma distribution with shape and rate parameters as b and 1, respectively.

C.1 Derivation of posterior distribution of β j
By introducing Pólya-Gamma latent variables ω ij " P Gp1, 0q, i " 1, . . ., N , we establish a connection between the logistic form and the normal distribution.We rephrase the jth conditional distribution ppy ij | y i,´j , β j q by the following equation, where Denote Y " py 1 , . . ., y N q J , Y j as the jth column of Y, and Y ´j the remaining j ´1 columns.Given β j , sample N augmentation variables ω ij , i " 1, . . ., N , each from a Pólya-Gamma distribution ω ij | β j , y i " P Gp1, β jj {2 `ÿ k‰j β jk y ik q, (S21) based on eq.(S20).Moreover, for the jth variable, we have `βJ j pY ´κj e J j q J D ω j pY ´κj e J j qβ j ´2κ J j pY ´κj e J j qβ j ˘ȷ , (S22) where κ j " pκ 1j , . . ., κ N j q J , κ ij " y ij ´1{2, D ω j " diagpω j q.We further have the following conditional distribution for β j `βJ j pY ´κj e J j q J D ω j pY ´κj e J j qβ j ´2κ J j pY ´κj e J j qβ j where Σ β j " " pY ´κj e J j q J D ω j pY ´κj e J j q `Dβ j ‰ ´1 , µ β j " Σ β j pY ´κj e J j q J κ j .Here, e j is a J-dimensional vector with the jth element be one and all others be zeros, D ω j " diagpω j q, D β j " diagpτ j q, where τ jl " σ ´2 1 , for l ‰ j and τ jj " σ ´2 2 .A weak informative prior on the intercept parameter by letting σ 2 2 ą σ 2 1 .To summarize, the introduced Pólya-Gamma latent variables ω ij establish a connection between the logistic form and the normal distribution that lead to a normal form of the posterior β j , i.e., " pY ´κj e J j q J D ω j pY ´κj e J j q `Dβ j ‰ ´1 , µ β j " Σ β j pY ´κj e J j q J κ j . (S24)

C.2 Derivation of posterior distribution of S
Observe that there is constraint between edge parameters s jk " s jk for j ‰ k, which is reflected by the symmetry of matrix S. It is sufficient to examine the lower triangular elements of S while adhering to the equality constraint.To impose such symmetric constraint on S, we reparameterize S by α " vechpSq, which is the half-vectorization of S. Specifically, α " ps 11 , . . ., s J1 , s 22 , . . ., s J2 , . . ., s J´1,J´1 , s J,J´1 , s J,J q J " pα 1 , . . ., α JpJ`1q{2 q J .(S25) To establish a relationship between S and α, we first define the following equation, In this equation, vecpSq " ps J 1 , . . ., s J J q J represents the vectorization of the matrix S. The matrix E j " p0 J , . . ., I J , . . ., 0 J q is a J ˆJ2 matrix, where the jth row block is the identity matrix I J and all other row blocks are zero matrices.Next, we can express vecpSq as follows, vecpSq " D J α, (S27) where D J is a J 2 ˆJpJ `1q{2 duplication matrix, which can be explicitly defined as, Here, u ij is a unit vector of order JpJ `1q{2 with ones in the position pj ´1qJ `i ´jpj ´1q{2 and zeros elsewhere.The matrix T ij is a J ˆJ matrix with ones in position pi, jq and pj, iq and zeros in all other positions.By combining equations (S26) and (S27), we obtain, where T j " E j D J is a J ˆJpJ `1q{2 transformation matrix.
Given α, we first sample N J augmentation variables Ω " pω ij q N ˆJ from where σ 2 ω,ij is the pi, jqth entry of Σ ω " YS ´pY ´1 2 1 N 1 J J q ˝1N diagpSq J .S " vech ´1pαq, diagpSq is the diagonal vector of the matrix S, and ˝is the Hadamard product.Furthermore, given the above transformation, the sampling of S can be done instead by sampling α from its posterior with a similar Pólya-Gamma augmentation procedure.Specifically, the pseudo likelihood with Pólya-Gamma augmentation is `sJ j pY ´κj e J j q J D ω j pY ´κj e J j qs j ´2κ J j pY ´κj e J j qs j ˘ff .`βJ j pY ´κj e J j q J D ω j pY ´κj e J j qβ j ´2κ J j pY ´κj e J j qβ j ˘´1 2 " β J j ppY ´κj e J j q J D ω j pY ´κj e J j q `Ds j qβ j ´2κ J j pY ´κj e J j qβ j ‰ + ." α J T J j ppY ´κj e J j q J D ω j pY ´κj e J j q `Ds j qT j α ´2κ J j pY ´κj e J j qT j α T J j ppY ´κj e J j q J D ω j pY ´κj e J j q `Ds j qT j ¸α ´2 ˜J ÿ j"1 ppY ´κj e J j qT j q J κ j ¸J α ff+ 9 exp " ´1 2 pα ´µα q J Σ ´1 α pα ´µα q ȷ , (S32) The resulting MSEs for edge parameter estimation under various sample sizes are displayed in Figure 4(a), where each box corresponds to the MSEs for 105 edge parameters.
As we can see, the MSEs decrease toward zero as the sample size increases.Furthermore, by employing a hard thresholding step after edge parameter estimation, a receiver operating characteristic (ROC) curve is created for edge selection under each setting, and the corresponding Area Under the Curve (AUC) is calculated to evaluate the performance of the proposed method.That is, given a hard threshold τ , the True Positive Rate (TPR) and False Positive Rate (FPR) are calculated as TPRpτ q " ř 50 k"1 ř jăl 1 t|ŝ k,jl |ąτ and s 0,jl ‰0u 50 ˆřjăl 1 ts 0,jl ‰0u , FPRpτ q " ř 50 k"1 ř jăl 1 t|ŝ k,jl |ąτ and s 0,jl "0u 50 ˆřjăl 1 ts 0,jl "0u .The Jaccard coefficient shows good consistency of the estimated edges with the true edges.
Finally, we show the true versus estimated networks for a single replication in Figure 8, further validates that the proposed method can accurately recover the true network structure.

Figure 1 :
Figure 1: Flow chart of the updating rule for the proposed method.
ptq ij for each i P Ω j from a Bernoulli distribution with success probability exppβ logistic regressions are close to those implied by the true Ising model, and thus, the conditional distributions we use to impute the missing values are close to those under the true model.This further guarantees that the joint distribution of the imputed data given the observed ones is close to that under the true Ising model, and consequently, the Ising model parameters we learn from the imputed data are close to those of the true model.
step.It constructs an MCMC transition kernel by a data argumentation trick.More precisely, the following two steps are performed.

Figure 2 :
Figure 2: Panel (a): Boxplots of MSEs for edge parameters s jl .Panel (b): Boxplots of MSEs for intercept parameters s jj .

Figure 3 :
Figure 3: Estimated network structure for MDD and GAD.Panel (a): Complete-case analysis.Panel (b) Proposed method.
) in probability as N Ñ 8. Finally, by employing the convergence of imputation from Lemma S1, specifically d T V pp ˚pY mis | Y obs q ´ppY mis | Y obs qq Ñ 0 in probability as N Ñ 8, we arrive at ÿ the posterior of S ppS | Y, Ωq9p ˚pY | S, is obtained by varying the value of τ .The ROC curves and the corresponding AUC values are given in Figure 4(b).The curves of TPR and FPR varies with threshold are displayed in Figure 5(a) and Figure 5(b), respectively.Furthermore, we show scatter plots of the estimated edge parameters against the true values for a single replication in Figure 6, which indicate that the proposed method provides a good estimation of the edge parameters.

Figure 7
Figure7illustrates the similarity between the estimated network against the true network, measured by the Jaccard coefficient.Given a threshold τ , the Jaccard coefficient is defined by

Table 1 :
MSEs and biases for edge parameters.

Table 2 :
Descriptions of MDD and GAD items and their missing rates.
, we consider data with nine items measuring MDD and six items measuring GAD.These items are designed according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) (American Psychiatric Association,