Estimating prevalence of post-war health disorders using multiple systems data

Effective surveillance on the long-term public health impact due to war and terrorist attacks remains limited. Such health issues are commonly under-reported, specifically for a large group of individuals. For this purpose, efficient estimation of the size or undercount of the population under the risk of physical and mental health hazards is of utmost necessity. A novel trivariate Bernoulli model is developed allowing heterogeneity among the individuals and dependence between the sources of information, and an estimation methodology using a Monte Carlo-based EM algorithm is proposed. Simulation results show the superiority of the performance of the proposed method over existing competitors and robustness under model mis-specifications. The method is applied to analyse two real case studies on monitoring amyotrophic lateral sclerosis (ALS) cases for the Gulf War veterans and the 9/11 terrorist attack survivors at the World Trade Center, USA. The average annual cumulative incidence rate for ALS disease increases by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$33\%$$\end{document}33% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16\%$$\end{document}16% for deployed and no-deployed military personnel, respectively, after adjusting the undercount. The number of individuals exposed to the risk of physical and mental health effects due to WTC terrorist attacks increased by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$42\%$$\end{document}42%. These results provide interesting insights that can assist in effective decision-making and policy formulation for monitoring the health status of post-war survivors.


Supplementary Material for "Estimating Prevalence of
Post-war Health Disorders using Multiple Systems Data" by Bhuyan and Chatterjee

A TRS Data Structure
Any capture-recapture type data structure generated by a small number of capture attempts is conventionally written in an incomplete contingency table for a closed population.Therefore, capture-recapture type data with three lists (i.e.TRS) can be written in the following incomplete 2 3 contingency table, where the last cell x 000 remains structurally unobserved.Note that in the context of multiple systems estimation (MSE), the population size N = i,j,k=0,1 x ijk is to be estimated since the cell x 000 is always unobserved in Table 1.Therefore, in an MSE setup, estimation of N is equivalent to the estimation of x 000 = N −n, where n refers to the sum of all seven observed cells in (1), i.e. n = x 111 + x 110 + x 101 + x 011 + x 100 + x 010 + x 001 .Therefore, in terms of the TRS data structure given in Table 1, we present our three TRS datasets (see Figure 1 in the main article) in the following table.

B Test for Heterogeneity
In order to validate the assumption of the existence of individual heterogeneity in both the datasets, we plot log f i / 3 i against i, using the plot.descriptivefunction in the Rcapture package in R, for i = 1, 2, 3. Here, f i denotes the number of individuals captured i times.
For further details, see Baillargeon and Rivest (2007).In terms of our notation, f 1 = x 100 + x 010 + x 001 , f 2 = x 110 + x 101 + x 011 and f 3 = x 111 .It is quite clear that the graphs indicate the presence of individual heterogeneity for non-deployed veterans and WTC data, whereas a marginal individual heterogeneity is supposed to be exist in deployed veterans data (see Figure B.1).

C Estimation Methodology
Traditionally in the context of MSE, population size N is estimated based on the likelihood theory, where the vector of observed cell counts h = k ; i, j, k = 0, 1; i = j = k ̸ = 0 , as presented in Table 1, follow a multinomial distribution with index parameter N and the associated cell probabilities p = {p ijk : i, j, k = 0, 1; i = j = k ̸ = 0} (Sanathanan, 1972), where h , Z h are defined in model (1) in the main article.Therefore, the likelihood function is given by where x 000 = N − x 0 , p 000 = 1 − i,j,k=0,1;i=j=k̸ =0 p ijk , α = (α 1 , α 2 , α 3 , α 4 ), and P = (P 1 , P 2 , P 3 ).In addition to the first and second order list-dependence involved in TBM, the proposed THBM accounts for the individual heterogeneity in terms of the variations in capture probabilities considering P s as a random effect for s = 1, 2, 3. Now, integrating the likelihood function, given in (C.0.1), with respect to the distribution of P l , we obtain the marginal likelihood function of θ as where g Ps (•|δ s ) is the density of P s with parameter δ s for s = 1, 2, 3, and θ = (N, α, δ).The estimation of N and other associated parameters in THBM is computationally challenging in a conventional frequentist setup due to the involvement of intractable numerical integrals (Coull and Agresti, 1999).One can consider approximation methods for numerical integration like Monte Carlo or Laplace method.

C.1 Identifiability Issues
In MSE, it is quite common to encounter identifiability issues with data available from a small number of dependent sources.To deal with such issues, some restrictions on parameters are considered in the existing models like M tb and log-linear models as well as in the sample coverage approach proposed by Chao and Tsay (1998).See Section E for details.Similarly for TBM, Chatterjee and Bhuyan (2020) considered two submodels TBM-1 and TBM-2 keeping α 3 and α 4 fixed as 0, respectively.However, this process of model contraction may lead to a model representing unrealistic scenarios (Gustafson, 2005).Unlike the situation that is modeled by TBM-1, available sources related to the surveillance of humanitarian crisis are not time ordered (Chan et al., 2021), and the capture attempts are generally interdependent between themselves, including the dependence between the first and third lists (Gold et al., 2015).Similarly, TBM-2 is not suitable for the scenarios when the second-order interaction among the lists is present.
The proposed THBM is intended to model more complex scenarios compared to the existing models and it also suffers from similar identifiability issues with respect to TRS data.It is important to note that a non-identifiable model may bring information about the parameters of interest, and identifiable models do not necessarily lead to better decision-making than nonidentifiable ones (Gustafson, 2005;Wechsler et al., 2013).Nevertheless, it is in the best interest of the practitioners, clear guidelines should be provided that are applicable to different scenarios under consideration.In the following subsection, we propose an estimation methodology of the parameter of interest N and other associated parameters of the THBM based on the MCEM algorithm which provides a generic solution to the identifiability issue.

C.2 MCEM Algorithm
As discussed before, the likelihood function L(θ|x), given by (C.0.2), is not mathematically tractable due to involvement of an integral and the additive structure of the cell probabilities p ijk as a function of the model parameters.To simplify the likelihood function we adopt data augmentation strategy proposed by Tanner and Wong (1987).First, we partition the observed cell counts x ijk depending on the various types of list-dependence as described in the model (1) in the main text, and define a vector of latent cell counts as where I[•] is an indicator function.For example, x 111 can arise from all the five types of dependence structures, whereas x 110 can be generated only based on the first and last types of dependence structures, as presented in (1) in the main text.We also treat the capture probabilities P = (P 1 , P 2 , P 3 ) as unobserved data.Therefore, the likelihood function of θ, based on the complete data (x, y, P), is given by where Interestingly, the complete data likelihood, given by (C.2.1), possesses a simple form as a product of the power functions of the parameters associated with the THBM.However, this likelihood function cannot be used for estimation purposes without resolving the identifiability issues.For this purpose, we employ a similar approach as suggested by Carle and Strub (1978) to obtain a likelihood function free of the nuisance parameters.We consider P s 's follow independent beta distribution with shape parameters n s and m s , for s = 1, 2, 3, and obtain their estimates as m1 = y 111,1 + y 111,3 + y 111,4 + y 110,1 + x 011 + x 010 ,  n2 = x 100 + x 101 + y 001,1 + y 000,1 + y 000,3 + y 000,4 , m3 = y 111,1 + y 111,2 + y 011,1 + y 101,1 + x 001 and  n3 = x 110 + y 100,1 + y 010,1 + y 000,1 + y 000,2 , by matching moments based on the observed and latent cell counts.Then plugging the corresponding density function of P s in (C.2.1), we get where θ = (N, α).To implement the MCEM algorithm, one must be able to sample from the conditional distributions of y given (x, N, α, P), and P given (x, y, N, α), denoted by π(y|x, N, α, P), and π(P|x, y, N, α), respectively.Note that π(P|x, y, N, α) is expressed as the product of Beta distributions, and π(y|x, N, α, P) is expressed as a product of Multinomial and Binomial distributions (see Section D for details).This approach is close in spirit to the method of EM-within-Gibbs and stochastic EM-within-Gibbs considered by Chatterjee and Mukherjee (2016a).Now, we employ a simple iterative algorithm to estimate the model parameters using the following steps.
Step 3. E step: Monte Carlo approximation of (C.2.2) is computed as The problem of multimodality of the likelihood often arises when fitting models with latent variables (Bartolucci and Pennoni, 2007).We generate different initial values randomly for the first step in the aforementioned algorithm.We take the parameter vector that at convergence gives the highest value of L c ( θ|x, y, P), given in C.2.2, as an estimate of θ, denoted by θ.

E Existing Models and Estimates
In this section, we present frequently used models for TRS and associated estimates for the population size.These models incorporate individual heterogeneity and/or list-dependence.The assumptions associated with these models and their limitations in real applications are also discussed briefly.We consider these estimates to compare the performance of the proposed method in the section on Simulation Study in the main article.
E.1 Log-linear Model Fienberg (1972b) and Bishop et al. (1975) discussed several log-linear models (LLMs) to account for list-dependence using interaction effects.The general LLM under TRS is given by where Bishop et al., 1975, p.64).The parameters u l , u ll ′ and u 123 denote the main effects, pairwise interaction effects, and the second order interaction effect for l, l ′ = 1, 2, 3. See Fienberg (1972a) and Fienberg (1972b) for more details.To ensure the estimability of the model given in (E.1.1),one needs to consider u 123 = 0, i.e. no second-order interaction between lists.Under this assumption, the estimate of m 000 is given by m000 = m111 m001 m100 m010 m101 m011 m110 , where mijk is the maximum likelihood estimate of m ijk assuming x ijk to be a realization of an independent Poisson random variate for all (i, j, k)th cells, except the (0, 0, 0)th cell.Finally, the estimate of N is obtained as NLLM = n + m000 (Fienberg, 1972b).This is equivalent to the estimator proposed by Zaslavsky and Wolfgang (1993) with α EP A = 1.With the additional assumption that u 12 = u 13 = u 23 = 0, LLM in (E.1.1)reduces to the Independent Model (IM) which has no closed-form solution (Fienberg, 1972b).

E.2 Time Behavioural Response Variation Model: M tb
It is often observed in TRS that an individual's behavior changes in subsequent recapture attempts after the initial attempt.This change is known as behavioral response variation.
When this behavioral response variation is considered along with the assumption of list variation in capture probabilities, one would have a model known as M tb model (Otis et al., 1978;Chatterjee and Mukherjee, 2016b).However, M tb model does not account for the individual heterogeneity in capture probabilities.Denote the first-time capture probability of any individual in the l-th list is by f s for s = 1, 2, 3, and the recapture probability is denoted by c s for s = 2, 3. Further, u s and m s denote, respectively, the number of first-time captured and recaptured individuals in the s-th list.Therefore, based on the the sufficient statistics (u ) and the assumption of constant proportionality, i.e., c s /f s = ϕ, for s = 2, 3, likelihood of the model M tb is given by where n = i,j,k:ijk̸ =000 x ijk and M s = u 1 + u 2 + ... + u s−1 denote the total number of distinct captured individuals and the number of individuals captured at least once prior to the s-th attempt, respectively (Chatterjee and Bhuyan, 2020).In this context, Chao et al. (2000) discussed various likelihood-based methods for the estimation of N .Note that, the M tb model does not utilize full information available from seven known cells in TRS, and hence, it loses efficiency in order to estimate N (Chatterjee and Bhuyan, 2020).Moreover, the assumption of c 2 /f 2 = c 3 /f 3 = ϕ is not justifiable in various real-life applications.

E.3 Models with individual heterogeneity
In educational statistics and Psychometry, Rasch model (Rasch, 1961) is widely is used to account for the individual heterogeneity and list variation.The capture probabilities are model using logistic regression: where v h is a random effect representing the catchability of hth individual and w s is listvariation effect of sth list.Even with the absence of behavioural response variation, dependence among the lists are induced is purely due to individual heterogeneity (Chao et al., 2001).When we set v h = 0 in (E.3.1), the model reduces to a multiple-recapture model with list independence.We now represent the Rasch model as a mixed-effects generalized linear model that allows for individual heterogeneity and list variation in the following subsections.

E.3.1 Quasi-symmetry Model
The marginal cell probabilities for all the eight cells in a TRS can be represented as: where i Darroch et al. (1993); Fienberg et al. (1999) for detailed derivation.Note that the term γ(i + ) is invariant with respect to permutations of (i, j, k), and hence the model in (E.3.2) is known as quasi-symmetry model (QSM) (Darroch, 1981).This is equivalent to the following assumption involving two constraints p 011 p 100 = p 101 p 010 = p 110 p 001 (E.3.3) which do not involve p 000 .Therefore, an additional assumption is required, such as no secondorder interaction exists.Under this assumption, the estimated count of the (0, 0, 0)th cell is given by where xijk is the maximum likelihood estimate of E[x ijk |n] (Darroch et al., 1993).

E.3.2 Partial Quasi-symmetry Model
In many situations, the constraint provided in (E.3.3)associated with quasi-symmetry model is not realistic.However, it is reasonable to assume p 011 p 100 = p 101 p 010 if x 011 x 100 and x 101 x 010 are close enough (Darroch et al., 1993).This condition implies that the patterns of individual heterogeneity are different for two different group of sources or lists as considered in generalized Rasch model proposed by Stegelman (1983).Under this assumption, the marginal cell probabilities can be represented as: ] with two nonidentiacally distributed random effects v h and z h associated with group of lists (L 1 , L 2 ) and L 3 , respectively .This model in is known as partial quasi-symmetry model (PQSM).See Darroch et al. (1993);Fienberg et al. (1999) for more details.

E.4 Sample Coverage Method
The idea of the sample coverage of a given sample is the probability weighted fraction of the population captured in that sample.The main principle relies on the fact that it is difficult to estimate the number of undercount directly, but the sample coverage can be well estimated in any non-independent situation with heterogeneous capture probability (Chao and Tsay, 1998).In the context of TRS, the sample coverage is defined based on three hypothetical populations I, II, and III with capture probabilities E Z (1) ; h = 1, . . ., N , and ; h = 1, . . ., N corresponding to lists L 1 , L 2 , and L 3 , respectively.In particular, the sample coverage of first two lists L 1 ∪ L 2 with respect to the population III is given by Similarly, one can obtain the sample coverages of other two possibles combinations of samples, C II (L 1 ∪ L 3 ) and C I (L 2 ∪ L 3 ) with respect to the populations II and I, respectively.Now, the sample coverage of the three lists is defined as n 2 + x 001 n 3 .Finally, the estimator of N on based sample coverage approach is given by Nsc We refer this estimate as SC.From empirical studies, it is found that the performance of Nsc is satisfactory if the sample coverage is over 55%.The performance also depends on the remainder term involved in the derivation of Nsc , and the bias increases with its magnitude (Chao and Tsay, 1998).Moreover, this method may produce infeasible estimates, i.e.Nsc < x 0 , unlike other methods.

F Simulation Study
In this section, we vividly describe the simulation study and its findings presented in the Simulation Study section of the main text.We consider eight different compositions of populations representing varying degrees of list-dependence and individual heterogeneity in capture probabilities and denote them as P1-P8 in Table F.1.The choices of α represent different degrees of list-dependence.The heterogeneity in capture probabilities are modeled as b s = log Ps 1−Ps , where b s 's are normal and generalized logistic type-I variates for s = 1, 2, 3. To compare the performance of the proposed estimator (THBM) with the existing competitors, we consider the following models: log-linear model (LLM) (Fienberg, 1972b), M tb model (Chao et al., 2000;Chatterjee and Bhuyan, 2020), quasi-symmetry model (QSM) (Darroch et al., 1993), partial quasi-symmetry model (PQSM) (Darroch et al., 1993), non-parametric sample coverage method (SC) (Chao and Tsay, 1998) and the independent model (IM) i.e.LLM without interaction effects (Fienberg, 1972b).The details of these existing models and associated estimates are briefly presented in Section E. In the capture-recapture setting, point estimators of population size commonly possess positively skewed distributions and we also observed a similar pattern in our study.Therefore, we obtain 95% interval (C.I.) for N based on the log-transformation method, proposed by Chao (1987).In this method, log( N − n) is approximately treated as normal variate and that gives 95% confidence interval for N as , and σ2 N is the estimate of the variance of the estimator N , computed based on 1000 nonparametric bootstrap samples which are drawn from implicitly assumed multinomial distribution for the mark-recapture model (Buckland and Garthwaite, 1991).Following Chatterjee and Mukherjee (2018), we compute the length of the 95% confidence interval (LCI) as well as its coverage probability (CP) and represent both the results in Figure F.2, where height and width of each bar represent CP and LCI of the corresponding method, respectively.
The proposed model performs the best in terms of RMAE for all the populations P1-P8 for both N = 500 and 1000.Moreover, the CPs of the proposed THBM estimator are higher than all other competitors.The CPs of M tb are similar to THBM but CIs are very wide due to high variability.However, the LCIs of the proposed THBM estimate are the smallest in all the cases under consideration except P5 and P7.As expected, the RMAEs of the IM are the highest among all the estimators for most of the populations.The performance of LLM, QSM and PQSM estimators are comparable with respect to both RMAE and CP.But the M tb performs better than LLM, QSM and PQSM.It is important to note that the CPs of SC and IM are very low as the associated CIs are very tight.The performances of each estimator are similar over all the populations with respect to CPs and LCIs except P5 and P7, where the LCIs are comparably wider.We also observe that the SC estimate suffers from boundary problems (see Section E.4 for details) in our simulation study.One can easily configure populations to generate data such that the percentage of infeasible estimates obtained from SC method can be as high as 60% to 80%.Also, the QSM, PQSM, and LLM may fail to converge in some situations and produce extremely large estimates.We have not considered those cases here for a valid and fair comparison.In some scenarios, sample size (i.e. the sum of all observed cell counts) may be small due to low capture probabilities across the sources.Unlike some previous work (Nour, 1982), our proposed method does not require any restriction on the domains of capture probabilities of individuals.Hence, the proposed method can definitely be applied when the sample size is small.To investigate, we conducted simulation studies where the sample size is less than 50% of the true population size.For example, we generate the random effects b i 's from generalized logistic type-I distribution with parameter 0.5 with dependence parameters (α 1 , α 2 , α 3 , α 4 ) = (0.2, 0.15, 0.4, 0.1) and the resulting RMAE is approximately 0.055.As expected, the RMAE increases compared to the cases where the sample size is not too small.However, the relative performance of the proposed method compared to the existing competitors remains superior.

F.1 Sensitivity of Model Misspecification
To study the effect of model misspecification when the data-generating mechanism deviates from the fitted model, we analyse the sensitivity of the estimates based on the proposed THBM in comparison with the existing estimators.For this purpose, we generate the latent capture statuses of the individuals X 1h , X 2h , X 3h from Bernoulli distributions with respective probabilities P = 1], 0.99}, for j = 2, 3, and h = 1, . . ., N , and then generate Z h using the model in Equation (1) in the main text with α = (0.1, 0.1, 0.1, 0.1).This mechanism induces an additional dependence among the capture statuses similar to a first-order auto-regressive model as considered in Chao (1987).To incorporate heterogeneity in the capture probabilities, we generate P (l) h from two different choices of probability distributions S1: Beta(2, 2), and S2: Beta(2, 4) for s = 1, 2, 3. We also generate data from Rasch model (Rasch, 1961) given by Equation (E.3.1) in Section E.3, where the capture status of the h-th individual in List l, Z h is a Bernoulli variate with probability P ] , for h = 1, . . ., N , s = 1, 2, 3, with v h 's as standard normal variates and two different choices of the fixed effects S3: (s 1 , s 2 , s 3 ) = (−1, 0, 1), and S4: (s 1 , s 2 , s 3 ) = (1, 0.5, 0.1).Note that all the aforementioned configurations (S1, S2, S3, S4) for data generation completely deviates from the assumed structure of the fitted models.The resulting RMAEs from the proposed THBM and its competitors are plotted in Figure proposed and its competitors in Figure F.4.Here also, the proposed estimator outperforms the existing competitors with respect to RMAE as discussed in Section F. Overall, the CPs of the proposed estimator and M tb model are close to 100% and much better than all other estimators.In particular, the performance of SC is very poor with respect to RMAEs and CPs.As expected, the performance of the model IM is the worst in all respects.

Figure B. 1 :
Figure B.1: Graphs showing the trajectory of log f i / 3 i against i for the deployed, non-
Figure F.3: Comparison of the RMAEs of the proposed estimator with the existing competitors under model misspecification when data generated from configurations S1-S4.

Table 1 :
Data structure associated with a triple record system (TRS).

Table 2 :
TRS Datasets on ALS disease among the Gulf war veterans and WTC Twin Tower population at the time of 9/11 terrorist attack.