Next Article in Journal
Spatially Weighted Bayesian Classification of Spatio-Temporal Areal Data Based on Gaussian-Hidden Markov Models
Next Article in Special Issue
A Proposed Simulation Technique for Population Stability Testing in Credit Risk Scorecards
Previous Article in Journal
Mixed-Integer Conic Formulation of Unit Commitment with Stochastic Wind Power
Previous Article in Special Issue
Non-Parametric Non-Inferiority Assessment in a Three-Arm Trial with Non-Ignorable Missing Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of a Sensitive Feature under Indirect Questioning via Warner’s Randomized Response Technique and Latent Class Model

1
Department of Statistics, Feng Chia University, Taichung 40724, Taiwan
2
Department of Mathematics, College of Natural Science, Can Tho University, Can Tho, Vietnam
3
Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam
4
School of Nursing, The State University of New York, University at Buffalo, Buffalo, NY 14214, USA
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(2), 345; https://doi.org/10.3390/math11020345
Submission received: 3 December 2022 / Revised: 30 December 2022 / Accepted: 5 January 2023 / Published: 9 January 2023
(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Abstract

:
We investigate the association of a sensitive characteristic or latent variable with observed binary random variables by the randomized response (RR) technique of Warner in his publication (Warner, S.L. J. Am. Stat. Assoc. 1965, 60, 63–69) and a latent class model. First, an expectation-maximization (EM) algorithm is provided to easily estimate the parameters of the null and alternative/full models for the association between a sensitive characteristic and an observed categorical random variable under the RR design of Warner’s paper above. The likelihood ratio test (LRT) is utilized to identify observed categorical random variables that are significantly related to the sensitive trait. Another EM algorithm is then presented to estimate the parameters of a latent class model constructed through the sensitive attribute and the observed binary random variables that are obtained from dichotomizing observed categorical random variables selected from the above LRT. Finally, two classification criteria are conducted to predict an individual in the sensitive or non-sensitive group. The practicality of the proposed methodology is illustrated with an actual data set from a survey study of the sexuality of first-year students, except international students, at Feng Chia University in Taiwan in 2016.

1. Introduction

Questionnaire surveys have been widely used to gather data for studies in various research fields, including behavioral science, education, sociology, economics, psychology, bio-medicine, etc., via Google form, email, phone call, or face-to-face interview. The information gathered from these surveys is utilized to estimate, compare, and forecast unknown population proportions of sensitive characteristics of interest. However, it will be likely to misreport the proportion of individuals who self-report some sensitive characteristic, such as political opinions, domestic violence, discrimination, abortion, drug use, gender identity, sexual behavior, anti-social behavior, exam fraud, plagiarism, illegal income, tax evasion, and gambling, or to refuse to answer when asked directly a sensitive question. This can lead to errors in analysis results and mistakes in statistical inferences. For example, in a study conducted by [1] that involved direct interviews with unemployment benefit recipients, 75% of the survey participants who had engaged in welfare or unemployment benefits deception denied doing so. In addition, according to [2], if respondents were forced to answer directly sensitive questions, the percentage of non-heterosexual participants in a community is typically underestimated. Therefore, various indirect questioning techniques, such as randomized response (RR), unmatched count, and crosswise model techniques, have been proposed to lessen potential bias due to non-response and social desirability response and, as a result, improve the reliability of data obtained from responses to sensitive topics to estimate better the population proportion of people bearing a sensitive attribute. See, e.g., [3,4,5,6] for more details.
The RR technique (RRT), initially introduced by Warner [3], is one of the most famous indirect questioning techniques and has been widely used to collect sensitive data. The main idea of this technique is to safeguard respondents’ privacy by concealing their answers or without disclosing their actual status by using a random device, such as spinners, a deck of cards, dice, or random number generators. Specifically, Warner [3] suggested the related-question RRT via a sensitive question of interest and its complement. Respondents can feel more comfortable choosing which question to honestly answer “yes” or “no” based on the results of running a random device because the interviewer does not know precisely their responses. However, Warner’s RRT still has limitations, such as not working when the probability that the sensitive question is selected to answer is 0.5. Therefore, various extensions of Warner’s RRT have been suggested to overcome its limitations and enhance computing effectiveness. For example, Horvitz et al. [7] and Greenberg et al. [8] proposed an unrelated-question RR design in which the first one is a sensitive question, as in Warner’s design, and the second one is innocuous and independent of the sensitive question. Mangat and Singh [9] proposed a two-stage RR design that used two random devices in the procedure. Christofides [10] suggested a generalized RRT. Huang [11] applied the two-stage RRT to enhance the performance of Warner’s RRT. Tian and Tang [12] provided another classification for the RRTs. The related literature can also be found in [2,5,13,14,15,16,17,18,19,20,21].
Data of some auxiliary variables are also gathered under direct questions when collecting sensitive trait or behavior data. The effects of these auxiliary variables on the population proportion of a sensitive characteristic are also quite essential and mentioned in many studies. For instance, by using the RRT of [3], Maddala [22] estimated the effects of the auxiliary variables, e.g., sex, age, and place of residence, on drug use by utilizing a logistic regression model. Scheers and Dayton [23] obtained sensitive information through the RR designs of [3,8], respectively. They proposed using the logistic regression model to establish the influence model of the accompanying variables on the sensitive feature and to provide a method for estimating the regression parameters. Hsieh et al. [14,24] proposed semiparametric methods of estimating the parameters of a logistic regression model when values of some of the covariates on some subjects are missing at random to gather information about sensitive characteristics based on the RR design of [8]. Chang et al. [17] developed a covariate extension of the two-stage RR design of [11] by employing logistic regression to investigate the effects of two covariates of interest on the sensitive attribute and the truthful response to the directly asked sensitive question in the first stage for a sensitive trait respondent to estimate the probability of a sensitive feature respondent’s honest response in the first stage and the probability of a respondent with the sensitive feature based on the logistic regression parameter estimates. Furthermore, in some practical applications, in addition to using an RR design to collect data on sensitive characteristics or behaviors, the related scale data are also frequently gathered to use as auxiliary variables. For example, Lee et al. [19] studied students’ attitudes toward love, gender identity, and online dating experience through the survey study of the sexuality of first-year students, except international students, at Feng Chia University in Taiwan in 2016. They used the sum of scores of responses to six internet dating experience statements (statements 40–45) as the auxiliary variable for the sensitive issue, question 67: “Have you ever had a one-night stand through a dating site or mobile app?” in which response data were gathered by the generalized RRT of Christofides [10]. They utilized the proposed Bayesian estimation methods to estimate the parameters of the normal independent and dependent models for the association of the response to question 67 with the sum of scores of responses to the six internet dating experiences. Two Bayesian model selection criteria were proposed to choose one of the two models as an appropriate model to describe this association.
In another approach to sample survey research, several latent variable models, which are statistical models and relate a set of observed variables (so-called indicators/manifest variables) to a latent variable, have been widely used in various sciences, such as the social, economic, behavioral, and health sciences. The main types of latent variable models can be classified based on whether the manifest and latent variables are categorical or continuous as in (Table 7.1 [25], p. 178). See, e.g., [13,25,26] for more details. Some other extensions of latent variable models, where the manifest variables are treated as ordinal categorical variables, count variables, or other metrics, can be found in [27,28].
Because both the latent and indicators/manifest/observed variables are categorical, the latent class (LC) analysis (LCA) is recommended for applications. Lazarsfeld [29] first introduced the conceptual foundation of LCA, which was used as a tool for building typologies based on observed dichotomous variables, in research about the ethnocentrism of American soldiers during World War II [28,30]. Lazarsfeld and Henry [31] provided a thorough and in-depth conceptual and mathematical treatment of the LCA, but there was a lack of reliable parameter estimation methods. Since then, the LC model (LCM) has been improved and applied in various science fields. For example, Goodman [32,33] provided a relatively simple method, which was later shown to be closely related to the expectation-maximization (EM) algorithm [34], to obtain the maximum likelihood (ML) estimates of LCM parameters and goodness-of-fit methods for the model for fitting the observed data. Haberman [35] demonstrated the connection between LCMs and log-linear models for frequency tables with missing (unknown) cell counts. Dayton and Macready [36] proposed a new development by incorporating covariates into an LCM. Muthén and Shedden [37] improved the models that identified latent growth trajectory class membership in longitudinal data based on individual growth trajectories and were estimated by the EM algorithm. Stern et al. [38] applied LCA in studying the two main temperamental types of children: inhibited and uninhibited. LCMs have been used to investigate the initiation of substance use habits throughout adolescence, such as alcohol, caffeine, and tobacco [39]. Vermunt [40] provided an overview of applications of LCMs in social science research and an extension of the LCM to deal with nested data structures. Collins and Lanza [26] presented the methodology and applications of LCA for social, behavioral, and health sciences data. Nasiopoulou et al. [41] applied LCA to investigate the professional profiles of Swedish preschool teachers. LCA was used to determine the cause of occupational fatalities in the study of [42]. Wu et al. [43] used LCA to stratify the risk of incident diabetes in Chinese adults. For other studies about LCA and its applications, see, e.g., [26,44,45,46,47,48].
However, to the best of our knowledge, there has not been any research about applying the combination of the RRT and LCM to estimate the proportion of a sensitive or latent characteristic based on observed variables. Therefore, we are strongly motivated by the issue to introduce the proposed models, estimation methods, likelihood ratio test (LRT), and reality applications. This study first presents an EM algorithm for estimating the parameters of the null and alternative/full model for the association between the sensitive attribute and an observed categorical random variable in which the RRT of Warner [3] is used. Note that our approaches are different from those of Lee et al. [19] in which they used a Markov chain Monte Carlo estimation method and generalized RRT of [10]. The LRT is then applied to assess whether there is a difference in the distribution of the observed categorical random variable (i.e., auxiliary variable) between the sensitive and non-sensitive groups. Finally, a combination of the RRT of [3] and LCA is introduced in which the indicators/manifest/observed variables are binary. After collecting questions/statements significantly related to the sensitive characteristic from the results of the LRTs, an LCM is used to classify an individual in the sensitive or non-sensitive group.
Section 2.1 reviews the RR design of [3]. In Section 2.2, we introduce a model for the association between a sensitive variable and an observed categorical random variable, which is presented in [19], by using the RRT of [3]. The EM algorithm, called EM algorithm 1, is applied to estimate the parameters of this model, and the LRT is used to evaluate whether the observed categorical random variable is significantly associated with the sensitivity characteristic variable. Section 2.3 provides an LCM to relate observed dichotomous random variables, obtained from dichotomizing the significant observed categorical random variables, to a latent variable that is the sensitive attribute from the RR design of [3]. Another EM algorithm, called EM algorithm 2, is proposed to estimate the parameters of the LCM. A classification criterion is conducted to predict an individual in a sensitive or non-sensitive group in this section. In Section 3, data from the survey of the sexuality of freshmen, except international students, at Feng Chia University in Taiwan in 2016 are used to demonstrate practical applications of the proposed methodology. Another classification criterion is proposed to forecast students in the sensitive or non-sensitive group. Conclusions and some remarks are given in Section 4.

2. Models and Methods

2.1. Warner’s RR Design

The fundamental idea of the related-question RR design of [3] is to protect respondents’ privacy by concealing their responses via a random device, e.g., spinners and dice, a deck of cards, etc. For example, assume that each respondent receives randomly a deck of cards marked with question A or A ¯ as follows:
  • A : Have you ever had a one-night stand through a dating site or mobile app?
  • A ¯ : Have you never had a one-night stand through a dating site or mobile app?
Respondents report a truthful “yes” or “no” response to the question they receive without showing the interviewer which question has been selected. The actual status of the respondents regarding whether or not they have ever had a one-night stand through a dating site or mobile app remains undisclosed, and, therefore, their privacy is protected because neither the interviewer nor the researcher is even aware of the question to which the released answer refers.
Let θ be the population proportion of persons bearing the sensitive trait having ever had a one-night stand through a dating site or mobile app, p the probability of cards marked with the question A, and 1 p the probability of cards marked with the question A ¯ . Assume that p is known. The probability of answering “yes” is then
P ( yes ) = p ( A ) P ( yes | A ) + P ( A ¯ ) P ( yes | A ¯ ) = θ p + ( 1 θ ) ( 1 p ) = θ ( 2 p 1 ) + 1 p .
Let λ ^ be the proportion of respondents who answer “yes”. Because 2 p 1 0 , the estimate of θ can be obtained as follows:
θ ^ w = λ ^ ( 1 p ) 2 p 1 .
However, using random devices may make a respondent report different answers when repeating the survey twice. Yu et al. [5] pointed out that this is a repetitive phenomenon and, hence, provided another design in which a respondent chooses question A or A ¯ to answer according to her/his characteristic. For example, if a respondent were born between August 11 and December 31, she/he chooses question A ¯ to answer and chooses question A to answer otherwise. Therefore, if the respondent repeated the survey twice, the results are the same. In addition, it can prevent respondents from rejecting interviews and providing false answers and, hence, improve estimation efficiency through this random birthday design to obtain data of responses to sensitive questions. In this study, for simplicity, we use the birthday design of [5] for the RR design of [3] by using the data in the study of [19] as reality analysis. By this approach, it is only necessary that specific explanations are written in the questionnaire, and students can then answer questions by themselves without using a random device. The details of specific explanations are given in Section 3.

2.2. Model for the Association between Sensitive Attribute and Categorical Variable

2.2.1. Model

Let Y i , i = 1 , 2 , , n , denote whether the ith sample respondent has the sensitive attribute A in which Y i = 1 if yes and Y i = 0 otherwise. Assume that Z i denotes the ith respondent’s response to a direct inquiry, which is a categorical random variable with values 1 , 2 , , B or a quantitative variable.
The model for the association between Y i and Z i is considered as follows:
P ( Y i = y ) = θ y ( 1 θ ) 1 y , P ( Z i = z | Y i = y ) = π y z , z = 1 , 2 , , B , y = 0 , 1 .
From the above model, one can then obtain
P ( Z i = z , Y i = y ) = θ y ( 1 θ ) 1 y π y z , z = 1 , 2 , , B , y = 0 , 1 .
Let Y i 0 be the result of using this non-random answering design as given in Table 10 of [17]. Let Y i 0 = 1 denote the answer of “A” to a sensitive question and Y i 0 = 0 the answer of “B” to the sensitive question. Assume that p is the probability of respondents’ birthday between January 1 and August 10. P ( Z i = z , Y i 0 = y 0 ) can then be expressed as follows:
P ( Z i = z , Y i 0 = 1 ) = P ( Z i = z , Y i 0 = 1 , Y i = 1 ) + P ( Z i = z , Y i 0 = 1 , Y i = 0 ) = P ( Z i = z , Y i 0 = 1 | Y i = 1 ) P ( Y i = 1 ) + P ( Z i = z , Y i 0 = 1 | Y i = 0 ) P ( Y i = 0 ) = π 1 z p θ + π 0 z ( 1 p ) ( 1 θ ) .
and
P ( Z i = z , Y i 0 = 0 ) = P ( Z i = z , Y i 0 = 0 , Y i = 1 ) + P ( Z i = z , Y i 0 = 0 , Y i = 0 ) = P ( Z i = z , Y i 0 = 0 | Y i = 1 ) P ( Y i = 1 ) + P ( Z i = z , Y i 0 = 0 | Y i = 0 ) P ( Y i = 0 ) = π 1 z ( 1 p ) θ + π 0 z p ( 1 θ ) .
Let y 0 = ( y 1 0 , y 2 0 , , y n 0 ) , z = ( z 1 , z 2 , , z n ) , π y = ( π y 1 , π y 2 , , π y B ) , y = 0 , 1 , and Θ = ( θ , π 0 , π 1 ) . Because only ( Y i 0 , Z i ) , i = 1 , 2 , , n , can be observed, the observed-data likelihood function of Θ can then be written as
L o b s ( Θ | y 0 , z ) = i = 1 n p θ π 1 z i + ( 1 p ) ( 1 θ ) π 0 z i I ( y i 0 = 1 ) p ( 1 θ ) π 0 z i + ( 1 p ) θ π 1 z i I ( y i 0 = 0 ) = i = 1 n j = 1 B p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j I ( y i 0 = 1 ) p ( 1 θ ) π 0 j + ( 1 p ) θ π 1 j I ( y i 0 = 0 ) I ( z i = j ) = j = 1 B p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j i = 1 n I ( y i 0 = 1 , z i = j ) p ( 1 θ ) π 0 j + ( 1 p ) θ π 1 j i = 1 n I ( y i 0 = 0 , z i = j ) .
Given an initial value for Θ , the R function optim or nlminb can then be used to obtain the ML estimate of Θ . However, in practice, when using the R function optim or nlminb to find the ML estimate of Θ for the RR design of [3], it may lead to encountering the problem of non-convergence, or the estimate is not in the parameter space of Θ . Therefore, an EM algorithm is proposed to overcome this problem in the next section.

2.2.2. EM Algorithm 1

The EM algorithm developed by Dempster et al. [34] has been the most widely used iterative technique for computing ML estimates from incomplete data. The EM algorithm consists of two steps: the expectation (E)-step and maximization (M)-step. In the E-step, the latent or unobserved data are estimated by their expectation given the observed data and current parameter estimates. The M-step is used to maximize the expectation in the E-step to update estimates of unknown parameters.
Let y = ( y 1 , y 2 , , y n ) . The complete-data likelihood function of Θ given ( y , y 0 , z ) is written as follows:
L c ( Θ | y , y 0 , z ) = i = 1 n p θ π 1 z i I ( y i 0 = 1 , y i = 1 ) ( 1 p ) θ π 1 z i I ( y i 0 = 0 , y i = 1 ) × ( 1 p ) ( 1 θ ) π 0 z i I ( y i 0 = 1 , y i = 0 ) p ( 1 θ ) π 0 z i I ( y i 0 = 0 , y i = 0 ) = i = 1 n j = 1 B p θ π 1 j y i 0 y i ( 1 p ) θ π 1 j ( 1 y i 0 ) y i × ( 1 p ) ( 1 θ ) π 0 j y i 0 ( 1 y i ) p ( 1 θ ) π 0 j ( 1 y i 0 ) ( 1 y i ) I ( z i = j ) .
The complete-data log-likelihood function of Θ given ( y , y 0 , z ) can then be expressed as follows:
c ( Θ | y , y 0 , z ) = j = 1 B i = 1 n y i 0 y i I ( z i = j ) ln ( p θ π 1 j ) + ( 1 y i 0 ) y i I ( z i = j ) ln [ ( 1 p ) θ π 1 j ] + y i 0 ( 1 y i ) I ( z i = j ) ln [ ( 1 p ) ( 1 θ ) θ π 0 j ] + ( 1 y i 0 ) ( 1 y i ) I ( z i = j ) ln [ p ( 1 θ ) π 0 j ] .
Note that although Y is a latent or unobserved variable, one can obtain P ( Y i = y | Z i = j , Y i 0 = y 0 ) , y = 0 , 1 , y 0 = 0 , 1 , j = 1 , 2 , , B , as follows:
P ( Y i = 1 | Z i = j , Y i 0 = 1 ) = p θ π 1 j p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j , P ( Y i = 1 | Z i = j , Y i 0 = 0 ) = ( 1 p ) θ π 1 j ( 1 p ) θ π 1 j + p ( 1 θ ) π 0 j , P ( Y i = 0 | Z i = j , Y i 0 = 1 ) = ( 1 p ) ( 1 θ ) π 0 j p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j , P ( Y i = 0 | Z i = j , Y i 0 = 0 ) = p ( 1 θ ) π 0 j ( 1 p ) θ π 1 j + p ( 1 θ ) π 0 j .
Therefore, it can yield the following results:
E Y i | Z i = j , Y i 0 = 1 = P ( Y i = 1 | Z i = j , Y i 0 = 1 ) = p θ π 1 j p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j ,
E Y i | Z i = j , Y i 0 = 0 = P ( Y i = 1 | Z i = j , Y i 0 = 0 ) = ( 1 p ) θ π 1 j ( 1 p ) θ π 1 j + p ( 1 θ ) π 0 j ,
E ( 1 Y i ) | Z i = j , Y i 0 = 1 = P ( Y i = 0 | Z i = j , Y i 0 = 1 ) = ( 1 p ) ( 1 θ ) π 0 j p θ π 1 j + ( 1 p ) ( 1 θ ) π 0 j ,
E ( 1 Y i ) | Z i = j , Y i 0 = 0 = P ( Y i = 0 | Z i = j , Y i 0 = 0 ) = p ( 1 θ ) π 0 j ( 1 p ) θ π 1 j + p ( 1 θ ) π 0 j .
E-step: E-step is to take the expectation of c ( Θ | y , y 0 , z ) in (3) with respect to the conditional distributions of the unobserved variables Y i s given the current estimate of Θ and the observed data ( y 0 , z ) . Let Θ ^ ( m ) = ( θ ^ ( m ) , π ^ 0 ( m ) , π ^ 1 ( m ) ) be an estimate of Θ at the mth iteration, where π ^ y ( m ) = ( π ^ y 1 ( m ) , π ^ y 2 ( m ) , , π ^ y B ( m ) ) , y = 0 , 1 . θ ^ ( 0 ) = max { 0.001 , [ y ¯ 0 ( 1 p ) ] / ( 2 p 1 ) } , where y ¯ 0 = n 1 i = 1 n y i 0 , and π ^ 1 j ( 0 ) = π ^ 0 j ( 0 ) = n 1 i = 1 n I ( z i = j ) are initial values. Given the observed data ( y 0 , z ) and Θ ^ ( m ) , by taking the expectation of c ( Θ | Y , Y 0 , Z ) in (3), where Y 0 = ( Y 1 0 , Y 2 0 , , Y n 0 ) , Y = ( Y 1 , Y 2 , , Y n ) , and Z = ( Z 1 , Z 2 , , Z n ) , the Q-function can be given as follows:
Q Θ | Θ ^ ( m ) = E c ( Θ | Y , Y 0 , Z ) | y 0 , z , Θ ^ ( m ) = j = 1 B i = 1 n y i 0 I ( z i = j ) E Y i | y 0 , z , Θ ^ ( m ) ln ( p θ π 1 j ) + j = 1 B i = 1 n ( 1 y i 0 ) I ( z i = j ) E Y i | y 0 , z , Θ ^ ( m ) ln [ ( 1 p ) θ π 1 j ] + j = 1 B i = 1 n y i 0 I ( z i = j ) E 1 Y i | y 0 , z , Θ ^ ( m ) ln [ ( 1 p ) ( 1 θ ) θ π 0 j ] + j = 1 B i = 1 n ( 1 y i 0 ) I ( z i = j ) E 1 Y i | y 0 , z , Θ ^ ( m ) ln [ p ( 1 θ ) π 0 j ] .
Let A ^ y ( m ) = j = 1 B A ^ y j ( m ) , y = 0 , 1 , where
A ^ 1 j ( m ) = i = 1 n E Y i Y i 0 I ( Z i = j ) | Z i = j , Y i 0 = 1 , Θ ^ ( m ) = i = 1 n y i 0 I ( z i = j ) p θ ^ ( m ) π ^ 1 j ( m ) p θ ^ ( m ) π ^ 1 j ( m ) + ( 1 p ) ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) , A ^ 0 j ( m ) = i = 1 n E Y i ( 1 Y i 0 ) I ( Z i = j ) | Z i = j , Y i 0 = 0 , Θ ^ ( m ) = i = 1 n ( 1 y i 0 ) I ( z i = j ) ( 1 p ) θ ^ ( m ) π ^ 1 j ( m ) ( 1 p ) θ ^ ( m ) π ^ 1 j ( m ) + p ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) .
Let
K ^ 1 j ( m ) = i = 1 n E ( 1 Y i ) Y i 0 I ( Z i = j ) | Z i = j , Y i 0 = 1 , Θ ^ ( m ) = i = 1 n y i 0 I ( z i = j ) ( 1 p ) ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) p θ ^ ( m ) π ^ 1 j ( m ) + ( 1 p ) ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) , K ^ 0 j ( m ) = i = 1 n E ( 1 Y i ) ( 1 Y i 0 ) I ( Z i = j ) | Z i = j , Y i 0 = 0 , Θ ^ ( m ) = i = 1 n ( 1 y i 0 ) I ( z i = j ) p ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) ( 1 p ) θ ^ ( m ) π ^ 1 j ( m ) + p ( 1 θ ^ ( m ) ) π ^ 0 j ( m ) .
Note that K ^ 1 j ( m ) + A ^ 1 j ( m ) = i = 1 n y i 0 I ( z i = j ) and K ^ 0 j ( m ) + A ^ 0 j ( m ) = i = 1 n ( 1 y i 0 ) I ( z i = j ) . These imply j = 1 B ( K ^ 1 j ( m ) + A ^ 1 j ( m ) ) = i = 1 n y i 0 and j = 1 B ( K ^ 0 j ( m ) + A ^ 0 j ( m ) ) = i = 1 n ( 1 y i 0 ) . Hence, j = 1 B K ^ 1 j ( m ) = i = 1 n y i 0 A ^ 1 ( m ) and j = 1 B K ^ 0 j ( m ) = n i = 1 n y i 0 A ^ 0 ( m ) . Based on these results, the Q-function in (4) can be re-written as follows:
Q Θ | Θ ^ ( m ) = j = 1 B A ^ 1 j ( m ) ln ( p θ π 1 j ) + A ^ 0 j ( m ) ln [ ( 1 p ) θ π 1 j ] + K ^ 1 j ( m ) ln [ ( 1 p ) ( 1 θ ) θ π 0 j ] + K ^ 0 j ( m ) ln [ p ( 1 θ ) π 0 j ] = j = 1 B A ^ 1 j ( m ) ln p + ln θ + ln π 1 j + A ^ 0 j ( m ) ln ( 1 p ) + ln θ + ln π 1 j + K ^ 1 j ( m ) ln ( 1 p ) + ln ( 1 θ ) + ln π 0 j + K ^ 0 j ( m ) ln p + ln ( 1 θ ) + ln π 0 j = j = 1 B ( A ^ 1 j ( m ) + K ^ 0 j ( m ) ) ln p + ( A ^ 0 j ( m ) + K ^ 1 j ( m ) ) ln ( 1 p ) + j = 1 B ( A ^ 1 j ( m ) + A ^ 0 j ( m ) ) ln θ + ( K ^ 1 j ( m ) + K ^ 0 j ( m ) ) ln ( 1 θ ) + j = 1 B ( A ^ 1 j ( m ) + A ^ 0 j ( m ) ) ln π 1 j + ( K ^ 1 j ( m ) + K ^ 0 j ( m ) ) ln π 0 j = j = 1 B ( A ^ 1 j ( m ) + K ^ 0 j ( m ) ) ln p + ( A ^ 0 j ( m ) + K ^ 1 j ( m ) ) ln ( 1 p ) + ( A ^ 1 ( m ) + A ^ 0 ( m ) ) ln θ + ( n A ^ 1 ( m ) + A ^ 0 ( m ) ) ln ( 1 θ ) + j = 1 B 1 ( A ^ 1 j ( m ) + A ^ 0 j ( m ) ) ln π 1 j + ( K ^ 1 j ( m ) + K ^ 0 j ( m ) ) ln π 0 j + ( A ^ 1 B ( m ) + A ^ 0 B ( m ) ) ln 1 j = 1 B 1 π 1 j + ( K ^ 1 B ( m ) + K ^ 0 B ( m ) ) ln 1 j = 1 B 1 π 0 j .
M-step: The M-step is to maximize Q ( Θ | Θ ^ ( m ) ) in (5) with respect to Θ given Θ ^ ( m ) to update the estimate of Θ denoted by Θ ^ ( m + 1 ) = ( θ ^ ( m + 1 ) , π ^ 0 ( m + 1 ) , π ^ 1 ( m + 1 ) ) by solving the equations Q ( Θ | Θ ^ ( m ) ) / Θ = 0 , which are expressed as follows:
θ ^ ( m + 1 ) = A ^ 1 ( m ) + A ^ 0 ( m ) n , π ^ 1 j ( m + 1 ) = A ^ 1 j ( m ) + A ^ 0 j ( m ) A ^ 1 ( m ) + A ^ 0 ( m ) , π ^ 0 j ( m + 1 ) = K ^ 1 j ( m ) + K ^ 0 j ( m ) n ( A ^ 1 ( m ) + A ^ 0 ( m ) ) .
Iterate the E-step and M-step until | Θ ^ ( m + 1 ) Θ ^ ( m ) | < ε = 5 × 10 5 in this study.
Under the null hypothesis H 0 : π 1 z = π 0 z = π · z , z = 1 , 2 , , B , the model in (1), called an alternative/full model, is reduced to the following null model:
P ( Y i = y ) = θ y ( 1 θ ) 1 y , P ( Z i = z | Y i = y ) = π · z , z = 1 , 2 , , B , y = 0 , 1 .
Based on the above null model, P ( Z i = z , Y i = y ) can then be expressed as
P ( Z i = z , Y i = y ) = θ y ( 1 θ ) 1 y π · z , z = 1 , 2 , , B , y = 0 , 1 .
P ( Y i = 1 | Z i = j , Y i 0 = y 0 ) can be given by
P ( Y i = 1 | Z i = j , Y i 0 = 1 ) = p θ p θ + ( 1 p ) ( 1 θ ) , P ( Y i = 1 | Z i = j , Y i 0 = 0 ) = ( 1 p ) θ ( 1 p ) θ + p ( 1 θ ) , P ( Y i = 0 | Z i = j , Y i 0 = 1 ) = ( 1 p ) ( 1 θ ) p θ + ( 1 p ) ( 1 θ ) , P ( Y i = 0 | Z i = j , Y i 0 = 0 ) = p ( 1 θ ) ( 1 p ) θ + p ( 1 θ ) .
Based on the arguments similar to those in EM algorithm 1, one can obtain an estimate of θ at the ( m + 1 ) th iteration of the EM algorithm as follows:
θ ˜ ( m + 1 ) = A ˜ 1 ( m ) + A ˜ 0 ( m ) n ,
where
A ˜ 1 ( m ) = i = 1 n I ( y i 0 = 1 ) p θ ˜ ( m ) p θ ˜ ( m ) + ( 1 p ) ( 1 θ ˜ ( m ) ) , A ˜ 0 ( m ) = i = 1 n I ( y i 0 = 0 ) ( 1 p ) θ ˜ ( m ) ( 1 p ) θ ˜ ( m ) + p ( 1 θ ˜ ( m ) ) .
Note that the ML estimate of π · z in (6) and (7) is i = 1 n I ( z i = z ) / n , z = 1 , 2 , , B .

2.2.3. Likelihood Ratio Test (LRT)

The LRT of Neyman and Pearson [49] is utilized to determine which model is appropriate for the association of the sensitive feature with an observed categorical random variable by testing the following hypotheses:
H 0 : π 1 j = π 0 j = π · j , for all j = 1 , 2 , , B , H 1 : π 1 j π 0 j , for some j = 1 , 2 , , B .
Let π · = ( π · 1 , π · 2 , , π · B ) . Under H 0 , the observed-data likelihood function of ( θ , π · ) is given by
L o b s ( θ , π · | y 0 , z ) = i = 1 n p θ π · z i + ( 1 p ) ( 1 θ ) π · z i I ( y i 0 = 1 ) p ( 1 θ ) π · z i + ( 1 p ) θ π · z i I ( y i 0 = 0 ) = i = 1 n j = 1 B p θ π · j + ( 1 p ) ( 1 θ ) π · j I ( y i 0 = 1 ) p ( 1 θ ) π · j + ( 1 p ) θ π · j I ( y i 0 = 0 ) I ( z i = j ) = i = 1 n p θ + ( 1 p ) ( 1 θ ) I ( y i 0 = 1 ) p ( 1 θ ) + ( 1 p ) θ I ( y i 0 = 0 ) i = 1 n j = 1 B [ π · j ] I ( z i = j ) = p θ + ( 1 p ) ( 1 θ ) i = 1 n I ( y i 0 = 1 ) p ( 1 θ ) + ( 1 p ) θ i = 1 n I ( y i 0 = 0 ) j = 1 B [ π · j ] i = 1 n I ( z i = j ) .
Therefore, one can use the following LRT statistic
Λ = 2 log sup θ , π · L o b s ( θ , π · | y 0 , z ) sup θ , π 0 , π 1 L o b s ( θ , π 0 , π 1 | y 0 , z ) d χ B 1 2
to determine whether to reject H 0 in (8).

2.3. A Latent Class Model Incorporating Warner’s RRT

2.3.1. Model

Let { Z ˜ s i } s = 1 k be k observed ordinal categorical random variables each with values 1 , 2 , , B for respondent i, where Z ˜ s i denotes the respondent’s response to the sth selected question or statement by the LRT, which is significantly associated with the sensitive attribute at a significance level α . Define K ˜ s i = 1 if Z ˜ s i { 1 , 2 , , q * } and K ˜ s i = 0 if Z ˜ s i { q * + 1 , q * + 2 , , B } , where q * is smaller than B. That is, Z ˜ s i is dichotomized based on the criterion Z ˜ s i q * . Based on a latent variable model, it is assumed that under the group to which an individual is known to belong, the corresponding observed/manifest variables are independent. Therefore, assume that K ˜ 1 i , K ˜ 2 i , , K ˜ k i given Y i , i = 1 , 2 , , n , are independent. The aims are to estimate the parameters of the following LCM
P ( Y i = y | Y i 0 , K ˜ 1 i , K ˜ 2 i , , K ˜ k i ) , y = 0 , 1 ,
via an EM algorithm and to predict an individual in a sensitive group or not.
Define P ( K ˜ s i = 1 | Y i = y ) = α y s , y = 0 , 1 , s = 1 , 2 , , k . Under the assumption that K ˜ 1 i , K ˜ 2 i , , K ˜ k i given Y i are independent, one can obtain the following results:
P ( Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = P ( Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i | Y i = 1 ) P ( Y i = 1 ) + P ( Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i | Y i = 0 ) P ( Y i = 0 ) = p θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + ( 1 p ) ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i
and
P ( Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = P ( Y i 0 = 0 , K ˜ 1 i = k 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i | Y i = 1 ) P ( Y i = 1 ) + P ( Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i | Y i = 0 ) P ( Y i = 0 ) = ( 1 p ) θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + p ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i .
Let Θ ˜ = ( θ , α 0 , α 1 ) , where α y = ( α y 1 , α y 2 , , α y k ) , y = 0 , 1 . The observed data likelihood function of Θ ˜ given ( Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) is
L o b s ( Θ ˜ | Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) = i = 1 n p θ s = 1 k α 1 s K ˜ s i ( 1 α 1 s ) 1 K ˜ s i + ( 1 p ) ( 1 θ ) s = 1 k α 0 s K ˜ s i ( 1 α 0 s ) 1 K ˜ s i Y i 0 × ( 1 p ) θ s = 1 k α 1 s K ˜ s i ( 1 α 1 s ) 1 K ˜ s i + p ( 1 θ ) s = 1 k α 0 s K ˜ s i ( 1 α 0 s ) 1 K ˜ s i 1 Y i 0 ,
where Y 0 = ( Y 1 0 , Y 2 0 , , Y n 0 ) and K ˜ s = ( K ˜ s 1 , K ˜ s 2 , , K ˜ s n ) , s = 1 , 2 , , k . The R function optim or nlminb can be used to obtain the ML estimate of Θ ˜ in (13), but to avoid encountering the problem of divergence, we provide an EM algorithm in the following section to solve this problem.

2.3.2. EM Algorithm 2

Suppose that ( Y i , Y i 0 , K ˜ 1 i , K ˜ 2 i , , K ˜ k i ) , i = 1 , 2 , , n , are observable. One can then express P ( Y i = y , Y i 0 = y 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) , y = 0 , 1 , y 0 = 0 , 1 , as follows:
P ( Y i = 1 , Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = p θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i , P ( Y i = 0 , Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = ( 1 p ) ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i , P ( Y i = 1 , Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = ( 1 p ) θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i , P ( Y i = 0 , Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = p ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i .
The complete-data likelihood function of Θ ˜ given ( Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) can be written as
L c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) = i = 1 n p θ s = 1 k α 1 s K ˜ s i ( 1 α 1 s ) 1 K ˜ s i Y i 0 Y i ( 1 p ) ( 1 θ ) s = 1 k α 0 s K ˜ s i ( 1 α 0 s ) 1 K ˜ s i Y i 0 ( 1 Y i ) × ( 1 p ) θ s = 1 k α 1 s K ˜ s i ( 1 α 1 s ) 1 K ˜ s i ( 1 Y i 0 ) Y i p ( 1 θ ) s = 1 k α 0 s K ˜ s i ( 1 α 0 s ) 1 K ˜ s i ( 1 Y i 0 ) ( 1 Y i ) .
With a bit of algebra, L c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) can be re-expressed as
L c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) = p i = 1 n Y i 0 Y i + ( 1 Y i 0 ) ( 1 Y i ) ( 1 p ) i = 1 n Y i 0 ( 1 Y i ) + ( 1 Y i 0 ) Y i θ i = 1 n Y i ( 1 θ ) i = 1 n ( 1 Y i ) × s = 1 k α 1 s i = 1 n Y i K ˜ s i ( 1 α 1 s ) i = 1 n Y i ( 1 K ˜ s i ) s = 1 k α 0 s i = 1 n ( 1 Y i ) K ˜ s i ( 1 α 0 s ) i = 1 n ( 1 Y i ) ( 1 K ˜ s i ) .
The complete-data log-likelihood of Θ ˜ given ( Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) can then be expressed as
c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) = i = 1 n [ Y i 0 Y i + ( 1 Y i 0 ) ( 1 Y i ) ] ln p + [ Y i 0 ( 1 Y i ) + ( 1 Y i 0 ) Y i ] ln ( 1 p ) + i = 1 n Y i ln θ + ( 1 Y i ) ln ( 1 θ ) + s = 1 k i = 1 n Y i K ˜ s i ln α 1 s + Y i ( 1 K ˜ s i ) ln ( 1 α 1 s ) + ( 1 Y i ) K ˜ s i ln α 0 s + ( 1 Y i ) ( 1 K ˜ s i ) ln ( 1 α 0 s ) .
Y i is a latent or unobserved variable, but one can obtain P ( Y i = y | Y i 0 = y 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) , y = 0 , 1 , y 0 = 0 , 1 , as follows:
P ( Y i = 1 | Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = p θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i p θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + ( 1 p ) ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i ,
P ( Y i = 0 | Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = ( 1 p ) ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i p θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + ( 1 p ) ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i ,
P ( Y i = 1 | Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = ( 1 p ) θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i ( 1 p ) θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + p ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i ,
P ( Y i = 0 | Y i 0 = 0 , K ˜ 1 i = k 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = p ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i ( 1 p ) θ s = 1 k α 1 s k ˜ s i ( 1 α 1 s ) 1 k ˜ s i + p ( 1 θ ) s = 1 k α 0 s k ˜ s i ( 1 α 0 s ) 1 k ˜ s i .
E-step: Let Θ ˜ ^ ( m ) = ( θ ^ ( m ) , α ^ 0 ( m ) , α ^ 1 ( m ) ) be an estimate of Θ ˜ at the mth iteration, where α ^ y ( m ) = ( α ^ y 1 ( m ) , α ^ y 2 ( m ) , , α ^ y k ( m ) ) , y = 0 , 1 . Let θ ^ ( 0 ) = max { 0.001 , [ y ¯ 0 ( 1 p ) ] / ( 2 p 1 ) } , where y ¯ 0 = n 1 i = 1 n y i 0 , and α ^ 0 s ( 0 ) = α ^ 1 s ( 0 ) = n 1 i = 1 n k ˜ s i be initial values. Given observed data ( y 0 , k ˜ 1 , k ˜ 2 , , k ˜ k ) , by taking the expectation of c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) in (14) given ( y 0 , k ˜ 1 , k ˜ 2 , , k ˜ k , Θ ˜ ^ ( m ) ) , the Q-function can be given as follows:
Q Θ | Θ ˜ ^ ( m ) = E c ( Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k ) | y 0 , k ˜ 1 , k ˜ 2 , , k ˜ k , Θ ˜ ^ ( m ) .
Define B ˜ ˜ 1 i ( m ) and B ˜ ˜ 0 i ( m ) as follows:
B ˜ ˜ 1 i ( m ) = E Y i | Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i , Θ ˜ ^ ( m ) = I ( Y i 0 = 1 ) p θ ^ ( m ) r = 1 k [ α ^ 1 r ( m ) ] k ˜ r i [ 1 α ^ 1 r ( m ) ] 1 k ˜ r i p θ ^ ( m ) r = 1 k [ α ^ 1 r ( m ) ] k ˜ r i [ 1 α ^ 1 r ( m ) ] 1 k ˜ r i + ( 1 p ) ( 1 θ ^ ( m ) ) r = 1 k [ α ^ 0 r ( m ) ] k ˜ r i [ 1 α ^ 0 r ( m ) ] 1 k ˜ r i ,
B ˜ ˜ 0 i ( m ) = E Y i | Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i , Θ ˜ ^ ( m ) = I ( Y i 0 = 0 ) ( 1 p ) θ ^ ( m ) r = 1 k [ α ^ 1 r ( m ) ] k ˜ r i ( 1 α ^ 1 r ( m ) ) 1 k ˜ r i ( 1 p ) θ ^ ( m ) r = 1 k [ α ^ 1 r ( m ) ] k ˜ r i ( 1 α ^ 1 r ( m ) ) 1 k ˜ r i + p ( 1 θ ^ ( m ) ) r = 1 k [ α ^ 0 r ( m ) ] k ˜ r i ( 1 α ^ 0 r ( m ) ) 1 k ˜ r i .
Let A ˜ ˜ 1 ( m ) = i = 1 n B ˜ ˜ 1 i ( m ) and A ˜ ˜ 0 ( m ) = i = 1 n B ˜ ˜ 0 i ( m ) . Hence, one can obtain
Q Θ | Θ ˜ ^ ( m ) = E c Θ ˜ | Y , Y 0 , K ˜ 1 , K ˜ 2 , , K ˜ k | y 0 , k ˜ 1 , k ˜ 2 , , k ˜ k , Θ ˜ ^ ( m ) = i = 1 n Y i 0 B ˜ ˜ 1 i ( m ) + ( 1 Y i 0 ) ( 1 B ˜ ˜ 0 i ( m ) ) ln p + Y i 0 ( 1 B ˜ ˜ 1 i ( m ) ) + ( 1 Y i 0 ) B ˜ ˜ 0 i ( m ) ln ( 1 p ) + i = 1 n B ˜ ˜ 1 i ( m ) + B ˜ ˜ 1 i ( m ) ln θ + 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) ln ( 1 θ ) + s = 1 k i = 1 n B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i ln α 1 s + B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) ( 1 k ˜ s i ) ln ( 1 α 1 s ) + s = 1 k i = 1 n 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i ln α 0 s + 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) ( 1 k ˜ s i ) ln ( 1 α 0 s ) = A ˜ ˜ 1 ( m ) + i = 1 n ( 1 Y i 0 ) A ˜ ˜ 0 ( m ) ln p + i = 1 n Y i 0 A ˜ ˜ 1 ( m ) + A ˜ ˜ 0 ( m ) ln ( 1 p ) + A ˜ ˜ 1 ( m ) + A ˜ ˜ 0 ( m ) ln θ + n A ˜ ˜ 1 ( m ) A ˜ ˜ 0 ( m ) ln ( 1 θ ) + s = 1 k i = 1 n B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i ln α 1 s + B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) ( 1 k ˜ s i ) ln ( 1 α 1 s ) + s = 1 k i = 1 n 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i ln α 0 s + 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) ( 1 k ˜ s i ) ln ( 1 α 0 s ) .
M-step: Update the estimate of Θ , denoted by Θ ˜ ^ ( m + 1 ) by maximizing (15), given as follows:
θ ^ ( m + 1 ) = A ˜ ˜ 1 ( m ) + A ˜ ˜ 0 ( m ) n , α ^ 1 s ( m + 1 ) = i = 1 n B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i i = 1 n B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) , α ^ 0 s ( m + 1 ) = i = 1 n 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) k ˜ s i i = 1 n 1 B ˜ ˜ 1 i ( m ) + B ˜ ˜ 0 i ( m ) , s = 1 , 2 , , k .
The E-step and M-step are iterated until | Θ ˜ ^ ( m + 1 ) Θ ˜ ^ ( m ) | < ε = 5 × 10 5 in this study.
Based on the estimate Θ ˜ ^ = ( θ ^ , α ^ 0 , α ^ 1 ) of Θ ˜ = ( θ , α 0 , α 1 ) , we propose the first criterion of classifying whether or not the ith individual belongs to the sensitive group, which is described below. If the ith individual were with ( Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) and
P ^ ( Y i = 1 | Y i 0 = 1 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = p θ ^ s = 1 k α ^ 1 s k ˜ s i ( 1 α ^ 1 s ) 1 k ˜ s i p θ ^ s = 1 k α ^ 1 s k ˜ s i ( 1 α ^ 1 s ) 1 k ˜ s i + ( 1 p ) ( 1 θ ^ ) s = 1 k α ^ 0 s k ˜ s i ( 1 α ^ 0 s ) 1 k ˜ s i 0.5 ,
then she/he is predicted to belong to the sensitive group. If she/he were with ( Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) and
P ^ ( Y i = 1 | Y i 0 = 0 , K ˜ 1 i = k ˜ 1 i , K ˜ 2 i = k ˜ 2 i , , K ˜ k i = k ˜ k i ) = ( 1 p ) θ ^ s = 1 k α ^ 1 s k ˜ s i ( 1 α ^ 1 s ) 1 k ˜ s i ( 1 p ) θ ^ s = 1 k α ^ 1 s k ˜ s i ( 1 α ^ 1 s ) 1 k ˜ s i + p ( 1 θ ^ ) s = 1 k α ^ 0 s k ˜ s i ( 1 α ^ 0 s ) 1 k ˜ s i 0.5 .
then she/he is classified to be in the sensitive group.

3. Real Data Application

The proposed methodology is employed to demonstrate its practical applications by using the survey study data of the sexuality of 3027 freshmen (1193 females, 1792 males, 29 non-binary, 13 no response), not including international students, enrolled based on convenience sampling at Feng Chia University in Taiwan in 2016. Because 262 (8.7%) respondents had missing data, 2765 respondents are used for the purpose. Through this face-to-face survey, respondents answered a questionnaire of 67 statements or questions, including three parts: demographic questions, attitude to love, and online dating experience. This questionnaire did not collect personal information such as respondents’ name and age. In addition, data of responses to the sensitive question, question 37: “What is your sexual orientation?”, were collected by using the multichotomous RR design of Groenitz [50,51]. We collected data of responses to the two sensitive questions, question 38: “Have you ever had sex?” and question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, via the design of [3] and birthday interval as a non-random device. The generalized RR design of [10] was employed to collect data of responses to the sensitive question, question 67: “Have you ever had a one-night stand through a dating site or mobile app?”, which is the same question as question 39. Therefore, the privacy of respondents was protected.
There were 27 statements in the third part about online dating experience for respondents to answer. In addition, question 39: “Have you ever had a one-night stand through a dating site or mobile app?” was designed to collect data on the sexual behavior. Because people in eastern cultures are often shy about referring to sexual behavior or reluctant to provide correct answers, the indirect question technique of [3] was used in this study. Expressly, two birthday intervals, January 1 to August 30 and August 11 to December 31, were set up as a non-random device so that interviewees based on their birthday could answer “yes” or “no” to question 39. The design detail of question 39 is as given in Table 10 of [17]. More specifically, if a respondent whose birthday was between January 1 and August 10 had ever had a one-night stand through a dating site or mobile app, answer “A” denoted “yes” to the question, and answer “B” denoted “no”. If a respondent whose birthday was between August 11 and December 31 had ever had a one-night stand through a dating site or mobile app, answer “B” denoted “yes” to the question, and answer “A” denoted “no”. In addition, statements 40–66 are direct ones about interviewees’ thoughts and evaluation on making friends on the internet, finding a lover or sex partner, and attitude toward online dating. Each statement had five response options, i.e., 1 = very consistent , 2 = almost consistent , 3 = fairly consistent , 4 = a bit consistent , and 5 = very inconsistent .
As in Section 2.2, the binary variable Y i denotes whether respondent i has ever had a one-night stand through a dating site or mobile app in which Y i = 1 indicates yes, and Y i = 0 otherwise. P ( Y i = 1 ) = θ is the population proportion of the sensitive attribute, having ever had a one-night stand through a dating site or mobile app. p = 0.6082 is the probability of people whose birthday is between January 1 and August 10 by assuming that birthdays are uniformly distributed. The LRT statistic in (9) is utilized to identify which statements from statements 40–66 are significantly related to question 39 based on data of responses to these 27 statements and question 39 at α = 0.1 . Table 1 displays the analysis results, including estimates of the parameters of the alternative/full model in (1) and null model in (6) for the association of this sensitive attribute with response to each of these 27 statements under H 0 and H 1 in (8) and corresponding p-values of the LRTs. As seen from these results, the p-values of the LRTs corresponding to statements 43, 45, 55, and 61 are 0.0187, 0.0836, 0.0259, and 0.0757, respectively, which imply that question 39 was significantly associated with response to each of these four statements at α = 0.1 . The selected four statements are listed below.
  • Q43. Finding friends on the internet can improve your social circle.
  • Q45. You can find people with similar interests online.
  • Q55. I am a homebody, so I want to make friends online.
  • Q61. I want to find my partner through online dating.
Let { Z ˜ s i } s = 1 4 denote the ith respondent’s response to the aforementioned four statements selected by the LRTs at α = 0.1 with full probability model P ( Z ˜ s i = z | Y i = y ) = π y z ( s ) , s = 1 , 2 , 3 , 4 , z = 1 , 2 , 3 , 4 , 5 , y = 0 , 1 . Table 2 presents estimates of θ ( s ) , π 0 z ( s ) , and π 1 z ( s ) with their estimated standard errors (SEs) for the alternative/full model for the association between the sensitive feature and each of { Z ˜ s i } s = 1 4 , respectively. Note that the estimated SEs are obtained by the bootstrap method with 200 replications. It can be seen from Table 2 that the estimated SEs of θ ^ ( s ) are around 0.04, and the estimated SEs of π ^ 0 z ( s ) are, respectively, smaller than those of π ^ 1 z ( s ) mainly because there is more sample information to estimate π 0 z ( s ) .
Based on θ ^ ( s ) , π ^ 0 z ( s ) , and π ^ 1 z ( s ) , one can estimate P ( Y i = 1 | Z ˜ s i = z , Y i 0 = y 0 ) as follows:
P ^ ( Y i = 1 | Z ˜ s i = z , Y i 0 = y 0 ) = p θ ^ ( s ) π ^ 1 z ( s ) p θ ^ ( s ) π ^ 1 z ( s ) + ( 1 p ) ( 1 θ ^ ( s ) ) π ^ 0 z ( s ) as y 0 = 1 , ( 1 p ) θ ^ ( s ) π ^ 1 z ( s ) ( 1 p ) θ ^ ( s ) π ^ 1 z ( s ) + p ( 1 θ ^ ( s ) ) π ^ 0 z ( s ) as y 0 = 0 .
According to the estimated posterior probability of Y i = 1 given Z ˜ s i = z and Y i 0 = y 0 , define C s i
C s i = 1 when P ^ ( Y i = 1 | Z ˜ s i = z , Y i 0 = y 0 ) 0.5 , 0 when P ^ ( Y i = 1 | Z ˜ s i = z , Y i 0 = y 0 ) < 0.5 ,
as a conditional classifier to identify whether the ith respondent has the sensitive attribute. As a second classification criterion, the proposed classification criterion is to predict the ith respondent in the sensitive group, i.e., she/he has ever had a one-night stand through a dating website or mobile app, if any of the C s i , s = 1 , 2 , 3 , 4 , is 1, i.e., if
s = 1 4 C s i 1 .
By applying the proposed methodology in Section 2.3 to the real data set, we define K ˜ s i = 1 if Z ˜ s i { 1 , 2 } and K ˜ s i = 0 if Z ˜ s i { 3 , 4 , 5 } . { K ˜ s i } s = 1 4 and Y i are observed binary and latent variables in the LCM, respectively. Then, P ( K ˜ s i = 1 | Y i = 0 ) = α 0 s and P ( K ˜ s i = 1 | Y i = 1 ) = α 1 s , s = 1 , 2 , 3 , 4 .
Table 3 displays the estimates of θ , α 0 s , and α 1 s in the proposed LCM incorporating the RRT of [3] with their estimated SEs. The SEs are estimated by using the bootstrap method with 200 replications. The estimate of θ is 0.2201, and its estimated SE is 0.0127. This estimate of θ is reasonable because this estimate is still within each of the 95% confidence intervals (CIs) of θ ^ ( s ) , ( θ ^ ( s ) 1.96 SE ^ θ ^ ( s ) , θ ^ ( s ) + 1.96 SE ^ θ ^ ( s ) ) , for the selected statements 43, 45, 55, and 61, which are ( 0.1412 , 0.2862 ) , ( 0.1343 , 0.2863 ) , ( 0.1261 , 0.2935 ) , and ( 0.1332 , 0.2896 ) , respectively, where θ ^ ( s ) and SE ^ θ ^ ( s ) , s = 1 , 2 , 3 , 4 , are given in Table 2. On the other hand, if using the estimate of θ , 0.2201, and its estimated SE, 0.0127, to construct a 95% CI, the 95% CI ( 0.1952 , 0.2450 ) also contains each of these four estimates of θ , 0.2137 , 0.2103 , 0.2098 , and 0.2201 , in Table 2.
Table 4 shows the classifications of 2765 respondents based on the two proposed classification criteria in (17), (18) and (20), respectively. Based on classification criterion 1, the estimated proportion of respondents who “have ever had a one-night stand through a dating site or mobile app” is 0.2 ( 553 / 2765 ) , which is quite close to the estimates of θ , 0.2201, given in Table 3, obtained from fitting the LCM in (10). By applying the second classification criterion, 246 respondents (with estimated proportion 0.089 = 246 / 2765 ) are predicted to have the sensitive attribute, “having ever had a one-night stand through a dating site or mobile app”, while the estimates of θ given in Table 2 are around 0.21, which are obtained from fitting the alternative/full model for the association of this sensitive attribute with each of the four observed ordinal categorical variables for online dating experience corresponding to selected statements 43, 45, 55, and 61, respectively, by the LRT at α = 0.1 .

4. Conclusions

A combination of the RRT of [3] and an LCM has been proposed to investigate the association between a sensitive character or latent variable and the observed binary random variables, which were obtained from dichotomizing the observed ordinal categorical variables selected by the LRT. The concept of the relationship between a sensitive attribute variable and auxiliary random variables in [19] has been extended by applying the RR design of [3] to collect sensitive characteristic information. The EM algorithm, called EM algorithm 1, has been provided to easily estimate the parameters of the null and alternative models for the association of the sensitive attribute variable with each of the auxiliary ordinal categorical random variables. The LRT has been utilized to select ordinal categorical random variables that are significantly associated with the sensitive attribute variable.
An LCM has been proposed to relate observed binary random variables to a sensitive characteristic or latent variable under the RRT of [3]. The EM algorithm, called EM algorithm 2, has been proposed to easily estimate its parameters and the population proportion of the sensitive characteristic. Two classification criteria have been conducted to predict the presence of the sensitive attribute in an individual. Practical applications of the proposed methodology have been demonstrated with the survey data from the study of the sexuality of freshmen, except international students, at Feng Chia University in Taiwan in 2016. Finally, the proposed methodology can be generalized to other RRTs to make inferences for a sensitive characteristic variable. This issue can also be solved by using the Bayesian approach.

Author Contributions

Conceptualization, S.-M.L. and C.-S.L.; methodology, S.-M.L. and C.-S.L.; investigation, S.-M.L.; writing—original draft preparation, P.-L.T. and T.-N.L.; writing—review and editing, S.-M.L. and C.-S.L.; visualization, S.-M.L. and C.-S.L.; software, S.-M.L.; supervision, S.-M.L. and C.-S.L. All authors have read and agreed to the published version of the manuscript.

Funding

Lee’s research was supported by Ministry of Science and Technology (MOST) Grant of Taiwan, ROC, MOST-109-2118-M-035-002-MY3.

Data Availability Statement

Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. van der Heijden, P.G.M.; van Gils, G.; Bouts, J.A.N.; Hox, J.J. A comparison of randomized response, computer-assisted self-interview, and face-to-face direct questioning: Eliciting sensitive information in the context of welfare and unemployment benefit. Sociol. Methods Res. 2000, 28, 505–537. [Google Scholar] [CrossRef]
  2. Hsieh, S.H.; Perri, P.F. Estimating the proportion of non-heterosexuals in Taiwan using Christofides’ randomized response model: A comparison of different estimation methods. Soc. Sci. Res. 2021, 93, 102475. [Google Scholar] [CrossRef] [PubMed]
  3. Warner, S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
  4. Dalton, D.R.; James, C.W.; Catherine, M.D. Using the unmatched count technique (UCT) to estimate base rates for sensitive behavior. Pers. Psychol. 1994, 47, 817–827. [Google Scholar] [CrossRef]
  5. Yu, J.W.; Tian, G.L.; Tang, M.L. Two new models for survey sampling with sensitive characteristic: Design and analysis. Metrika 2008, 67, 251–263. [Google Scholar] [CrossRef]
  6. Groenitz, H. Logistic regression analyses for indirect data. Commun. Stat.-Theory Methods 2018, 47, 3838–3856. [Google Scholar] [CrossRef]
  7. Horvitz, D.G.; Shah, B.V.; Simmons, W.R. The unrelated question randomized response model. Proc. Soc. Stat. Sect. Am. Stat. Assoc. 1967, 62, 65–72. [Google Scholar]
  8. Greenberg, B.G.; Abul-Ela, A.L.A.; Simmons, W.R.; Horvitz, D.G. The unrelated question randomized response model: Theoretical framework. J. Am. Stat. Assoc. 1969, 64, 520–539. [Google Scholar] [CrossRef]
  9. Mangat, N.S.; Singh, R. An alternative randomized response procedure. Biometrika 1990, 77, 439–442. [Google Scholar] [CrossRef]
  10. Christofides, T.C. A generalized randomized response technique. Metrika 2003, 57, 195–200. [Google Scholar] [CrossRef]
  11. Huang, K.C. A Survey technique for estimating the proportion and sensitivity in a dichotomous finite population. Stat. Neerl. 2004, 58, 75–82. [Google Scholar] [CrossRef]
  12. Tian, G.L.; Tang, M.L. Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in Surveys; Chapman & Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
  13. Bhargava, M.; Singh, R. A modified randomization device for Warner’s model. Statistica 2000, 60, 315–322. [Google Scholar]
  14. Hsieh, S.H.; Lee, S.M.; Shen, P.S. Semiparametric analysis of randomized response data with missing covariates in logistic regression. Comput. Stat. Data Anal. 2009, 53, 2673–2692. [Google Scholar] [CrossRef]
  15. Blair, G.; Imai, K.; Zhou, Y.Y. Design and analysis of the randomized response technique. J. Am. Stat. Assoc. 2015, 110, 1304–1319. [Google Scholar] [CrossRef]
  16. Hsieh, S.H.; Lee, S.M.; Tu, S.H. Randomized response techniques for a multi-level attribute using a single sensitive question. Stat. Pap. 2018, 59, 291–306. [Google Scholar] [CrossRef]
  17. Chang, P.C.; Pho, K.H.; Lee, S.M.; Li, C.S. Estimation of parameters of logistic regression for two-stage randomized response technique. Comput. Stat. 2021, 36, 2111–2133. [Google Scholar] [CrossRef]
  18. Hsieh, S.H.; Lee, S.M.; Li, C.S. A two-stage multilevel randomized response technique with proportional odds models and missing covariates. Sociol. Methods Res. 2022, 51, 439–467. [Google Scholar] [CrossRef]
  19. Lee, S.M.; Le, T.N.; Tran, P.L.; Li, C.S. Investigating the association of a sensitive attribute with a random variable using the Christofides generalised randomised response design and Bayesian methods. J. R. Stat. Soc. Ser. C 2022, 71, 1471–1502. [Google Scholar] [CrossRef]
  20. Tang, M.L.; Tian, G.L.; Tang, N.S.; Liu, Z. A new non-randomized multi-category response model for surveys with a single sensitive question: Design and analysis. J. Korean Stat. Soc. 2009, 38, 339–349. [Google Scholar] [CrossRef]
  21. Tang, M.L.; Wu, Q.; Tian, G.L.; Guo, J.H. Two-sample non randomized response techniques for sensitive questions. Commun. Stat.-Theory Methods 2014, 43, 408–425. [Google Scholar] [CrossRef]
  22. Maddala, G.S. Limited-Dependent and Qualitative Variables in Econometrics; Cambridge University Press: Cambridge, MA, USA, 1983. [Google Scholar]
  23. Scheers, N.J.; Dayton, C.M. Covariate randomized response models. J. Am. Stat. Assoc. 1988, 83, 969–974. [Google Scholar] [CrossRef]
  24. Hsieh, S.H.; Lee, S.M.; Shen, P.S. Logistic regression analysis of randomized response data with missing covariates. J. Stat. Plan. Inference 2010, 140, 927–940. [Google Scholar] [CrossRef]
  25. Bartholomew, D.J.; Steele, F.; Moustaki, I.; Galbraith, J.I. Analysis of Multivariate Social Science Data, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011. [Google Scholar]
  26. Collins, L.M.; Lanza, S.T. Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences; John Wiley & Sons: New York, NY, USA, 2009. [Google Scholar]
  27. Böckenholt, U. Mixed-effects analyses of rank-ordered data. Psychometrika 2001, 66, 45–62. [Google Scholar] [CrossRef]
  28. Vermunt, J.K.; Magidson, J. Latent class cluster analysis. In Applied Latent Class Analysis; Hagenaars, J.A., McCutcheon, A.L., Eds.; Cambridge University Press: Cambridge, MA, USA, 2002; pp. 89–106. [Google Scholar]
  29. Lazarsfeld, P.F. The logical and mathematical foundation of latent structure analysis. In Studies in Social Psychology in World War II Vol. IV: Measurement and Prediction; Princeton University Press: Princeton, NJ, USA, 1950; pp. 362–412. [Google Scholar]
  30. Andersen, E.B. Latent structure analysis: A survey. Scand. J. Stat. 1982, 9, 1–12. [Google Scholar]
  31. Lazarsfeld, P.F.; Henry, N.W. Latent Structure Analysis; Houghton Mifflin: Boston, MA, USA, 1968. [Google Scholar]
  32. Goodman, L.A. The analysis of systems of qualitative variables when some of the variables are unobservable. Part IA modified latent structure approach. Am. J. Sociol. 1974, 79, 1179–1259. [Google Scholar] [CrossRef]
  33. Goodman, L.A. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 1974, 61, 215–231. [Google Scholar] [CrossRef]
  34. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar]
  35. Haberman, S.J. Analysis of Qualitative Data, Vol 2: New Developments; Academic Press: New York, NY, USA, 1979. [Google Scholar]
  36. Dayton, C.M.; Macready, G.B. Concomitant-variable latent-class models. J. Am. Stat. Assoc. 1988, 83, 173–178. [Google Scholar] [CrossRef]
  37. Muthén, B.; Shedden, K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 1999, 55, 463–469. [Google Scholar] [CrossRef]
  38. Stern, H.S.; Arcus, D.; Kagan, J.; Rubin, D.B.; Snidman, N. Using mixture models in temperament research. Int. J. Behav. Dev. 1995, 18, 407–423. [Google Scholar] [CrossRef]
  39. Collins, L.M.; Graham, J.W.; Rousculp, S.S.; Hansen, W.B. Heavy caffeine use and the beginning of the substance use onset process: An illustration of latent transition analysis. In The Science of Prevention: Methodological Advances from Alcohol and Substance Abuse Research; Bryant, K.J., Windle, M., Eds.; American Psychological Association: Washington, DC, USA, 1997. [Google Scholar]
  40. Vermunt, J.K. Applications of latent class analysis in social science research. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  41. Nasiopoulou, P.; Williams, P.; Sheridan, S.; Hansen, K.Y. Exploring preschool teachers’ professional profiles in Swedish preschool: A latent class analysis. Early Child Dev. Care 2019, 189, 1306–1324. [Google Scholar] [CrossRef]
  42. Farina, E.; Bianco, S.; Bena, A.; Pasqualini, O. Finding causation in occupational fatalities: A latent class analysis. Am. J. Ind. Med. 2019, 62, 123–130. [Google Scholar] [CrossRef] [PubMed]
  43. Wu, Y.; Hu, H.; Cai, J.; Chen, R.; Zuo, X.; Cheng, H.; Yan, D. Applying latent class analysis to risk stratification of incident diabetes among Chinese adults. Diabetes Res. Clin. Pract. 2021, 174, 108742. [Google Scholar] [CrossRef] [PubMed]
  44. Hagenaars, J.A.; McCutcheon, A.L. (Eds.) Applied Latent Class Analysis; Cambridge University Press: Cambridge, MA, USA, 2002. [Google Scholar]
  45. Lanza, S.T.; Cooper, B.R. Latent class analysis for developmental research. Child Dev. Perspect. 2016, 10, 59–64. [Google Scholar] [CrossRef]
  46. Nagin, D.S. Group-Based Modeling of Development; Harvard University Press: Cambridge, MA, USA, 2005. [Google Scholar]
  47. Petersen, K.J.; Qualter, P.; Humphrey, N. The application of latent class analysis for investigating population child mental health: A systematic review. Front. Psychol. 2019, 10, 1214. [Google Scholar] [CrossRef] [Green Version]
  48. Aflaki, K.; Vigod, S.; Ray, J.G. Part II: A step-by-step guide to latent class analysis. J. Clin. Epidemiol. 2022, 148, 170–173. [Google Scholar] [CrossRef]
  49. Neyman, J.; Pearson, E.S. On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika 1928, 20A, 175–240. [Google Scholar]
  50. Groenitz, H. A new privacy-protecting survey design for multichotomous sensitive variables. Metrika 2014, 77, 211–224. [Google Scholar] [CrossRef]
  51. Groenitz, H. Using prior information in privacy-protecting survey designs for categorical sensitive variables. Stat. Pap. 2015, 56, 167–189. [Google Scholar] [CrossRef]
Table 1. Results of the likelihood ratio test and the null and alternative/full models for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and each of the 27 observed ordinal categorical random variables for online dating experience corresponding to statements 40 to 66.
Table 1. Results of the likelihood ratio test and the null and alternative/full models for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and each of the 27 observed ordinal categorical random variables for online dating experience corresponding to statements 40 to 66.
Parameter θ π 01 π 02 π 03 π 04 π 05 π 11 π 12 π 13 π 14 π 15 p-Value
Q 40   H 1 0.21060.10830.17750.27020.24450.19950.15380.18140.22360.12750.31360.6644
     H 0 0.20840.11800.17800.26000.22000.2240
Q 41   H 1 0.21440.04010.09890.28330.30890.26890.06910.14380.04330.38290.36090.1905
     H 0 0.20840.04600.10800.23200.32500.2890
Q 42   H 1 0.21050.02260.13260.24370.34390.25710.10070.06260.33160.21680.28830.3650
     H 0 0.20840.03900.11800.26200.31700.2640
Q 43   H 1 0.21370.03470.14940.34480.31560.15550.18220.29130.38480.05900.08270.0187
     H 0 0.20840.06600.18000.35300.26100.1400
Q 44   H 1 0.20980.03350.13440.25320.34330.23560.06360.07300.33400.32900.20040.8481
     H 0 0.20840.04000.12200.27000.34000.2280
Q 45   H 1 0.21030.07040.25260.37250.20180.10280.16060.12820.36570.24670.09890.0836
     H 0 0.20840.08900.22600.37100.21100.1020
Q 46   H 1 0.21080.07530.26440.36880.19050.10090.16250.21110.31430.21140.10080.8095
     H 0 0.20840.09400.25300.35700.19500.1010
Q 47   H 1 0.20950.08050.22490.37690.20540.11230.10020.26140.30920.22970.09951.0000
     H 0 0.20840.08500.23300.36300.21000.1100
Q 48   H 1 0.21070.10240.26930.36140.17160.09540.24650.27860.28710.10730.08050.5409
     H 0 0.20840.13300.27100.34600.15800.0920
Q 49   H 1 0.21010.02970.11550.24400.32800.28280.05020.05470.28570.30410.30541.0000
     H 0 0.20840.03400.10300.25300.32300.2880
Q 50   H 1 0.20950.03680.10830.29610.35470.20400.08380.15580.28780.23940.23310.7806
     H 0 0.20840.04700.11800.29400.33100.2100
Q 51   H 1 0.20970.03530.14850.33690.29450.18470.10150.13020.35310.20420.21110.7475
     H 0 0.20840.04900.14500.34000.27600.1900
Q 52   H 1 0.22370.02300.11260.25430.33770.27240.06560.00380.33350.33150.26560.1262
     H 0 0.20840.03300.08800.27200.33600.2710
Q 53   H 1 0.21040.03560.13960.27330.31280.23860.05550.04840.24110.33690.31810.7395
     H 0 0.20840.04000.12000.26700.31800.2550
Q 54   H 1 0.20990.02690.09400.28190.36360.23360.05400.13730.31900.21820.27151.0000
     H 0 0.20840.03300.10300.29000.33300.2420
Q 55   H 1 0.20980.00620.02340.14200.30010.52820.07820.07040.09600.19530.56000.0259
     H 0 0.20840.02100.03300.13200.27800.5350
Q 56   H 1 0.21010.01040.02360.09600.25750.61240.03150.03330.03830.16810.72881.0000
     H 0 0.20840.01500.02600.08400.23900.6370
Q 57   H 1 0.21130.07580.21090.37830.19560.13930.20310.25520.29600.19930.04640.3833
     H 0 0.20840.10300.22000.36100.19600.1200
Q 58   H 1 0.20930.03280.09980.27350.30700.28700.03160.11210.20240.43330.22050.8481
     H 0 0.20840.03300.10200.25900.33300.2730
Q 59   H 1 0.21110.03550.08560.22730.28750.36410.00440.06710.20780.42140.29930.6802
     H 0 0.20840.02900.08200.22300.31600.3500
Q 60   H 1 0.20880.02780.08120.24780.32120.32200.04350.10290.16090.32630.36640.9284
     H 0 0.20840.03100.08600.23000.32200.3310
Q 61   H 1 0.21140.00020.02600.10390.27940.59040.06240.00050.15990.22530.55190.0757
     H 0 0.20840.01300.02100.11600.26800.5820
Q 62   H 1 0.20520.00000.02790.09770.27050.60390.06870.01700.16110.23900.51430.3277
     H 0 0.20840.01400.02600.11100.26400.5860
Q 63   H 1 0.20780.00010.02390.07940.28230.61430.07450.01320.21080.16300.53840.9950
     H 0 0.20840.01600.02200.10700.25800.5990
Q 64   H 1 0.20990.00300.02970.08190.18390.70150.09390.02410.04320.09530.74340.1355
     H 0 0.20840.02200.02900.07400.16500.7100
Q 65   H 1 0.21080.00350.01390.07380.19030.71840.06740.05940.09260.07330.70740.9131
     H 0 0.20840.01700.02400.07800.16600.7160
Q 66   H 1 0.20970.00480.03040.06240.12780.77470.09760.02870.06320.08590.72450.1985
     H 0 0.20840.02400.03000.06300.11900.7640
Table 2. (Results extracted from Table 1) Summary of significant results of the alternative/full model for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and each of the four observed ordinal categorical random variables for online dating experience corresponding to selected statements 43, 45, 55, and 61, respectively, by the likelihood ratio test at α = 0.1 .
Table 2. (Results extracted from Table 1) Summary of significant results of the alternative/full model for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and each of the four observed ordinal categorical random variables for online dating experience corresponding to selected statements 43, 45, 55, and 61, respectively, by the likelihood ratio test at α = 0.1 .
Parameter θ π 01 π 02 π 03 π 04 π 05 π 11 π 12 π 13 π 14 π 15
Q 43 Estimate0.21370.03470.14940.34480.31560.15550.18220.29130.38480.05900.0827
SE0.03700.01420.02170.02250.01970.01730.06040.07210.07370.05560.0464
Q 45 Estimate0.21030.07040.25260.37250.20180.10280.16060.12820.36570.24670.0989
SE0.03880.01760.02130.03200.02660.01700.06790.06420.11340.09920.0646
Q 55 Estimate0.20980.00620.02340.14200.30010.52820.07820.07040.09600.19530.5600
SE0.04270.00640.00760.01740.02600.02330.03140.02990.05700.09430.0778
Q 61 Estimate0.21140.00020.02600.10390.27940.59040.06240.00050.15990.22530.5519
SE0.03990.00190.00430.01560.02680.02690.01730.00580.07230.08410.0995
Table 3. Estimates of parameters of the latent class model for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and the four observed dichotomized random variables for online dating experience corresponding to selected statements 43, 45, 55, and 61 by the likelihood ratio test at α = 0.1 .
Table 3. Estimates of parameters of the latent class model for the association between the sensitive characteristic variable from question 39: “Have you ever had a one-night stand through a dating site or mobile app?”, and the four observed dichotomized random variables for online dating experience corresponding to selected statements 43, 45, 55, and 61 by the likelihood ratio test at α = 0.1 .
θ α 01 α 02 α 03 α 04 α 11 α 12 α 13 α 14
Estimate0.22010.05940.16020.01450.00520.90680.86680.19660.1361
SE0.01270.00970.00950.00330.00190.02620.02570.01660.0142
Table 4. Comparison of results of predicting respondents to have ever had a one-night stand through a dating site or mobile app based on classification criterion 1 in (17) and (18) and criterion 2 in (20).
Table 4. Comparison of results of predicting respondents to have ever had a one-night stand through a dating site or mobile app based on classification criterion 1 in (17) and (18) and criterion 2 in (20).
Classification Criterion 1
Classification Criterion 2Having Ever Had a One-Night Stand through a Dating Site or Mobile AppHaving Never Had a One-Night Stand through a Dating Site or Mobile AppTotal
Having ever had a one-night stand21234246
through a dating site or mobile app
Having never had a one-night stand34121782519
through a dating site or mobile app
Total55322122765
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, S.-M.; Tran, P.-L.; Le, T.-N.; Li, C.-S. Prediction of a Sensitive Feature under Indirect Questioning via Warner’s Randomized Response Technique and Latent Class Model. Mathematics 2023, 11, 345. https://doi.org/10.3390/math11020345

AMA Style

Lee S-M, Tran P-L, Le T-N, Li C-S. Prediction of a Sensitive Feature under Indirect Questioning via Warner’s Randomized Response Technique and Latent Class Model. Mathematics. 2023; 11(2):345. https://doi.org/10.3390/math11020345

Chicago/Turabian Style

Lee, Shen-Ming, Phuoc-Loc Tran, Truong-Nhat Le, and Chin-Shang Li. 2023. "Prediction of a Sensitive Feature under Indirect Questioning via Warner’s Randomized Response Technique and Latent Class Model" Mathematics 11, no. 2: 345. https://doi.org/10.3390/math11020345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop