Systematic Measurement Error in Self Reported Health: Is Anchoring Vignettes the Way Out?

This paper studies systematic reporting heterogeneity in self-assessed health in India using World Health Survey (WHS)-SAGE survey that has subjective assessments on own health and hypothetical vignettes as well as objective measures like measured anthropometrics and performance tests on a range of health domains. The study implicitly tests and validates the assumption of response consistency in a developing country setting, thus lending support to the use of vignettes. Additionally, we are able to control for unobservable heterogeneities of reporting behavior at the individual level by employing individual fixed-effects estimation using multiple ratings on a set of vignettes by the same person. The study confirms identical pattern of systematic bias by the socioeconomic subgroups as is indicated by vignette technique. It further highlights that substantial amount of reporting heterogeneity remains unexplained after controlling for the usual socioeconomic control variables. The finding has potentially broader implications for research based on self-reported data in a developing country.

vignette description of a hypothetical situation that is fixed for all respondents (King et al. 2004;Bago d'Uva et al. 2011). The idea is based on the underlying assumption that any variation in rating of a vignette (depicting a fixed level of latent health) would identify systematic reporting bias, which can then be adjusted in the individual's subjective assessment of her own situation.
The validity of this approach however relies on two important assumptions, viz. "vignette equivalence" which requires that all individuals perceive the vignette description as corresponding to a given state of the same underlying construct and "response consistency" which implies that individuals use the same response categories for their subjective assessment as they have used for evaluating the hypothetical scenarios presented to them in vignettes. This assumption will not hold if there are strategic influences on reporting of the individual's own situation that are absent from evaluation of the vignette (Bago d' Uva et al. 2011). This study is one of the first to test the assumption of response consistency in a developing country setting where measurement error in survey data may be more of a problem.
The paper first presents a framework to formally test the existence of systematic measurement error across sociodemographic subgroups. We examine systematic reporting heterogeneity using two ways: first using responses from vignette ratings across different health domains and second using a method that combines data on objective and self-reported health indicators. In this process, we implicitly test the validity of the "response consistency" assumption. The paper adds on to the existing body of literature by explicitly checking the validity of this assumption in a developing country setting. Further, using repeated ratings from the same individual over multiple vignettes, we can control for idiosyncratic heterogeneities in the individual fixed-effects estimation. We precisely examine whether the pattern of reporting behavior matches with that obtained from our first exercise and to what extent it can be explained by socioeconomic characteristics that are usually accounted in a regression framework.
The study finds strong presence of systematic reporting heterogeneity in self-assessed health across subgroups and validates the assumption of response consistency. The result is confirmed in our robustness check where we even control for individual fixed effects. Further, it finds that the reporting heterogeneity in SAH can only be partially explained by observable characteristics of individuals and a large part of it remains unexplained. Thus, the study has important implications for research that solely rely on subjective health data.
2 Theoretical framework Van Doorslaer and Jones (2002) find subgroups of the population systematically use different thresholds in classifying their health into a categorical measure. Individuals are likely to use different reference points and interpret the self-assessed health (SAH) question within their own specific context (Lindeboom and van Doorslaer 2004). Sen (1993Sen ( , 2002 points that comparison of self-reported morbidities in a typical developing country setting may find the children in the poorest households are the healthiest. While various techniques have been proposed for achieving comparable response scales across groups, Murray et al. (2002) indicate anchoring vignettes as "the most promising" of available strategies. Anchoring vignettes reveal how groups may differ in their use of response categories, i.e., where along the health spectrum, individuals locate thresholds between the ordered categories. The idea is to vary the health status exogenously in each of the hypothetical cases, where any difference in rating of these fixed latent health situations would identify systematic difference in reporting behavior by socioeconomic subgroups. Bago d' Uva et al. (2008) using vignette technique rejected reporting homogeneity by different educational groups using a pilot data from Indonesia, India 1 , and China.
Although SAH measure is widely used in empirical research, for a given true but unobserved health state, if survey respondents report their health differently depending on certain characteristics like conceptions of general health, utilization of health services, expectations for own health, financial incentives to report ill health, and comprehension of the survey questions, measurement error in SAH is no longer random. Bound (1991) highlights if measurement error in a given variable is not "classical," it can introduce serious biases in estimates leading to simple attenuation to misattributing relationships. Economic circumstances and geographic location may alter health expectations through factors like peer effects, societal norm, and access to medical care. Reporting of health may vary with education through the awareness factor, i.e., conceptions of illness, understanding of disease and knowledge of the availability, access, and effectiveness of health care. Antman and McKenzie (2007) and Escobal and Laszlo (2008) point a number of reasons why measurement error may be more of a problem in developing country settings for which validation data are not readily available. It becomes particularly problematic as there can be high degree of heterogeneity among the respondents in this setting in terms of literacy level and health awareness. Noteworthy is the fact that the state of Kerala (with one of the lowest levels of mortality among Indian states) has consistently reported the highest morbidity rate (approximately three times the all-India average) in three successive rounds of nationally represented survey NSS, whereas in contrast, Bihar, with one of the highest mortality rates, reported the lowest morbidity. Banerjee et al. (2004) mentions that sick individuals in a poor disease endemic area, with limited health access or opportunities for medical treatment, may report being in good health because some type of illnesses may be perceived as "normal" phenomena due to their prolonged, widespread occurrence in the area, where people might be adapted to the sickness that they experience. Schultz and Strauss (2008) mention some illnesses such as blindness, ringworms, or malaria may be perceived as normal phenomena due to their prolonged, widespread occurrence in a disease prone area without health access, where individuals may not see themselves as particularly unhealthy.
However, one of the key identifying assumptions of this methodology is that of response consistency, which is the assumption that respondents use the same thresholds while evaluating own health as they use in evaluating the vignettes. Kapteyn et al. (2011) using an Internet-based panel in the USA find a mixed evidence on the response consistency assumption which holds for certain health domains and not in others. Van Soest et al. (2011) develop an integrated framework in which objective measurements are used to validate vignette-based corrections of subjective assessments of drinking behavior by students in Ireland. Bago d'Uva et al. (2011) point that the assumption of response consistency is testable given sufficiently comprehensive objective indicators that independently identify response scales. Their study finds mixed results for response consistency in a sample of older English individuals. Although the assumption of response consistency has been debated in the recent literature in the context of Dasgupta IZA Journal of Development and Migration (2018) 8:12 Page 3 of 30 developed countries, there exists scant evidence from the developing country setting. The current study addresses this gap by testing this assumption using a nationally representative data from India. This paper provides a formal framework to test the existing pattern of reporting behavior in SAH and offers a simple methodological technique to check the assumption of response consistency used in vignette approach 2 in a developing country setting which has important implications for informing survey design using self-assessed responses.

Empirical strategy
We employ three distinct methods to test the reporting behavior in SAH responses in this study. First, we test the existence of systematic measurement error in SAH across population subgroups by estimating the ordered probit model of the vignette responses following King et al. (2004) to identify the reporting heterogeneity by covariates.
Let H i V be the reported health status for the vignette question; the vector X i is a vector of observed characteristics (sociodemographic covariates potentially susceptible to systematic reporting bias like age, gender, education, income, and location). The underlying assumption for this identification relies on the fact that a vignette represents a fixed level of latent health; hence, the difference in rating pattern by covariates can be attributed to the systematic reporting heterogeneity associated with the X i 's. We estimate the following equation: Thus, a positive (negative) and significant coefficient (β) would imply over-reporting (under-reporting) of worse health, as degree of worse health/difficulty level in health increases from 1 to 5 in the categorical response of the dependent variable.
As reporting of health status can potentially be influenced by expectations for own health, tolerance of illness, and health norm in society, we include the following controls in the X vector: education categories, gender, age groups, body mass index 3 (BMI categories), expenditure quintiles, religion, ethnic groups, sector (urban/rural), and underdeveloped state dummy-capturing development in the state (which implicitly captures and controls for the access to effective health care and can be a rough measure for tolerance of illness in the society). Banerjee et al. (2004) find that individuals in the upper third income group report the most symptoms over the last 30 days, and attribute this to higher awareness of health status. Thus, in order to identify any nonlinear effect of income on reporting bias, we use the middle expenditure quintile as the reference category in our estimated equation.
In our second empirical approach, we identify systematic reporting heterogeneity using both self-reported and objective health indicators collected in the data. We regress the self-reported health (H i rep ) on the same set of covariates (X i ) controlling for a battery of "objective" health measures (H i obj ). The underlying idea is any systematic variation in subjective assessments that remains after conditioning on the objective indicators can be attributed to systematic biases in reporting behavior.
This specification hinges on the fact that after correcting for "true" health, the reporting heterogeneity (if any) would be reflected as the coefficients of the covariates in the Dasgupta IZA Journal of Development and Migration (2018) 8:12 Page 4 of 30 second equation. Specifically, the assumption is addition of a precise set of objective indicators would soak up the variation coming from the difference in true/latent health, leaving out the reporting bias to be identified. We claim that the battery of "objective" health measures is sufficient to ensure that our second approach holds that lets us test whether the response consistency assumption holds in this data. As we have both the objective and subjective counterpart on the particular health domain of mobility, functioning, cognition, and memory, we are able to precisely control and condition for the "observable" health counterpart and run this specification.
So, one way of implicitly checking the response consistency assumption would be to see if the pattern of reporting heterogeneity by socioeconomic subgroups from Eq. (1) matches with Eq. (2). Precisely, we claim that "response consistency" assumption would hold in this data if we get the same signs of β's from both the estimations.
In our third empirical strategy, we exploit the individual fixed-effects estimation to control for the individual specific unobservable factors in reporting heterogeneity. This is more of a robustness exercise where we even tackle idiosyncratic reporting heterogeneity employing individual fixed effects. It may be possible that there are certain unobservable characteristics at the individual level (for example, say the person is not being serious while evaluating responses) that can add to the reporting heterogeneity. With responses on multiple vignettes for the same individual, we are able to control for the individual fixed effects in a two-stage regression estimation. In the first stage (Eq. 3), we regress the vignette responses (10 questions per vignette set for each individual) on individual dummies ID i to get their corresponding coefficients μ's which we use in the second stage (Eq. 4) as dependent variable to be explained by the usual covariates. We claim that any variation in the assessment of H i v (representing fixed level of latent health) between respondents is reflected in μ's which captures the reporting heterogeneity at the individual level. This method lets us explore the variation in reporting behavior devoid of the noise that can arise due to individual specific unobservable factors. We estimate the following set of equations: Through this exercise, we examine the pattern of reporting heterogeneity that remains after we control for such individual specific unobservable characteristics. Further, we examine how much of that reporting heterogeneity is explained by observable factors that usually gets controlled in a typical regression, by checking the R-square of the estimated Eq. (4).The motivation behind this exercise is the fact that vignette adjustments can only detect and correct for reporting heterogeneity by observable characteristics of the respondents. If much of the reporting heterogeneity arises due to unobservable characteristics of the respondents, then the scope of anchoring vignettes for greater inter-personal comparability is limited. The next section discusses the data followed by the results. Rajasthan, Uttar Pradesh, and Assam. The data collected included self-reported assessments of health linked to anchoring vignettes, which are hypothetical stories that describe the health problems of third parties in several health domains. This data has information of both "subjective" and "objective" measures of identical health questions in addition to the responses on vignettes.

Data and summary statistics
The states were selected randomly in the sample such that one state was selected from each region (from six regions: north, central, east, north east, west, and south) as well as from each level of development category. The level of development was based on four indicators 6 , namely infant mortality rate, female literacy rate, percentage of safe deliveries, and per capita income at the state level. We use the development classification 7 used in WHS to construct a dummy for underdevelopment (=1 for the two least developed states, viz. Rajasthan and Uttar Pradesh, and =0 for the other four states).
The following sets of vignettes in the data included mobility and affect, pain and personal relationships and vision, sleep and energy, and cognition and self-care. The respondent was asked to rate how much of a problem or difficulty the person has in the vignette, on an ordered scale response of 1 to 5-the same scale as used to rate SAH.
The survey data includes perceptions of well-being and more objective measures of health, including measured performance tests (rapid walk) and cognitive tests (verbal fluency, recall capacity). We construct four categories of individuals by body mass index by using the measured height and weight variable: underweight (BMI < 18.5), normal (BMI 18.5-24.9-reference category in regression), overweight (BMI 25-29.9), and obese (BMI > 30). We include six education categories capturing the highest level of education completed: no formal education (reference category), less than primary education, primary, secondary, high school, and college and above. Age is categorized into four groups: 18 to 29.9 years (reference category), 30 to 44.9 years, 45 to 60 years, and greater than 60.
The total number of individuals who have the complete information 8 across measured health is 10,873 individuals. Table 1 presents the summary statistics of the key variables of interest. Figure 1 presents the variation in SAH responses in the sample. We find individual responses on SAH cluster around the middle value. We plot the distribution of measured and reported height across expenditure quintiles and education categories in Figs. 2 and 3 respectively. It reveals that on average, individuals underreport their true height and this difference becomes smaller for higher expenditure and education categories. Disaggregating by undeveloped category of the states, we find the difference in reported and measured height is most prominent for individuals from the poorest quintiles (Fig. 4). Interestingly, in states like Karnataka and Uttar Pradesh, the self-reported height is always greater than measured height for all expenditure quintiles unlike the other states where it is the opposite. This suggests there may be some cultural factors that can be driving the reporting bias. Also, individuals from lesser developed states (correlated with lesser education and lower access to health facilities) have systemically different reporting behavior. We find that the gap between reported weight and measured weight is significant for all expenditure quintiles for the less developed states and not so for that of developed states (Fig. 5). Also, the gap is highest in the poorest quintile. This is actually in line with the finding from Strauss and Thomas (1996) where they observe that the gap between maternal reports and measurements of child height is smaller among higher income and better educated mothers. In the next section, we explore further on this line of enquiry with our regression framework followed by robustness checks.

Results
Equation (1) is estimated separately for 10 health state vignettes from each health domain. We separately present the regression results for the domains "mobility and affect," "pain and personal relationships," "vision, sleep, and energy," and "cognition and self-care" in Table 2, Table 3, Table 4, and Table 5 respectively. All specifications for these set of regressions include dummies for education categories, gender, age groups, marital status, body mass index categories, household expenditure quintiles, religion, caste, sector, and level of development in one's state. We then estimate the dependent variable "how would you rate your health today" on the same set of covariates (Table 6) but include a set of performance tests and interviewer assessments. Further, we control for (i) performance test scores for mobility and cognitive ability and (ii) biomarkers including tests for lung function, blood pressure, pulse rate, and chronic illness diagnosed (arthritis, stroke, angina, diabetes chronic lung disease, asthma, depression, hypertension, cataracts, oral health, injuries, cancer screening) in specification (3) in the same table. Specification (4) adds the respective interviewer assessment dummies. The idea is that we are able to precisely control and condition for the "observable" health counterpart and test it by specific health domain of mobility, functioning, cognition, and memory (results presented in Tables 7, 8, 9, and 10).
Males, on average, show a systematic pattern of under-reporting of worse health consistent across all the health domains. 9 Interestingly, we find that individuals from both lower as well as higher quintiles have higher probability to report better health Page 9 of 30 compared to the middle income group. Individuals from urban are more likely to under-report worse health. The dummy for underdevelopment is negative and statistically significant across specifications. With regards to the age group, individuals over 60 years tend to over-report illness. Interestingly, those who are underweight and obese, controlling for their objective health, tend to over-report worse health. Perhaps, the most interesting result that stands out of this exercise is that of systematic reporting bias by different underdevelopment category of states in India. This is perhaps suggestive of the hypothesis that socially disadvantaged individuals fail to perceive and report the presence of illness because an individual's assessment of their health is directly contingent on their social experience. It can be attributed to lower expectation for own health/higher tolerance for diseases where an individual may not see herself as being unhealthy conditional on the health norm prevailing in one's community.
We now discuss the findings from the cross-validation exercise estimating Eq.
(2) and comment on the validity of "response consistency" assumption across different health domains. Overall, we find that the subjective evaluation of own health problems and that of the vignette person are basically identical which lends support to the response consistency assumption in the data. We find individuals with education level secondary and above are more likely to under-report illness that is statistically significant at 1% level across all specifications. It is possible that higher educated respondents feel greater confidence regarding their capacity to handle a given level of health impairment, and underrate it. Males, as before, show consistent patterns of underreporting illness as compared to females, statistically significant across specifications. Compared to the young age group, individuals over 60 years significantly over report illness, which is consistent with our earlier finding from vignette approach.
Both the poor and the rich tend to understate illness compared to the middle expenditure group. The underdeveloped dummy is consistently negative and statistically (3) (8) (3) (8) (3) (8) (3) (8) (3) (8) (3) (8) (3) (8)   significant across all specifications, implying a underreporting of worse health among the disadvantaged group. This coefficient increases after interviewer assessments of health states are controlled, confirming that it is picking up the reporting bias. We further estimate a vector of self-reported functioning measures in the domain of mobility (Table 7), daily activities (Table 8), and cognitive outcomes (Table 9). While estimation of self-reported measure for memory would suggest that males fare better, we (3) (8)   find contrary result when we estimate objective memory test for words recalled (Table 10). As expected, individuals from underdeveloped states score lower on both cognitive tests. The findings reveal systematic underreporting of worse health among males, higher educated groups, and urban and underdeveloped states reconfirming our earlier findings. Interestingly, we find that coefficient on the underdeveloped dummy for interviewer assessed health problem reveals that individuals from underdeveloped states were more likely to have health problems (results not included). The distribution of μ for different health domains (Figs. 6,7,8,and 9) reveals substantial reporting heterogeneity between individuals. We examine how much of this variation in μ can be explained by the observable characteristics of the respondents in Table 11. The result matches with what we found earlier. We find that  males are more likely to favorably rank their health state and individuals above 60 years were likely to overstate bad health. Interestingly, both the quintiles above and below the middle expenditure group were likely to understate ill health. Individuals from underdeveloped states were found to be consistently underestimating health problems. This has important implications for inter-personal comparability of self-reported data even within a geographical region that may not be homogeneous in terms of development. Now, in order to see how much of the reporting heterogeneity can be attributed to the observable characteristics, we examine the R-square of the estimated Eq. (4) for different health domains. We find that the R-square for estimations (1) to (4) is just explaining 3% (mobility and affect) to 7% (cognition and self-care) of the variation in the self-reported behavior. 10 This is alarming given the fact that we get to only control and adjust for the observables in the regression, which leaves much of the reporting heterogeneity at the individual level typically unaccounted for. Also, this potentially limits the use of anchoring vignette approach to make SAH responses more comparable if much of reporting heterogeneity between individuals is due to unobservable factors that we do not control in a regression. We discuss the implications of our results in the next section.

Conclusions
One of the key challenges in the analysis and interpretation of health survey data is improving the interpersonal comparability of subjective indicators that comes with systematic measurement error-as a consequence of differences in the ways that individuals understand and use the available responses for a given question. In this paper, we examine the pattern of reporting differences in SAH from a nationally representative survey in India and find evidence that measurement error in SAH systematically varies with demographic characteristics, such as the age, gender, and education, and community characteristics such as sector and level of development in the state. This has important implications on several aspects. First, one should be careful in using self-reported health data for inter-personal comparison of health status. This becomes all the more relevant for policy formulation in the case for a developing country setting like India where objective data on health is scarce and one has to literally rely on self-reported health measures for assessing the health situation of the country. This has consequences on evaluation of health policies that are entirely based on self-reported data on morbidity, utilization and expenditure on health care, perceived well-being 11 , and self-rated ranking of health service delivery used in citizen and community report cards. Hence, drawing causal inference of a program based on self-reported health measures needs to be re-examined in the light of this problem. Further, one has to reflect on the problem that this reporting heterogeneity cannot be simply dealt with by controlling for the covariates in a typical regression framework. The findings on systematic reporting behavior by social disadvantage are mixed in our study. While there are non-linearities in systematic reporting bias by education and expenditure quintile of the respondents, we find individuals from underdeveloped  states underreport the presence of illness or health-deficits across all specifications. We additionally verify that the assumption of the "response consistency" assumption holds in this data. Further, controlling for individual fixed effects, we purge the idiosyncratic unobservable features of individual reporting behavior and confirm the earlier patterns of systematic bias by gender, age, and development level. Also, we point out that the observable characteristics of the respondents only explain a small portion of this remaining heterogeneity in SAH. Hence, we argue that in the dearth of objective health information, which is often costly to collect in a developing country setting, inclusion of vignette profile in questionnaire provides an arguably low-cost measure of identifying the systematic bias in responses, thus improving upon this problem.
Endnotes 1 For India, only a pilot data from Andhra Pradesh was analyzed. 2 It has been argued that individuals may use different thresholds for rating vignette questions as opposed to rating self-reported health questions. 3 In order to see whether reporting bias varies by true health, we include the measured body mass index categories (viz. underweight, normal, overweight and obese). 4 Implementation of SAGE Wave 1 was from 2007 to 2010 in six countries over different regions of the world (China, Ghana, India, Mexico, Russian Federation, and South Africa) 5 The sample was stratified by state and locality (urban/rural) resulting in 12 strata and is nationally representative. Of the 28 states, 19 were included in the design which covered 96% of the population. The survey implemented a multistage cluster sampling design resulting in nationally representative cohorts. 6 A composite index of the level of development was computed by giving equal weightage to the four indicators. 7 The states were ranked in this decreasing order of development (Maharashtra > Karnataka > West Bengal > Assam > Rajasthan > Uttar Pradesh) based on the composite index of infant mortality rate, female literacy rate, percentage of safe deliveries, and per capita income. 8 Around 500 observations do not have scores/not measured on some performance tests, i.e., less than 5% of the sample had missing information on X's; however, they were not dropped from the analysis. 9 The only exception being in the health domain of pain and discomfort, where male dummy changes sign and is actually positive and significant in 3 estimations (Table 3). 10 The inclusion of the interaction terms of the covariates also does not seem to improve the R-square. 11 For example, Gilligan et al. (2009) use self-perceived well-being as an outcome of interest in examining the causal impact of PSNP-food security program in Ethiopia.