The criterion validity of willingness to pay methods: a systematic review and meta-analysis of the evidence

Background: The contingent valuation (CV) method is used to estimate the willingness to pay (WTP) for services and products to inform cost benefit analyses (CBA). A long-standing criticism that stated WTP estimates may be poor indicators of actual WTP, calls into question their validity and the use of such estimates for welfare eva- luation, especially in the health sector. Available evidence on the validity of CV studies so far is inconclusive. We systematically reviewed the literature to (1) synthesize the evidence on the criterion validity of WTP/willingness to accept (WTA), (2) undertake a meta-analysis, pooling evidence on the extent of variation between stated and actual WTP values and, (3) explore the reasons for the variation. Methods: Eight electronic databases were searched, along with citations and reference reviews. 50 papers de-tailing 159 comparisons were identified and reviewed using a standard proforma. Two reviewers each were involved in the paper selection, review and data extraction. Meta-analysis was conducted using random effects models for ratios of means and percentage differences separately. Meta-bias was investigated using funnel plots. Results: Hypothetical WTP was on average 3.2 times greater than actual WTP, with a range of 0.7–11.8 and 5.7 (0.0–13.6) for ratios of means and percentage differences respectively. However, key methodological differences between surveys of hypothetical and actual values were found. In the meta-analysis, high levels of heterogeneity existed. The overall effect size for mean summaries was 1.79 (1.56–2.04) and 2.37 (1.93–2.80) for percent summaries. Regression analyses identified mixed results on the influence of the different experimental protocols on the variation between stated and actual WTP values. Results indicating publication bias did not account for differences in study design . Conclusions: The evidence on the criterion validity for CV studies is more mixed than authors are representing because substantial differences in study design between hypothetical and actual WTP/WTA surveys are not accounted for.


Introduction
Cost-benefit analysis (CBA) of public investments requires measurement of aggregate WTP (Slothuus, 2000). The CV method allows the assignment of a monetary value to the benefits attached to a public good or service for comparison with its costs (Mitchell and Carson, 1989). In this way, the method enables the estimation of economic value for a wide range of commodities not traded in markets (Slothuus et al., 2002). Surveys or interviews are used to elicit people's preferences and monetary valuation for goods or services by asking about their WTP or WTA (Mitchell and Carson, 1989). By assuming a utility-theoretic model of consumer preferences, utility is maximised through the consumption of quantities of a good (or service) regarded as a "good" (ceteris paribus) (Klose, 2003). On the other hand, when a good (or service) is regarded as "bad", utility is maximised by consuming (purchasing) less of it. The maximum WTP or minimum willingness to accept (WTA) values for provision (loss) of goods or improvements (reductions)  the situation with or without the good/service (Mitchell and Carson, 1989). The WTP (or WTA) values are elicited contingent on a market existing for the valuation goods.
Unlike other preference elicitation methods such as travel cost (TCM), hedonic pricing method (HP), conjoint analysis and averting expenditures or averting behaviour, CV can be used to estimate both use and non-use values. CV therefore represents the most promising approach yet developed for determining the public's WTP especially for public goods. The CV is the most widely used yet controversial of methods to value non-marketed goods (Munro, 2009). While the CV method has been widely used in the environmental and transport sectors, it has been less frequently applied in the health sector. Significant concerns about the use of the method focus on the validity of estimates with critics arguing that hypothetical WTP values do not accurately reflect actual values (Loomis et al. 1996(Loomis et al. , 1997(Loomis et al. , 2009Blumenschein et al., 2001;Blumenschein et al. 1998). This difference between hypothetical and actual values has been defined as hypothetical bias.
There are also concerns about the potential for other biases relating to both the researcher (e.g. design bias) and the survey respondents (e.g. strategic bias) when using the CV method (Brown and Taylor, 2000;Champ et al., 1997;Cummings et al. 1995Cummings et al. , 1997. The extent to which hypothetical estimates of mean WTP reflect true values can be assessed using: (i) content validity, which reflects the extent to which an empirical measurement adequately reflects a specific domain of content (Carmines and Zeller, 1979). In WTP this is reflected in whether the framing of the CV questions for the good being valued is appropriate; (ii) construct validity, which concerns the correspondence between a measure and other measures of the same construct, and the degree to which the findings of a study are consistent with theoretical expectations. For example, construct validity may be assessed by measuring the convergence between values generated using a CV study and other preference elicitation measures such as the TCM. Theoretical relationships may also be tested by comparing mean WTP values of different conditions for which theory suggests different values (Hanley and Splash, 1993;Mitchell and Carson, 1989); and (iii) criterion validity, which is defined as the correlation of a scale with another measure of the trait, ideally a gold standard which has been used and accepted in the field. Criterion validity is assessed through either concurrent validity in which a new measure is correlated with an existing gold standard with data for both collected at the same time; or through predictive validity in which the new criterion is not yet available as at the time of data collection.
Criterion validity has the greatest potential for offering a definitive test of a measure's WTP validity (Mitchell and Carson, 1989). Actual market prices have been taken as an important criterion in CV studies. However, market prices are rarely available for public and quasi-public goods that generate significant non-use values, and therefore often no ideal criterion validity tests are available . In this absence, experimental (simulated) markets, in which the outcomes of hypothetical CV markets are compared with outcomes for identical markets in which the same goods are bought or sold, have been used. The actual (real) payment values generated from the simulated market experiments are compared against hypothetical values to evaluate criterion validity (Mitchell and Carson, 1989).
To date, 6 reviews (four of which include meta-analysis of values obtained) of criterion validity have been conducted (Carson et al., 1996;Harrison and Rutström, 2008;Liljas and Blumenschein, 2000;List and Gallet, 2001;Little and Berrens, 2003;Murphy et al., 2004) across different sectors. The evidence from these reviews further confirms the presence of hypothetical bias in CV-WTP studies. The effects of different experimental protocols on hypothetical bias have been investigated with mixed results. For example, the variety of elicitation formats, subject pools, study designs (whether within-group or between group), whether the welfare measure is WTP or WTA and the type of good (private or public) have been identified as potential drivers of hypothetical bias. However, the effect of these on hypothetical bias is mixed across the reviews. The last review was conducted more than a decade ago (Little and Berrens, 2003). The review by Harrison and Rutström (2008) was conducted in 1999 but was not published until 2008. Only two criterion validity assessments of health goods were included in the synthesised evidence to this date (Bhatia and Fox-Rushby, 2003;Blumenschein et al., 2001).
In a review of literature in 1998, Smith argued that data for criterion validity assessments -the 'gold standard' -was not available (Smith et al., 1999). However, since this date, both data on this gold standard and its use in the health sector have developed substantially, with some authors arguing that "the potential for survey instruments to provide valid estimates of WTP has been proven" (Donaldson and Shackley, 2002). However, there remains great concern about whether hypothetical values provide correct estimates of actual WTP and the evidence appears to be mixed (Munro, 2009;Loomis et al. 1996Loomis et al. , 1997Blumenschein et al., 2001). With more recent studies comparing stated and actual values performed since the last review, meta-analyses of the summary values will hopefully show consistent results regarding the magnitude of hypothetical bias. This paper presents a narrative and quantitative systematic review and meta-analysis assessing the criterion validity of WTP methods. The review seeks to provide current evidence across the sectors on the criterion validity of WTP methods. This review differs from previous systematic reviews and meta-analyses of criterion validity assessments in two ways; we include (1) only criterion validity assessments which include direct WTP elicitation methods only in both the hypothetical and actual surveys and (2) only studies which report empirical WTP or WTA values. These criteria justify the broad search which identified some of the studies included in previous studies. An updated review will potentially highlight improvements in both the conduct and analysis of criterion validity assessments and may derive important methodological findings regarding WTP CV methods.

Methods
The review follows the PRISMA guidance on methods for conducting and reporting of systematic reviews (Moher et al., 2009).

Literature search strategy
Eight electronic databases (EconLit, TRID, MEDLINE, Embase, Web of Science, Psychinfo, CRD and CINAHL Plus) were searched from their inception to September 2016. The search terms were identified from previous systematic reviews (Carson et al., 1996;Harrison and Rutström, 2008;Liljas and Blumenschein, 2000;List and Gallet, 2001;Little and Berrens, 2003;Murphy et al., 2004). Valuation terms (WTP, WTA, CV, hypothetical value, hypothetical market, indirect, stated preference, stated value, actual market, revealed market and real market or payment) were crossed with validity terms (external validity, criterion validity or predictive validity). Appropriate mesh terms were used and the search strategy adapted for each of the databases (see Appendix 1 for a sample search strategy). In addition, reference lists of key papers and citation searches were conducted to identify additional papers. Results were handled using Mendeley reference management software.

Study selection criteria
The database search was run by one reviewer (LK) with reference lists and citation searches conducted by two reviewers (LK & JFR). All titles and abstracts, and full papers when in doubt, were double-reviewed (LK & JFR) using the following inclusion criteria: (1) conducted and reported in English; (2) assessed criterion validity of WTP/WTA; (3) included direct WTP elicitation methods (CV) only in both hypothetical and actual surveys; (4) included both a hypothetical and actual survey (with accompanying transaction) and (5) reported empirical WTP or WTA values.

Data extraction
Data were extracted by one reviewer (LK) using a standard template in MS Excel (see Appendix 2), with a second reviewer double extracting data for a randomly selected 10% sample (SS). Disagreements were resolved through discussion, with any implications followed through to all other papers. Extracted data included background characteristics (e.g. country, terminology used, good valued), survey design (e.g. welfare perspective, elicitation format and pre-specified values for both hypothetical and actual WTP surveys where appropriate, payment vehicle, mode of administration, survey setting), study design (e.g. sampling (unit, sample selection, type of sample, size, response), duration between hypothetical and actual surveys, analytic methods (e.g. WTP estimation methods, regression methods) and main findings (types of comparisons produced and values). Where multiple comparisons were reported in a study, these were extracted separately. This was done to allow for the use of all the estimates and hence a larger dataset for analysis.

Risk of bias
A quality rating was not employed for individual studies as no agreed criteria exist for criterion validity assessments. Risk of bias, which could potentially affect the pooled results, was considered. Metabias (publication and selective reporting bias) was investigated using funnel plots (Lipsey and Wilson, 2001). The metafunnel command in Stata was used to explore the relationship between the ratio (logratio) and the standard error of the ratio (standard error of the logratio). In the absence of publication bias, the funnel plot generated from the studies included in the analysis should be inverted or asymmetrical. Where this is the case, the largest samples would be at the top of this inverted funnel plot, and closer to the true effect size. On the other hand, the smaller studies would be scattered along the x-axis. The reverse is true where publication bias is suspected.

Statistical analysis
For all comparisons, WTP estimates for hypothetical and actual data were matched as pairs, when provided, and compared as a ratio (for mean values) and as odds ratios (for percentage summaries). All quantitative analysis were conducted using Stata14 (StataCorp, 2015). Three types of analysis were conducted.

Narrative summary
Using the entire dataset, a narrative and quantitative summary of the methods used in the comparisons and findings is provided. The comparisons of hypothetical and actual values in terms of background characteristics, survey design, study methods and results were summarized using counts, descriptive statistics, 2 by 2 tables, and box and whisker plots.

Meta-analysis
A reduced dataset was used in the meta-analysis. For the mean summaries, only comparisons which reported standard errors of the mean, or those which provided sufficient statistics to enable the calculation of the standard error were included. Only comparisons which had a non-zero hypothetical and actual WTP value (and hence a nonzero odds ratio) were included in the meta-analysis for percentage summaries.
Given the variation in the methods used in the reviewed studies, a random-effects meta-analysis was conducted to calculate the weighted average of the log ratios and odds ratios (separately for mean and percentage summaries respectively). The weights were based on the inverse of variance of the effect estimates. Forest plots are presented separately for these and the I 2 statistic used to determine the level of heterogeneity (Higgins et al., 2003). Sensitivity analyses and subgroup analyses were also conducted in exploring the sources of the heterogeneity. In the sensitivity analysis, meta-analyses were re-run excluding comparisons with the smallest sample sizes. Sub-group analyses explored heterogeneity by sector, sample selection types, study administration modes and survey elicitation formats using the metan command in STATA14.

Meta-regressions
Meta-regressions were conducted to explain the heterogeneity in the presented summaries and determine the drivers of hypothetical bias. These regressions were all clustered by study to control for the multiple comparisons from some of the studies The dependent variables in these regressions were: (1) the ratio of hypothetical to actual values derived from comparisons presenting mean summaries, and (2) the log of the odds ratio of hypothetical to actual values, for comparisons presenting summaries as percentages. Previous meta-analyses investigated the effect of different study attributes on hypothetical bias (Carson et al., 1996;Harrison and Rutström, 2008;Liljas and Blumenschein, 2000;List and Gallet, 2001;Little and Berrens, 2003;Murphy et al., 2004). The results of these have been either mixed or inconclusive. In the absence of a theory explaining the divergence between hypothetical and actual WTP payments (hypothetical bias), the following variables were introduced into the models in an exploratory manner: (1) sector within which a valuation good or service falls; (2) class of good; (3) purpose of good; (4) study administration mode; (5) sample selection in both surveys; (6) type of sample (student or otherwise and users versus non-users of a service or good); (7) WTP elicitation format used in both surveys; (8) type of comparison (either between samples or within same sample); (9) study setting (laboratory or field); (10) duration between the hypothetical and actual surveys and (11) money effects (whether respondents were paid to participate in either survey or given money to purchase the good valued).

Univariate regressions.
The range of univariate regressions explored the relationship between the dependent and independent variables listed in the previous section separately for comparisons presenting mean and percentage summaries. Significant variables are presented in the results section and discussed thereafter 2.5.3.2. Multiple regression. Where the ratio is the dependent variable (comparisons presenting mean summaries), the GLM estimator was used. The GLM permits the use of the estimates in their natural form, with a straight forward interpretation. Where the odds ratio was presented, the natural log was used and a logit model estimated. Base and reduced models were determined separately for comparisons summarized as means and percentages. In the base models, all the independent variables listed in section 2.4.3 were included. To arrive at a reduced model, variables with the highest non-significant p-values were removed and the model re-estimated. To examine model fit, model diagnostics were run with every estimation. The linktest (Cameron and Trivedi, 2009) was used to examine specification errors in the models. Further, the Hosmer Lemeshow test was used to check for the goodness of fit of the models (Hosmer and Lemeshow, 2013). The final reduced models included the range of variables which were significant and for which the models were best specified. Finally, for each of the models, a predicted ratio or log odds ratio was determined for the mean and percent summaries respectively

Background characteristics
Of the 480 papers identified, 50 were included (see Fig. 1) from 14 countries. Comparisons were typically carried out in the USA (n = 79 comparisons), followed by Norway (n = 35 comparisons), Nigeria (n = 16 comparisons) and Sweden (n = 9 comparisons). More than half the papers (n = 33) generated multiple comparisons (range: 2-30) of hypothetical and actual values. The results therefore, with the exception of country and year of publication, focus on 159 comparisons of hypothetical and actual WTP (WTA) values. Background characteristics of all the comparisons included in the review are provided in appendix 3.
The majority of comparisons (n = 94), did not explicitly use any specific terms for validity assessment, preferring to reflect papers as testing comparisons between hypothetical and actual WTP values. Approximately one fifth (n = 32) referred to this as testing for hypothetical bias (Blumenschein et al., 2014;Botelho and Pinto, 2002;Bryan and Jowett, 2010;Camacho-Cuena et al., 2004;Getzner, 2000;Johannesson, 1997;Mozumder and Berrens, 2007;Murphy Stevens et al., 2002;Onwujekwe et al., 2005). Two comparisons from the same study used the term predictive validity (Onwujekwe, 2001), while one used external validity (Muller and Ruffieux, 2011). A further one-fifth (n = 30) of the comparisons referred to assessments of criterion validity (Bhatia and Fox-Rushby, 2003;Bratt, 2010;Carlson, 2000;Johnston, 2006;Loomis et al., 1996;Onwujekwe et al., 2001;Onwujekwe and Uzochukwu, 2004;Onwujekwe, 2004;Ramke et al., 2009;Vossler et al., 2003a,b;Vossler and Kerkvliet, 2003;Willis and Powe, 1998). Table 1 shows that most comparisons (38%) were in the environmental and 23% in the health sector, with the remainder spread in 'other' sectors. Of the 36 health sector comparisons, 30 elicited values for prevention products such as treated mosquito nets (Bhatia and Fox-Rushby, 2003) and six elicited values for management or treatment of a disease condition (e.g. Asthma management program (Blumenschein et al., 2001) and spectacles (Ramke et al., 2009)). In the environmental sector, 55 comparisons provided values for conservation, 2 elicited values for prevention purposes while 3 elicited values for use or access to public goods or services e.g. provision of public water to a remote village in Rhode Island (Johnston, 2006). Most comparisons (n = 54) in 'other' sectors elicited values for personal and household goods (e.g. art prints (Loomis et al., 1997;Loomis et al., 1996), sunglasses (Blumenschein et al., 2014)), one study elicited values for a personal good (chocolate bar) and a public good (prevention of additional damages to an aquatic system from acid rain) (Kealy et al., 1990).

Comparison of hypothetical and actual survey attributes
All comparisons adopted cross-sectional designs. Nearly all elicited WTP estimates (n = 154) while WTA values were derived in 5. One, in the environment sector (Heberlein and Bishop, 1986) sought WTA values in exchange for goose permits which hunters had earlier purchased in the hypothetical survey. In the actual survey, cash offers were made to the hunters to give up their permits. Four WTA comparisons were conducted in other sectors and these included eliciting expected compensation values from respondents in exchange for the holiday gifts followed by offers of actual payments for their holiday gifts (List and Shogren, 2002) and WTA in exchange for goose and deer permits (Heberlein and Bishop, 1986).
All comparisons used the same payment vehicle in actual and hypothetical surveys. Out of pocket payments were used in 154 comparisons across all sectors (exclusively so for the health and other sectors) and these included user fees and voluntary donations. Tax payments, primarily property taxes were used in 3 comparisons eliciting WTP values for public goods in the environmental sector (Vossler et al., 2003a,b;Vossler and Watson, 2013;Vossler and Kerkvliet, 2003). In the same sector, two comparisons were asked for voluntary donations towards a public good (Macmillan et al., 1999;Veisten and Navrud, 2006).
The majority of comparisons also used the same elicitation format (n = 111), administration mode (n = 143), sample selection technique (n = 135) and sample type (n = 158) in both the hypothetical and actual surveys. These are presented in Table 2 where, for every attribute, the diagonal in bold represents the similarities between hypothetical and actual surveys.
Different WTP elicitation formats were used across the hypothetical and actual surveys in nearly one-quarter of the comparisons (n = 39), where for example, the bidding game was used in the hypothetical survey but a dichotomous choice was used in the actual survey (Bhatia and Fox-Rushby, 2003;Vernazza et al., 2015). In one particularly unusual case, an open ended question is asked in the hypothetical survey, but an auction is used in the surveys of actual values (Fox et al., 1998). It is typically the environment (n = 54), and other (n = 43) sectors that have used the same elicitation formats for both the hypothetical and actual surveys. WTP for health goods/services has most commonly used different elicitation formats for the hypothetical and actual surveys (90%).
The same mode of administration, in-person interviews, was predominantly used in the "other" sectors but different modes of administration were used in the health and environment sector. For example, in the health sector, one study used mail surveys in the hypothetical survey but in-person interviews in the actual survey (Loomis et al., 2009). In the environment sector, four comparisons used mail surveys for hypothetical values and in-person interviews to elicit actual values (Vossler and Watson, 2013;Vossler and Kerkvliet, 2003;Johnston, 2006) with three comparisons (2 studies) using the opposite (Brown and Taylor, 2000;Seip and Strand. 1992).
Considering hypothetical and actual surveys separately, the general response rate was not indicated for nearly two-fifths of the comparisons in the hypothetical surveys (n = 63). However, in the actual survey, the general response rates were indicated in more than half the comparisons (n = 130). A comparison of the general response rates by study modes of administration shows that telephone interviews had the higher mean response rates, followed by mail surveys. Response rates from in-person interviews were scattered across the scale suggesting missing response values or outliers (Fig. 2).
The response rates to the valuation question were reported in only one-third of the comparisons for the hypothetical survey (n = 53) and for only fifteen comparisons in the actual survey. For thirteen comparisons (3 in health; 6 in environment; 4 in other sectors), the response rate for the actual and hypothetical questions was the same. Overall, the presence and treatment of the different non-responses, where a Other elicitation formats include all other elicitation formats with a count of less than 5 e.g. structured haggling, payment cards and mixed methods such as binary or bidding game with follow up. present, is not discussed. It is therefore not clear whether summary statistics provided exclude these missing values or not. Fig. 3 compares the sample sizes used in the hypothetical and actual surveys, with five comparisons which were outliers dropped from the summary. Sample sizes ranged from 9 to 2890 in the hypothetical surveys and from 9 to 15,781 in the actual surveys. The sample sizes for the two surveys were similar in 88 comparisons. In most cases, where the sample size differed, the hypothetical survey had a larger sample than the subsequent survey of actual values (n = 44). However, for three comparisons from one study valuing a public good (comprehensive restoration plan for a riverfront commemorative park), the hypothetical survey sample size was less than 1% (122) of the actual survey sample size (15,781).
For more than two-fifths of the comparisons (n = 67) authors stated that different respondents were approached to complete the hypothetical and actual surveys, particularly so in the other (n = 35) and health sectors (n = 13). In most environment sector comparisons (n = 41/60) the same respondents were approached. Unfortunately, where the respondents and the sample size differ, tests relating to the representativeness of the sample of the actual survey in relation to the hypothetical survey were not always reported.
Hypothetical and actual surveys were undertaken at the same time or within a period of 2 weeks in the majority of comparisons (n = 126), with 31 administering the two surveys more than 2 weeks apart. The duration between the two surveys was not clear in 2 comparisons. The hypothetical and actual surveys conducted more than one month apart (n = 3) were in the environment sector (Vossler et al., 2003a,b;Johnston, 2006;Vossler and Watson, 2013).

Justification for the values used in the surveys
When closed-ended elicitation formats are used to elicit WTP or WTA values, pre-specified value cues are presented to respondents. For instance, a payment card presents a range of money values from which respondents are asked to select the value that best reflects their maximum WTP while bidding methods present single or multiple bids for valuation. As values presented are significant cues, they should not bias the true population mean WTP and therefore require justification to allow judgement of likely bias. However, in 56 comparisons across both the hypothetical and actual surveys for the same good, justifications were not provided for value cues used. In 7 comparisons from five studies, all in the environment sector, the values presented to the respondents in both the hypothetical and actual surveys were based on prior costings of the planned projects (Byrnes et al., 1999;Champ et al., 1997;Spencer et al., 1998;Champ and Bishop, 2001;Blumenschein Blomquist et al., 2008). In another sample, (Loomis et al., 1997), values obtained from a pre-test of the survey were presented to respondents in both surveys.
In four comparisons, values from hypothetical surveys were used to inform the actual survey (Bhatia and Fox-Rushby, 2003;Loomis et al., 1997;Willis and Powe, 1998;Onwujekwe, 2004). In two comparisons, one each in the health and environment sector, the stated hypothetical values were presented in the actual survey (Onwujekwe, 2004;Willis and Powe, 1998). One study each in the other sector (Loomis et al., 1997) and one in the health section (Bhatia and Fox-Rushby, 2003) used mean value from the hypothetical survey as the value cue for the actual surveys. Market prices for the commodities were used in the  L. Kanya, et al. Social Science & Medicine 232 (2019) 238-261 actual surveys in one paper in the health sector . For 51 comparisons that used open ended questions, a justification was not relevant. Most comparisons in the environment sector presented the stated hypothetical values in the actual survey (n = 34). In 11 comparisons, the value presented in the actual survey was based on a costing of the proposed project. In the 'other sectors', the market price for the good was presented in two comparisons while nearly one-third of the comparisons (n = 14) did not provide a justification for the values used in the hypothetical surveys. In two comparisons both from the same study (Loomis et al., 1997) the value presented in both hypothetical and actual surveys centred on a pre-test mean. Auctions and open ended elicitation formats were used in the actual surveys in 13 comparisons.

WTP/WTA estimates and criterion validity assessment
The estimation methods for mean WTP/WTA summaries are varied and these would be expected to relate to question format. Some studies, primarily those employing open ended questions, derived the summary estimates by a computation of averages (e.g., Balistreri et al., 2001;Fox et al., 1998;Getzner, 2000;List, 2001). The spread of this data is not given in around half (n = 50) of the comparisons presenting mean summaries. Summary WTP/WTA estimates were also modelled using a range of statistical techniques. Roughly 70% (n = 111) of the comparisons specified the statistical tests used with the majority (n = 88) employing parametric methods (non-parametric methods n = 18, both parametric and non-parametric methods n = 20). While similar elicitation formats were used in the hypothetical and actual surveys for nine comparisons, different summaries were presented. These included mean summaries in the hypothetical survey with a percentage in the actual survey and vice versa. 84 comparisons presented summary means for both surveys and 60 provided summary percentages. Different summary estimates were provided for 15 comparison pairs. Study authors concluded that criterion validity was demonstrated where hypothetical and actual WTP estimates were relatively similar. However, the criteria for judging the ratios for a conclusion on criterion validity were often not provided. As a result, different conclusions were given even for similar ratios and odds ratios.
Criterion validity was not confirmed by study authors for more than three-quarters (n = 124) of comparisons. Of the 33 comparisons where study authors confirmed criterion validity of the WTP/WTA estimates, 17 were from the other sectors; 10 from the health sector and 6 from environment sector. Vernazza et al. (2015) reported mixed results for two comparisons in the health sector. Criterion validity confirmations were similar across the WTP summary methods. Table 3 summarises author's conclusions on criterion validity by sector and WTP/WTA summary measure.
Based on the summaries presented by the study authors, ratios (for mean comparisons) and odds ratios (for percentage summaries) were calculated. Of the comparisons that reported mean values in both the hypothetical and actual surveys (n = 84), the ratio of hypothetical to actual mean values was an average of 3.2 (range 0.7-11.8). The highest ratios were for environment sector (5.99), pure public (4.92), and conservation goods (5.96). For example, in one study which elicited WTP for the protection of sensitive rainforest land, the hypothetical mean WTP was $27.97 for female and $72.22 for male respondents whereas the mean actual WTP was $3.23 among females and $6.14 for males (Brown and Taylor, 2000). Ratios were also highest when the hypothetical and actual surveys were administered concurrently (3.24), when a donation mechanism was used as the payment vehicle (4.53) and when a one-off payment was elicited (3.22).
For the comparisons which presented percentage summaries in both hypothetical and actual surveys odds ratios were calculated for only the comparisons which had non-zero values in both surveys (n = 56). The average odds ratio was 5.7 (range of 0-13.6). The highest odds ratios were observed in; comparisons in the environment sector (0.88), quasi private goods (−1.48), goods used for "other" purposes (1.96), within sample comparisons (1.08), studies conducted within a laboratory setting (1.72), study periods of between 1 and 7 days between the hypothetical and actual surveys, when a property tax was used as the payment vehicle (−3.72) and when monthly payments were elicited (−5.58). The ratios and odds ratios for the included comparisons by different design attributes are presented in Table 4 (overall characteristics) and Table 5 (hypothetical and actual surveys).
In the comparisons of hypothetical and actual surveys (Table 5), the highest ratios were observed when a purposive sample was used in the hypothetical survey (3.60); a random sample in the actual survey (4.22); with a mixed (student and non-student) sample in both the hypothetical and actual surveys (10.21 in both), followed by a nonstudent sample in both surveys; with the use of telephone surveys in the hypothetical survey (5.60) and open ended surveys in the hypothetical (5.10) and actual (5.49) surveys. The sample type (whether users or non-users of the valuation good/service) generated similar ratios in both the hypothetical and actual surveys.
The highest odds ratios were observed for percentage summaries where; convenience samples were used in both the hypothetical and

Results of meta-analyses
Meta-analysis was conducted separately for comparisons presenting mean and percentage summaries. Standard errors were provided or calculated where possible for a total of fifty four of the comparisons presenting mean summaries and only these were included in the metaanalysis. For comparisons presenting percentage summaries, four reported a zero value in the actual survey results, generating an odds ratio of zero. These were excluded from the analysis, leaving a total of 56 comparisons. Fifteen comparisons which presented different summaries in the hypothetical and actual surveys were also excluded from the meta-analysis.

Comparisons presenting mean summaries
The ratio of the actual and hypothetical mean values was used in the random effects meta-analysis. The pooled ratio of hypothetical to actual WTP values for the 54 comparisons was 1.79 with a range of 1.56-2.04 (see Fig. 4). This implies that for these comparisons hypothetical WTP was higher than actual WTP by 79%. Some variation in the effect sizes was expected, given the differences in the characteristics of the comparisons pooled in this analysis. However, a very high level of heterogeneity was detected in pooling the 54 comparisons. This is indicated by the I 2 of 97.1% which was significant (p < 0.001). .
The pooled ratio (see Fig. 4) was highest in the environment sector (1.85) compared to 1.25 in the other and 1.49 in the health sectors (Appendix 4). In addition, studies in the health sector had the lowest heterogeneity level (56.5%, p = 0.0056). However, the number of comparisons in the health sector was small (4) and from the same study. This compares with heterogeneity levels in the environment (92.7%, p < 0.001) and other (97.7%, p < 0.001) sectors.
In the subgroup analysis by survey setting, while the overall level of heterogeneity remained high and significant regardless of study setting (97.1%, p < 0.001 overall), this was much lower with field studies (68.4%, p < 0.001) compared to laboratory studies (97.7%, p < 0.001) (Appendix 5). In a sensitivity analyses, the effect on the pooled ratio of dropping comparisons which had the widest confidence intervals was explored. The pooled ratio was slightly smaller at 1.78 but the level of heterogeneity increased by 0.3 percentage points, remaining significant (97.4%, p < 0.001) (Appendix 6).

Comparisons presenting percent summaries
The log odds ratio of the actual and hypothetical percentages was used in the random effects meta-analysis. A forest plot of these comparisons is presented in Fig. 5. The forest plot shows that respondents were more likely to say "yes" in the hypothetical survey than they were in the actual survey. The pooled odds ratio from the studies presenting percent summaries was 2.37 (range 1.93-2.80) i.e. the odds of saying "yes" in the hypothetical survey were more than double the odds of saying "yes" in the actual survey. As the level of heterogeneity was high and significant (90.2%, p < 0.001), the variation could not be attributed to chance alone.
Sub-group analysis showed heterogeneity was high and significant for studies from the environment sector (93.25%, p < 0.001). Heterogeneity in the other and health sectors was considerably lower and insignificant (35.2%, p = 0.117 & 21.9%, p = 0.211 respectively), and that the variation could be attributed to chance alone (Appendix 7). The differences in the levels of heterogeneity were not significant for the other study attributes. The differences by survey setting (Appendix 8) could be attributed to the few laboratory studies.
In the sensitivity analysis, three comparisons which had the widest confidence intervals were dropped from the analysis. In the meta-analysis of this reduced sample, the pooled odds ratio from the comparisons was slightly higher (2.36) and significant (p < 0.001) (Appendix 9).
The pooled estimates from both the mean and percentage summaries demonstrated that hypothetical WTP estimates overestimate actual values. For all the analyses presented, high levels of heterogeneity were noted. The explorations of the heterogeneity did not isolate any study characteristic as contributing to this. While this suggests that the variation in the estimates across the comparisons was not due to chance, this might simply be due to the differences in the pooled studies. Using  L. Kanya, et al. Social Science & Medicine 232 (2019) 238-261 meta-regressions, the drivers of this variation in the ratio and odds ratios were further explored. This analysis is presented in the next section.

Meta-regression results
All the comparisons presenting mean summaries are included in the regression analysis (n = 84) while only 56 comparisons presenting percentage summaries are included. Univariate and multiple regression results are presented separately in the next section. For the presented multiple regression models, the linktest estimate was not significant, indicating that the models were correctly fitted. In interpreting all the regression results, variables with positive coefficients are associated with higher ratios (odds ratios) and therefore higher hypothetical bias. Similarly, negative coefficients are associated with lower ratios (odds ratios) and therefore lower hypothetical bias.

Univariate regression: Mean and Percent summaries
The sector within which the valuation good falls, the type and purpose of good were all significant with the direction of influence similar for both the mean and percent summaries. The ratio and log odds ratio were reduced for a good in the health sector, prevention goods and goods classified as pure private good. Both the ratio and log odds ratio for pure public goods were significantly higher.
In comparing similarities in design attributes for the hypothetical and actual surveys, both the ratio and log odds ratio were lower when the same WTP elicitation method was used in both surveys and with the use of the bidding technique. Random sampling techniques also contributed to lower ratios and log odds ratios. However, the use of openended WTP elicitation and one-off payments elicited significantly higher ratios and log odds ratios.
The direction of effect was reversed for mean and percent comparisons for some of the attributes. For instance, while a good in the other sectors elicited lower ratios, the log odds ratio was higher and both were significant. This could be explained by the number of comparisons in the mean and percent summaries. Similar effects and direction were seen for comparisons using the same sample type, administration mode and in-person interviews in both surveys. The univariate regression outputs for both mean and percent summaries are  Robust standard errors in parentheses ***p < 0.01, **p < 0.05, *p < 0.1. Robust standard errors in parentheses ***p < 0.01, **p < 0.05, *p < 0.1.

Meta-regression
The meta-regression results are presented separately for comparisons presenting mean and percent summaries (Tables 7 and 8). For both the base and reduced models presented, the regressions weighted by the study fit the data for r 2 of 0.68 and 0.65 respectively. Interestingly, the direction of effect is maintained in the base and reduced models for the mean summaries regression (with the exception of one variable), whereas there were five changes in sign for the percent summaries regression. The discussion focusses on the reduced model results as they fit the data better.
3.6.2.1. Mean summaries. The ratio of hypothetical to actual WTP values significantly increased for; goods in the environment sector compared to the health sector (4.3), pure public goods (1.8) and quasiprivate goods (2.3) when compared to pure private goods, and valuation with good or service users (2.1) compared to non-users. Focusing on the WTP elicitation format, while the use of the same format in both the hypothetical and actual surveys reduced the ratio by 3.8 (p < 0.1), the use of open ended methods increased the ratio by 6.4 (p < 0.01) while auctions and bidding methods increased the ration by 5.6 (p < 0.05) and 4.9 (p < 0.05) respectively.
Conversely, the use of random sampling techniques significantly reduced the ratio of hypothetical to actual WTP values by a factor of 3.4 (p < 0.001). The use of the same administration mode in both surveys, and in particular, the use of mail surveys, significantly reduced the ratio by 2.6 (p < 0.001) and 1.7 (p < 0.05) respectively. As would be expected, the ratio of hypothetical to actual values was significantly lower where authors had concluded that criterion validity had been established based on their study findings (2.1, p < 0.001). Finally, based on the model, the predicted ratio for comparisons presenting mean summaries was 3.1 (s.d. 2.64). The model estimation results are presented in Table 7.
3.6.2.2. Percent summaries. For comparisons presenting percent summaries, the log odds ratios were back transformed into odds ratios for ease of interpretation (see Table 8). The odds of a higher WTP value in the hypothetical survey (and hence higher odds ratio) were statistically significantly higher for valuation goods classified as pure public, where the same sample was approached in both the hypothetical and actual surveys and when participants were given money to participate in either the hypothetical or actual survey (money effects). However, the odds of a higher hypothetical WTP value (lower odds ratio) were 62% lower when a valuation good was from the environment sector (compared to the health sector); more than 99% lower when the same administration mode and elicitation format were used in the hypothetical and actual surveys of WTP, and when cash was asked, compared to donations. All these results were statistically significant at p < 0.001. The predicted odds ratio for these comparisons was 2.24.

Summary of regression results: comparisons between mean and percent summaries
In comparing the characteristics of both surveys, using the same administration mode and WTP elicitation formats in the hypothetical and actual surveys led to lower ratios and odds ratios. The ratios and odds ratios for valuation goods classified as pure public and those in the other sectors were significantly high. Differences were observed for all the other variables in the reduced models.

Risk of bias analysis
Meta-bias was investigated separately for the studies reporting mean and percentage summaries. As illustrated in Fig. 6 (for studies reporting mean summaries) and Fig. 7 (for studies reporting percentage summaries), the funnel plots would signify the presence of publication bias, provided there was no difference in methods between the large and smaller studies. However, these funnel plots do not account for the differences in the methods used in the different studies. If the large studies differ in methods or other key characteristics, this relationship is not expected to hold. There are substantive differences in the methods used in the different studies. These differences are shown in this paper to affect the difference in the gap between the hypothetical and actual WTP. Therefore, these analysis of publication bias are very likely themselves biased.

Discussion
This review shows that considerable research has focussed on the criterion validity of CV methods since the late 1990s, with most papers from the health sector appearing after 2000. It is the first review and meta-analysis on criterion validity for over a decade and presents the first meta-regression that explores potential reasons for differences between hypothetical and actual WTP across studies. . With the increasing use of simulated market experiments, it is not surprising that Robust standard errors in parentheses ***p < 0.01, **p < 0.05, *p < 0.13.7.
L. Kanya, et al. Social Science & Medicine 232 (2019) 238-261 the majority of the work has focussed on private goods and this is particularly the case beyond the health and environment sectors. However, an important body of evidence now also exists for quasipublic and public goods/services. Applications in the environmental sector lead all assessments of criterion validity for public goods and the greater part of quasi-public goods. Almost two-thirds of investigations are for private goods in the US, with the remaining 35% spread across 9 countries. The question of whether results from simulated market experiments for a private good can transfer to evidence of the validity of CV methods in quasi-or pure-public goods has not yet been addressed. The definition of external/criterion validity differs in the CV literature, but authors have equated this type of research with assessments of construct validity and reliability. This variety could explain why a large proportion of our evidence was accessed through reviews of references and citation searching. Previous reviews encountered the same difficulty and criterion validity assessments published since 2005 continue to use a variety of terms to describe similar types of research. Future reviews might therefore consider a wider variety of search terms but expect this to be resource intensive in the very large numbers of titles and abstracts returned for review. This paper gives an indication of the degree of variation in hypothetical and actual WTP in the CV literature; hypothetical WTP (WTA) was on average 5.1 times greater (lower) than actual WTP (WTA), with a range of 1.5-11.99. The meta-analysed results place the degree of variation as 1.79 (range: 1.56-2.04) for mean summaries and 2.37 (range: 1.93-2.80) for percent summaries. Further, the predicted ratios and odds ratios of 3.18 and 2.24 respectively from the meta-regression further confirm the variation in stated and actual WTP values. The review also shows that current conclusions are heavily weighted (76% agreement) towards claims that criterion validity is not demonstrated, as only 24% authors claim evidence of criterion validity. Analysis for publication bias can, assuming no difference in methods between large and small studies, be used to question pooled evidence on the presence of criterion validity as it would indicate studies which demonstrated a lack of criterion validity were published whereas those that show validity are not. Whether this hegemony exists is tempered by our finding of great variety in methods, and it would be worth exploring whether a difference in methods between large and small studies would allay concerns over publication bias. However, alongside this evidence, we have found neither discussion of 'how close is close enough?' nor consideration of how valid the evidence itself was and therefore we question whether the results are quite as robust as they  L. Kanya, et al. Social Science & Medicine 232 (2019) 238-261 appear to be. This review has highlighted a great deal of methodological variation between hypothetical and actual surveys, and potentially sufficient variation to question the validity of findings about criterion validity itself. For example, the elicitation format was different in over half the comparisons, the same value cues were not necessarily used as results from some hypothetical surveys influenced values presented in the actual survey. A series of other differences relate to variation in the survey comparisons used between hypothetical and actual surveys. For example, half the papers stated that different populations were used and 54% clearly used different sample sizes. As all these differences have been shown to influence mean WTP (Trapero-Bertran et al., 2013;Veronesi et al., 2011), there could arguably be a good reason to accept that WTP results should infact be different. Comparisons also involved a wide range of goods and differing conclusions on criterion validity were obtained from these. Evidently, criterion validity is good-specific. The meta-analysis further highlighted the high levels of heterogeneity in the surveys, further questioning pooling of the results. An exploration of the heterogeneity through sub-group analysis did not yield any meaningful explanation as the reduction in some sub-groups can be explained by the lower numbers, not the rigour of the studies. Further investigation of the heterogeneity through meta-regression generated mixed results on potential drivers of the variation between stated and actual WTP values. This further questions the validity of estimates pooled across the different study settings and valuation goods. It is not yet clear from the literature how or whether the results can be transferred across settings and types of goods.
To help in interpreting and lending credibility to the responses and possibly also in forming adjustments that can enhance reliability, attempts should be made to collect additional data for cross tabulations (Arrow et al., 1993). Surveys should collect information on the respondent's background characteristics and socio-economic data such as income, attitudes towards the good or service and prior exposure or experience with the good. Such questions help in the interpretation of the primary valuation question and could also be used as further tests of validity of the data. The majority of reviewed comparisons did not report on the collection and use of such data in the assessment of criterion validity.
The review found a marked difference in the duration of time between surveys for hypothetical and actual values, with 65% occurring concurrently and 25% with more than a 4-week gap between the surveys. A two-week interval is the generally recommended retest period to enhance reliability of the values obtained (Duane, 1992). However, while longer durations could potentially introduce recall bias, short durations of time difference means that respondents may remember what they said in the hypothetical survey and deliberately repeat the value to appear publicly consistent. While a longer duration between the two surveys might offer the respondent sufficient time to think and possibly forget or change their original values, it also increases the possibility of real change occurring and thus justifying a change in any value given. The duration between the two surveys is likely to contribute to conclusions on the criterion validity of contingent values. In the meta-regressions, the ratio of stated and actual WTP values was higher when the two surveys were conducted at the same time. The duration between the two surveys was not identified as influencing hypothetical bias for studies presenting percent summaries.
The review also highlights some potential queries about how valid the comparisons of mean values were, not only raising questions of study quality but also how appropriate current conclusions might be. For example, 20% of comparisons did not include descriptions of how mean WTP/WTA was calculated, one-third of the comparisons had no information on tests used to determine differences in mean values between hypothetical and actual comparisons and there was a general absence of information on the treatment of missing values. There were also very few explanations given for the selection of value cues behind bid offers regardless of design format. Until there is a set of reliable reference surveys, the burden of proof of reliability and validity (of a CV Survey) rests on the survey designers and analysts (Neill et al., 1994;Onwujekwe et al., 2001). It is not clear what the impact of analytic methods has had on conclusions to date. In addition, poor reporting continues to limit the use of comparisons for systematic reviews and meta analyses in CV research (Trapero-Bertran et al., 2013). Queries on the methodological quality of comparisons also raises the broader issue of the potential for developing either an evidence-based set of guidelines for high quality WTP comparisons or appropriate reporting guidelines for CV comparisons.
Whilst the assessments are carried out in different sectors, the methods used to evaluate validity could be comparable and lessons transferred. With only a few comparisons identified, the health focussed comparisons seemed to use some appropriate methods compared with other sectors. For example, higher proportions of comparisons used the same respondents, administration modes, elicitation formats and payment vehicles in hypothetical and actual surveys. Comparisons also reported on key explanatory variables, allowing for a comparison within the sector and potential transferability of the methods used to assess criterion validity across sectors. Having the same respondent for the hypothetical and actual valuation scenarios reduces bias when judging criterion validity and this too occurred more frequently in the health focussed comparisons. However, the assessment of criterion validity could be enhanced in all sectors if values were elicited from comparisons with the closest relation to the planned intervention.
Appropriate estimation methods should be used and summary statistics provided in comparable formats, such as ratios. Ensuring content validity might also improve the tests. This can be achieved by conducting focus group discussions with key stakeholders in the valuation context. This would help achieve credible scenarios, determine suitable values for use in the surveys, appropriate study administration modes and payment vehicles. The payment vehicle forms a substantive part of the overall package under evaluation and is generally believed to be a non-neutral element of the survey (Bateman et al., 2002), affecting both the response rate and the magnitude of the values. The majority of the payment vehicles used in the surveys were amenable to a criterion validity assessment except coercive measures such as tax. It is difficult to assess how this payment vehicle was used in actual surveys and the results used to determine criterion validity.

Conclusions
The evidence on the criterion validity for CV comparisons is more mixed than authors are representing because substantial differences in study design between hypothetical and actual WTP/WTA surveys are not accounted for. This concern is compounded by the presence of key gaps in the reporting of methods and data. Infact, there does not seem to be a sufficient pool of criterion validity studies in sectors such as health, to permit a reasonable meta-synthesis and meta-analysis, and draw robust results.
Across the sectors, there was a general dearth of studies employing similar methods (e.g. WTP elicitation formats) combined with other attributes (such as respondent characteristics discussed earlier), to allow the testing of the effect of these on criterion validity. As a result, the evidence does not adequately support current conclusions on the criterion validity of WTP. Sufficient breadth of empirical evidence on the criterion validity of WTP across sectors and goods is needed. This would enable a dataset to facilitate more rigorous testing of the different experimental protocols that might influence the differences between stated and actual WTP values.
The WTP method offers potential for a welfare based measure of value for non-marketed goods and should not be subjected to the blanket criticism that it has received over the years based on findings from poorly designed comparisons that incorrectly suggest a lack of criterion validity. Such criticism might be one of the reasons for the slow uptake of the method in evaluations, especially in the health sector. For evaluations of public health interventions where outcomes beyond quality-adjusted life years are often needed, this is particularly important. However, if the method is to be improved, more studies are required. The presented review and synthesis of the evidence further contributes to the ongoing work aimed at improving the method. The development of reporting guidelines for CV comparisons and the development of methodological guidelines for the conduct of criterion validity assessments would aid in the assessment of validity of the studies and transferability of findings.