Simple but not simpler: a systematic review of Markov models for economic evaluation of cervical cancer screening

The aim of this study was to critically evaluate the quality of the models used in economic evaluations of screening strategies for cervical cancer prevention. We systematically searched multiple databases, selecting model-based full economic evaluations (cost-effectiveness analyses, cost-utility analyses, and cost-benefit analyses) of cervical cancer screening strategies. Two independent reviewers screened articles for relevance and performed data extraction. Methodological assessment of the quality of the models utilized formal checklists, and a qualitative narrative synthesis was performed. Thirty-eight articles were reviewed. The majority of the studies were conducted in high-income countries (82%, n=31). The Pap test was the most used screening strategy investigated, which was present in 86% (n=33) of the studies. Half of the studies (n=19) used a previously published Markov model. The deterministic sensitivity analysis was performed in 92% (n=35) of the studies. The mean number of properly reported checklist items was 9 out of the maximum possible 18. Items that were better reported included the statement of decision problem, the description of the strategies/comparators, the statement of time horizon, and information regarding the disease states. Compliance with some items of the checklist was poor. The Markov models for economic evaluation of screening strategies for cervical cancer varied in quality. The following points require improvement: 1) assessment of methodological, structural, heterogeneity, and parameter uncertainties; 2) model type and cycle length justification; 3) methods to account for heterogeneity; and 4) report of consistency evaluation (through calibration and validation methods).


' INTRODUCTION
Cervical cancer continues to be an important public health problem, with an estimated 266,000 deaths from cervical cancer worldwide in 2012 (approximately 87% of cervical cancer deaths occur in less developed regions) (1). Screening programs have reduced the incidence and mortality of cervical cancer. However, substantial costs are involved in providing the infrastructure, training the manpower, buying consumables, elaborating surveillance mechanisms, and treating and following up with patients (2). Therefore, successful programs will require using evidence-based, cost-effective approaches and strengthening national health systems (3).
Decision-analytic modeling (DAM) has increasingly been used to assess cancer prevention and control strategies in terms of their cost-effectiveness and to inform public policies. DAM supports decision makers in making choices related to the evaluated screening strategies for cervical cancer options.
Cervical screening models vary considerably in their degree of complexity. The Markov model is the most common model used to simulate the natural history of progression to cervical pre-neoplastic and neoplastic disease. This popularity is likely due to the apparent simplicity of its implementation and use.
Previous reviews (4)(5)(6)(7)(8) have specifically discussed the use of DAM to evaluate the cost effectiveness of cervical cancer screening, and others have discussed models that also evaluate the impact of human papillomavirus (HPV) vaccination on screening programs. However, none of these reviews critically evaluated the quality of the Markov models used in economic evaluations of screening strategies for cervical cancer using formal checklists. These instruments may identify flaws that influence the cost-effectiveness results (9). Thus, critical evaluation can confirm the credibility and reliability of the results being used by decision makers (10).
The aim of this review, which was performed as part of a health technology assessment project funded by the Brazilian Public Health System, was to provide an overview of the quality of Markov models for economic evaluation of screening strategies for cervical cancer prevention. We identify some of the most important methodological issues, reflect on the reasons for the poor report and discuss implications for research standards.

Protocol and registration
This methodological systematic review was conducted based on the Centre for Reviews and Dissemination (CRD) guidelines (9) and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (11). A protocol was developed prior to the initiation of this review but was not registered with International Prospective Register of Systematic Reviews (PROSPERO) because this review does not contain direct patient or clinically relevant outcomes.

Eligibility criteria
Studies were included if they reported on the use of a Markov model to evaluate the costs and health outcomes of cervical cancer screening. Eligibility criteria were defined based on the components of the PICOS approach: Participants: Markov model for economic evaluation of cervical cancer screening.
Intervention: Cervical cancer screening in settings with or without an HPV immunization program.
This review included only English, Spanish, and German language publications. Editorials, abstracts of congress, review studies, studies that did not compare screening strategies in terms of costs and health consequences, and studies that exclusively analyzed vaccination strategies were excluded.

Searching other sources
Additional relevant studies were identified by assessing the reference lists of major publications on the subject and the references of studies identified by electronic databases.

Study selection
This review included only Markov model-based full economic evaluations of cervical cancer screening in settings with or without an HPV immunization program. Two independent reviewers (JYKV and CGF) screened the titles and abstracts of the identified studies and selected them using specific inclusion and exclusion criteria. Any disagreements during this process were resolved by discussion or by a third reviewer (PCS).

Data collection process
Two reviewers (JYKV and CGF) independently extracted the data into a Microsoft Excel 2016 spreadsheet form tailored to this project. The data collection form was based on a prior publication (4) and was piloted in five studies.
The following data were extracted from all studies:

Summary measures conversions
To enable comparisons across studies conducted in different countries and years and account for the effects of inflation over the designated period, the summary measures (ICERs) were updated to the year 2015. When the year of reported costs was not specified, the article publication year was used. Local currencies were initially inflated to 2015 values using specific consumer price indexes (24,25) and then converted into 2015 international dollars (I$) using purchasing power parity conversions provided by the World Bank (http://data.worldbank.org/indicator/PA.NUS.PPP) (26).

Quality assessment
We evaluated the reporting quality of the structuring and development of Markov models using items of the framework for quality assessment of DAM (27) and a previously described instrument (28). The adapted checklist is an 18-item measure of the overall quality assessment of a DAM and contains three dimensions: 1) structure, 2) data, and 3) consistency (see Appendix 2). We chose these instruments as they are widely accepted as a scientific standard for the reporting of DAM studies, and they can be applied to quality assessment of DAMs for health technology assessment (HTA). The response options for each item include 'yes', 'no' and 'not applicable'. Each reviewed study was evaluated individually, and we counted each properly reported item (answer = 'yes') and summed responses based on a maximum possible count of 18.

Synthesis of results
The more relevant results were summarized as a narrative synthesis. The study characteristics are presented in tables and figures. Due to study heterogeneity, meta-analysis or statistical pooling of the extracted summary measure (ICER) was not performed, given that this was neither feasible nor meaningful (29).

Search results
After the removal of duplicates, a total of 201 potentially relevant articles were identified. After assessment of the eligibility criteria, 38 studies (30-66) met these review inclusion criteria. Figure 1 presents the flowchart of the selection process. Table 1 presents the economic evaluation of included Markov model-based studies. The majority of the studies were conducted in high-income countries (82%, n=31). Greater than half of the studies were set in three highincome countries (USA=13, GBR=5, and CAN=3). Sixteen percent (n=6) of these studies were conducted in uppermiddle income countries, and only one study included lower-middle and low-income countries (48).

Study characteristics
The Pap test was the most commonly used screening strategy investigated and was employed in 86% (n=33) of the studies. The LBC, HC2 and HPV-DNA test were employed in 34% (n=13), 29% (n=11) and 24% (n=9) of the studies, respectively. Combined tests, such as Pap + HC2, Pap + HPV-DNA, Pap + speculum and HC2 + cytology, were employed in 26% (n=10) of the studies. Other technologies, such as VIA, VILI and self-collection, were also investigated (16%, n=6). Thirteen studies (34%) considered the effect of an HPV immunization program on the analysis.
Half of the studies (n=19) used a previously published Markov model. In particular, five studies (36,43,49,51,66) used the model developed by Myers et al. (67). A graphical representation was presented in 68% (n=26) of the studies. The number of health states considered when stated (n=31, 82%) ranged from 4 to 23 states (mean of 12). Among the studies that reported the duration of the Markovian cycle used (n=31, 82%), the majority (n=20, 65%) considered annual cycles. Among the studies that reported (n=19, 50%) the use of some software, most studies (n=11, 58%) used TreeAge (TreeAge Software Inc., Williamstown, MA), whereas Excel (Microsoft Corp., Redmond, WA) was used by 47% (n=9) of the studies. One study used software developed by the WHO, PopMod (68), and one study implemented the model using the C ++ programming language (35) (Figure 2).    Self-collection = high-risk HPV DNA testing of self-collected vaginal samples; VIA = visual inspection with acetic acid; VIA/VILI = VIA in combination with Lugol's iodine. 3 Target population: women within the age range indicated. 4 Economic study type: CEA = cost-effectiveness analysis; CUA = cost-utility analysis. 5 Currencies classified according to the International Organization for Standardization, ISO 4217:2015 (http://www.iso.org/iso/home/standards/currency_codes.htm). Deterministic sensitivity analysis was performed in 92% (n=35) of the studies, of which 23% (n=8) also performed probabilistic analysis. The validation of the model was informed by 24 (63%) studies, whereas 53% (n=20) of the studies mentioned that model parameters were calibrated ( Figure 2). Figure 3 presents the proportion of economic evaluation studies (n=38) that properly complied with the 18 items of the checklist domains. The detailed assessment is reported in Appendix 3. The mean number of properly reported checklist items was 9 (SD 2.0) out of the maximum possible 18. Items that were better reported than others were the statement of decision problem (item 1, 100%), the description of the strategies/comparators (item 5, 100%), the statement of the time horizon (item 7, 95%) and informing the disease states (item 8, 87%). Only one study simultaneously assessed  the methodological, structural, heterogeneity, and parameter uncertainties (item 12) (61). Compliance was poor for the assessment of structural uncertainty (55%, n=21) and extremely poor for the justification of model type (5%, n=2), cycle length (5%, n=2), assessment of heterogeneity (18%, n=4), the appropriateness of utilities (17%, n=4), and assessment of external consistency (21%, n=8).

' DISCUSSION
This systematic review was the first study to comprehensively assess the methodological quality of the models of previously published studies using items of formal checklists. We evaluated 38 decision-analytic cost-effectiveness models, and the results demonstrated poor compliance with these checklists.
As noted in a previous review (12), only one study has been conducted in lower-middle and low-income countries (48), which exhibit the greatest cervical cancer burden. Approximately 84% of cervical cancer cases occur in less developed countries, with the highest incidences of cervical cancer noted in Africa, Latin America and the Caribbean. This finding reflects a lack of technical expertise and shortage of trained health economists in these regions. This finding also highlights the importance of local studies and enforces the need for strengthening the local modeling capacity.

Model structure
Half of the included studies (n=19) used a previously published Markov model. Only two studies justified the choice of model type (43,54), and the overwhelming majority did not provide reasons or explain why the use of a Markov model was appropriate. The choice of model type should be appropriate for the problem. In the case of cervical cancer, a Markov model may be suitable if the objective of the study is to assess alternative screening strategies in a setting in which disease prevalence is constant. The Markov model will simulate disease progression for a particular cohort of patients, assigning a probability of progression and regression between each of the classifications of dysplasia and invasive cancer (69). One limitation of the closed population model (such as a Markov cohort model) is that it may predict an increased cancer incidence compared with an open model. If the analysis incorporates the effect of an HPV immunization program, the ideal model would be a dynamic model that follows an entire population, allowing for evaluation of the impact of herd immunity (i.e., indirect protection of susceptible individuals by a significant proportion of immune individuals in the population) (69). Thirteen studies (36,38,(45)(46)(47)(48)50,55,57,59,62,63,66) reported that the effect of an HPV immunization program was considered in the analysis but did not explain how herd immunity was incorporated using a static cohort model.
The Markov model can be more transparent and easy to understand and provides more conservative estimates than dynamic models. In contrast, because the latter model type allows for the inclusion of more detail, it can generate several uncertainties in the evaluation process in addition to requiring more input and computational resources that may not be available in all settings. The direct and indirect effects of vaccination may not be observed in surveillance data for many years. Thus, although dynamic models are still developed by a small group of modelers (70), the development of these models will become increasingly important to explore the impact on screening as the first vaccinated cohorts approach the age of cervical cancer screening (12). Previous studies have reported an increased screening rate among vaccinated women and the lowest proportion of cervical abnormalities compared with those not vaccinated (71,72). Future model-based economic evaluations will need to take into account the continuum interaction between screening and vaccination to predict the effects of vaccination on screening programs (6).
Only two studies justified the choice of cycle length (56,64). The cycle length should reflect the clinical problem and be the shortest interval at which the pathologies and/ or diagnosis typically occurs (73), and its justification should be based on the natural history of the disease (74). In the case of cervical cancer, often the only source of information regarding cases is the clinical examination results. However, this information may be under-reported given that HPV infections and precursor lesions may regress in less than a year (75) and screening is typically performed annually. Therefore, ideally, the definition of the cycle should not be based on the intervals between exams (74). However, occasionally, these are the only available data. The other option would be to use data from another setting, and both approaches would impact the analysis results.

Model data
Although half of the included studies presented the transition probabilities, none of them explained how the probabilities were calculated or whether the cycle correction was used. Concern has been raised in the DAM literature regarding confusion about the appropriate use of rates and probabilities. Depending on the model, this misconception may introduce important errors, impacting the validity of the model results (76,77). Various approaches can be used to estimate transition probabilities for the natural history of cervical cancer in Markov models, including a literature review of HPV and cervical intraepithelial neoplasia (CIN) progression and regression rates, data from observational studies, and fitting approaches (78). Although some relevant publications exist, no formal guidelines are available for the estimation of transition probabilities for use in Markov models (79). The understanding of the difference between rates and probabilities and how to transform them correctly is essential for those developing Markov models.
According to international guidelines, if health benefits are measured through utility measures, the methods used (e.g., time trade-off, standard gamble, specific questionnaires) and the subjects in whom the assessments were performed (e.g., patients, members of the general public, health professionals) need to be reported (80,81). Only 17% of the reviewed studies reported the applied instruments, methods of measurement and the sources of utilities employed. Inadequate reporting of utility measurement methods leads first to difficulties in comparing different assessments, given that discrepancies between these measures using different measurement instruments and methods were previously observed in other studies (82)(83)(84). In addition, in relation to the lack of reporting of sources of utility measures (populations used to derive these measures), if the ultimate objective of the evaluation is to influence the allocation of resources to decisions based on social interests, it would be important that health state evaluations were based on utility weights representative of the preferences of the general population (85).
Specifically in relation to economic evaluations of cervical cancer screening, differences in utility values for CIN lesions, presence of cervical cancer and genital warts may partially explain the differences in the analysis results. In addition, considering the limited data available on the utility values associated with these states (7), it is fundamental that sensitivity analyses performed in future studies consider a wide range of variation, including all plausible utility values.

Uncertainty
Uncertainty is present in all HTA models (74). DMA researchers distinguish among parameter, structural and methodological uncertainties, all of which require assessment (27). Parameter uncertainty can be addressed by deterministic or probabilistic sensitivity analysis. Structural uncertainty can be managed through alternative model structures, which involves re-running the model under alternative structural assumptions and presenting the results of each scenario. Methodological uncertainty can be addressed with a similar method. Only one study simultaneously assessed methodological, structural, heterogeneity, and parameter uncertainties (61). Approximately half of the included studies failed to account for structural uncertainty, reflecting the gap between guidelines and applied research. This finding was also highlighted in a previous review (28), where many published models failed to account correctly for the major sources of uncertainty, particularly structural uncertainty. Most studies (92%) addressed only parameter uncertainty through deterministic sensitivity analysis. In addition to the standard considerations of uncertainty about parameter estimates, it is important to assess the implications of model uncertainty on results (28).
Most models (89%, n=34) simulated aggregate groups of women at risk of cervical cancer over time without accounting for other aspects of population heterogeneity in screening behavior. Heterogeneity (i.e., the extent to which variability between patients can be explained as a function of their characteristics) (86) reflects differences in outcomes that may in principle be explained by variations among subgroups of patients, including characteristics such as age, sex, level of risk and severity of the disease, or the relative effects of treatment (87). Given the natural history of cervical cancer, women less than 30 years of age have more HPV infections than older women, while older women may experience the progression of this virus 116fold more frequently than younger women. Therefore, HPV-DNA screening after the age of 30 years seems to be more effective than before the age of 30 (88). Thus, not considering "heterogeneity" during the analysis, which could be performed by executing the model for different subgroups of patients, may lead to errors in the results obtained (89). To capture heterogeneity in screening and vaccination behavior, it would be ideal to use individualbased models (microsimulation).

Model consistency
Model consistency refers to the quality of the model overall. This parameter tests the internal logic of the modeling practice, changing model inputs and examining the direction of results (internal consistency). Model consistency also compares the model's result with the best available evidence or with the results of previously developed models (external consistency, also known as calibration). For instance, the model consistency of cervical precancerous lesions predicted by cytology can be compared with observed CIN-related outcomes. However, it is generally not clear whether these outcomes are predicted by cytological results or histologically confirmed lesions (8). Only 8 studies (21%) reported the use of some calibration method. This low value can be explained by the lack of standards in calibrating disease models in economic evaluation, especially cancer screening models (90,91). There is no consensus in the literature regarding an acceptable minimum specification for the fitting targets that should be reported (78). Another potential barrier to calibration is insufficient local data to estimate parameters associated with organized screening.
The Markov models for economic evaluation of screening strategies for cervical cancer varied in quality. Items that were generally well reported were the statement of the decision problem, the description of the strategies/comparators, the statement of time horizon, and informing disease states. One limitation of the present study is that most models did not adequately assess methodological, structural, heterogeneity, and parameter uncertainties. Moreover, the minority justified the model type and cycle length, assessed heterogeneity and the appropriateness of utilities, and evaluated external consistency. Future studies should evaluate the appropriateness of the different methods to account for uncertainty (through sensitivity analysis and alternative model structures), heterogeneity, consistency (through calibration and validation techniques), and the relevance of reporting guidelines for Markov models to improve their transparency.