Variation in the observed effect of Xpert MTB/RIF testing for tuberculosis on mortality: A systematic review and analysis of trial design considerations.

Background: Most studies evaluating the effect of Xpert MTB/RIF testing for tuberculosis (TB) concluded that it did not reduce overall mortality compared to usual care. We conducted a systematic review to assess whether key study design and execution features contributed to earlier identification of patients with TB and Results: We included seven randomised and one non-randomised study. All included studies demonstrated relative reductions in overall mortality in the Xpert MTB/RIF arm ranging from 6% to 40%. However, mortality reduction was reported to be statistically significant in two studies. Study features that could explain the lack of observed effect on mortality included: the higher quality of care at study sites; inclusion of patients with a higher pre-test probability of TB leading to higher than expected empirical rates; performance of additional diagnostic testing not done in usual care leading to increased TB diagnosis or empiric treatment initiation; the recruitment of participants likely to return for follow-up; and involvement of study staff in ensuring adherence with care and follow-up. Conclusion: Most studies of Xpert MTB/RIF were designed and conducted in a manner that resulted in more patients being diagnosed and treated for TB, minimising the potential difference in mortality Xpert MTB/RIF testing could have achieved compared to usual care. The rationale for the study and the objectives were clearly This study is very Xpert MTB/RIF is an important investment for TB control globally. It is therefore crucial to have this issue addressed. The authors here conducted a systematic review to investigate the effect of several trial design parameters on patient outcomes in Xpert MTB/RIF evaluation studies


Introduction
Tuberculosis (TB) is the leading cause of mortality from an infectious disease globally. The 2018 World Health Organization (WHO) TB report estimates that there were 10 million incident TB cases and about 1.6 million TB-related deaths in 2017 1 . Early TB case detection and treatment initiation are critical for TB care and global TB elimination.
Sputum smear microscopy remains the primary method for diagnosing pulmonary TB in most countries with a high TB burden. Microscopy has suboptimal sensitivity and requires patients to submit multiple sputum samples often over several days, leading to loss to follow-up and missed opportunities for case detection and treatment. Nucleic acid amplification tests (NAAT) are known to increase sensitivity but until recently were not feasible in high-burden countries 2 . In 2010 3 , WHO first recommended Xpert MTB/RIF (Cepheid, Sunnyvale, CA, USA), a semi-automated, cartridge-based NAAT, as a first-line TB test for all patients suspected to have multi-drug resistant TB or HIV-associated TB and in 2013 4 , revised the recommendation to include Xpert MTB/RIF testing for all patients suspected to have TB where resources permit.
Since the initial WHO recommendations based on diagnostic accuracy estimates, several trials 5-12 have evaluated whether Xpert MTB/RIF testing reduced mortality among those undergoing TB evaluation in comparison to smear microscopy or pre-existing diagnostic algorithms. These trials have reported variable estimates of reduction in mortality, with only two 9,11 reporting a statistically significant decrease in mortality. A recently published individual patient data meta-analysis of five of such trials [6][7][8]10,13 also did not show significantly reduced six-month all-cause mortality (OR 0.88, 95% CI 0.68 to 1.14) in adults ≥18 years with presumptive pulmonary TB 14 .
Available literature cites possible reasons to explain methodological limitations of test-treatment trials and Xpert MTB/RIF's apparent lack of significant effect on mortality. A methodological review of test-treatment trials (n=103) published between 2004 and 2007 concluded that such trials were probably underpowered and had issues related to blinding, attrition, and inadequate primary analyses 15 . Other reviews of trials of Xpert MTB/RIF have raised issues related to the health systems in which the trials were conducted 16 , limited study power 14,16 , persistent use of empirical therapy 17 , limitations in interpreting trial results by focusing on statistical significance rather than clinically important differences 18 , enrolling patients whose test results are not likely to influence treatment decisions or limitations in evaluating a diagnostic test itself rather than a diagnostic test strategy in the intervention arm 19 . However, to date, less attention has been paid to the external validity of trials: the extent to which the design and conduct of the trials reflect what could be expected in usual care. In addition to earlier identification of drug resistance, Xpert MTB/RIF testing is expected to reduce mortality through earlier identification of patients with TB (increased sensitivity compared with smear microscopy) and decreased pre-treatment loss to follow-up (faster turn-around-time for results). We conducted a systematic review to assess whether the design and/or execution of studies also contributed to earlier identification of patients with TB and decreased pre-treatment loss to follow-up, thereby reducing the potential impact of Xpert MTB/RIF testing.

Study identification
We conducted a literature search to identify randomised and non-randomised studies assessing mortality following the introduction of Xpert MTB/RIF testing. We searched the Cochrane Central Register of Controlled Trials (CENTRAL), MEDLINE, and Scopus for studies in English published between 1 January 2009 and February 2019 with the terms 'Xpert MTB/ RIF' or 'Xpert' or 'GeneXpert' and 'impact' or 'effect*' or 'implementation' or 'trial*'. We included studies that compared Xpert MTB/RIF to usual care as defined by the authors (for example sputum microscopy or culture), intending to measure the effect of these tests on mortality among participants presumed to have active pulmonary TB. Hypothetical trials or modelling studies were excluded. The study protocol, details of which are available as Extended data 20 , followed PRISMA guidelines for performing systematic reviews, where applicable 21,22 ; however, since this was not a classical systematic review, not all items were appropriate. A completed checklist is available from Open Science Framework 20 .

Appraisal of studies
One reviewer (NK) searched, identified and appraised eligible articles up to December 2016. A second reviewer (EO) updated the search, identified and appraised eligible articles up to February 2019 in discussion with a senior reviewer (AC). The study data were extracted using Google forms and included the following elements: general study characteristics (geographical location, TB and HIV co-infection); description of study arms; sample size and power; description and results of the mortality outcome; and description of key study design features (study setting and context; study population; participant recruitment and enrolment; study procedures and participant

Amendments from Version 1
We have made minor changes incorporating a suggestion from reviewer 1 and suggestion from a reader. In the introduction (4th paragraph) we added this additional reason that could explain lack of evidence of effect in the trials; "limitations in evaluating a diagnostic test itself rather than a diagnostic test strategy in the intervention arm". We have corrected some information based on a reader's comment. The reader 's feedback was "To clarify, chest radiography in the XTEND study (reference 6) was not a studydefined procedure. It was, however, part of the South African algorithm for investigation of TB using xpert/ microscopy (please see the appendix to our article). We did not influence how that was implemented". Based on this feedback we have deleted the previous text (under study procedures and in Table 2) suggesting that this in this trial chest radiography was a study defined procedure.
Any further responses from the reviewers can be found at the end of the article follow-up). We used descriptive statistics to summarise quantitative data and provide a narrative summary of key design features concerning their potential impact on usual care. In appraising usual care, we considered how the study was executed assessing if usual care was enhanced beyond what is considered routine 23-27 .

Characteristics of included studies
Our search yielded 2147 records ( Figure 1). From this, eight studies were included in this review (Table 1) 12 . These studies comprised three individual randomized trials 5,8,10 , two cluster randomised trials 6,9 , one secondary analysis of a stepped wedged randomised trial 11,13 , one cross-over trial 7 , and one pre-post intervention study 12 . Further information about each trial is given as Extended data 20 .
Each study was described as pragmatic by the study authors and involved patients undergoing evaluation for pulmonary TB in routine care settings (primary health care clinics 6-11 and tertiary referral hospitals 5,12 ). All eight studies were conducted in high-TB-burden countries 28 , including seven in sub-Saharan Africa 5-10,12 , and one in Brazil 11 . Seven studies included adults ≥18 years 5-10,12 and one study 11 included adults and children of any age. Proportion of HIV-positive participants in the included studies ranged from 10% to 100%.
Usual care consisted of sputum smear microscopy in all but one study, where both culture and smear microscopy comprised standard of care 5 following a change in government policy recommending Xpert MTB/RIF as the initial diagnostic test.
Overall rates of participant loss to follow-up (LTFU) ranged from 1% to 22% in included studies. LTFU rates between trial arms were similar except for two studies in which LTFU was higher in the smear microscopy arm compared to the Xpert MTB/RIF arm (10% vs 2% 12 and 22% vs 18% 8 , respectively).
All-cause mortality was evaluated in seven studies 5-9,12,17 and TB-attributed mortality in one study 11 . Mortality was assessed as the primary outcome in three studies 6,9,12 , as a composite primary outcome in one study 8 and as a secondary outcome in the other four studies 5,7,11,17 .  All included studies demonstrated relative reductions in overall mortality in the Xpert MTB/RIF arm ranging from 6% to 40%. However, mortality reduction was reported to be statistically significant in two studies (

Analysis of key study design features relative to usual care
We analysed study features across five domains: study setting and context, study population, participant recruitment and enrolment, study procedures, and study follow-up. A summary of study features can be found in Table 2.

Study setting and context
We focused on whether the quality of care in the usual care arm was higher at study sites than would be expected in usual care settings, either because of the sites chosen or the manner in which studies were executed. All eight studies used laboratories that observed high quality standards for TB testing, with one study 6 excluding laboratories that did not meet quality standards. In two studies, research staff were directly involved in the care of participants, including facilitating chest X-rays, delivering test results to participants, and referring participants for TB treatment 10,12 . In one study, research staff performed sputum induction and bronchoscopy, neither of which were routinely available at the study site 12 .

Study population
Empiric treatment is more common when pre-test probability of TB is high 17,19 or in study populations with very ill patients who have a high likelihood of dying, reducing the potential impact of Xpert MTB/RIF, a more sensitive test than sputum microscopy. Five studies reported rates of empiric treatment, and the rates ranged from 12% to 60% 5,8,10-12 . Five of eight studies enrolled participants with a higher pre-test probability of TB than the target population (i.e., all patients referred for sputum-based TB testing in usual care). Yoon and colleagues 12 and Calligaro and colleagues 5 conducted their studies in inpatient settings, where TB prevalence and empiric treatment rates are generally higher than in outpatient settings. Theron and colleagues 10 required HIV-negative participants to have at least two TB symptoms (cough for more than two weeks, fever lasting two weeks, weight loss, sweats, fatigue, chest pain or hemoptysis), rather than enrolling all patients referred for TB testing. Two studies 8,9 included only HIV-positive patients who had not started ART, a population in whom empiric treatment is more common. In addition to high rates of empiric treatment, Churchyard and colleagues 6 and Ngwira and colleagues 9 excluded patients who resided outside the clinic catchment area or in remote locations, reducing the potential for loss to follow-up.

Participant recruitment/enrolment
A high level of interaction between research staff and participants could lead to increased adherence to care and follow-up. Study staff requested consent from participants in all but two 7,11 studies, and as noted earlier, transported patients for chest radiographs in two studies 10,12 . Both studies provided an opportunity for research staff to build rapport and counsel and educate participants on TB diagnosis and treatment. In addition, patients were asked to wait for their smear microscopy results or were offered voluntary counselling as they were being transported for chest radiographs in one study 10 , likely reducing pre-treatment loss to follow-up relative to routine care.

Study procedures
The use of testing and other procedures not typically available in many high burden settings could lead to more patients being diagnosed with and treated for TB than would have occurred under usual care. For example, chest radiography was performed in all participants in two studies 10,12 , at baseline at the discretion of clinicians in one study 8 . The availability of chest radiograph results compatible with active TB is likely to have made empiric TB treatment initiation more frequent, especially for HIV-positive participants 10 . Culture is not routinely available in most high TB burden settings. However, it was part of usual care in one study setting 5 , and was performed in two other studies as a reference standard for diagnostic test accuracy calculations 10,12 . In all these three studies, a positive culture test result also informed treatment in the Xpert MTB/RIF arm.

Study follow-up
To maintain contact and study follow up, Churchyard and colleagues 6 sent mobile phone call vouchers (worth $2 USD) as an incentive to encourage patients to remain contactable during the study and later organised home visits when contact calling failed. Ngwira and colleagues 9 enhanced follow up by scheduling extra visits, conducting home visits and using data registers to trace participants who missed clinic appointments. Yoon and colleagues provided transport vouchers and made home visits for patients who did not return for scheduled follow-up visits 12 . The enhanced follow-up procedures likely increased initiation of TB treatment for those with bacteriologically-confirmed disease and those without bacteriological confirmation but persistent symptoms.

Discussion
Our review has implications for the design of future trials aiming to assess the comparative effectiveness of novel TB diagnostics. We highlight features of trial design and execution that could have mitigated the key advantages of Xpert MTB/RIF relative to smear microscopy with respect to faster diagnosis and treatment of TB patients. Such features included: a higher quality of care in comparison to usual care at trial sites, inclusion of patients with higher pre-test probability of TB relative to all patients undergoing TB testing at the trial sites leading to higher than expected empiric treatment rates, selection criteria and increased contact with participants as a result of study procedures leading to reduced pre-treatment loss to follow-up, the performance of additional diagnostic testing not done in usual care leading to increased TB diagnosis or empiric treatment initiation, the recruitment of participants likely to return for follow-up, and involvement of study staff in ensuring adherence with care. Designing future comparative studies of novel TB diagnostics in real life settings where optimal conditions are not likely to be met could mitigate these issues and provide a better assessment of their likely impact.
Our findings complement those of Auld and colleagues 16 , who also published a literature review exploring Xpert MTB/RIF's lack of effect on morbidity and mortality. They appraised eight trials (six randomised 5-8,10,11 and two pre-post trials 12,29 ) and concluded that study characteristics that may explain this lack of effect on morbidity and mortality include underpowered trials, higher rates of empiric treatment in the control arms compared to the Xpert MTB/RIF arm, studies with populations not comprised exclusively of those likely to benefit from Xpert MTB/RIF, and health system limitations such as patient loss to follow-up. Our review extends upon and contextualizes these findings by focusing on how specific study design and execution features that improve upon usual care may mitigate the potential benefit of novel diagnostics.
Of the eight studies included in our review, Trajman and colleagues 11 minimally interrupted usual care for that setting.
The study was conducted at public primary care settings, included all patients undergoing TB testing (no exclusion criteria) and utilized routinely collected data to assess outcomes. Electronic records of routinely collected diagnostic, treatment and outcome data were linked and analysed retrospectively with minimal influence by external research staff. Trajman and colleagues also did not utilize additional resources in terms of staff or diagnostics that were used over and above what was available in usual care settings and similar implementation protocols were uniformly applied at all sites. Informed consent was also not a requirement.
There is an inherent tension between a study's internal and external validity 30,31 , with the former favouring more rigorous control and the latter more pragmatism. Indeed, most research on which practice guidelines have been based have focused on internal validity rather than external validity 30 . For example, some selection and/or additional support for study sites is needed to ensure the availability of test kits and anti-TB drugs during the trial period and some strengthening of routine data collection and recording is needed to minimize missing data. If a study is completely hands-off with regard to clinical practice it may not demonstrate effects on mortality because the system in which the test is introduced is poorly functioning. This may be useful information in that specific context (there may be little point in implementing a new diagnostic in the context of a dysfunctional health system) but may mislead on the potential impact on mortality in a better functioning system. In practice, feasibility issues such as available study funding and available time to conduct the studies 27 mean that most studies fall along a continuum between pragmatic and explanatory approaches 32 . In this light, researchers are encouraged to use the Pragmatic Explanatory Continuum Indicator Summary (PRECIS) tool to inform design decisions on how explanatory (ideal context) or pragmatic (usual care context) the study features of their trials can be in the pragmatic/explanatory continuum 33 . Trial findings also need to be interpreted in line with the trial's position in this continuum particularly if they are labelled as pragmatic.
When a study aims to provide valid evidence for or against the introduction of a trial-validated intervention in real-world settings, a more pragmatic trial is needed to evaluate its performance in less controlled, heterogeneous settings and populations that are typical of the settings the intervention is intended for 34 . The study population should be all persons that would qualify for the intervention under usual care including adults and children and participants that may be prone to loss to follow-up. Recruitment approaches should be built on existing ones 35 . Individual consent should be inferred from participants' presentation at the health facility and request for treatment especially if the intervention under study is already approved. Study populations, would, therefore present themselves to the health facility staff and be evaluated by no more effort than that observed under usual care or alternative methods of obtaining consent can be sought such as consent waivers, integrated clinical and research consent, and broadcast consent (notifications in health settings informing patients that trials with minimal risk are permitted) 36 . The intervention should be delivered through usual care providers and resources 34 . Data on patients and outcome measures should also be gathered from routine programmatic data sources wherever efforts can be made to strengthen data collection and bring it to a higher standard, without having the potentially problematic effect of placing research staff at each study site.
The strengths of our review include a comprehensive search in multiple databases for studies assessing the effect of Xpert MTB/RIF testing on mortality. Two reviewers extracted data in discussion with a senior reviewer. Our review was limited by focusing on the effect of Xpert MTB/RIF on one health outcome. However, other health outcomes such as morbidity and quality of life are limited by lack of standardized scores and are rarely 18 measured in trials. For example, only one trial 10 in our review evaluated the effect on morbidity and none evaluated the effect on quality of life. An advantage of Xpert is its high sensitivity in detecting rifampicin-resistant TB 37 . It would be informative to evaluate the effect of Xpert MTB/RIF on health outcomes in patients with rifampicin-resistant TB. However, none of the included studies evaluated the effect of Xpert on rifampicin resistant-TB due to limited prevalence and follow-up. In addition, we did not review the effect of Xpert testing in children because TB diagnosis in children is still a challenge 38 . Indeed, only one study 11 included children. Lastly, our review was limited to studies written in English and to what was reported in the included studies.
In conclusion, although presented as pragmatic, specific study design and execution choices are likely to reduce the ability of trials to demonstrate an impact of Xpert MTB/RIF testing on mortality. Offering higher quality of care than what occurs in usual care may lead to differences in mortality between control and intervention arms that are smaller than would have been observed with usual care. Trialists face an inherent tension between balancing internal and external validity. Nonetheless, our findings indicate trials that are further along the explanatorypragmatic continuum are needed to evaluate the impact of the next-generation of TB diagnostics in real-world settings.

Data availability
Underlying data All data underlying the results are available as part of the article and no additional source data are required. pragmatic continuum are needed to evaluate the impact of the next-generation of TB diagnostics in real-world settings."

Extended data
This is an interesting argument as in general, when trials are less pragmatic there tends to be a larger effect size than in real world settings. The authors are arguing that in this instance, the opposite is true, and that larger effect sizes would be seen in real world settings.
Their arguments are convincing but do not address the main reasons for the failure of these studies to show an effect on mortality. In this reviewer's opinion, the primary reason that many of these trials failed to show an effect of Xpert MTB/RIF was fundamental design flaws in terms of the choice of the intervention arm.
The fundamentals of determining the impact of a novel diagnostic is to perform 'test research' (diagnostic accuracy studies) before moving to 'diagnostic research' (e.g. developing algorithms or prediction rules), before finally moving to 'diagnostic intervention research' (randomised trials). In the case of Xpert MTB/RIF no adequate 'diagnostic research' was performed and so appropriate diagnostic strategies were not developed prior to designing randomised trials. The trials therefore merely tested the standard of care, which was smear microscopy and frequent empiric therapy (due to the known lack of sensitivity of smear) vs Xpert MTB/RIF as a stand alone test. An appropriate research strategy would have been to first develop a full diagnostic strategy based around Xpert MTB/RIF, which included for example, therapy for patients with a high pre-test probability of disease regardless of Xpert result, empiric therapy for patients with high pre-test probability who were unable to produce sputum, possibly with the inclusion of a trial of antibiotics and referral for CXR when the diagnosis was in doubt.
Such an approach would have adequately compared the current standard of care with an evidence based diagnostic strategy which included Xpert MTB/RIF. In their form that these studies were done, it was in my opinion highly likely that no effect would be seen.
Therefore, while the authors make some convincing arguments as to how these trials might have been better designed, my suggestion would be that they acknowledge that these were minor flaws in comparison to those mentioned above. I would conclude that while there is room for improvement in the 5 areas they discuss, adequate randomised trials of Xpert MTB/RIF would compare the current standard of care diagnostic strategy based around smear microscopy. with a novel diagnostic strategy which included Xpert MTB/RIF, rather than simply comparing a test with another test.

Are the rationale for, and objectives of, the Systematic Review clearly stated? Yes
Are sufficient details of the methods and analysis provided to allow replication by others? Yes

Is the statistical analysis and its interpretation appropriate? Yes
Are the conclusions drawn adequately supported by the results presented in the review?

Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Infectious diseases clinician with an interest in the appropriate evaluation of novel diagnostics, particularly for tuberculosis.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Page 17 of 17