Improving the design of studies evaluating the impact of diagnostic tests for tuberculosis on health outcomes: a qualitative study of perspectives of diverse stakeholders

Background: Studies evaluating the impact of Xpert MTB/RIF testing for tuberculosis (TB) have demonstrated varied effects on health outcomes with many studies showing inconclusive results. We explored perceptions among diverse stakeholders about studies evaluating the impact of TB diagnostic tests, and identified suggestions for improving these studies. Methods: We used purposive sampling with consideration for differing expertise and geographical balance and conducted in depth semi-structured interviews. We interviewed English-speaking participants, including TB patients, and others involved in research, care or decision-making about TB diagnostics. We used the thematic approach to code and analyse the interview transcripts. Results: We interviewed 31 participants. Our study showed that stakeholders had different expectations with regard to test impact and how it is measured. TB test impact studies were perceived to be important for supporting implementation of tests but there were concerns about the unrealistic expectations placed on tests to improve outcomes in health systems with many influencing factors. To improve TB test impact studies, respondents suggested conducting health system assessments prior to the study; developing clear guidance on the study methodology and interpretation; improving study design by describing questions and interventions that consider the influences of the health-care ecosystem on the diagnostic test; selecting the target population at the health-care level most likely to benefit from the test; setting realistic targets for effect sizes in the sample size calculations; and interpreting study results carefully and avoiding categorisation and interpretation of results based on statistical significance alone. Researchers should involve multiple stakeholders in the design of studies. Advocating for more funding to support robust studies is essential. Conclusion: TB test impact studies were perceived to be important to support implementation of tests but there were concerns about their complexity. Process evaluations of their health system context and guidance for their design and interpretation are recommended.


Abstract
Studies evaluating the impact of Xpert MTB/RIF testing for Background: tuberculosis (TB) have demonstrated varied effects on health outcomes with many studies showing inconclusive results. We explored perceptions among diverse stakeholders about studies evaluating the impact of TB diagnostic tests, and identified suggestions for improving these studies.
We used purposive sampling with consideration for differing Methods: expertise and geographical balance and conducted in depth semi-structured interviews. We interviewed English-speaking participants, including TB patients, and others involved in research, care or decision-making about TB diagnostics. We used the thematic approach to code and analyse the interview transcripts.
We interviewed 31 participants. Our study showed that Results: stakeholders had different expectations with regard to test impact and how it is measured. TB test impact studies were perceived to be important for supporting implementation of tests but there were concerns about the unrealistic expectations placed on tests to improve outcomes in health systems with many influencing factors. To improve TB test impact studies, respondents suggested conducting health system assessments prior to the study; developing clear guidance on the study methodology and interpretation; improving study design by describing questions and interpretation; improving study design by describing questions and interventions that consider the influences of the health-care ecosystem on the diagnostic test; selecting the target population at the health-care level most likely to benefit from the test; setting realistic targets for effect sizes in the sample size calculations; and interpreting study results carefully and avoiding categorisation and interpretation of results based on statistical significance alone. Researchers should involve multiple stakeholders in the design of studies. Advocating for more funding to support robust studies is essential.

Introduction
Tuberculosis (TB) continues to be a major public health burden. In 2018 it was estimated that about 10 million people developed TB disease, there were about half a million new cases of rifampicin-resistant TB, and 1.5 million deaths due to TB 1 . The End TB strategy strives to reduce TB incidence by 80%, and TB mortality by 90% compared to 2015 levels. To facilitate progress towards these targets, the World Health Organization (WHO) recommends that countries aim to have 90% or more of TB patients diagnosed with WHO recommended rapid tests, and 90% or more of eligible patients treated with new recommended drugs by the year 2025 2 .
In order to improve TB case detection and rapid initiation of treatment, new rapid molecular diagnostic tests with reported high sensitivity and specificity and/or short-turnaround times, such as Xpert MTB/RIF and Xpert Ultra (the newest version) continue to be introduced to the market 1,3 . It is expected that accurate diagnosis and rapid initiation of treatment would improve downstream health outcomes such as morbidity and mortality.
However there is uncertainty about the effects of Xpert MTB/ RIF on people-important outcomes, which include outcomes that directly reflect how an individual feels, functions or survives (patient health outcomes) 4 , and outcomes that lie on the causal pathway through which a test can affect a patient's health, and thus predict patient health outcomes (surrogate or intermediate outcomes) 5,6 . Two recently published systematic reviews and meta-analyses of randomized trials suggest Xpert MTB/ RIF likely reduces mortality 7 [odds ratio 0·88, 95% CI 0·68-1·14] and unfavorable treatment outcomes 8 [risk ratio 0.92, 95% CI 0.82-1.02] when compared to smear microscopy in adults with presumptive TB, but uncertainty in effect estimates was high. Pooled results in the meta-analyses suggested Xpert MTB/ RIF did not affect time to diagnosis [hazard ratio 1·05, 95% CI 0·93-1·19] and time to treatment [hazard ratio 1·0, 0·75-1·32]. Confidence intervals were wide demonstrating large variation in estimates.
Randomized trials of diagnostic tests are typically considered the best way 9 to evaluate the effects or impact of interventions but these studies are challenging and their interpretation may not be straightforward [10][11][12] . A diagnostic test is evaluated as an element in a complex intervention, comprised of a sequence of interrelated events and decisions, all which vary across different study contexts 12 . End users and other stakeholders may have different perspectives on the impact of diagnostic tests, outcome measures that matter, and how they should be evaluated. To our knowledge no systematic attempts to gather and analyze these perspectives have been published.
Qualitative research can help in understanding the complex phenomena at play and the varied perceptions of participants who are part of TB diagnostic test studies, and help to shed light on why and how these tests work in different contexts, and on how best to implement them 13 .
We explored perceptions of diverse stakeholders about studies evaluating the impact of TB diagnostic tests, and identified suggestions for improving these studies.

Study design
We conducted a qualitative study, using a phenomenological approach, that aimed to develop a complete description and understanding of human experiences and meanings, allowing findings to emerge from the data 14 .
Sampling and recruitment Participants were purposively sampled from institutions known to our network and from other diagnostic forums such as the Stop TB New Diagnostics Working Group, and the Global Health Diagnostics community online (GHDonline). To source participants from GHDonline, we sent a general email to members on the platform inviting them to participate in the study. Invitation letters can be found in Extended data: Annex 1.
We only included English-speaking participants who had been involved in research, care, or decision making about both drug susceptible and drug resistant TB diagnosis. Considering that diagnostic tests need to function in a complex ecosystem of various users at various levels of the health care systems 13 , we sampled diverse stakeholders. We considered maximum variation with regard to expertise (researchers, clinicians, laboratory workers, TB programme managers, guideline developers, policy makers, TB technical assistance and support agencies, funding agencies, patients, TB survivors and activists) and geographical location (from various low and high TB burden countries). We believed that a diverse group of stakeholders would give us a broader insight in designing, executing, interpreting and using TB studies for decision-making.
We sent out invitations to 60 potential participants, and interviewed only those who responded to, and accepted our invitation. We aimed to have a purposive sample of 30 participants in the study, since we anticipated that data saturation would have been reached with this number.

Data collection
Data were collected through in-depth semi-structured interviews. We prepared an interview guide and tailored it to the different stakeholders we were interviewing (Extended data: Annex 2). The topic guide was piloted by conducting mock interviews on three colleagues (not part of this project) from the Centre of Evidence-based health care in Stellenbosch University and modified based on the results of a pilot exercise.
Interviews were conducted by two researchers (EO [female] and SN [male] 15 ). EO has a medical background with further training in international health and clinical epidemiology. SN is an epidemiologist. Both EO and SN underwent an additional three-month training course in qualitative research methods and interview techniques.
Interviews were conducted in English via a conference call platform or by telephone. Teleconference interviews were conducted by EO with SN listening in and taking notes. Face to face interviews were conducted with patients in Khayelitsha community health clinic (Cape Town) by SN with the help of a professional interpreter who translated questions from English to the local language isiXhosa. Participant responses were then translated back to English.
There were no pre-established relationships between the interviewers and participants prior to the interviews. Participants were provided with information sheets and written consent forms prior to the interview; via Google Forms for teleconference interviews, and hard copies for face-to-face interviews. The content of consent forms was similar for non-patient participants and patients; however consent forms for patients were translated into the local language isiXhosa (Extended data: Annex 3).
Interviews lasted between 30 to 45 minutes. Interview data were captured using a digital voice recorder. Interviews were transcribed for analysis, by a professional transcriber. All transcripts were audited for accuracy by the interviewer who conducted the interview. Names of participants did not appear on the transcripts. Transcripts were not returned to participants for corrections or clarification.
Data are stored electronically in password protected computers, and on secure online data storage platforms.

Data analysis
Analysis of the interviews was done after data collection using thematic analysis 16 . Two researchers EO and SN coded the interview transcripts together, discussing the codes and themes. EO and SN first familiarized themselves with the subject matter by listening to the audio tapes and reading the transcripts. The first transcript was coded independently and themes in the data were discussed. For feasibility reasons we decided to code subsequent transcripts together. Guided by the research questions, our analysis utilized deductive and inductive approaches grounded in the data. We did not apply line by line coding to every single line, but coded information that was relevant to our research question. We developed a broad set of codes, and modified or added to the codes as we read the transcripts. We coded the hard transcripts using the qualitative software Atlas ti. version 7.
We generated and merged similar codes to minimize duplication and improve readability and grouped the codes into sub-themes and themes in discussion with a senior author (MN) (see coding hierarchy in Extended data: Annex 4).

Ethics and reporting
This study received ethical approval from the health research ethics committees of Stellenbosch University (HREC Reference # N18/01/009) and University of Cape Town, and approval from the city of Cape Town to use health facilities in Khayelitsha. We referred to the consolidated criteria for reporting qualitative research (COREQ) to guide the reporting of this study 17 .

Results
We conducted 31 interviews between September and December 2018. A summary of the study participants' characteristics can be found in Table 1.
We explored four major themes: General perception of test impact studies, barriers facing test impact studies, selection of outcome measures, and suggestions for improving test impact studies. These themes and related subthemes have been summarized in Figure 1.

Importance of test impact studies depends on product cycle.
Some respondents felt that the need for impact studies depends on the product cycle. For example, an impact study may not be necessary at the beginning when a test has just been developed, due to concerns of delaying market access of the tests. However, it may be necessary after roll out of a test.
"It is very difficult to use impact information in the beginning to make a decision to invest or not invest and so consequently we are willing to take the risk. Usually in a practical sense to invest or not invest in a technological approach without really understanding the feasibility of ultimately that technology having impact at the other end of the journey". -(P20, Funder) Theme 2: Barriers facing test impact studies Barriers facing the design, conduct and interpretation of test impact studies are summarized in Figure 1 and are discussed below.

Design barriers
Underdeveloped methodology Respondents felt that study designs and methods used in test impact studies are still not well developed, hence it is difficult to rely on them to guide decisions on test roll out.
"So, I think in general it is underdeveloped area….. the overall field of impact assessment is in its nascent state… impact assessments do not feature prominently in that, simply because they are not well articulated in a credible manner so as to provide reliable information to help us make a decision". -(P20, Funder)

Lack of clear guidance
The lack of guidance for TB test impact studies was discussed by respondents.
"There was no standard way of doing this, so that was the overall feeling that you just had to come up with whatever you thought was best for the patient so a very subjective view in a way". -(P1, Researcher)

Funding limitations
Funding was discussed as a major deciding factor of the size and duration of the test impact studies. It was noted that funding to support the studies was often limited.
"I think maybe it's many funding institutions do not offer the amount of money you need to recruit thousands of patients and follow then up for years….. If you go to test like every variation of that intervention, you know, you end up having many study arms and rapidly becomes impossible to do the study because it is expensive and will take forever".

Disconnect in multi-stakeholder needs
Respondents discussed the difficulties in multi-stakeholder collaboration in the planning and design of multifactorial impact studies.
"Sometimes you struggle with stakeholder support. I think there's a bit of a disconnect between levels of government".

Limitations of routine data
The challenges of routine data in pragmatic studies were also highlighted, including issues with requirements for many approvals to access data, and collecting accurate and complete data.
" "I think the decision to roll out Genexpert was predominantly a political decision."-(P12, Researcher)

Theme 3: Considerations in outcome selection
Respondents proposed the outcomes they would prefer to be measured in TB test impact studies and commented on the limitations of the proposed outcomes (Table 2).
In selecting outcomes as a measure of test impact, the following considerations were put forward by the respondents.

Funding considerations.
Funding to measure outcomes that can be measured early or late in the cascade of care were considered when selecting outcomes in impact studies. Reflection of functioning of the health system. Intermediate (also known as surrogate) outcomes such as time to diagnosis were discussed as suitable for demonstrating the functioning and quality of the health system, and would thus inform roll out of tests.
"I'm a strong believer honestly, in surrogate outcomes simply because I think it holds the diagnostic tests to the bar of the whole health system". -(P27, Technical agency representative) Ease of measurement. The ability of an outcome measure to give unequivocal or unambiguous results such as mortality, and ease of measurement such as time to diagnosis, and time to treatment were considered in selecting outcomes. For example, mortality can easily be assessed because patients can be traced, and death can be recorded. Quality of life measurements were preferred by some, because standardized scores or widely accepted tools for measuring them exist. Morbidity was regarded as difficult to measure because of lack of standardized scores (see Table 2).

Availability and quality of data.
The availability and quality of data was an important consideration when selecting the outcome to be measured. Respondents stated that analyses of outcomes such as ongoing transmission of infection were limited by availability of data. In routine settings especially, assessment of mortality would be limited by loss to follow-up and poor routine data sources (see Table 2).

Strengthening multi-stakeholder collaborations and support
Respondents suggested greater collaboration between producers and users of research to provide evidence that was truly useful to end users. They also stressed the need for collaboration at all levels of health systems governance from the beginning of the study, in order to account for all factors that could influence test impact studies. Need for more highly pragmatic studies To enable decision making some respondents stressed the need for such studies to be designed and conducted in settings of intended use.

Considering the magnitude of absolute reduction in interpretation
Improving the clarity on the implications of statistical significance on decision making by focusing not only on statistical significance but the magnitude of reduction was discussed.

Discussion
Our study explored the perceptions of different stakeholders about studies evaluating the effect of TB diagnostic tests on health outcomes, and identified suggestions for improving these studies. In summary, our study showed that stakeholders had different expectations with regard to test impact and how it is measured. TB test impact studies were perceived to be important for supporting implementation of tests but there were concerns about the unrealistic expectations placed on tests to improve outcomes in health systems with many influencing factors. To improve TB test impact studies, respondents suggested conducting health system assessments prior to the study; developing clear guidance on the study methodology and interpretation; improving study design by describing questions and interventions that consider the influences of the health-care ecosystem on the diagnostic test; selecting the target population at the health-care level most likely to benefit from the test; setting realistic targets for effect sizes in the sample size calculations; and interpreting study results carefully and avoiding categorisation and interpretation of results based on statistical significance alone. Engaging multiple stakeholders when designing these studies, advocating for more funding to support robust studies and conducting more highly pragmatic studies were also suggested.
Expertise and role in the health care system contribute to how test impact is perceived and measured 5 . To improve the usefulness of results to end-users, researchers designing the impact studies need to seek insights from various stakeholders involved in decision making about TB diagnostic tests. This will clarify which patient-important outcomes are considered important at the study design stage.
Qualitative research exploring the complex process involved in impact evaluations of TB tests is scarce 13 . Existing qualitative studies about TB diagnostic tests focus mainly on stigma and disease perceptions influencing diagnosis 13,18-22 , barriers facing TB evaluation services, or TB control efforts and factors influencing delays in TB diagnosis 15,23-26 . These studies nonetheless give insight on the health system barriers that may affect the implementation of TB diagnostic tests, and indirectly flag aspects that researchers ought to consider when designing and conducting TB implementation trials in routine settings. For example Cattamanchi and colleagues 25 demonstrated that health system barriers (stock outs, limited infrastructure, poor staff motivation, high workload, poor coordination of health services) and setting barriers (stigma, patient time and costs) both impede TB diagnosis, and if not addressed could impede TB case detection. Indeed, one respondent in our study cited stock outs in Xpert MTB/RIF cartridges as a challenge that delayed their impact study. Unavailability of tests could contribute to high rates of empirical therapy in a study, mitigating the effect of Xpert on mortality.
Since the initial recommendations for the use of Xpert MTB/ RIF in 2010 27 , we still lack strong evidence of the test's impact on people important outcomes 28 . Calls have been made to better understand how to implement and evaluate this test (as well as the newest version, Xpert Ultra) in weak health systems 28,29 . The effective implementation of Xpert MTB/RIF has been limited by funding, lack of comprehensive diagnostic implementation plans, evaluations suggesting limited impact and weak health systems 29-31 . The design and execution of implementation trials evaluating the effect of Xpert MTB/RIF (and Xpert Ultra) on health outcomes thus needs to consider the health ecosystem in which the test is expected to perform 28,30,32 . This could be done by incorporating process evaluations 33,34 before or alongside the trials to understand the different diagnostic implementation processes, and how the diagnostic interventions and the health ecosystem interact with each other in the TB cascade of care.
Qualitative research methods 33 incorporated in these process evaluations can explain how interventions work, why interventions do not work, and explore factors influencing the delivery and implementation of an intervention. Process evaluations have been used to inform the design of trials evaluating the impact of malaria diagnostic tests 35,36 . For example, Ansah and colleagues 35 evaluated the impact of malaria rapid diagnostic tests on fever management in Ghana. To inform their study design they conducted a baseline study of available antimalarial drugs and also conducted focus group discussions to explore the acceptability of their intervention and how best to introduce it.
The updated recommendations on the use of Xpert MTB/RIF advised that impact evaluations be done, but did not provide detailed guidance on how to do so 30,37 . To design effective implementation trials and impact evaluations, guidance informed by programmatic data specific to real life settings is needed. The impact assessment framework for TB diagnostic tests proposed by Mann and colleagues 38 discussed areas and different types of analyses (effectiveness, equity, health systems, scale-up and policy analyses) to be considered in impact assessments in general. This framework was however not specific to trials or studies evaluating the impact of TB diagnostic tests on health outcomes. Schumacher and colleagues described the range of study designs that can be used to assess the impact of TB diagnostics but did not provide guidance on how to conduct such studies 6 . Guidance on designing effective impact trials of TB diagnostic tests could address areas highlighted in the findings of our study including how to use a priori process evaluations to guide the design of impact studies, and how to improve the study design by defining the diagnostic intervention, setting realistic targets in sample size calculations, selecting appropriate target populations, and guiding the selection of outcomes to be measured. Such guidance could also suggest how to incorporate the views of different stakeholders in the design and conduct of the impact studies and offer direction on how these studies can best be interpreted.
Our study had a number of strengths. We incorporated views from various stakeholders, including patients, to obtain a holistic view of the multi factorial components of test impact studies and we followed the COREQ guidelines in reporting our study.
Our study was however limited by the fact that we interviewed only those who responded to our invitations. Participant bias where respondents give expected and socially desirable answers could also have occurred. We tried to mitigate this by asking open ended questions. Most stakeholders interviewed were based in high-income countries, or from India and several countries in Africa including South Africa, Malawi, Zimbabwe and Uganda. This could limit the applicability of our findings. Most respondents gave their perceptions about Xpert MTB/RIF. We did not explore perceptions of studies evaluating the effect of a point of care urine based lipoarabinomann assay (LAM) on health outcomes. Trials evaluating the impact of this test have also shown variation in effects on health outcomes with some demonstrating conclusive reduction on mortality 39 and others inconclusive effects when TB LAM 40 is compared to standard of care. Nonetheless, the effect size for LAM in those trials was about a 10-20% mortality reduction similar to Xpert MTB/RIF 7,8,39,40 . Perceptions and insights explaining the significant effects of the TB LAM test would also be useful in guiding the design of impact evaluations of novel TB diagnostic tests.
In summary, TB test impact studies were perceived to be important to support implementation of tests but there were concerns about their complexity and how they are influenced by the health system context. Process evaluations of their health system context and guidance for their design and interpretation are recommended.

Data availability
Underlying data Ethical approval from the ethics committees and informed consent by participants was granted to disseminate de-identified data. Relevant de-identified quotes to support the results provided have been included in the main manuscript. Despite de-identification, transcripts of the interviews have not been provided because information contained in the transcripts can betray the identity of participants. Any further requests for particular de-identified data or quotes can be granted by contacting the corresponding author directly. The sampling of participants was, as the authors acknowledge, limited because only those who responded to the invitation were included in the study. Another potential bias which they do not discuss, is the preponderance of researchers in their sample. One of their aims was to obtain maximum variation in categories of respondents, and although this was achieved to a certain extent, the higher number of researchers in their sample may have biased their results to the researchers' points of view. In addition, although they included patients in their sample, the views of patients were reported on much less than those of other respondent categories, leading to a narrowing of the focus of this paper. Whilst this in itself is not a bad thing, it does undermine to some extent the achievement of the holistic view that the authors were aiming for.

Extended data
A second area, which could be improved even though the study is complete, is the consideration of the more general issues of how research questions are generated, how research projects are formulated, and how stakeholders interact to ensure that research is useful to as wide a range of stakeholders as possible.
A key concern of this study was how research on TB diagnostic tests fails to meet the expectations of stakeholders. This is an important concern around research in general, and the authors could draw on this literature both in the contextualization of this study, as well as in the discussion of their results.

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?
designers. This claim is widely refuted in qualitative literature produced by the field of science studies which suggests that implementation must be understood from the user and the designer's perspectives (see Madeline Ackrich among others). If this is the argument of the paper, it must refute an entire well established discipline. That seems a tall order.
The paper quickly shifts to argue that studying test designers and diverse stake holders will help address problems with those studies. Though this seems useful, it's not clear what makes this a publishable finding. Surely, scientists talk to each other and discuss how they evaluate studies all the time. Is this currently not happening for TB? If not, why not? If so then why is this paper doing anything more than asking colleagues for their suggestions? It's unclear how interviewing the creators of studies evaluating impact will affect the impact of a diagnostic test. The authors seem to be watching the bee-watcher and suggesting doing so will tell us about the bee. If that's the case, they must make a stronger argument about why. How can improving evaluation lead actual improvement in performance at the health system level and for our patients? :

Methods
First, the study is not phenomenological. It is descriptive. It does not fit into the broader history and philosophical school of phenomenology and this term is inappropriately used here. Descriptive is best. One could also argue that the study falls somewhere in an interactionist paradigm but perhaps that's not of much use here. Finally, the paper, like all papers cannot attain a complete description of anything. This modifier should be stricken during revision. The paper tells us many things and should be lauded, but it is important to be both realistic and modest at the same time.
The authors also tell readers that they considered maximum variation with regard to expertise, but do not tell us how they did so or how they define expertise. How were people put in categories? Don't many of these categories overlap. It seems that the sample is very heavily weighted to researchers and one wonders why. What effect might this have had on results? It could even be a positive effect.
It strikes me that the interviews were rather short. Why was this the case? How might their limited time span effect data quality? Might this be a limitation?
The authors should add a paragraph describing the non-coding analytic work they did. Coding is a process of preparing data for analysis, not for analyzing it. What kind of analytical work has been done here? This revision is essential. Often in qualitative research analysis is looking for connections, considering outliers, or looking for variation of a transversal theme based on interviewee positionality. Did the authors do any of this work? :

Results
The results say that authors explored four major themes. Were these themes designed into the interview process, or did they emerge from the interview? Both ways accessing information are valuable, but it is important to clearly state the way you came to these themes.
The final sentence on column 1 on page 5 refers to a TB community. It's not clear what 'the TB community' is. Some precision here would be useful. Certainly, patients -the majority of the TB community at a planetary scale -would not be particularly concerned about a test's ability to demonstrate improved outcomes. I'd suggest redefining the term here to represent the small network of actors who make decisions about tests and their expectations. In this same paragraph, the authors suggest that logistical issues thwart test effectiveness. This has little to do with the paper's stated purpose of better assessing test impact. How will this data point help the better assess a tests' potential and actual effects?
It seems essential but the paper is unable to effectively develop in it as written.
It seems essential but the paper is unable to effectively develop in it as written. Table 2 in the results section could be more clearly referenced to in the text. It is not immediately clear what the differences between "health outcomes" and "intermediate outcomes" are. It's also not clear who is doing the preferring in "preferred outcome measures." :

Discussion
The discussion's first line is symptomatic of a larger question that the publication of this study poses. It reads, "Our study explored the perceptions of different stakeholders about studies evaluating the effect of TB diagnostic tests on health outcomes…". The clear statement of purpose here gestures to a conflation made by the authors. They write assuming that better evaluation of tests would result in better outcomes. On what data do they base this first principle. They follow this sentence with a second contradictory sentence a few lines later: "TB test impact studies were perceived to be important for supporting implementation of tests but there were concerns about the unrealistic expectations placed on tests to improve outcomes in health systems with many influencing factors." Indeed this claim undercuts their methods and purpose of the study as well as the first quote in the section. Nonetheless the argument it is strongly supported by the data they present in the results section. This, to my mind, indicates the importance of studies that move past questions of how to better assess those that ask how do we know, predict, and anticipate health systems when designing and anticipating the effects of a test. The paper is unable to access this information due to its sample. Still, these questions are crucial and might be gestured to in the conclusion.
The authors suggest that qualitative research on impact evaluation is scarce. Though this claim is true, there is a large body of anthropological and STS literature on randomized control trials, test development, and global health metrics. I encourage them to look to work conducted in Oslo, Maastricht, Edinburgh, and Berkeley among others. It is also not totally clear to this qualitative scientist how epidemiologists with a three month course in qualitative methods feel comfortable making such broad claims about qualitative research. Their paper is a good one and it has a valuable store of important information, but perhaps calls to qualitative research action and broad claims about the need including qualitative methods in process evaluation might best be left to others. Even this paper does not incorporate qualitative methods in processes evaluations, though I would encourage authors to do so. This points to a need to better sum up what the paper does tell us rather than make larger claims about where and what research ought to do. The paper indeed shows that "TB test impact studies were perceived to be important to support the implementation of test but there were concerns about the complexity of how they were influence by the health system context." This is a central finding and I'd encouraging reconfiguring the discussion around this claim. It may help authors resolve the problems created by their tacit assumption that better evaluation will solve structural problems. To my mind this is not the case; but designing evaluation with structural limitations in mind and developing new ways to account for them when evaluating the possible effect of a diagnostic is essential. This is, after all, the task the authors set themselves out to do. I encourage them to re-configure the discussion to do attend to such considerations and help evaluation designers think innovatively about how context does and will always matter.
Though this piece has limitations, it is a crucial step in recognizing that accounting for context is essential when conducting evaluations and predicting effect. It begins an important conversation about the centrality of considering the context in which tests are being used and how they may affect the care and system related effects of a technology. It could be an important tool for improving the quality of TB care and reminding us that a test not connected to care, despite many favorable evaluations, has squarely achieved failure.