Clinical relevance assessment of animal preclinical research (RAA) tool: development and explanation.

Background
Only a small proportion of preclinical research (research performed in animal models prior to clinical trials in humans) translates into clinical benefit in humans. Possible reasons for the lack of translation of the results observed in preclinical research into human clinical benefit include the design, conduct, and reporting of preclinical studies. There is currently no formal domain-based assessment of the clinical relevance of preclinical research. To address this issue, we have developed a tool for the assessment of the clinical relevance of preclinical studies, with the intention of assessing the likelihood that therapeutic preclinical findings can be translated into improvement in the management of human diseases.


Methods
We searched the EQUATOR network for guidelines that describe the design, conduct, and reporting of preclinical research. We searched the references of these guidelines to identify further relevant publications and developed a set of domains and signalling questions. We then conducted a modified Delphi-consensus to refine and develop the tool. The Delphi panel members included specialists in evidence-based (preclinical) medicine specialists, methodologists, preclinical animal researchers, a veterinarian, and clinical researchers. A total of 20 Delphi-panel members completed the first round and 17 members from five countries completed all three rounds.


Results
This tool has eight domains (construct validity, external validity, risk of bias, experimental design and data analysis plan, reproducibility and replicability of methods and results in the same model, research integrity, and research transparency) and a total of 28 signalling questions and provides a framework for researchers, journal editors, grant funders, and regulatory authorities to assess the potential clinical relevance of preclinical animal research.


Conclusion
We have developed a tool to assess the clinical relevance of preclinical studies. This tool is currently being piloted.

115 Who is this intended for? 116 This tool is intended for all preclinical researchers and clinical researchers considering 117 translation of preclinical findings to first-in-human clinical trials, the funders of such studies, and 118 regulatory agencies that approve first-in-human studies.
119 Materials & Methods 120 We followed the Guidance for Developers of Health Research Reporting Guidelines (Moher, 121 Schulz et al. 2010) as there is no specific guidance for developers of tools to assess clinical 122 relevance of preclinical tools. The registered protocol is available at 123 http://doi.org/10.5281/zenodo.1117636 (Zenodo registration: 1117636). The study did not start 124 until the protocol for the current study was registered. The overall process is summarised in 125 Figure 1.
126 Search methods 127 First, we established whether there is any domain-based assessment tool for preclinical 128 research. We searched the EQUATOR Network's library of reporting guidelines using the terms 129 'animal' or 'preclinical' or 'pre-clinical'. We included any guidelines or tools that described the 130 design, conduct, and reporting of preclinical research. We searched the references of these 131 guidelines to identify further relevant publications. We searched only the EQUATOR Network's 132 library as it contains a comprehensive search of the existing reporting guidelines. A scoping 133 search of Pubmed using the terms 'animal[tiab] AND (design[tiab] OR conduct[tiab] OR 134 report [tiab])' returned nearly 50,000 records and initial searching of the first 1000 of them did 135 not indicate any relevant publications. Therefore, the more efficient strategy of searching the 136 EQUATOR Network's library was used to find any publications of a domain-based tool related 137 to design, conduct, or reporting guidelines of preclinical research.
138 Development of domains and signalling questions 139 We recorded the topics covered in the previous guidance on preclinical research to 140 develop a list of domains and signalling questions to be included in the formal domain-based 141 assessment of preclinical research. The first author identified and included all the topics covered 142 in each of the publications and combined similar concepts. The initial signalling questions were 143 developed after preliminary discussions with and comments from the all the Delphi panel 144 members (please see below) prior to finalising the initial list of signalling questions. The full list 145 of the topics covered in the publications and the initial signalling questions are available in the 146 supplementary information Appendix 1 (second column). The signalling questions are questions 147 that help in the assessment of a domain. Additional details about how domains and signalling 148 questions can be used are listed in Box 1. 208 Then, we refined the signalling questions and explanation by iterative electronic 209 communications. Finally, we piloted the tool in biomedical researchers who perform animal 210 preclinical research and those who perform first-in-human studies to clarify the signalling 211 questions and explanations.

Deviations from protocol
214 There were two deviations from our protocol. Firstly, we did not exclude questions even when 215 consensus was reached on the necessity of the questions: this was because the phrasing of the 216 domain/signalling question, the explanation, the domain under which the signalling question is 217 located, and combining or splitting the domains were still being debated. Secondly, we did not 218 conduct an online meeting of the panel members between the second and third rounds of the 219 Delphi process because listing and summarising the comments from different contributors 220 achieved the aim of providing information to justify or revise the ranking. 275 Scope and applicability of the tool 276 The scope of the tool is only for assessment of the clinical relevance of a preclinical research 277 study in terms of the likelihood that therapeutic preclinical findings can be translated into 278 improvement in the management of human diseases and not for assessment of the quality of the 279 study, i.e. how well the study was conducted, although we refer to tools that assess how well the 280 study was conducted. It is important to make this distinction as even a very well-designed and 281 conducted preclinical study may not translate to improvement in the management of human 282 diseases, as is the case of clinical research.
283 As part of the Delphi process, the scope was narrowed to include only in vivo laboratory based 284 preclinical animal research evaluating interventions. Therefore, our tool is not intended for use 285 on other forms of preclinical research such as in vitro work (e.g. cell cultures), in silico research, 286 or veterinary research. This tool is not applicable in the initial exploratory phase of development 287 of new animal models of disease, although the tool is applicable in interventional studies using 288 such newly developed models. 299 A domain can be classified as 'low concern' if all the signalling questions under the domain 300 were classified as 'yes' or 'probably yes', 'high concern' if any of the signalling questions under 301 the domain were classified as 'no' or 'probably no', and as 'moderate concern' for all other 302 combinations.
303 Overall classification of the clinical relevance of the study 304 A study with 'low concerns' for all domains will be considered as a study with high clinical 305 relevance in terms of translation of preclinical results with similar magnitude and direction of 306 effect to improve management of human diseases. A study with unclear or high concerns for one 307 or more domains will be considered as a study with uncertain clinical relevance in terms of 308 translation of preclinical results with similar magnitude and direction of effect to improve 309 management of human diseases.
310 However, depending upon the nature and purpose for use of the research, certain domains may 311 be more important than the other and the users can decide in advance whether a particular 312 domain is important (or not). For example, if the purpose is to find out whether there is enough 313 information to perform a first-in-human study, the clinical translatability and reproducibility 314 domain is of greater importance than if the report was about the first interventional study on the 315 model.
316 At the design and conduct stage, researchers, funders, and other stakeholders can specifically 317 look at the domains that are assessed as unclear or high concern and improve the design and 318 conduct to increase the clinical relevance. At the reporting stage, researchers, funders, and other 319 stakeholders can use this tool to design, fund, or give approval for further research.
320 Practical use of the tool 321 The tool should be used with a clinical question in mind. This should include the following 322 aspects of the planned clinical study as a minimum: population in whom the intervention or 323 diagnostic test is used, intervention and control, and the outcomes (PICO). 324 We recommend that the tool is used after successfully completing the training material, which 325 includes examples of how the signalling questions can be answered and assessment of 326 understanding the use of the tool (the training material is available at: 327 https://doi.org/10.5281/zenodo.4159278) and at least two assessors using the tool independently.
328 A schema for the practical use of the tool is described in Figure 3.
329 Scoring 330 The tool has not been developed to obtain an overall score for clinical relevance assessment. 331 Therefore, modifying the tool by assigning scores to individual signalling questions or domains 332 is likely to be misleading.   366 This question assesses whether after choosing the appropriate model (species, sex, genetic 367 composition, age), the authors have performed studies to characterise the model. For example, 368 sepsis is often induced through caecal ligation and puncture; however, the effects of this 369 procedure can produce variable sepsis severity. Another example is when genes that induce 370 disease may not be inherited reliably: the resulting disease manifestation could be variable and 371 interventions may appear to be less effective or more effective than they actually are (Perrin 372 2014). Therefore, it is important to ensure that the genes that induce the disease are correctly 373 identified and that such genes are inherited. Another example is when the authors want to use a 374 knockout model to understand the mechanism of how an intervention works based on the 375 assumption that the only difference between the knockout mice and the non-knockout mice is the 376 knockout gene. However, the animals used may still contain the gene that was intended to be 377 removed or the animals may have other genes introduced during the process of creating the 378 knockout mice (Eisener-Dorman, Lawrence et al. 2009). Therefore, it is important to understand 379 and characterise the baseline model prior to testing an experimental intervention.

384
For pharmacological interventions, there may be a therapeutic dose and route which is 385 likely to be safe and effective in humans. It is unlikely the exact dose used in animals is studied 386 in humans, at least in the initial human safety studies. Therefore, dose conversion is used in first-387 in-human studies. Simple practice guides and general guidance for dose conversion between 388 animals and humans are available (FDA 2005, Nair and Jacob 2016, EMA 2017). However, 389 some researchers may use doses in animals at levels that would be toxic when extrapolated to 390 humans and therefore unlikely to be used. Dose conversion guides (Nair and Jacob 2016) can 391 help with the assessment of whether the dose used is likely to be toxic. The effectiveness of an 392 intervention at such toxic doses is not relevant to humans. It is preferable to use the same route 393 of administration for animal studies as planned in humans, since different routes may lead to 394 different metabolic fate and toxicity of the drug.

395
For non-pharmacological interventions for which similar interventions have not been 396 tested in humans, feasibility of use in humans should be considered. For example, thermal 397 ablation is one of the treatment options for brain tumours. Ablation can, for example, also be 398 achieved by irreversible electroporation, which involves passing high voltage electricity and has 399 been attempted in human liver and pancreas (Ansari, Kristoffersson et al. 2017, Lyu, Wang et al. 400 2017). However, the zone affected by irreversible electroporation has not been characterised 401 fully: treatment of human brain tumours using this technique can only be attempted when human 402 studies confirm that there are no residual effects of high voltage electricity in the surrounding 403 tissue (not requiring ablation). Until then, the testing of irreversible electroporation in animal 404 models of brain tumours is unlikely to progress to human trials and will not be relevant to 405 humans regardless of how effective it may be.

406
The intervention may also be effective only at a certain time point in the disease (i.e. 407 'therapeutic window'). It may not be possible to recognise and initiate treatment during the 408 therapeutic window because of the delays in appearance of symptoms and diagnosis. Therefore, 409 there is no rationale in performing preclinical animal studies in which the intervention cannot be 410 initiated during the likely therapeutic window. Finally, the treatment may be initiated prior to 411 induction of disease in animal models: this may not reflect the actual use of the drug in the 412 human clinical situation.

1.4 If the study used a surrogate outcome, was there a clear and reproducible relationship 414 between an intervention effect on the surrogate outcome (measured at the time chosen in the 415 preclinical research) and that on the clinical outcome?
416 A 'surrogate outcome' is an outcome that is used as a substitute for another (more direct) 417 outcome along the disease pathway. For example, in the clinical scenario, an improvement in 418 CD4 count (surrogate outcome) leads to a decrease in mortality (clinical outcome) in people with 419 human immune deficiency (HIV) (Bucher, Guyatt et al. 1999). The relationship between the 420 effect of the intervention (a drug that improves the CD4 count) on the surrogate outcome (CD4 421 count) and a clinical outcome (mortality after HIV infection) should be high, should be shown in 422 multiple studies, and should be independent of the type of intervention for a surrogate outcome 423 to be valid (Bucher, Guyatt et al. 1999). This probably applies to preclinical research as well. For 424 example, the relationship between the effect of an intervention (a cancer drug) on the surrogate 425 outcome (apoptosis) and a clinical outcome or its animal equivalent (for example, mortality in 426 the animal model) should be high, shown in multiple studies and independent of the type of 427 intervention for a surrogate outcome to be valid in the preclinical model.

428
If the surrogate outcome is the only pathway or the main pathway between the disease, 429 intervention, and the clinical outcome (or its animal equivalent) (Figure 4), the surrogate 430 outcome is likely to be a valid indirect surrogate outcome (Fleming and DeMets 1996) This signalling question assesses whether the authors have provided evidence for the 443 relationship between surrogate outcome and the clinical outcome (or its animal equivalent). 444 There is currently no guidance as to what a high level of association is in terms of determining 445 the relationship between surrogate outcomes and the clinical outcomes (or its animal equivalent 450 This question aims to go further than the evaluation of association between surrogate outcome 451 and the clinical outcome (or its animal equivalent). A simple association between a surrogate 452 outcome and clinical outcome may be because the surrogate outcome may merely be a good 453 predictor. For example, sodium fluoride caused more fractures despite increasing bone mineral 454 density, even though, low bone mineral density is associated with increased fractures (Bucher, 455 Guyatt et al. 1999). If a change in the surrogate outcome by a treatment results in a comparable 456 change in the clinical outcome (or its animal equivalent), the surrogate outcome is likely to be a 457 valid surrogate outcome (Figure 4). This change has to be consistent, i.e. most studies showing 458 that a treatment results in a comparable improvement in the clinical outcome (or its animal 459 equivalent). Note that it is possible that there may not a fully comparable change, for example, a 460 50% improvement in the surrogate outcome may result only in a 25% improvement in the animal 461 equivalent of the clinical outcome. In such situations, it is possible to use the 'proportion 462 explained' approach proposed by Freedman et al. (Freedman, Graubard et al. 1992), a concept 463 which was extended to randomised controlled trials and systematic reviews by Buyse et al. 464 (Buyse, Molenberghs et al. 2000). This involves calculating the association between the effect 465 estimates of the surrogate outcome and clinical outcome (or its animal equivalent) from the 466 different trials or centres within a trial ( 485 Failure to conduct a systematic review of preclinical studies prior to the start of the clinical 486 research and presenting selective results to grant funders or patients is scientifically questionable, 487 likely to be unethical, and can lead to delays in finding suitable treatments for diseases by 488 investing resources in treatments that could have been predicted to fail (Cohen 2018, Ritskes-489 Hoitinga and Wever 2018). Therefore, this signalling question assesses whether the authors 490 provide evidence from systematic reviews of preclinical animal research studies and clinical 491 studies that the intervention or a similar intervention showed treatment effects that were similar 492 in preclinical research studies and clinical studies in humans. 500 Sample size calculations are performed to control for random errors (i.e. ensure that a difference 501 of interest can be observed) and should be used in preclinical studies that involve hypothesis 502 testing (for example, a study conducted to find out whether a treatment is likely to result in 503 benefit). This signalling question assesses whether the authors have described the sample size 504 calculations to justify the number of animals used to reliably answer the research question. 505 2.2 Did the authors plan and perform statistical tests taking the type of data, the distribution of 506 data, and the number of groups into account?
507 The statistical tests that are performed depend upon the type of data (for example, categorical 508 nominal data, ordinal data, continuous quantitative data, continuous discrete data), distribution of 509 data (for example, normal distribution, binomial distribution, Poisson distribution, etc.), and the 510 number of groups compared. The authors should justify the use of statistical tests based on the 511 above factors. The hypothesis testing should be pre-planned. This signalling question assesses 512 whether the authors planned and performed statistical tests taking type of data, distribution of 513 data, and the number of groups compared into account.
514 The authors may use multivariable analysis (analysis involving more than one predictor variable) 515 or multivariate analysis (analysis involving more than one outcome variable), although these 516 terms are often used interchangeably (Hidalgo and Goodman 2013). Some assumptions about the 517 data are made when multivariable analysis and multivariate analysis are performed (Casson and518 Farmer 2014, Nørskov, Lange et al. 2020) and the results are reliable only when these 519 assumptions are met. Therefore, assessment of whether the authors have reported about the 520 assumptions should be considered as a part of this signalling question. 521 The authors may have also performed unplanned hypothesis testing after the data becomes 522 available, which is a form of 'data dredging' and can be assessed in the next signalling question. 523 The authors may also have made other changes to the statistical plan. This aspect can be assessed 524 as part of signalling question 8.2.

2.3 Did the authors make adjustment for multiple hypothesis testing?
526 This signalling question assesses whether study authors have made statistical plans to account for 527 multiple testing.
528 When multiple hypotheses are tested in the same research, statistical adjustments are necessary 529 to achieve the planned alpha and beta errors. Testing for more than two groups is a form of 530 multiple testing: the statistical output usually adjusts for more than two groups. However, testing 531 many outcomes is not usually adjusted in the statistical software output and has to be adjusted 532 manually (or electronically) using some form of correction. This is not necessary when the study 533 authors have a single primary outcome and base their conclusions on the observations on the 534 single primary outcome. However, when multiple primary outcomes are used, adjustments for 535 multiple hypothesis testing should be considered (Streiner 2015). For example, if the 536 effectiveness of a drug against cancer is tested by apoptosis, cell proliferation, and metastatic 537 potential, authors should consider statistical adjustments for multiple testing.
538 Multiple analyses of the data with the aim of stopping the study once statistical significance is 539 reached and data dredging (multiple unplanned subgroup analyses to identify an analysis that is 540 statistically significant; other names include 'P value fiddling' or 'P-hacking') are other forms of 541 multiple testing and should be avoided (Streiner 2015). Methods for interim analysis to guide 542 stopping of clinical trials such as sequential and group sequential boundaries have been 543 developed (Grant, Altman et al. 2005). Implementation of group sequential designs may improve 544 the efficiency of animal research (Neumann, Grittner et al. 2017).

2.4 If a dose-response analysis was conducted, did the authors describe the results?
546 In pharmacological testing in animals, it is usually possible to test multiple doses of a drug. This 547 may also apply to some non-pharmacological interventions, where one can test the intervention 548 at multiple frequencies or duration (for example, exercise for 20 minutes versus exercise for 10 549 minutes versus no exercise). A dose-response relationship indicates that the effect observed is 550 greater with an increase in the dose. Animal studies incorporating dose-response gradients were 551 more likely to be replicable to humans (Hackam and Redelmeier 2006). This signalling question 552 assesses whether the authors have reported the dose-response analysis if it was conducted. 553 2.5 Did the authors assess and report accuracy? 554 Accuracy is the nearness of the observed value (using the method described) to the true value. 555 Depending upon the type of outcome, these can be assessed by Kappa statistics, Bland-Altman 556 method, correlation coefficient, concordance correlation coefficient, standard deviation, or 557 relative standard deviation ( . This signalling question assesses whether the authors have provided a 560 measure of accuracy by using an equipment for which accuracy information is available, or used 561 a reference material (material with known values measured by an accurate equipment) to assess 562 accuracy. 563 2.6 Did the authors assess and report precision? 564 Precision, in the context of measurement error, is the nearness of values when repeated 565 measurements are made in the same sample (technical replicates). The same methods used for 566 assessing accuracy can be used for assessing precision, except that instead of using a reference 567 material, the comparison is between the measurements made in the same sample for assessing 568 precision. The width of confidence intervals can also provide a measure of the precision. This 569 signalling question assesses whether the authors have measured and reported precision. 570 2.7 Did the authors assess and report sampling error? 571 In some situations, errors arise because of the non-homogenous nature of the tissues or change of 572 values over time, for example, diurnal variation. The same methods used to assess accuracy can 573 be used for assessing sampling error, except that instead of using a reference material, the 574 comparison is between the measurements made in samples from different parts of 575 cancer/diseased tissue (biological replicates) or samples from different times. This signalling 576 question assesses whether the authors have measured and reported sampling error. 577 2.8 Was the measurement error low or was the measurement error adjusted in statistical 578 analysis? 579 This signalling question assesses whether the measurement errors (errors in one or more of 580 accuracy, precision, sampling error) were low or were reported as adjusted in statistical analysis. 581 There are currently no universally agreed values at which measurement errors can be considered 582 low. This will depend upon the context and the measure used to assess measurement error. For 583 example, if the differences between the groups is in cm and the measurement error is non-584 differential (i.e. the error does not depend upon the intervention) and is a fraction of a mm, then 585 the measurement error is unlikely to cause a major difference in the conclusions. On the other 586 hand, if the measurement error is differential (i.e. the measurement error depends upon the 587 intervention) or large relative to the effect estimates, then this has to be estimated and adjusted 588 during the analysis. Measurement error can be adjusted using special methods such as ANOVA 589 repeated measurements, general linear model repeated measurements, regression calibration, 590 moment reconstruction, or simulation extrapolation (Vasey andThayer 1987, Carroll 1989 594 Even if an animal model with good construct validity is chosen, biases such as selection bias, 595 confounding bias, performance bias, detection bias, and attrition bias can decrease the value of 839 involves repeated measurements, the measurement error is generally not reported or not taken 840 into account during the analysis. This Delphi panel arrived at a consensus that measurement 841 errors should be taken into account during the analysis if necessary and should be reported to 842 enable an assessment of whether the preclinical research is translatable to humans. 843 We are now piloting this tool to improve it. This is in the form of providing learning material to 844 people willing to pilot this tool and requesting them to assess the clinical relevance of preclinical 845 animal studies. Financial incentives are being offered for piloting the tool. We intend to pilot the 846 tool with 50 individuals including researchers performing or planning to perform preclinical or 847 clinical studies. If the percentage agreement for classification of a domain is less than 70%, we 848 will consider refining the question, explanation, or training by an iterative process to improve the 849 agreement. The link for the learning material is available at: 850 https://doi.org/10.5281/zenodo.4159278. The tool can be completed using an Excel file, which is 851 available in the same link.
852 Conclusions 853 We have developed a tool to assess the clinical relevance of preclinical studies. This tool is 854 currently being piloted. 855 Acknowledgements 856 Figure 1 Overall process The outline of the process is shown in this figure. A total of three rounds were conducted.
Consensus agreement was reached when at least 70% of panel members strongly agreed (scores of 7 or more) to include the domain or signalling question.