Practical approaches to improving translatability and reproducibility in preclinical pain research

Pain research continues to face the challenge of poor translatability of pre-clinical studies. In this short primer, we are summarizing the possible causes, with an emphasis on practical and constructive solutions. In particular, we stress the importance of increased heterogeneity in animal studies; formal or informal pre-registration to combat publication bias; and increased statistical training in order to help pre-clinical scientists appreciate the usefulness of available experimental design and reporting guidelines


Introduction
Chronic pain remains one of the most common and urgent health issues, with low back pain and headache disorders being the 4th and 5th leading cause of disability in 25-49 year-olds world-wide (Vos et al., 2020).Over the past 60 years, research has made great strides in improving our understanding of the underlying biology of chronic pain, with inflammation and dysfunctional neuro-immune interactions thought to be significant driving factors (Calvo et al., 2019;Fiore et al., 2023;Hore and Denk, 2019).
And yet, surprisingly little of this work has translated into novel drug treatments.In fact, the two most widely used classes of pain-killers, opioids and non-steroidal inflammatory drugs (NSAIDs), have been available since the 19th century: morphine was first isolated by Friedrich Sertürner in 1804 (Trang et al., 2015) and aspirin discovered by Bayer in 1897, with "newer" NSAIDs, like ibuprofen, first marketed in 1969 (Brune and Hinz, 2004;Jones, 2001).There have been a few new additions, but they are usually drugs that were moved across indications, such as gabapentin, an anti-epileptic which was approved for the treatment of neuropathic pain in 2002 (Wiffen et al., 2017).Arguably the most successful drug discovery efforts have been made in the field of headache disorders, with triptans developed in the 1980s and treatments based on calcitonin-gene related peptide (CGRP) approved just several years ago (Ogunlaja and Goadsby, 2022).It is interesting to note that these targets were not initially discovered through work with animal models (Lassen et al., 2002), and that their mechanisms of action, especially those of anti-CGRP, are still not fully understood (Goadsby et al., 2017).In contrast, many drug trials which started with mechanistic evidence derived from animals, typically rodent studies, have failed.This includes efforts to interfere with the function of neurokinin-1, glycine receptors, nerve growth factor and the sodium channel subunit Na v 1.7 (Mogil, 2009).
Some have argued that drug development efforts should de-prioritize target-focused strategies in favor of phenotype-based screening approaches (Swinney and Anthony, 2011).However, with a complex condition like pain, where biomarkers are lacking, this is easier said than done.Moreover, biologics, one of the most successful drug classes of the past decade, are entirely based on identifying a specific target.The question therefore remains: can we identify and address the root causes that account for the limited translational value of animal models in the pain field?
There are likely to be several of reasons, as already eloquently discussed in previous articles (Klinck et al., 2017;Mogil, 2009;Sadler et al., 2022;Scannell and Bosley, 2016).Obviously, there are outright species differences, e.g. in sensory neuron gene expression (Jung et al., 2023).Assuming these differences have been taken into account, reasons for the failure in translation can be broadly divided into four categories.Firstly, it is difficult to make compounds that have good target engagement and favorable pharmacogenetics.This was, for example, what hampered the success of the Na v 1.7 blocker trialed by Pfizer (McDonnell et al., 2018;Mulcahy et al., 2019).Secondly, there are significant complexities associated with clinical trial design.For instance, the choice of patient group can be critical, as has been demonstrated in the case of oxcarbazepine, which appears to work much better as an analgesic in a particular sub-type of neuropathic pain (Demant et al., 2014).Thirdly, even if a drug has good efficacy, side effects can often be dose-limiting to the point of making the approach unviable, as was the case with the anti-NGF antibody tanezumab (FDA Briefing Document, 2021).Finally, there are significant issues with the quality of the pre-clinical work that is feeding into our drug development pipelines.Animal models have proved indispensable for toxicity screenings and elucidating basic causal pain mechanisms, but they have been less successful at forward translating novel drug targets.This might be due to three persistent challenges that we face in pre-clinical pain research: lack of face validity, significant publication bias, and poor reproducibility.In the following, we will discuss each of these problems in turn, and propose possible solutions that could help improve pre-clinical pain research (Fig. 1).

Face validity of animal models
Several reviews, including large systematic meta-analyses (Soliman et al., 2021;Zhang et al., 2022) as well as examinations of historical data (Sadler et al., 2022), have demonstrated that animals used in pain research, largely rodents like mice and rats, have traditionally been young, inbred and male.This is in contrast to human populations living with chronic pain, who tend to be older and majority female.Moreover, the paradigms by which we induce pain, most commonly via traumatic nerve injury or injection of complete Freund's adjuvant (Sadler et al., 2022), are only very poor mechanistic approximations of the conditions they intend to model, e.g.entrapment neuropathy or inflammatory arthritis.They are also usually only studied in the short term, while in most people, chronic pain persists over months and years.Finally, the outcome measures we use in the pain field are limited, and heavily skewed towards assessing evoked mechanical and thermal sensitivity thresholds.These are of poor clinical relevance and have more utility as stratification tools (Rice et al., 2018).
The predominant use of homogenous rodent models is an extreme example of complexity reduction, designed to achieve the high control required to establish a causal link between a single variable and an outcome.However, it can have the unintended consequence of decreasing reproducibility and reducing the chance of finding robust and generalizable effects.Increasing heterogeneity is a potential solution to this problem (Voelkl et al., 2020).We can do so within a laboratory, by actively incorporating biological variation into study designs (Festing, 2014;von Kortzfleisch et al., 2020); or we can conduct multi-centre experiments which will capitalize on natural differences between labs (Voelkl et al., 2018).For example, Wodarski and colleagues (Wodarski et al., 2016) demonstrated the multi-centre approach to evaluating suppressed burrowing as robust and reproducible and therefore have supported the use of burrowing as an outcome to infer the global effect of pain on rodents.
How can we improve on the homogeneous nature of past animal work?We should strive to increase heterogeneity on many different levels, as has been widely suggested (Currie et al., 2019;Sadler et al., 2022;Soliman et al., 2021).However, until we have done more of this work, it remains unclear which variations will have the best cost-benefit ratio.Or indeed which will have the most impact on a particular phenotype, e.g.there are reports that mouse inter-individual variability can have a much greater effect on exploratory behaviours than sex or estrous cycle (Levy et al., 2023).To identify the variables that matter, we will have to spend time working with more heterogeneous rodent cohorts, varying age, sex and strain; we should diversify the species we use for pain research (Klinck et al., 2017), especially for conditions in which non-rodent models offer better face validity, e.g.sheep for osteoarthritis; we should model longer time courses and/or use aging models where feasible; and finally, we should diversify our outcome measures to include assessments of complex behaviors that can serve as proxymeasures of spontaneous, non-evoked pain (Eisenach and Rice, 2022).
The latter is not an easy task.There have been many suggestions of novel behavioral testing paradigms, such as conditioned place preference, machine-vision paradigms and species typical behaviors like burrowing and cage-lid hanging.However, so far, none of these have clearly 'won out', in that they are not yet widely adopted as standard measures within pain research.Given that the most popular behavioral tests, like von Frey or Hargreaves, involve significant subjective judgement from an individual experimenter, it seems obvious that more automated analysis methods that might mitigate observer bias will be very helpful.On the other hand, care must be taken that more data, and thus an increase in the number of analyzable variables, does not inadvertently cause a multiple comparison problem.
There is also an additional challenge, specifically faced by those interested in neuro-immune interactions.Pain is a cardinal sign of inflammation.It is therefore crucial that we fully investigate and understand the local peripheral inflammatory environment in chronic pain N. Soliman and F. Denk conditions.In diseases, like osteo-and rheumatoid arthritis, where human tissue is more readily available, we can then use this knowledge to inform our pre-clinical animal models.However, in many other chronic pain conditions, like back pain or diabetic neuropathy, there is very limited access to relevant human tissues.Animal models have therefore often been developed 'blind' to neuro-immune interactions and instead focused solely on features of neuronal hypersensitivity.This is a significant weakness that needs to be addressed through future interdisciplinary collaborations (Renthal et al., 2021).For instance, immunologically, a neuropathy arising from external surgical trauma (as induced in animal models) is not at all comparable to one arising from sterile nerve entrapment.Consequently, the local immune environment that peripheral nerves find themselves in, is likely to be very different in these two examples.

Publication bias
Once an animal experiment has been conducted, it unfortunately is often destined to end up in a file drawer rather than in a scientific journal.This leads to significant publication bias in the pre-clinical sciences, as demonstrated in various meta-analyses which indicate that data which confirm the null hypothesis are oddly absent from the literature (van der Worp et al., 2010).For instance, out of 525 preclinical papers on stroke, only 1.2 % reported no significant findings (Sena et al., 2010).Moreover, a systematic review of pre-clinical studies of pregabalin, reported that the literature might overestimate its analgesic effects by 27 % due to publication bias (Federico et al., 2020).This figure is very similar to one identified by Currie and colleagues, whose work on chemotherapy-induced peripheral neuropathy indicated that missing experiments might have decreased the estimate of pre-clinical intervention effects by 28 % (Currie et al., 2019).However, not all pain preclinical systematic reviews have been able to identify and quantify potential publication bias (Soliman et al., 2021), likely due to methodological limitations (Zwetsloot et al., 2017).
While it is not always easy to measure, publication bias is likely to have a substantial adverse impact on our ability to interpret past preclinical results.This risk has been recognized since the 1970s, with psychologist Anthony Greenwald noting on the same subject: "…the research-publication system may be regarded as a device for systematically generating and propagating anecdotal information."(Greenwald, 1975).Unfortunately, modelling data confirm Greenwald's fears: Nissen and colleagues convincingly demonstrate that publication of data confirming the null hypothesis is essential for rejecting false facts and preventing them from being canonized as true (Nissen et al., 2016).There are powerful real-life illustrations of how long and how strongly falsepositive results can prevail in a world biased towards significant effects.For example, it is estimated that there are at least 400 + studies on the association between the serotonin transporter gene SLC6A4 and depression, many of which reported significant results (Border et al., 2019).Critical voices appeared early on, questioning the methodology of the typical candidate gene analyses that led to these results (Colhoun et al., 2003); but it has taken nearly 20 years for studies to emerge that specifically refute the link between SLC6A4 and depression, using large, well-powered cohorts or meta-analysis methods.In the meantime, any casual reader of this literature would have assumed that there is a lot of evidence in favor of connection.
How can we combat publication and outcome reporting bias?One obvious solution is pre-registration.It provides transparency and enables comparison of the completed study with what was initially planned.It also has the additional benefit of helping to avoid duplication.When pre-registration was made a legal requirement for clinical trials, it immediately affected the publication landscape: before 2000, when trials were not routinely logged on clinicaltrials.gov,57 % of studies claimed that their primary outcome was significant; after 2000, this figure dropped down to 8 % (Kaplan and Irvin, 2015).Similar benefits of pre-registration have been shown in psychology and neuroscience research (Soderberg et al., 2021).With pre-clinical work, it may seem unnecessarily laborious to pre-register, but we would argue that it is an important ethical matter: its benefits to the scientific community far outweigh the inconvenience to individuals.Systematic reviews and hypothesis-testing pre-clinical studies, e.g.elucidating the effect of analgesic compound × on pain-like behaviors, should be registered.Indeed, there are now dedicated repositories for just this purpose, e.g. on the Open Science Framework (OSF), PreclinicalTrials.eu, or Prospero.Many journals also offer the new format of "Registered Reports", where you submit your study design and analysis plan for review before any data are generated.Leading pain journals, as well as Brain Behavior and Immunity, have yet to introduce this category and are thus presented with an easy opportunity to support the fight against reporting bias.
For those who are conducting hypothesis-generating work, it is still very useful to set up a scientific analysis and statistical plan, that can then be registered informally, either by uploading it into a private repository on OSF, or simply time-stamping it as a pdf.This practice will help scientists to more clearly identify the model they are investigating, think about the size of the effect they are likely to observe (see below) and identify whether there could already be a hypothesis-testing element within a larger hypothesis-generating project.
Beyond this, we all need to make a concerted effort to publish as many of our experimental results as possible, whether they support the null hypothesis or not (Andrews et al., 2016).This is a goal that has also received increasing traction from funding agencies over the past several years.For example, several UK funders now have their own repositories, like the Wellcome Open Research or the NC3Rs Gateway, that encourage deposition of null data.Ultimately, however, in our capacity as peerreviewers, we all have to work together to achieve a cultural shift: basing our recommendations for publication not on apparent novelty, but rather on the rigor and quality of experimental design, execution and reporting practices.

Reproducibility
A final significant issue that limits pre-clinical translatability is that many of the published 'positive' results are not fully reproducible.This problem has been discussed specifically in the context of pain research, but clearly spreads far beyond this narrow specialty to include general neuroscience (Button et al., 2013), cancer research and, of course, psychology (Open Science Collaboration, 2015), where some have even gone as far as arguing that it should no longer be viewed as a quantitative discipline (Yarkoni, 2022).
Reasons for poor reproducibility include a lack of pre-defined hypotheses, statistical shortcomings, poor experimental design (e.g.lack of blinding, inadequate controls, failure to verify the success of drug delivery in pharmacological studies) and poor reporting.In an interdisciplinary space, another common source of irreproducibility can be the imperfect transfer of techniques from one field to another.For example, knowledge of what a high-quality flow cytometry experiment looks like is still somewhat limited in the neuroscience community, while immunologists understandably may have a hard time assessing neuroscience results, e.g.immunostaining of cortical regions.
What are some possible solutions?In terms of improving the reproducibility of interdisciplinary science, cross-field reviewing and research collaborations across field-specific boundaries are absolutely essential and should be supported by funders.Indeed, agencies are starting to recognize this, with interdisciplinary calls becoming a lot more common.
In terms of more generic barriers to reproducibility, a lot of prior work has gone into making practical suggestions, including the generation of ARRIVE 2.0 reporting guidelines (Percie du Sert et al., 2020), the PREPARE checklist for planning animal research (Smith et al., 2018) and a recently published formal framework for Enhancing Quality in Preclinical Data (EQIPD).EQIPD makes suggestions on broad experimental principles, e.g. the need to pre-define a hypothesis and statistical N. Soliman and F. Denk analysis plan, as well as the utmost importance of randomization and blinding.

Discussion and a call for better training
It seems, therefore, that we are not short of solutions.We are short of people who implement them.For example, the ARRIVE guidelines, originally developed in 2010, are the most comprehensive checklist for the conduct and reporting of animal experiments.However, a randomized controlled trial found that journal-requested completion of an ARRIVE checklist did not improve compliance with the guidelines, suggesting that editorial policy alone is not sufficient to improve comprehensive and transparent reporting (Hair et al., 2019).One barrier to uptake may be education: the training, availability and accessibility of other tools designed to improve research quality are likely to facilitate implementation of the ARRIVE guidelines and ultimately improve the value of pre-clinical research over time (Vollert et al., 2020).
Pre-clinical scientists often assume that since they are conducting hypothesis-generating work, pre-planning of experimental design and statistics is not all that relevant for them.Many are simply unaware of how much a fluid and flexible design will affect the confidence level that one can have in any given result.In fact, a typical pre-clinical paper tends to be seen as reliable if it tests a model with a great variety of complex techniques, e.g.behavior, electrophysiology, immunostaining and Western blotting.If these varied sets of data are reasonably well executed and support a particular narrative, we are often quick to accept this: a new promising 'target x' is born, pursued in future studies and drug development pipelines, and highlighted in reviews and press releases.
However, this is a very risky approach.Given current norms around data reporting and statistics in hypothesis-generating studies, it is often unclear how often an experiment was repeated within a single modality.As in, how many times did the authors obtain a particular Western blot result?How often was a behavioral study conducted?This information is crucial.Without it, we are left with 1) significant selection bias, i.e. it is easy to forget or explain away when one blot in a series of three repeats does not quite support what we hope to see; 2) a high risk of false positives in each modality, due to the small samples sizes we all use.We therefore risk ending up with a series of underpowered and biased experiments which, as we know from meta-analyses, do not suddenly addup to a large, confident whole.
There are many other examples of how there are failures in statistical training, ranging from incorrect selection of parametric tests to the omission of multiple comparison corrections when they are required, e. g. in the case of conducting 10 different behavioural tests or interrogating 7 different immune cell populations via flow cytometry.But there are even more basic failures, for instance when we consider the typical level of understanding of effect sizes in the pre-clinical space.The crucial distinction between 'observed' and 'true' effects is often entirely lost, with people basing their power calculations on a single observed effect derived from one small pilot experiment.This is a terrible practice, for reasons that are well-explained elsewhere (Albers and Lakens, 2018), as it risks leading to an endless series of underpowered experiments.
Worse still, pre-clinical scientists often fail to consider whether their experimental design is suitable for the kinds of effect sizes they are likely to observe.Sex differences are a great example of this.There is no doubt that there will be sex difference in neuro-immune interactionsafter all, we know from decades of work in immunology, that women have a stronger adaptive immune response (Klein and Flanagan, 2016).However, how big do we expect this effect to be? Presumably, in most cases sex is a smaller modulatory factor that influences a larger effect, e.g. an inflammogen within a joint will cause significant pain, but this large effect might increase or decrease slightly in size depending on sex.Continuing with this example, most pre-clinical pain studies are wellpowered to observe only very large effects, such as Cohen's d of 1.3 or above.Let's assume that the effect of a pain-inducing injury or of a putative analgesic drug is indeed truly this large.In most cases, we would then expect the modulatory effect of sex on this injury or analgesic drug to be smaller, i.e. require a larger sample size.And yet, the pre-clinical literature abounds with very small-scale studies (n = 6-12) on sex differences in rodents.Their results are frequently debated at length, especially when their conclusions are conflicting.In reality, it is quite likely that none of the studies are powered to reliably detect sex differences in neuro-immune interactions with our current pre-clinical experimental designs.If so, we may be spending time and money debating spurious statistical noise.
How can we improve on current training?Statistical educational material has greatly improved over the past several years, with many authors making a very complex and difficult branch of mathematics conceptually accessible to biologists.There are online textbooks (Lakens, 2022), lecture series, and introductory books for the general public (Spiegelhalter, 2019).Getting pre-clinical scientists to engage with these materials may be one way to help motivate them to adopt the many wonderful guidelines for experimental design and reporting that have been developed over the past decade, such as EQIPD or ARRIVE 2.0.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.