Are changes in nociceptive withdrawal reflex magnitude a viable central sensitization proxy? Implications of a replication attempt

OBJECTIVE
The nociceptive withdrawal reflex (NWR) has been proposed to read-out central sensitization (CS). Replicating a published study, it was assessed if the NWR magnitude reflects sensitization by painful heat. Additionally, NWR response rates were compared for two stimulation, the sural nerve at the lateral malleolus (SU) and the medial plantar nerve on the foot sole (MP), and three recording sites, biceps femoris (BF), rectus femoris (RF), and tibialis anterior (TA) muscles.


METHODS
16 subjects underwent one experiment with six blocks of eight transcutaneous electrical stimulations to elicit the NWR while surface electromyography was collected. Tonic heat was concurrently applied in the same dermatome. Temperatures rose from 32 °C in the first to 46 °C in the last block following the previously published protocol.


RESULTS
Tonic heat did not influence NWR magnitude. The highest NWR response rate was obtained for MP-TA combination (79%). Regarding elicitation in all three muscles, SU stimulation outperformed MP (59% vs 57%).


CONCLUSIONS
The replication failed. NWR magnitude as a CS proxy in healthy subjects needs continued investigation. With respect to response rates, MP-TA proved efficient, whereas SU stimulation seemed preferable for multiple muscle recordings.


SIGNIFICANCE
Unclear methodological descriptions in the original study affected CS and NWR replication. The NWR magnitude changes induced by CS may closely depend on the different stimulation methods used.


Introduction
The nociceptive withdrawal reflex (NWR) is a polysynaptic spinal response to noxious stimulation. Activation of nociceptive Ad and C fibres, and of mechanosensitive Ab fibres initiates the reflex pathway, prompting an involuntary muscle response of flex-  ors and extensors, withdrawing the affected limb (Pierrot-Deseilligny and Burke, 2012). It is used to study nociceptive processing on the spinal level in healthy human participants and in chronic pain patients (Sandrini et al., 2005;Sklarevski and Ramadan, 2002;Manresa et al., 2011;Willer, 1977;Dowman, 1991;Wallwork et al., 2017).
As a spinal read-out, the NWR is an interesting tool to investigate central sensitization (CS) (Manresa et al., 2010;Manresa et al., 2013;Arendt-Nielsen et al., 2018;Sandrini et al., 2005). CS is defined by the International Association for the Study of Pain (IASP) as an increased responsiveness of nociceptive neurons in the central nervous system to their normal or subthreshold afferent input (Loeser and Treede, 2008). It is considered to be pathophysiologically important in chronic pain (Arendt-Nielsen et al., 2018;Woolf, 2011). Because direct neuronal recordings are not possible in humans (Arendt-Nielsen et al., 2018) proxies of CS need to be assessed. The NWR has been proposed to read-out ongoing CS processes (Manresa et al., 2010;Curatolo et al., 2015;Leone et al., 2021;Sandrini et al., 2005). Similar to established CS proxies, such as secondary hyperalgesia or dynamic mechanical allodynia (Quesada et al., 2021;Woolf, 2011), the NWR can be influenced by supraspinal facilitation or inhibition (Bjerre et al., 2011;Terry et al., 2016;Willer et al., 1979). Thus, it is crucial to choose a measure for the NWR with high test-retest reliability in spite of potential supraspinal influences.
Most studies using the NWR after experimental CS induction focus on effects on the NWR threshold I 0 (Manresa et al., 2014;Linde et al., 2021;Leone et al., 2021). It is one of the most studied characteristics of the NWR because changes in I 0 potentially reflect altered nociceptive processing, especially in clinical populations (Sandrini et al., 1993;Arendt-Nielsen et al., 1994;Edwards et al., 2007;Manresa et al., 2011;Linde et al., 2021). However, conflicting results regarding within-subject test-retest reliability of I 0 have been reported for studies with healthy subjects (Manresa et al., 2014;Linde et al., 2021;Neziri et al., 2010;French et al., 2005). Therefore, I 0 might not be the best choice to assess the effects of CS induction. Instead, the NWR magnitude -also called reflex size-might be a viable alternative because it has been observed to remain stable in test-retest studies (Jurth et al., 2014). In addition, it has been observed to reflect CS induction (Manresa et al., 2010;Ellrich and Treede, 1998). In particular, one study reported effects on NWR magnitude using noxious heat, which can be interpreted as reflecting successful CS induction (Ellrich and Treede, 1998). In contrast, a recently published study using capsaicin in combination with noxious heat for CS induction reported no effect on the NWR magnitude . In both studies, the NWR magnitude had been quantified by the area under the curve (AUC) of the sEMG recordings. To help clarify the controversial results, the primary objective of the present study was to replicate the previously published results (Ellrich and Treede, 1998), where a thermally induced sensitized state had changed the NWR magnitude.
In that study, the NWR had been elicited by transcutaneous electrical stimulations to the medial plantar nerve (MP) on the foot sole proximal to the big toe and recorded from the ipsilateral tibialis anterior muscle (TA) by surface electromyography (sEMG) (Ellrich and Treede, 1998). Stimulation to a distal nerve and sEMG recording from one of the muscles involved in the withdrawal is a practical approach to using the NWR in experimental studies. Because of its simplicity, it can be integrated into designs involving outcome measures from different modalities (Rhudy et al., 2020). More advanced approaches to studying the NWR are also used, such as muscle synergy analyses (Jure et al., 2019). However, their integration into a multimodal design is more challenging than just combining one stimulation and one recording site. Apart from MP-TA (Meinck et al., 1985), another typical combination uses the sural nerve (SU) dorsal to the lateral malleolus as the stimulation and the biceps femoris muscle (BF) as the recording site (Hugon, 1973;Willer, 1977), although other combinations are used as well, some of which alternatively stimulate free nerve endings instead of trunks (Sandrini et al., 2005;Jensen et al., 2015;Jure et al., 2020).
For MP-TA, I 0 exhibits higher between sessions test-retest reliability than for SU-BF (Jensen et al., 2015). However, a high between-sessions reliability of I 0 does not guarantee successful repetitive elicitations within experimental sessions, when the NWR is attempted to be elicited repeatedly at suprathreshold currents (Willer, 1977;Sklarevski and Ramadan, 2002;Sandrini et al., 2005). Conversely, a stimulation and recording site combination might be reliable for elicitations but not for I 0 determination. In contrast to I 0 , the NWR response rate has not been extensively studied, except for the so-called "habituation" effect (Dimitrijević et al., 1972;Arendt-Nielsen et al., 1994;Sandrini et al., 2005). But, habituation observations only focus on the declining NWR amplitudes over time of succesful elicitations and not on the number of actually elicited reflexes. Therefore, the secondary objective of the present study was to determine the stimulation and recording site combination yielding the highest NWR response rate, i.e. the highest percentage of NWR elicitations.
To address both research objectives, the experimental paradigm from the published study was not only replicated using the MP-TA combination as in the original design, wherein the NWR was repeatedly elicited at MP and noxious heat applied to the same dermatome resulted in a NWR magnitude increase at the TA (Ellrich and Treede, 1998). But, it was also replicated for the SU-BF combination. In order to compare more than just two combinations, sEMG recordings were in both cases simultaneously collected from three muscles, TA, BF, and the rectus femoris muscle (RF), leading to a total of six comparable combinations.
Based on the results of the original study (Ellrich and Treede, 1998), it was expected that the NWR magnitude would increase for MP-TA and SU-BF between the baseline temperature and the two highest temperatures. For the secondary objective, it was hypothesized, because of the reported high I 0 reliability for MP-TA (Jensen et al., 2015), that (i) MP-TA would show the highest NWR response rate and (ii) stimulations at MP would trigger a higher NWR response rate in all three muscles compared to SU.

General study design
The study followed a within-subject experimental design, closely following a previously published study (Ellrich and Treede, 1998). The original design was reproduced as faithfully as possible using all the published information. Changes were only made if deemed necessary in order to avoid biasing the results. All changes are discussed in detail in the supplementary material A. Study approval was obtained from the ethics board of the Canton of Zurich. All participants underwent a single experimental session that consisted of two identical testing procedures, one for each stimulation site (SU and MP) in counterbalanced order. Participants were pseudorandomly tested either in the morning or in the afternoon to minimize a potential influence of time of day on the results (Labrecque and Vanier, 1995;Sandrini et al., 2005).

Participants
Participants were healthy, pain-free adults who gave written informed consent prior to inclusion according to the Declaration of Helsinki. Exclusion criteria were the presence of any low back pain on more than two consecutive days within the last year, a serious disease, e.g. heart failure or clinical depression, any pain condition, or inability to tolerate the applied electrical stimulations or heat stimuli.

Sample size
The previously reported results corresponded to effect sizes (calculated as Cohen's d (Cohen, 1988)) of 0.7, 1, and 2.4, respectively (Ellrich and Treede, 1998). Using an effect size of 0.85, a significance level of 0.05, and a power of 0.8 the necessary sample size was determined to be 13. It was decided to recruit 17 participants to account for potential dropouts or unusable data.

Participant setup
For the duration of the experiment, participants were in a semisupine position on a test bed, with the back rest inclined at 60°and their legs comfortably positioned. To ensure muscle relaxation in the test leg, a rolled-up towel was placed underneath the popliteal fossa. All tests were performed on the right leg. During the tests, participants were asked to read a newspaper article in order to be distracted from the test stimulations as a distraction-related facilitation of the NWR had been shown (Bjerre et al., 2011;Terry et al., 2016).

Electrical simulation and sEMG recordings
Transcutaneous constant current stimulation and ipsilateral sEMG recordings were performed using a Dantec Ò KeypointÓ G4 EMG/EP Workstation (Natus Medical Inc, San Carlos CA, US). The recording time window extended from 120 ms pre-to 380 ms post-stimulation. Fig. 1 shows a schematic of the stimulation and recording setup during the experiment. On each of the three recording sites, two Ambu Ò (Ambu A/S, Copenhagen, DK) Blue Sensor NF-50 1.5 mm surface electrodes were placed, one on the muscle belly and one on the nearest reference point, i.e. near the distal tendon insertion for BF and RF and on the shin for TA. Before attaching the electrodes, the skin was cleaned with ethanol and exfoliated with Nuprep Ò gel (Weaver and Company, Aurora CO, USA) to achieve impedances < 5 kX. For SU stimulation, two Ambu Ò Neuroline 700 2 mm surface electrodes were placed 2 cm apart, whereas for MP stimulation one Ambu Ò Neuroline 700 2 mm surface electrode was placed as the cathode on the foot sole proximal to the medial side of the ball of the foot and one ValuTrode Ò (Axelgaard Manufacturing Co., Lystrup, DK) VTX 5 cm surface electrode as the anode on the dorsum of the foot.
Every electrical stimulation consisted of a train of five rectangular stimuli of 1 ms duration delivered at 200 Hz. For each of the two stimulation sites, eight blocks of 90s were performed with pauses of 300s between blocks (Fig. 2). During the 90s, electrical stimulations were applied. The first two blocks served to determine I 0 in either the TA (for MP stimulation) or the BF (for SU stimulation). The TA and BF muscles were chosen for I 0 determination because they are the most commonly used muscles to record the NWR after MP or SU stimulation respectively (Sandrini et al., 2005;Jensen et al., 2015). The I 0 was defined as the higher of the single stimulation and 3-stimulation thresholds because it has been observed that this definition yields high reliability of I 0 determination (Terry et al., 2001;Terry et al., 2016;Rhudy et al., 2020). The single stimulation threshold is the lowest current to reliably elicit a reflex with a single electrical stimulation, as defined above. The 3stimulation threshold refers to the third such stimulation in a triplet of identical electrical stimulations delivered at 2 Hz. It is the lowest current to reliably elicit a reflex with the third stimulation of such a triplet (Terry et al., 2001;Terry et al., 2016;Rhudy et al., 2020). In both, a single ascending staircase  was used with increments of 1 mA starting at 1 mA until the first reflex was reached and continued for two more increments for validation. The order of these first two blocks, i.e. starting either with the determination of the single or the 3-stimulation threshold, was counterbalanced across subjects. The subsequent six test blocks employed eight single stimulations with a random interstimulation interval of 5 to 15s to avoid predictability, which has been shown to modulate the NWR response (Jure et al., 2020). The eight stimulations were delivered at eight different currents, starting from I 0 -4 mA to I 0 +1 mA in steps of 1 mA followed by one stimulation at 1.5ÁI 0 and one stimulation at 2ÁI 0 , always in that order.
Participants were instructed to rate each stimulation on a numerical scale ranging from 0 (no sensation) to 200 (maximum tolerable pain), with 100 corresponding to the pain threshold.

Heat stimuli
The electrical stimulations were applied during the application of 90s tonic heat stimuli using a 3 cm x 3 cm computer controlled Peltier thermode (TSA II, Medoc, St. Ramat Yishai, IL). Fig. 2 shows the heat stimuli procedure following the previously published design (Ellrich and Treede, 1998). The respective target tempera- tures of the six test blocks were 32, 36, 39, 42, 45, and 46°C, always in that order. During the two I 0 determination blocks the temperature was kept at 32°C baseline temperature.
The thermode was attached proximal to the site of electrical stimulation, i.e. on the lateral lower leg, proximal to the ankle for SU and on the sole of the foot for MP stimulation respectively (Fig. 1). These positions were chosen following the previously published design (Ellrich and Treede, 1998) and to ensure that the tonic heat stimulus would be applied in the same dermatome as the electrical stimulations, L5 and S1 respectively (Downs and Laporte, 2011). The thermode remained attached until all blocks for the first site had been performed and was then switched to the other site. For each of the six target temperatures ascending and descending ramps of 8°C/s were used. The temperature returned to baseline (32°C) for the 300s pause in between blocks. After every block, participants were asked whether they considered the applied temperature as painful or not painful.

Data analysis
The sEMG signals were sampled at 48 kHz and downsampled to 6 kHz, rectified, band-pass filtered from 10 Hz to 500 Hz and amplified up to 125 times. Between À120 ms pre-and 380 ms post-stimulation, traces for all applied stimulations were automatically saved into separate text files (txt) with stimulation number logged for every stimulation. All suprathreshold stimulations, i.e. applying currents PI 0 , were further analysed, i.e. 2304 of 4608 (= 8 stimuli Á 6 blocks Á 3 muscles Á 2 stimulation sites Á 16 subjects). The time window of interest for NWR occurrence extended from 60 to 150 ms post-stimulation. The lower bound of 60 ms was chosen because it corresponds to the reported earliest onset for the NWR defining activation of Ad fibres in similar studies (Andersen, 2007). The upper bound was set at 150 ms to avoid contamination of the reflex response by voluntary muscle activation (Dowman, 1992).
A reflex was computationally judged to be present if the NWR interval mean z-score (Z) was P1.4. Z was defined as the difference between the mean of the 60 to 150 ms post-stimulation amplitude and the mean of the À110 to À20 ms pre-stimulation amplitude divided by the standard deviation of the À110 to À20 ms prestimulation amplitude. The Z criterion is a well validated approach to automatically detect the NWR Rhudy and France, 2007). However, it is susceptible to contamination of the sEMG by spontaneous muscle activation. Visual inspection of the traces is necessary to eliminate false classifications (Rhudy and France, 2007;France et al., 2009) although alternative methods have been proposed (Jensen et al., 2013). Thus, all sEMG traces were also visually inspected by the two investigators performing the tests (AG and ACG) to eliminate any potential false negative or false positive automatic identifications. When either of the investigators disagreed with the Z criterion the traces were given to two expert raters (MH and MS) who had not been not involved in the data collection. The independent assessments of the two expert raters were compared. In case of disagreements, the respective traces were given to an external expert rater to act as an arbiter. In total, 113 (4.9%) out of the analysed 2304 traces were identified as either false negative (110 i.e. 4.8%) or false positive (3, i.e. 0.1%).
Statistical analyses were performed in R 3.4.4 (R Foundation for Statistical Computing, Vienna, Austria) using the tidyverse collection of packages as well as the packages pwr 1.3.0, fmsb 0.7.0, nlme 3.1.131, lme4 1.1.21, relaimpo 2.2.3 rjags 4.10 and runjags 2.0.4.6 relying on JAGS 4.3.0. Demographic information of the study participants as well as the data from the traces to be used for further analysis were stored in a comma separated values (csv) file.
As all subsequent analysis focussed on reflexes with suprathreshold current stimulations it was investigated whether there was a potential association of the I 0 with the demographic characteristics of the study participants, namely with age, height, sex, handedness, time of measurement, and stimulation order (SU or MP first). For continuous variables, Spearman's rank correlation tests, and for categorical variables, Kruskal-Wallis rank sum tests were performed. Because reflex responses between the two stimulation sites were compared, it was also tested whether there was an association between the two respective I 0 for the two stimulation sites using Spearman's rank correlation tests. P < 0.05 was considered significant for all analyses.
For the primary objective, determining and analysing a potential influence of temperature on the NWR, the magnitude of all obtained reflexes was quantified using the NWR Cohen's-d (D). D was defined as the difference between the mean of the 60 to 150 ms post-stimulation amplitude and the mean of the À110 to À20 ms pre-stimulation amplitude divided by the pooled standard deviation of the À110 to À20 ms pre-stimulation and 60 to 150 ms post-stimulation amplitude. D is a robust way of quantifying the magnitude of the NWR with a very high test-retest reliability (Rhudy et al., 2020;Terry et al., 2016;France et al., 2009;Rhudy et al., 2009;Rhudy et al., 2008;Rhudy and France, 2007). For comparability with the results from the original study (Ellrich and Treede, 1998), the area under the sEMG curve (AUC) was also calculated as an alternative quantification metric. Bayesian normal-normal models were used to assess the effect size for the temperature influence on D for all muscle responses. Models without and with a simple hierarchical structure were applied to account for individual differences on the participant level.
Additionally, a classical linear model with temperature as the sole predictor was trained. In a separate mixed effects model subject ID was entered as a random effect.
Because the original design can be interpreted as a split-model design, albeit not stated as such in (Ellrich and Treede, 1998), a separate mixed effects model was trained with temperature as whole-plot factor, suprathreshold current as split-plot factor, subject ID as block factor, and a random effect per combination of participant and temperature.
Following the analysis in the original study (Ellrich and Treede, 1998), paired samples t-tests for AUC at the two highest temperatures (i.e. 45 and 46°C) compared to baseline were performed. For comparability with the results from the original study, this analysis was restricted to the MP-TA combination for electrical stimulations that were rated as not painful (i.e. < 100) and to subjects who had considered the heat stimuli of 45 and 46°C as painful.
The secondary outcome measure was the presence of a NWR (binary variable) which was analysed for the two stimulation (MP, SU) and the three recording sites (BF, RF, TA). Bayesian beta-binomial models were used to determine the highest density intervals (HDI) for elicitation of all six stimulation and recording site combinations separately as well as for the respective MP and SU elicited reflexes combined. These models allowed to determine robust ranges of NWR response rates using a particular stimulation and recording site combination. The effectiveness of one combination compared to another was then calculated with the help of the so-called "rate ratio" (RR) (Fleiss et al., 2003), i.e. the ratio of the two calculated ranges of the combinations in question, with effectiveness defined as 1-RR.
In addition, the data were analysed using a classical (i.e. frequentist) statistical approach. For every stimulation and recording site combination, the NWR response rate, i.e. the percentage of elicited NWR at suprathreshold currents was calculated. The six combinations (MP-BF, MP-RF, MP-TA, SU-BF, SU-RF, SU-TA) were compared using v 2 tests of independence for between subjects and McNemar's tests for within subjects analysis. The analysis was conducted between the two stimulation sites for all three muscles together (1 group-wise comparison) as well as for every muscle separately (9 pair-wise comparisons). Additionally, comparisons of muscle responses for the same stimulation site were performed (2 Á 3 pair-wise comparisons).
The four subthreshold stimulations served to indirectly test for a potential decrease in reflex threshold I 0 over the course of the experiment. To that end, the NWR response rates for all subthreshold stimulations were determined for the MP-TA and SU-BF combinations at each of the six applied temperatures to test for a potential increase of the response rates with increasing temperature. Because I 0 had been obtained for MP-TA and SU-BF respectively increased response rates for these combinations would indicate a lowered reflex threshold.

Participant characteristics
In total, 17 participants were recruited. One had to be excluded because of not tolerating the two highest temperatures used. This led to a study sample of 16 participants of equal sex distribution and a mean age of 26.8 AE 4.7 years.
No association between the stimulation-site-specific I 0 and the demographic characteristics of the study participants was found (Table 1). Therefore, they were not included as covariates in the further analysis. The I 0 at MP (8.6 AE 4.9 [mA]) and SU (11.4 AE 6.1 [mA]) were positively correlated (Spearman's q=0.65 with p = 0.006). Fig. 3 shows the sEMG recordings of two representative subjects for the MP-TA and SU-BF combinations following 1.5ÁI 0 stimulation for all six temperature blocks, with the NWR clearly visible within the 60 to 150 ms post-stimulation time window.

No observable influence of noxious heat stimuli on the NWR
The original study claimed a facilitation of the electrically elicited reflex response by painful heat. A significant increase of the NWR magnitude, quantified as AUC, was reported for stimulations in the last two experimental blocks, i.e. when a noxious tonic heat stimulus of 45 or 46°C was respectively applied. At 46°C, the increase of AUC had been significant in all subjects who had been tested with stimulation currents corresponding to about 1.2ÁI 0 and 3ÁI 0 respectively. However, the increase at 45°C had been significant only for stimulations corresponding to about 3ÁI 0 (Ellrich and Treede, 1998).
In the present study, all participants reported that they considered the two highest temperatures (45 and 46°C) as painful, thus fulfilling the prerequisite of the original study. It was assessed whether a tonic heat stimulus, applied in the same dermatome as the concurrent NWR elicitations, had on effect on the NWR magnitude, as reported in the original study (Ellrich and Treede, 1998), when comparing between the two highest, i.e. painful, temperatures (45 and 46°C) and the baseline (32°C). To that end, a Bayesian normal-normal model without hierarchical structure was first used. The model was run for both stimulation sites, for all three recording sites and suprathreshold stimulation currents together and separately.
Second, hierarchical Bayesian normal-normal models were applied to account for individual differences between the participants. The resulting median effect sizes were greater but did still not come close to the reported ones. In the case of MP-TA, the present median effect sizes were 0.02 (HDI 95 =[À0.80,0.90]) and À0.05 (HDI 95 =[À0.87,0.79]), respectively. And for SU-BF, the median effect sizes were À0.11 (HDI 95 =[À0.92,0.71]) and 0.19 (HDI 95 = [À0.65,1.06]). Again, separate assessments for the other stimulation and recording site combinations displayed the same behaviour (results not shown). Fig. 4 shows the variation of the reflex magnitude in the BF, RF, and TA muscles after suprathreshold stimulations as quantified by D in the presence of a tonic heat stimulus in the same dermatome. The temperature of the heat stimulus increased from 32°C (baseline) in the first to 46°C in the last block following the published design (Ellrich and Treede, 1998). As determined in the analysis, no clear pattern emerged with increasing temperature in any of the three recorded muscles, neither for MP nor for SU stimulation. This did not change when the four stimulation currents (I 0 ; I 0 +1 mA, 1.5ÁI 0 , 2ÁI 0 ) were considered separately (see supplementary material B).
Using a classical statistical approach, neither a linear regression model with temperature as the sole predictor nor a mixed effects model had any explanatory value (see supplementary material B).
Similarly, the split-plot model analysis did not reveal any significant influence of the whole-plot factor temperature on the reflex magnitude (see supplementary material B).
Following the original study (Ellrich and Treede, 1998), the NWR magnitude for the MP-TA combination was also quantified by the area under the curve (AUC) of the sEMG recordings for the 50 to 100 ms post-stimulation time window for all stimulations corresponding to about 1.2ÁI 0 , i.e. for the stimulation at I 0 +1 mA, for 10 out of 16 participants. In order to comply with the original design (Ellrich and Treede, 1998), two participants were excluded from this analysis because they had consistently rated the stimulations at I 0 + 1 mA as painful, i.e. P100 (see supplementary material D for details on reported pain ratings). Paired samples t-tests were performed for the AUC at either of the two highest temperatures (i.e. 45 and 46°C) and at 36°C. In contrast to the published results (Ellrich and Treede, 1998), the AUC did not continuously increase with temperature but rather fluctuate (Fig. 5) and and was not significantly different neither between 46 and 36°C nor between 45 and 36°C (p = 0.086 and p = 0.105, respectively). The analysis was restricted to AUC of sEMG curves from actual reflexes, i.e. fulfilling the Z criterion (interval mean z score P1.4 Rhudy and France, 2007)). However, the original article did not state whether a scoring criterion was used or not. If the AUC of all sEMG curves were considered the p-values greatly increased, both between 46 and 36°C as well as between 45 and 36°C (p = 0.289 and p = 0.273, respectively). If the same tests were performed for D the differences between 46 and 36°C as well as between 45 and 36°C were not significant either (p = 0.075 and p = 0.067, respectively). Additionally, the values for D were lower at the higher temperatures thus contradicting the results from the original study (Fig. 5). Again, the differences became more insignificant (p = 0.568 and p = 0.4485 respectively) if all values of D were considered.

NWR response rate
Fig. 6 depicts NWR response rates for all six stimulation and recording site combinations when suprathreshold stimulation currents were used. Overall, MP-TA, SU-BF, and SU-RF showed the highest NWR response rates of 79.4%, 66.9%, and 61.2% respectively. Fig. 7 shows the NWR response rate per recording and stim-    Table 2. These data were entered into Bayesian beta-binomial models.
When all three muscles and all four stimulation currents were considered together, the 95% HDI (HDI 95 ) for the NWR response rate ranged from 0.56 to 0.62 (median = 0.59) for SU stimulation and from 0.54 to 0.60 (median = 0.57) for MP stimulation. This means that 59% of the time reflexes were elicited with SU stimulation, and 57% with MP stimulation. However, the HDI 95 for the calculated greater effectiveness extended from À0.03 to 0.10. Because 0 still lay within the HDI 95 , the SU stimulation did statistically not yield greater effectiveness to elicit a NWR than MP stimulation. The HDI 95 for all six pairs, as well as the results of the classical statistical approach are presented in the supplementary material C. By and large, the classical approach yielded the same results.
The NWR response rates for subthreshold stimulations did not increase with temperature, neither for MP-TA nor for SU-BF (see supplementary material C.2).

Discussion
First, this study aimed to replicate the results of (Ellrich and Treede, 1998). The NWR had been shown to reflect ongoing sensi-  Abbreviations: MP = medial plantar nerve electrical stimulation site; SU = sural nerve electrical stimulation site; BF = biceps femoris muscle; RF = rectus femoris muscle; TA = tibialis anterior muscle; I 0 =reflex threshold current; NWR = nociceptive withdrawal reflex; response rate = percentage of elicited NWR. Maximum possible elicited NWR per stimulation and recording site combination and current PI 0 : 96 (6 blocks Á 16 participants).
A. Guekos, A.C. Grata, M. Hubli et al. Clinical Neurophysiology 145 (2023) 139-150 tization processes, which resembled thermally induced CS in the same dermatome where the reflex was elicited. In the present study, this finding could not be replicated. Although the documented design of (Ellrich and Treede, 1998) was reproduced as closely as possible, no systematic influence of the thermal CS induction paradigm on the NWR magnitude was observed. Because the potential CS induction was not verified by other proxies, such as secondary hyperalgesia, it cannot be said with absolute certainty whether CS was really induced. Yet, the absence of a modulation of the NWR (see 3.2) as well as of the pain ratings of the electrical stimulations (see supplementary material D), indicate that, in the present study, CS was probably not induced. Thus, it remains an open question if NWR magnitude changes can serve as a reliable CS proxy in studies using painful heat to induce CS in healthy participants as has been shown for CS induction using electrical stimulation (Manresa et al., 2010). Second, this study investigated the NWR response rate after repetitive electrical stimulation. Six pairs of stimulation and recording sites were tested. Three of these showed response rates of 61% or higher. MP-TA was the combination that yielded the highest NWR response rate (79.4%), followed by SU-BF (66.9%) and SU-RF (61.2%). The difference between MP-TA and SU-BF as well as between MP-TA and SU-RF was significant. It is therefore concluded that MP-TA is the best choice for obtaining robust NWR response rates in an experiment including repetitive elicitations. This information is crucial because the NWR response rate determines the minimum number of stimulations that are necessary for meaningful results.

The reflex magnitude may depend on experimental and analytical parameters
In the present study, no significant increase of the NWR magnitude was observed when a noxious heat stimulus was applied in the same dermatome as the concurrent electrical stimulation to elicit the reflex. This is in contrast to the in the original study (Ellrich and Treede, 1998) that reported an increase in NWR magnitude, which would indicate a successful induction of CS through the applied heat pain paradigm. There are several potential reasons for this discrepancy: (1) As indicated above, it is unclear whether or even unlikely that CS was induced in the present study. Although the exact same induction paradigm as in the original study was used, it is conceivable that, for some unknown reason, the participants of the present study sensitized less. (2) A Type I error in the original study or a Type II error in the present study.
(3) Slight adaptations of the number and rate of the electrical stimulations. (4) Details of the statistical analysis.
Regarding (2), the results of the original study were obtained from a cohort of 11 participants, of which just 5 were used for all published analyses. These small sample sizes might have biased the results, hampering the present replication attempt that relied on 16 participants.
Regarding (3), the number and rate of the electrical stimulations to elicit the reflex can have a direct impact on the results. In the original study, 18 electrical stimulations were applied at 0.2 Hz (Ellrich and Treede, 1998). This frequency is close to the reported 0.3 Hz threshold for temporal summation of pain (TSP) which is often used to study CS with the help of the NWR (Price, 1972;Terry et al., 2001;Terry et al., 2016;Rhudy et al., 2020;Arendt-Nielsen et al., 1994;Serrao et al., 2004;Sandrini et al., 2005). TSP, a phenomenon similar to wind-up in animal studies, is an increase in perceived pain over time during repetitive noxious stimulations of the same intensity (Price, 1972;Price et al., 1977). TSP has been reported to cause earlier onsets of the NWR (Arendt-Nielsen et al., 1994;Terry et al., 2001), which is normally observed in the 60 to 150 ms post-stimulation interval. This would explain why in the original study reflex responses were reported for the 50 to 100 ms post-stimulation time window (Ellrich and Treede, 1998). However, a shortening of onset latencies for the activation of Ad and C fibres is not considered characteristic for CS processes as longer or unchanged latencies have been reported in CS studies (Baek et al., 2016;Liang et al., 2016;De Schoenmacker et al., 2021). It seems therefore unlikely that the original study observed CS and more likely that they observed TSP. In the present study, a potential influence of TSP was avoided by reducing the number of electrical stimulations to eight, delivered at random interstimulation intervals of 5 to 15 s (see supplementary material A for all changes to the original design).
Alternatively, it should be noted that, using stimulations on the foot sole, onset latencies as early as 60 ms post-stimulation are reported for the activation of nociceptive Ad fibres that defines the NWR. In this context, the 60 ms post-stimulation moment has been characterized as a natural separation point between non-nociceptive and nociceptive components visible in the sEMG signal (Andersen, 2007). Therefore, a time window of 50 to 100 ms post-stimulation could theoretically contain the response of activated Ad fibres, i.e. the NWR, even in the absence of TSP. Conversely, it seems unlikely that the observed reflex responses in the original study were predominantly non-nociceptive, although they were interpreted as such in the original article, because the published traces show no activity before 60 ms (Ellrich and Treede, 1998).
Regarding (4), the NWR magnitude was quantified in the original study using the average AUC of the obtained reflexes (Ellrich and Treede, 1998). However, the AUC might not be the best criterion to quantify and analyse the reflex magnitude. Because AUC has been reported to be significantly worse for reflex identification than Z or D (Rhudy and France, 2007;France et al., 2009) it is conceivable that quantification and analysis using AUC misrepresents the actual muscle response activity. This would explain the striking difference in the present replication study between the evolution of the AUC from the lowest to the highest temperature in comparison to D, with AUC increasing and D decreasing (Fig. 5). For future studies investigating the viability of the NWR magnitude as a CS proxy, it is therefore recommended not to use AUC as a quantification because it could distort the results. Relying on a more robust metric for quantification, such as D, might improve reproducibility of findings relating to subtle differences in NWR response.

One muscle drives the withdrawal action for each stimulation site
The withdrawal action to electrical stimulation was always most reliable in one of the three muscles of which responses were recorded. For MP stimulation it was the TA, and for SU stimulation the BF muscle. This pattern is in line with other published results that muscle activation depends on the stimulation site, indicative of withdrawal strategies (Massé-Alarie et al., 2019), and justifies the use of these specific two combinations. It can be excluded that it was due to the specifics of the sEMG recordings because the muscle with the most reliable response changed between stimulation sites although the recording specifics remained identical. The importance of one particular muscle suggests that the withdrawal action is typically driven by this specific muscle indicating that the NWR is organized in a hierarchical pattern on the spinal level. This interpretation is in line with proposed hypotheses regarding hierarchical muscle activation strategies (Tresch and Jarc, 2009;Pierrot-Deseilligny and Burke, 2012).
The purpose of such a pattern would be to prevent any potential harm to the limb by moving the innervation sites (i.e. the sites of nociceptive input in a naturalistic situation) of the electrically stimulated nerve away. The muscle that dominates the withdrawal action should be the one most effectively removing the limb. For a stimulus on the foot sole (as in the case of an electrical MP stimulation), a contraction of the TA muscle results in dorsiflexion and inversion of the foot (Juneja and Hubbard, 2021), thereby ensuring removal of the foot sole from the stimulus. Likewise, in the case of SU stimulation at the ankle, contracting the BF muscle achieves the same result; the knee is flexed with the ankle region and the lateral side of the foot being immediately removed from their current position.
The other muscles that are involved in the withdrawal would only assist the dominating muscle in moving the affected limb away from the nociceptive input. In the case of MP stimulation, the involvement of the two other muscles that were observed (BF and RF) was limited. Their contribution was only detected in approximately half of all elicited reflex responses, probably because the contraction of the dominating TA is sufficient to move the foot away. The muscles of the thigh, i.e. BF and RF, might in the instance of a nociceptive stimulus on the sole of the foot only subserve stabilization. Thus, their activation would be expected to be more variable. Interestingly, the contribution of BF and RF increased with increasing stimulation currents (see supplementary material C), presumably because higher currents signal greater potential harm. Greater harm in turn warrants greater supraspinal control, which has been suggested to be increased for proximal muscles such as BF and RF (Jure et al., 2019). Therefore, the whole limb is moved with the help of these two muscles.
In the case of SU stimulation, the dominating BF was most of the time assisted by the RF in the withdrawal. The NWR response rates of these two muscles were not statistically different for suprathreshold stimulations. This could be due to the complementary nature of their movements regarding knee and hip. Specifically, the BF acts as a hip extensor and a knee flexor while the RF is a hip flexor and a knee extensor (Darby and Cramer, 2014). Thus, in the withdrawal, the dominating action of the BF is probably knee flexion and of the RF hip flexion to effectively withdraw the lower limb from nociceptive input. In contrast, the TA is significantly less activated (see supplementary material C), probably because dorsiflexion of the foot does not aid to move the lateral side of the foot, i.e. the innervation territory of the sural nerve, away from the putative nociceptive source.
The above considerations are limited by focussing on individual muscle responses to electrical stimulation of nerve trunks. For an in-depth assessment of the underlying NWR pattern involving multiple muscle activations other approaches, such as muscle synergy analyses (Jure et al., 2019), and using alternative stimulation and recording site combinations would be more suitable. However, those kind of approaches would have been beyond the scope of the present study, which was primarily a replication attempt.

The MP-TA combination is not always the best choice
The MP-TA stimulation and recording site combination showed significantly higher NWR response rates than the SU-BF and SU-RF combinations. In addition, participants typically consider suprathreshold stimulations at MP less painful than at SU (Jensen et al., 2015) as can be observed in the collected pain ratings (see supplementary material D). Still, MP-TA might not always be the combination of choice for NWR studies, for two main reasons.
First, stimulating the MP nerve on the foot sole is not straightforward. The ideal position for the stimulation electrode is proximal to the medial side of the ball of the foot, i.e. close to where the arch of the foot ends and the ball begins. This region is often covered by calloused skin, the extent of which may depend on foot anatomy, walking patterns, and skin properties. Because of its composition and thickness, calloused skin influences the electrical conductivity and thus the nociceptive input. The skin would have to be peeled off before an experiment in order to achieve compara-bility between subjects. This would result in a lengthy preparatory procedure and potentially cause sensitization of the testing area. In contrast, stimulating the SU nerve does not require any preparation except for identifying the SU by manual palpation before placing the stimulation electrodes over its retromalleolar pathway.
Second, the MP is a mixed sensory and motor nerve, whereas the SU is a purely sensory nerve. Electrical stimulation of a mixed nerve can cause noticeable motor responses, such as toe twitching in case of a MP stimulation. These motor responses may confuse the participant and prompt voluntary muscle responses, which in turn may contaminate the EMG recordings.
Additionally, the difference in NWR response rates (79.4% for MP-TA compared to 66.9% for SU-BF), albeit significant, is not big enough to be the sole decisive factor when choosing the stimulation and recording site combination in a clinical study. Depending on the research question, it might be advisable to choose SU-BF instead of MP-TA, e.g., if a mixed nerve stimulation is to be avoided. If multiple muscle responses are to be recorded simultaneously, the choice of stimulation site should guarantee high NWR response rates in all recorded muscles. In the present study, SU stimulation yielded better results than MP when all three muscles, BF, RF, and TA, were considered.
Using the SU-RF combination might be a viable alternative to both MP-TA and SU-BF. If only the RF reflex responses are recorded, the NWR threshold I 0 should be determined at this muscle. That might yield even higher NWR response rates than in the present study, where for all SU stimulations the I 0 of the BF was used.

Clear methods' sections support replicability
The present study attempted to replicate results that had been published over 20 years ago (Ellrich and Treede, 1998), based on the published, i.e. publicly available information. However, in certain instances, such as threshold determination, exact currents used for electrical stimulation, or number of experimental runs, this information proved to be unclear or incomplete. Hence, some potential changes to the original design were inevitable (see Methods and supplementary material A). All changes were made after carefully considering how they might affect the results of the replication attempt. Overall, the changes have presumably not affected the outcome of the replication as a whole although a small influence cannot be fully excluded.
For replication studies, it is of paramount importance that the methods published in the original study are precise and unequivocal. Otherwise, results of previous studies cannot be completely understood and evaluated by different researchers at a later date. And, it would become virtually impossible to scrutinize historic results, that may form part of the canon literature in a certain field, against the backdrop of new knowledge that might have emerged over time. Thus, a just comparison could not be made. In the present case, the understanding of CS and of the NWR has greatly advanced over the past two decades (Quesada et al., 2021;Arendt-Nielsen et al., 2018;Sandrini et al., 2005;Andersen, 2007;Jure et al., 2020). Because of these advances, it was worthwhile to re-examine the original results (Ellrich and Treede, 1998) in order not just to confirm or dismiss them but to also put them into context with the new knowledge in the field of CS and NWR studies. Unfortunately, such an assessment was not fully possible due to the aforementioned incompleteness of the published methodology.
The results from the present replication attempt would discourage the use of the published paradigm for CS studies although a final verdict cannot be given because small changes to the original design had to be made. Small changes can still impact the results of potential CS proxies, if the expected modulation could be very sub-tle, such as in the case of a modulation of the NWR magnitude. Therefore, methods in studies on CS involving the NWR magnitude as an outcome measure need to be clearly and precisely reported.

Conclusion
Tonic heat stimuli that were concurrently applied in the same dermatome as the electrical stimulations to elicit the NWR did not significantly alter the NWR magnitude in the present study despite closely reproducing a previous design that reflected CS (Ellrich and Treede, 1998). The viability of NWR magnitude changes as a CS proxy after induction with painful heat should only be assessed if the induction can be independently confirmed, e.g., by the presence of secondary hyperalgesia. For future replicability, methods of CS studies involving the NWR need to be reported in detailed and precise fashion.
The NWR can be reliably elicited in a repeated measures experiment. Presently, the MP-TA stimulation and recording site combination yielded the highest response rates for the NWR. If simultaneous sEMG read-outs in more than one muscle are used, the SU stimulation site appears to be the superior option.