Evaluating the Measurement Properties of the Self-Assessment of Treatment Version II, Follow-Up Version, in Patients with Painful Diabetic Peripheral Neuropathy

Background. The Self-Assessment of Treatment version II (SAT II) measures treatment-related improvements in pain and impacts and impressions of treatment in neuropathic pain patients. The measure has baseline and follow-up versions. This study assesses the measurement properties of the SAT II. Methods. Data from 369 painful diabetic peripheral neuropathy (PDPN) patients from a phase III trial assessing capsaicin 8% patch (Qutenza®) efficacy and safety were used in these analyses. Reliability, convergent validity, known-groups validity, and responsiveness (using the Brief Pain Inventory-Diabetic Neuropathy [BPI-DN] and Patient Global Impression of Change [PGIC]) analyses were conducted, and minimally important differences (MID) were estimated. Results. Exploratory factor analysis supported a one-factor solution for the six impact items. The SAT II has good internal consistency (Cronbach's alpha: 0.96) and test-retest reliability (intraclass correlation coefficients: 0.62–0.88). Assessment of convergent validity showed moderate to strong correlations with change in other study endpoints. Scores varied significantly by level of pain intensity and sleep interference (p < 0.05) defined by the BPI-DN. Responsiveness was shown based on the PGIC. MID estimates ranged from 1.2 to 2.4 (pain improvement) and 1.0 to 2.0 (impact scores). Conclusions. The SAT II is a reliable and valid measure for assessing treatment improvement in PDPN patients.


Introduction
Neuropathic pain (NP) is a disorder of the central and peripheral nervous system resulting from a lesion or disease [1][2][3]. NP is one of the most prevalent pain aetiologies [3], with reported rates ranging from 0.9% to 8% of the general population [4,5]. In diabetic patients, NP (referred to as diabetic polyneuropathy [DPN]) is one of the most common complications [3]. Painful diabetic peripheral neuropathy (PDPN) is a common form of DPN, with a prevalence of 5.8-34% in type I, type 2, or overall diabetes mellitus patients and an incidence of approximately 0.7 per 1000 persons per year [6]. In these patients, the system that signals pain is damaged or dysfunctional, resulting in symptoms such as aching, burning, shooting, and/or stabbing pain, often manifesting at night [3,7]. Limbs and extremities are often affected, which subsequently impacts activities of daily living, sleep, work, and overall quality of life (QoL) [8].
The Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) identified six core outcome domains as key for the assessment of efficacy and effectiveness of pain treatments: (1) pain; (2) physical functioning; (3) emotional wellbeing; (4) participant ratings of improvement and satisfaction with treatment; (5) symptoms and adverse events; and (6) participant disposition [9]. Patient-reported outcome (PRO) measures of pain, often used as primary endpoints, capture changes in pain intensity or frequency resulting from treatment but typically do 2 Pain Research and Treatment not assess patient ratings of improvement and satisfaction. Furthermore, in a sample of patients with postoperative pain given the American Pain Society Satisfaction Survey, satisfaction was influenced by effectiveness of the medication independent of the level of pain intensity [10]. The fiveitem Self-Assessment of Treatment (SAT) questionnaire was developed based on the IMMPACT recommendations for assessing patient ratings of improvement and satisfaction [11,12]. Items assess patient ratings of treatment benefit relating to pain, activity level, and QoL. Additionally, the SAT includes an item assessing if they would receive the treatment again and an item comparing treatments.
Despite strong evidence of the measurement properties of the SAT items [12], concerns were expressed about the lack of a recall period and about that the activity and QoL items covering too broad a construct for single items. Qualitative interviews were conducted with clinical experts and with patients diagnosed with NP [11]. Three clinicians provided their perspective on the most relevant symptoms and impacts of NP, as well as on the key benefits and harms associated with treatment. Additionally, the SAT was administered to 44 patients with NP, including PDPN ( = 20), human immunodeficiency virus-associated neuropathy ( = 16), and postherpetic neuropathy ( = 8), who provided feedback on both the measure and their experience with treatment for pain. The interviews confirmed the previous concerns with the activity and QoL items, with both being reported as too broad to capture key impacts. The activity item was subsequently split into three items, measuring improvements in self-care, daily activities, and physical activities. The QoL item was also split into three items, measuring improvements in sleep, emotional wellbeing, and social functioning. A recall period of 7 days was also added to the measure, and response options were adjusted for consistency between items. The modified measure, the SAT version II (SAT II), includes both baseline (measuring pain and impacts) and followup (measuring treatment-related improvements in pain and impacts and impressions of treatment) versions.
The SAT II was included in a phase III, double-blind, randomized, placebo-controlled clinical trial evaluating the efficacy and safety of capsaicin 8% patch (Qutenza) in subjects with PDPN [13]. The aim of this study was to develop the scoring algorithms for the SAT II follow-up version, including detecting and evaluating potential subscales, and to assess the measurement properties of these scores through psychometric evaluation.

Patient
Sample. Data were collected from 369 patients with PDPN (≥3 Michigan Neuropathy Screening Instrument) from the phase III trial assessing the efficacy and safety of capsaicin 8% patch [13]. Patients included in the clinical trial had a score of ≥4 on the Brief Pain Inventory-Diabetic Neuropathy (BPI-DN) item 5 (a patient-reported measure of pain intensity evaluated on a 0-10 scale) at the screening visit and stable glycemic control when entering study. Patients had been diagnosed with painful, distal, symmetrical, sensorimotor polyneuropathy due to diabetes for at least 1 year prior to screening. They had at least one medical record of glycosylated hemoglobin (HbA1c) of <11.0% at 3-6 months before the screening visit and at screening, with variations of <1.0% between the 3-and 6-month prescreening value and screening value.

Study
Design. The phase III trial was a double-blind, randomized, placebo-controlled, efficacy and safety study [13]. Subjects were randomly assigned to receive either a single application of capsaicin 8% patch or a placebo patch for 30 minutes at the baseline visit (day 1). This was followed by an observation period of 12 weeks involving four visits at weeks 2, 4, 8, and 12. The primary efficacy endpoint was the percent change in the BPI-DN item 5 score from baseline (average of daily scores during the week ending on day 1) to weeks 2-8 (average of daily scores during this period) in the active arm compared to the placebo arm. Data used to characterize the sample, including sociodemographic and clinical data, were collected at baseline or during screening.

PRO Measures.
The SAT II follow-up version contains nine items in total. It measures the extent to which the study treatment has improved pain (question 1) and has six impact items assessing key impacts on QoL (self-care activities, daily activities, and physical activities [questions 2a-c]; emotional wellbeing, sleep, and social functioning [questions 3a-c]), all assessed on a five-point Likert scale using a 7-day recall period. Additionally, the extent to which a patient would be willing to receive the study treatment again (question 4) and impressions on how it compares to other treatments (question 5) are both captured using five-point Likert scales with no recall period stated. The SAT II follow-up version was administered at weeks 8 and 12.
The BPI-DN was developed to assess pain resulting from diabetes [14]. Item 5 of this measure assesses pain due to diabetes during the past 24 hours using an 11-point numerical rating scale (NRS; where 0 represents "no pain" and 10 represents "worst possible pain"). Item 9F assesses how pain interferes with sleep during the past 24 hours using an 11point NRS (where 0 represents "does not interfere" and 10 represents "completely interferes"). A 30% reduction in pain severity using an 11-point NRS has previously been identified as a clinically important difference [15]. The BPI-DN items 5 and 9F were completed daily from first screening visit to week 12/end of study. For the 7-day average of daily BPI-DN item 5 and item 9F scores, data were considered nonmissing if scores were available from at least 4 days in the week. A 7-day average was calculated using 7 consecutive days ending on the day of the baseline, week 8 and week 12 visits.
The Patient Global Impression of Change (PGIC) measures the change in patients reported overall health status on a seven-point scale ranging from 1 (very much improved) to 7 (very much worse). In the phase III clinical trial, the PGIC was administered at weeks 2, 8, and 12.
The EuroQol-5 dimensions (EQ-5D) is a PRO measure developed to derive health utilities and is typically used in cost-utility analyses. The EQ-5D contains five items (pain/discomfort, self-care, mobility, anxiety/depression, and usual activities) that are scored using three-point Likert-type response scales. The responses are converted into a single index score using valuations of health states, based on the EQ-5D response options using the time trade-off method, in a representative sample of the general population [16]. The EQ-5D was completed at baseline and weeks 2, 8, and 12.
The Hospital Anxiety and Depression Scale (HADS) is a PRO measure developed to assess levels of anxiety and depression for use in clinical practice [17] but has also been used in numerous clinical trials. The HADS contains 14 items, with seven assessing depression and anxiety. Each item is scored using three-point Likert-type response scales. Summary scores for the anxiety and depression domains can be scored ranging from 0 to 21 with higher scores indicating greater anxiety/depression. The HADS was completed at baseline and weeks 2, 8, and 12.
The Neuropathic Pain Symptom Inventory (NPSI) is a PRO measure developed to evaluate different symptoms of neuropathic pain [18]. The NPSI contains 12 items, from which five summary pain scores can be calculated: burning, evoked, pressive, paroxysmal, and abnormal sensations. The 10 items used to derive the domain summary scores are each scored using a 0-10 NRS ranging from no pain/sensation to worst pain/sensation imaginable. The remaining two items report how consistently pain has been present and the number of pain episodes. The NPSI was completed at baseline and week 12.

Statistical Analyses.
All subjects in the intent-to-treatpopulation with available PRO data, as required for each analysis, were included in the analysis sample, and no missing data were imputed. All statistical tests conducted were twotailed with < 0.05 used to determine significance. Due to similarities in the results for the follow-up version time points (weeks 8 and 12), unless specified otherwise, only week 12 data are reported.

Patient Demographics and Clinical Characteristics.
Demographic (gender, age, and race) and clinical variables (weight and concomitant medication use) collected at the baseline visit or during screening were used to characterize the patient sample.

SAT II Descriptive Statistics.
The distributional characteristics of the individual SAT II items were examined at week 8. Frequencies and percentages at each response level are reported to provide information on the range of response options used. In addition, the mean, standard deviation (SD), and median are reported for all items.

Item-to-Item and Item-to-Scale Correlations.
Spearman correlations were calculated to assess the relationship between the items (item-to-item correlations) and to provide information about the functioning of the instrument in the population. In addition, summary scores (combining items expected to be related) were correlated with the individual items (item-to-scale correlations). The analyses conducted at week 12 included correlations: between all SAT II items; between summary scores (the sum of the three activity items, the sum of the three QoL items, and the sum of all six impact items) and all items; and between the week average BPI-DN item 5 score and the pain item, the three summary scores, the treatment continuation item and treatment comparison item.

Factor Analysis.
To determine the number of domains and thus inform the scoring, exploratory factor analysis (EFA) was conducted at week 8 and at week 12. We included only the six impact items in the EFAs. The pain, treatment continuation, and treatment comparison items were not included, as they measure distinct and different concepts. Eigenvalues and the Root Mean Square Error of Approximation (RMSEA) were used to evaluate number of factors. Factor loadings >0.4 were considered acceptable (provided the loading is on one factor only).

2.4.5.
Scoring. The scoring approaches were based on the findings from the correlations and EFAs and derived after discussion among all authors. The pain, treatment continuation, and treatment comparison items were scored as individual items. The proposed scores were then assessed for reliability, validity, and ability to detect change.
2.4.6. Reliability. The internal consistency reliability for the impact domain was assessed using Cronbach's formula for coefficient alpha at weeks 8 and 12. The target Cronbach's alpha is at least 0.70, though patterns of item-to-item correlations and item-to-total correlations are also important, as are the number of items in the subscale.
To measure test-retest reliability, stable patients were defined as those with a <20% change in BPI-DN item 5 (pain) score from week 8 to week 12 [15]. Stable patients were also defined using the definition of a change of <20% in EQ-5D Visual Analogue Scale (VAS) score. Intraclass correlation coefficients (ICCs) were calculated between week 8 and week 12 using SAT II follow-up scores. An ICC of >0.60 among stable subjects is considered acceptable to demonstrate testretest reliability [19].

Validity. Convergent validity was assessed via
Spearman's rank-order correlation coefficient at weeks 8 and 12, between the SAT II scores and BPI-DN item 5 change from baseline score and change from previous week score (i.e., week 8 minus week 7; week 12 minus week 11); BPI-DN item 9F score; BPI-DN item 9F change from baseline score and change from previous week score; HADS subscale scores; HADS subscale change from baseline scores; PGIC; EQ-5D index and VAS scores; EQ-5D index and VAS change from baseline scores.
Known-groups validity for the follow-up version was examined at weeks 8 and 12 by analysis of variance (ANOVA) assessments comparing SAT II scores based on the following groups: BPI-DN item 5 score: 0-3, 4-6, and 7-10; and BPI-DN item 9F score: 0-3, 4-6, and 7-10. Pairwise comparisons between group means were assessed via -tests. To account for multiple comparisons, Scheffé's method was applied.
Nonresponders were defined as all patients not meeting those categories. Comparisons between responders and nonresponders were conducted for each of the SAT II scores.

Minimally Important Scores.
In the context of clinical trial use, while a measure may detect a difference between treatment arms, such an assessment does not consider whether or not the actual change experienced by patients is meaningful. A variety of methods have been developed to determine the minimum change in score that can be considered important, including both distribution-and anchorbased methods. Minimally important scores were estimated for the follow-up version SAT II scores.
One distribution-based approach which has been used for estimating minimally important scores is the standard error of measurement (SEM) [20,21]. The SEM describes the error associated with the measure, in this case the SAT II scores, and is estimated by the SD of the measure multiplied by the square root of one minus its reliability coefficient (ICC from the test-retest assessment or Cronbach alpha from the internal consistency assessment). Shikiar et al. [22] found a general correspondence between the minimally important difference (MID) and SEM; however, this is somewhat dependent upon the magnitude of the reliability coefficient. SEM was calculated at week 8 and week 12. A second distributionbased approach conducted was an assessment of half of a SD of the SAT II scores at weeks 8 and 12. Norman et al. [23] suggest that one-half of a SD of a measure represents a clinically meaningful change, but not necessarily a MID. The half SD estimate provides an upper boundary for the MID. These analyses represent a statistical approach to defining minimally important scores and are considered supportive of anchor-based methods [24].
Anchor-based assessments select patients that achieve the MID for a measure that assesses a related construct (the anchor). The mean SAT II scores for this patient group represent minimally important scores, as it is assumed that patients that achieve a minimal response on the conceptually related anchor will also achieve a minimally important score on the SAT II. Minimally important scores were calculated at weeks 8 and 12, using the following anchors [15]: BPI-DN item 5 score change: 30%-40% and 25%-35%; BPI-DN item 9F score change: 30%-40% and 25%-35%; PGIC: minimally improved as well as minimally improved and much improved.

Patient Demographics and Clinical Characteristics.
The mean age at study baseline was 63.0 years, with a range of 33-89 years (Table 1). Patients were predominantly white (71.3%) and 58.3% were male. The mean weight was 93.4 kg This is to be expected, given that not all patients are expected to improve and that the scale for these items does not include options that account for increased levels of pain (and subsequent impacts). Thirty-six percent (36%) of patients responded "yes, definitely" to a question if they would like to receive the treatment again, with a further 26% responding "yes, probably." Fifty percent (50%) of patients report the treatment to be "somewhat better" or "very much better" than the other treatments they received for their condition. A very similar pattern of results was observed at week 12.

Item-to-Item and Item-to-Scale Correlations.
Item-toitem correlations between the pain and impact items at week 12 ranged from 0.70 to 0.90. The correlation between the self-care item (2a) and daily activities item (2b) was particularly high ( = 0.90), indicating potential redundancy. However, given that the importance of both the self-care and daily activity items was established by patient interviews during the revision of the SAT and given the daily nature of self-care activities, both items seem to measure separate and important constructs. Correlations between the treatment comparison/treatment continuation items and the other items were typically lower, ranging from = 0.50 to 0.69. Correlations between the pain and impact items with the activity summary, QoL summary, and impact summary Pain Research and Treatment 5 scores were high ( = 0.78 to 0.95), indicating a strong relationship between these items.

Exploratory Factor
Analysis. RMSEA was lower for a two-factor solution than a one-factor solution at week 8 (0.11 versus 0.19); however, correlations between the factors were relatively high ( = 0.68). Additionally, eigenvalues were dominated by a large first eigenvalue (4.8) with a value below 1.0 (0.62) for the second eigenvalue. Factor loadings were greater than 0.5 for all items in the one-factor solution at week 8 (ranging from 0.71 to 0.98). Combined, these findings support a one-factor solution for the six impact items. The findings were very similar between week 8 and week 12 factor analyses.

Scoring Approach.
Factor analysis and item-to-scale correlations support a single factor for items 2a-c and 3ac. These items should be scored as a single summary score using the mean of the constituent item scores. Using the mean allows for a more instinctive interpretation of the score back on the original five-point scale of the constituent items (i.e., ranging from "not at all" to "very much better"). Items 1, 4, and 5 should all be scored separately.
The following options are recommended for comparing treatment arms using the SAT II follow-up version: (1) compare mean item and summary scores by treatment arm, and/or (2) compare the proportions of patients by item response category or the proportion scorings above a specified threshold for each item.
3.6. Reliability. Internal consistency reliability, as assessed using Cronbach's alpha, was 0.86 for the impact domain at baseline, 0.96 at week 8, and 0.96 at week 12, indicating good internal consistency. Test-retest reliability was assessed among stable patients (with <20% change in BPI-DN item 5) at week 8 and week 12 ( Table 2). Acceptable test-retest reliability was demonstrated for all of the follow-up scores (ICC range: 0.62-0.78). Among stable patients defined as <20% in EQ-5D VAS score, acceptable test-retest reliability was demonstrated for all of the follow-up scores (ICC range: 0.68-0.79; Table 2).

Validity. Convergent validity was assessed at week 8 and
week 12, between the SAT II scores and the BPI-DN item 9F (sleep interference) score, the HADS subscale scores (anxiety and depression subscales), the EQ-5D index and VAS scores, and the NPSI domain scores (burning, evoked, pressive, paroxysmal, and abnormal sensations). Correlations ranged from 0.01 to −0.79 at week 8 and from −0.02 to −0.77 at week 12 and are presented in Tables 3(a) (week 8) and 3(b) (week 12).
At week 8, moderate to strong correlations were demonstrated on the BPI-DN item 5 and 9F overall and change from baseline scores for pain improvement, impact summary, treatment continuation, and treatment comparison (−0.30 to −0.60) (Table 3(a)). Strong correlations were demonstrated for all the items tested compared to the PGIC (−0.57 to −0.79), and moderate correlations were shown for the EQ-5D index overall and change from baseline (0.32 to 0.38). Weak correlations were shown between all SAT II items tested and HADS subscales and the EQ-5D VAS.
At week 12, moderate to strong correlations were also demonstrated on the BPI-DN item 5 and 9F overall and change from baseline scores for pain improvement, impact summary, treatment continuation, and treatment comparison (−0.29 to −0.58) (Table 3(b)). Strong correlations were demonstrated for all the items tested compared to the PGIC (−0.57 to −0.77). Moderate correlations were shown for the EQ-5D overall and change from baseline (0.32 to 0.38) and for the EQ-5D VAS (0.20 to 0.25). Weak correlations were shown between all SAT II items tested and HADS subscales.
For the known-groups validity analyses, tests for ANOVAS were significant suggesting that the SAT II pain improvement, impact summary, treatment continuation, and treatment comparison scores discriminate between groups as defined by the BPI-DN item 5 (Table 4(a)) and item 9F (Table 4(b)) at weeks 8 and 12. Scheffé's post hoc tests demonstrated that all scores were significantly different between the 0-4 versus 4-6 and 0-4 versus 7-10 categories on the BPI-DN items 5 and 9F for weeks 8 and 12. Scores were also significantly different between the 4-6 and 7-10 categories for the pain improvement and treatment comparison scores at week 8, and for the pain improvement at week 12 on the BPI-DN item 5, as well as at week 8 on the pain improvement scores on the BPI-DN item 9F.

Responsiveness.
At both weeks 8 and 12, significant differences ( < 0.0001) were demonstrated for all items between responders and nonresponders, based on a responder definition of a ≥ 30% and a ≥ 50% reduction in BPI-DN item 5 score. The general trend for all items is that there are a greater proportion of responders than nonresponders in categories indicating superior benefit (e.g., "quite a bit better" and "very much better").
When using the PGIC to define responders, significant differences were observed between responders and nonresponders (p < 0.0001) for all items at both weeks 8 and 12, using both definitions (i.e., "minimally improved" or better or "much improved" or better, to define responders). The general trend for all items is that there is a greater proportion of responders than nonresponders in categories indicating superior benefit (e.g., "quite a bit better" and "very much better"). These results demonstrate that the SAT II items and summary scores were able to detect a clinically meaningful change in health status or level of pain. Table 5       SAT II mean scores and SD for patients with a BPI-DN item 5 score within the specified range.  SAT II mean scores and SD for patients with a BPI-DN item 9F score within the specified range.   Based on the total summary of evidence on estimates of minimally important scores, with greater focus on the anchor-based estimates, attainment of a pain improvement score of 1.2 to 2.4 may represent a meaningful threshold for determining clinically meaningful improvement. For impact summary scores, a score of 1.0 to 2.0 may represent a meaningful threshold for determining clinically meaningful improvement. Based on all the anchor-based estimates, a score of ≥1.5 may be considered clinically meaningful for both the pain improvement and impact summary scores; a more conservative estimate would be a score of ≥2.0. Treatment continuation and treatment comparison scores are directly translatable.

Discussion
The SAT II questionnaire is based on the original SAT questionnaire developed based on the IMMPACT recommendations. The original measure lacked a recall period and the activity and QoL items were considered too broad [12]. The need for a new version of the questionnaire was identified through qualitative research, which showed that the original questionnaire was lacking in content validity for the activity and QoL items. The modified SAT II measure was developed to address these concerns [11]. There are two versions of the SAT II, one to be administered at baseline and the other at follow-up visits. The SAT II baseline version contains seven items evaluating current status (pain level, impact on selfcare activities, daily activities, physical activities, emotional wellbeing, and sleep and social functioning). The SAT II follow-up version contains nine items: seven items similar to the baseline version but asking about the improvement on the level or impact of pain due to treatment and two additional items on whether (1) the patient wants to receive the treatment again and (2) how the treatment compares with other pain treatments. The SAT II baseline version is recommended for use in characterizing the patient sample, while the SAT II follow-up version is recommended for use to compare patient-reported improvements by treatment arm. This study focused on the development of scoring algorithms and the assessment of the measurement properties of the SAT II follow-up version for use in clinical trials.
The analyses reported here support combining items 2a-c (daily and physical activities) and 3a-c (emotional wellbeing, sleep and social functioning) as a single summary score. Itemto-item correlations were generally moderate to strong. This relationship was further explored by factor analysis, which supported the use of a single summary score comprising items 2a-c and 3a-c. There are two acceptable approaches to scoring. The first is simply to compare mean scores by treatment arm for questions 1, 4, and 5 and the summary score (questions 2-3). The second method compares the proportions of patients reporting different SAT II items response levels by treatment arm. For this analysis, rather than comparing across all response categories, patients can be grouped as those at or above versus below a response category (e.g., "moderately better") and compared by treatment arm (i.e., a 2 × 2 contingency table; and odds ratios and chi-square p values can be reported).
The SAT II demonstrated good internal consistency and good test-retest reliability (both the BPI-DN item 5 and the EQ-5D VAS were used to define stable patients). Tests of convergent validity showed that the BPI-DN items 5 and 9F and the PGIC were most strongly correlated with SAT II score at both weeks 8 and 12, while the EQ-5D showed moderate correlations. The weakest correlations were seen in relation to the EQ-5D VAS and the HADS. Tests for known-groups validity showed that the SAT II scores varied significantly by level of pain intensity and sleep interference. SAT II scores clearly delineated between pain severity and sleep interference groups, with better SAT II scores in the groups reporting lower pain severity or sleep interference.
The distribution-based minimally important scores were consistent across time points when using the ICC based on either the BPI-DN item 5 or the EQ-5D VAS. Based on the overall anchor-based estimates, a score of ≥1.5 at follow-up may be considered clinically meaningful for both the pain improvement and impact summary scores; a more conservative estimate would be a score of ≥2.0. Note that an achieved score of 2.0 is equivalent to "moderately better" or greater, and an achieved score of 1.5 is equivalent to the case between slightly and moderately better. Based on the results, a threshold of 2.0 may provide the best estimate for clinical significance. The treatment continuation and treatment comparison scores are directly translatable where a treatment continuation score of 3 or greater represents yes probably or yes definitely, and for treatment comparison a score of 3 or greater represents somewhat or very much better.
To be used as an endpoint in clinical trials, it is not enough for a measure to be reliable and valid but it needs to also be sensitive to changes in a patient's condition. The SAT II has been shown to be very responsive to change, based on a measure of patient-reported global change and to an improvement in pain severity scores. Responsiveness was demonstrated for all item scores across all responder definitions. In addition, the SAT II impact summary score (items 2-3) also demonstrated ability to detect change based on the patient global ratings and improvements in pain severity.

Conclusion
The SAT II follow-up version measures patient-reported improvement in pain and impact of treatment on daily activities and functioning and treatment satisfaction. The SAT II follow-up version demonstrated good internal consistency and test-retest reliability and good evidence supporting convergent and known group's validity. More importantly, the SAT II was responsive to changes in pain severity and global ratings of change in health status. The SAT II may be an acceptable endpoint for pain treatment studies. These findings suggest that the SAT II may be an acceptable primary or secondary endpoint in PDPN clinical trials. Future research is needed to confirm the measurement properties of the SAT II.

Competing Interests
Floortje van Nooten was an employee at Astellas Pharma at the time of the study. This study was conducted by Dennis A. Revicki, Dorota Staniewska, Dylan Trundell, and Evan W. Davies who were employed by Evidera, a consultancy company funded by Astellas Pharma to conduct this study. Dorota Staniewska, Evan W. Davies, and Dylan Trundell are no longer employed by Evidera.

Authors' Contributions
Dorota Staniewska and Jun Chen conducted the statistical analyses. All authors were involved in the design and interpretation of the analyses, contributed to the drafting of the manuscript, and approved the final manuscript.