Validation of the Swedish version of PROMIS-29v2 and FACIT-Dyspnea Index in patients with systemic sclerosis

Abstract Purpose To evaluate the reliability, internal consistency, and construct validity of the Swedish versions of PROMIS-29 and Functional Assessment of Chronic Illness Therapy-Dyspnea (FACIT-Dyspnea) instruments in patients with systemic sclerosis (SSc). Methods In a cross-sectional study, consecutive SSc patients completed a paper-based survey. Internal consistency was assessed using Cronbach’s alpha. Test–retest reliability was tested employing weighted Kappa (K w) and intra-class correlation coefficient (ICC). Construct validity was evaluated by hypotheses testing using RAND-36, MRC Dyspnea score, Scleroderma Health Assessment Questionnaire (SHAQ) and clinical measurements. Results Forty-nine patients (86% female; 73% limited cutaneous SSc) completed the survey. The mean disease duration was 11 years and mean SHAQ was 0.5. Internal consistency and test–retest reliability were good with the exception of PROMIS-29 anxiety. PROMIS-29, FACIT-Dyspnea, and Functional limitation showed strong correlations to corresponding RAND-36 domains (|r s|=0.67 to −0.85). Relevant PROMIS-29 domains, FACIT-Dyspnea and Functional limitation correlated strongly to SHAQ and VAS overall disease severity (|r s|=0.60 to −0.75). Ceiling effects (>15%) were found in six PROMIS-29 domains and in both FACIT-Dyspnea and Functional limitations. Four (4/5) hypotheses were confirmed. Conclusions PROMIS-29 and FACIT-Dyspnea meet the requirements for reliability and have adequate construct validity in Swedish patients with SSc. Implications for rehabilitation PROMIS-29v2 and Functional Assessment of Chronic Illness Therapy-Dyspnea (FACIT-Dyspnea) Index are patient outcome measures that gain increasing interest for the evaluation of patient with rheumatologic diseases. PROMIS-29v2 and FACIT-Dyspnea Index meet the requirements for reliability and have adequate construct validity compared to legacy measures in Swedish patients with systemic sclerosis. Translation and validation of PROMs is important for studies of rare diseases in multi-center collaborations.


Introduction
Systemic sclerosis (SSc) is a chronic connective tissue disease characterized by immune dysfunction, vascular injury, and abnormal fibrotic processes [1]. This multisystem disorder can affect skin, lung, gastrointestinal tract and cardiovascular system and restrict performance of activities of daily living and health-related qualityof-life [2,3]. Pulmonary involvement is a common manifestation. In clinical practice, severity of this involvement is frequently quantified using pulmonary function tests. However, clinical outcome measures, such as laboratory and objective functional tests, rarely match the patient's experience of the day-to-day functioning [4]. Patient-reported outcome measures (PROMs) which document the patient's perceived impact on functioning in daily life and well-being are therefore important tools for evaluation of treatment and rehabilitation from a patient perspective [5]. For studies of rare diseases, such as SSc, international collaborations between sites are important. It is therefore necessary that standardized PROMs are validated and psychometrically tested for participating countries and languages.
The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS; online at https://www.nihpromis.org and https://www.promishealth.org/) seeks to use modern psychometric methods to standardize the measurement of patient report outcomes for all medical conditions [6]. This involves building itembanks of questions that can be used in computer adapted testing (CAT) and short-forms and profile measures that can be used in paper form. PROMIS-29 is a multidimensional profile scale with 29 items that has been validated and referenced in the general US population [7]. The Functional Assessment of Chronic Illness Therapy-Dyspnea (FACIT-Dyspnea) is a symptom specific instrument that is validated in patients with self-reported chronic obstructive pulmonary disease [8]. Hinchcliff et al. [9], Kwakkenbos et al. [10], Morrisroe et al. [11], and Fisher et al. [12] found that these new measures had a strong correlation with legacy PROMs in patients with SSc and that they could therefore be valid measures of health status in SSc. Both PROMIS-29 and FACIT-Dyspnea are translated and psychometrically tested in several languages, but not in Swedish. This study aimed therefore to translate these two instruments into Swedish and examine their psychometric properties including their reliability and construct validity within a Swedish SSc population.

Translation procedure
Translation of the PROMIS-29 followed the PROMIS guideline document for translation, and cultural adaptation from 4 November 2012 and the FACIT translation methodology [13]. This methodology consists of 11 steps. All steps were reported and approved by the PROMIS organization. The translation procedure was initially performed by the Department of Rheumatology, Lund University. Translational support was also purchased from the Department Translation and Language Services, Lund University. The later steps of the translation procedure were completed in cooperation with members of the PROMIS group at the Quality Register Centrum (QRC) in Stockholm, Sweden. FACIT-Dyspnea was translated from English to Swedish applying the FACIT Measurement procedure (www.facit.org). Cognitive debriefing interviews were performed with 10 SSc patients in a structured interview format, and the FACIT organization surveyed the interview forms and approved the Swedish translation as conceptually equivalent to the original instrument.

Study design and patient cohort
To psychometrically test the new PROMs, patients were consecutively enrolled during their regular scheduled in-patient follow-up at the department of rheumatology, Lund, Sweden, between 1 September 2017 and 31 May 2018. Patients were eligible for the study if they fulfilled the 2013 American College of Rheumatology/European League Against Rheumatism (ACR/ EULAR) criteria for SSc [14], were 18 years of age or older and fluent in Swedish. The patients completed paper-based PROMIS-29 and FACIT-Dyspnea questionnaires, and legacy PROMs (RAND-36 [15], MRC Dyspnea score [16], and Scleroderma Health Assessment Questionnaire (SHAQ) [17]).

Ethics and consent
The study was conducted in accordance with the Declaration of Helsinki, and was approved by the Regional Ethics Committee in Lund (Dnr 2016/342). The patients were given verbal information on the aim of the study, and written consent was obtained.

Demographic and disease parameters
Age, gender, and employment status were retrieved from the medical journals. Disease onset was defined as the first non-Raynaud's manifestation. Patients were classified as limited cutaneous SSc (lcSSc) or diffuse cutaneous SSc (dcSS) [18]. Analysis of SSc specific antibodies and organ workup including pulmonary function tests was performed as previously described [19]. Routine laboratory and diagnostic values were systematically retrieved from the medical record for each patient. Organ system involvement was also characterized according to the Medsger Severity Scale (MSS) [20] with adaptation to the local patient workup: (1) the general condition was estimated from the body mass index (BMI) since weight loss was not recorded; and from PCV that was estimated indirectly from hemoglobin (hemoglobin ¼ hematocrit (PCV)/3) since the PCV was not measured routinely; (2) peripheral vascular involvement was defined based on absence or presence of Raynaud's, digital pitting scars, and/or digital tip ulcerations; (3) skin involvement was quantified using the modified Rodnan skin score (mRss) [21]; (4) joint symptoms were measured with the finger flexion item in the hand mobility in scleroderma test [22]; (5) muscle involvement was determined according to creatine kinase elevation since data on proximal weakness were not available; (6) gastrointestinal symptoms were defined as mild, moderate, and severe esophageal involvement analyzed by cine-radiography of the esophagus [23]; lung (7), heart (8), and kidney (9) involvement was evaluated in accordance to Medsger et al. [20]. All organ systems were scored 0 (normal) to 4 (endstage) and were summed to obtain the severity score.
For comparison, patients were also characterized by a modified MSS according to Hinchcliff et al. [9] which included the following variables: mRss, diffusion capacity for carbon monoxide (DL CO , p% of predicted), estimated right ventricular systolic pressure, NT-probrain natriuretic peptide (NT-pro-BNP), hemoglobin values and creatine levels. Variable severity was scored 0, 1, and 2 (0 and 1 for estimated right ventricular systolic pressure) and summed to obtain the severity score.

PROMIS-29 version 2
PROMIS-29v2 is a PROMIS profile instrument, containing four questions from each of seven PROMIS domains (depression, anxiety, pain interference, fatigue, sleep disturbance are completed referring to the last seven days; physical function and ability to participate in social roles and activities are completed without reference to a time period), and a single pain intensity 0-10 numeric rating scale [24]. Each item is scored from 1 to 5, and is summed to create a raw score for each domain. Raw scores are converted into T-scores standardized for the general population in USA (mean ± SD, 50 ± 10) [7]. Higher scores represent worse symptoms within the domains: anxiety, depression, fatigue, pain interference, and sleep disturbance, while higher scores within physical functioning and social roles represent better functioning.

FACIT-Dyspnea 10-item short form
FACIT-Dyspnea consists of two subscales: Dyspnea score and Functional limitation score. The patients rate the severity of dyspnea and the performance of 10 common tasks of daily life over the past seven days. The scales are scored from 0 -no shortness of breath/no difficulty to perform the activity, to 3 -severely short of breath/much difficulty to perform the activity. To obtain a raw score of the two subscales the individual items are summed, and multiplied by the number of items in the scale and then divided by the number of items answered. Based on a population with chronic obstructive pulmonary disease (reference population), the raw scores are converted into T-scores standardized for persons with chronic obstructive pulmonary disease (mean ± SD, 50 ± 10) [8].

Legacy PRO instruments
RAND-36 is a generic measure of health related quality of life [25]. The questionnaire consists of one question that measures change in perceived health in the past 12 months, and 35 items divided into eight dimensions of health: physical functioning, social functioning, role limitations (physical problem), role limitations (emotional problem), mental health, vitality, pain, and general health perception. The questions relate to current health, health in the past 4 weeks, and change in health over the past year. Scores are calculated via a scoring key that represents the percentage of the total possible score achieved in each domain. The scores therefore range from 0 to 100. Higher scores represent better health status. The wording of the items and domains in RAND-36 are the same as SF-36 but the two instruments differ regarding their item summation [26].
The Medical Research Council (MRC) Dyspnea Scale consists of five statements, related to daily life activities. These statements measure current disability experienced due to perceived breathlessness from 0 to 4 [27,28]. The grading applied in Sweden is 0 ¼ breathlessness on strenuous exercise; 1 ¼ shortness of breath when walking fast on level ground or walking up a slight hill; 2 ¼ out of breath when walking at the same rate as other same age people; 3 ¼ stops for breath after about 100 yds when walking at my own pace on level ground; 4 ¼ too breathless to leave the house, or breathless when undressing. In this study, we used the recommended self-administered version retrieved from the PROM-guide (https://lvr.registercentrum.se/). Disability was reported by the Swedish version of SHAQ [29]. The SHAQ consists of the HAQ-DI (Health Assessment Questionnaire-Disability Index) [30] (range from 0 (no impairment) to 3 (not able to perform the task)) and five SSc specific VAS scales to evaluate Raynaud's phenomenon, digital tip ulcers, gastrointestinal involvement, lung involvement, and overall disease severity from the patient's perspective [17]. The recall period is seven days for all items. The Swedish version of SHAQ has been found to have an acceptable reproducibility and concurrent validity [29].

Statistical analysis
Descriptive data are presented as means, standard deviations (SD), and range, or as numbers and percentages (%). p Values of p< 0.05 were considered significant. All statistical analyses were performed in SPSS v.24 (IBM, Armonk, NY) or STATISTICA v.12 (StatSoft, Tulsa, OK). Quadratic Kappa values were analyzed using an online calculator (http://vassarstats.net/index.html).

Internal consistency
Internal consistency was examined on the first test occasion and analyzed with Cronbach's alpha. Alpha values of 0.70-0.95 were considered as good internal consistency [31].

External reliability
Test-retest reliability was analyzed with intra-class correlation coefficients (ICCs) of the summary scores of the PROMs. ICC values of >0.70 represented good reliability in samples of 50 individuals [31]. Due to ordinal scaling, linear weighted kappa (K w ) coefficient was used to measure stability within test scores for the individual items in the PROMs [32]. Linear weights were applied assuming equal distances between the scoring steps of the items [33]. In addition, quadratic K w -values were shown, as recommended by Vanbelle [34]. K w -values were interpreted as <0.2 ¼ slight, 0.21-0.40 ¼ fair, 0.41-0.60 ¼ moderate, 0.61-0.80 ¼ substantial and >0.8 ¼ almost perfect agreement [35]. The first completion of the patients' questionnaire was during a routine clinical visit. The retest was completed within two weeks using questionnaires mailed to the patients and completed at home. We expected good test-retest reliability according to ICC for PROMIS-29 in patients with SSc based on the previous study by Fisher et al. [12] (Hypothesis 1).

Floor and ceiling effects
Frequency distribution of the questionnaires and percentages were calculated of patients scoring the lowest possible health (floor effects) and the best possible health (ceiling effect) irrespective of the direction of the scale to facilitate interpretation. Negative floor or ceiling effects are noted if 15% or more of the patients gave the lowest or best possible health scores [36].

Construct validity À hypotheses testing
Hypotheses testing was undertaken via Spearman's correlation coefficient. The r s -values were interpreted as follows: r s <0.30 as low correlation, r s ¼0.30 to �0.50 as moderate, r s >0.50 as strong correlation [37]. Based on references measuring PROMs in patients with SSc [9-12], we hypothesized that PROMIS-29 and FACIT-Dyspnea would have strong correlations with corresponding legacy PROMs and their corresponding subscales [38] (Hypothesis 2), summarized together with the results. We also hypothesized that PROMs subscales would have weak correlations to clinical outcome measures (Hypothesis 3) since clinical outcome measures rarely match the patient's experience of the day-to-day functioning [4]. We expected to detect moderate correlations between the PROMIS-29 scales of physical functioning, pain interference, satisfaction with social roles and the two FACIT-Dyspnea scales that were modified according to Hinchcliff et al.
[9] (Hypothesis 4). Since worsening clinical and psychosocial factors correlate with increased work disability [10], we hypothesized that patients who were classified as able to work would have better perceived functioning (physical functioning and ability to participate in social roles and activities) and fewer symptoms (anxiety, depression, fatigue, pain interference, and sleep disturbance) than patients receiving a sick-pension or on sick-leave (Hypothesis 5).

Demographic characteristics
The demographic data are depicted in Table 1. We aimed to include 50 SSc patients which were recommended as adequate for cross-cultural validation [39]. Unfortunately, one patient was included twice, measured with one year apart, and was therefore excluded. Thus, 49 consecutively enrolled patients (42 female and seven male) aged between 24 and 76 years were included in the study. Mean disease duration from non-Raynaud symptom onset was 11 (±9.3) years. Thirty-six (73%) patients had lcSSc and 13 (27%) patients had dcSS. Mean mRss was 4.4 (±7.7) points and mean VC was 96.3 (±16.4) population %. Less than one third of the patients had arthritis. Pitting scars were present in 19 patients (39%) ( Table 1). Twenty-seven (55%) patients had mild disease (<5 points) according to MSS. Peripheral vascular and upper gastrointestinal complications showed the highest scores (1.5 ± 0.8 and 1.1 ± 0.6, respectively) of the MSS items. The PROMIS-29 scores of the patients were lower than the general US population reference [7] with poorer physical functioning (44.5), more anxiety (52.3), and pain interference (53.1) ( Table 2). According to the FACIT instrument, our study group reported less dyspnea (41.7) and functional limitations (42.9) than the US reference population [8] ( Table 2). The descriptive data of the legacy PROMs are summarized in Supplementary Table 1. On average, the study group had mild disability according to HAQ-DI 0.5 (±0.6). Raynaud's phenomenon and fatigue VAS scores were higher than the other VAS measures (1.0 ± 1.0 and 1.0 ± 0.9) (Supplementary Table 1). RAND-36 scores (0-100) ranged from 43.5 (±22.1) for general health to 78.9 (±22.3) for social function (Supplementary Table 1).

Internal consistency
Internal consistency was good for both instruments (

External reliability
Thirty-nine patients completed the retest questionnaires. External reliability was good for the new instruments. ICC ranged from 0.78 to 0.94 for summary scores of PROMIS-29, with the exception of the domain anxiety (ICC 0.67, CI ¼ 0.37-0.83), Table 3. Two PROMIS-29 anxiety questions had the lowest agreement with linear K w -values of 0.34 and 0.37 (Supplemental Table 2). These anxiety questions indicated less anxiety when completed at retest compared to the scores from the first, hospital completed, test. Supplementary Table 3 Table 4).

Construct validity -hypotheses testing
PROMIS-29 domains showed strong positive correlations to corresponding RAND-36 domains (Table 4 and Supplemental Table 6). PROMIS-29 domains physical functioning, pain interference, and ability to participate in social roles and activities were strongly correlated to SHAQ score (r s ¼ À 0.75, 0.75, and À 0.71). In addition, PROMIS-29 showed that patients with a lower degree of sick-leave had better physical function (r s ¼ À 0.46) and better ability to participate in social roles and activities (r s ¼ À 0.39). It was also found that these patients had less pain (r s ¼0.42) and less depression (r s ¼0.36).
FACIT-Dyspnea and FACIT Functional limitation had strong correlations to the MRC Dyspnea score, the SHAQ score and to the RAND-36 subscales of physical functioning and general health (jr s j¼0.73 to À 0.85) (Table 4 and Supplemental Table 7). Both FACIT-Dyspnea and FACIT Functional limitation correlated moderately with work ability (r s ¼ À 0.50 and À 0.51).
PROMIS-29 physical functioning, pain interference and ability to participate in social roles had no correlations with the sum of MSS (r s ¼ À 0.04, 0.03, and 0.21, respectively) nor the modified MSS according to Hinchcliff (r s ¼ À 0.28, 0.07, and 0.00, respectively). Neither FACIT-Dyspnea nor FACIT Functional limitation correlated to the sum of MSS (r s ¼0.13 for both) but they were correlated moderately to the sum of modified MSS according to Hinchcliff (r s ¼0.38 for both). PROMIS-29 physical functioning, FACIT-Dyspnea and FACIT Functional limitation had moderate correlations to MSS lung subscale (r s ¼ À 0.30, 0.37, and 0.38, respectively). These correlations were mainly caused by correlations with the DL CO (r s ¼0.33, À 0.43, and À 0.43, respectively) which is part of the MSS lung subscale. PROMIS-29 physical functioning, FACIT-Dyspnea, and FACIT Functional limitation also had moderate correlation to VC/DLCO ratio (r s ¼ À 0.32, À 0.31, and À 0.31, respectively) but not to VC (r s ¼ À 0.06, À 0.10, and À 0.10, respectively). Similarly, RAND-36  Floor effect: score indicating lowest possible health; ceiling effect: score indicating best possible health irrespective of the direction of the scale.
T score 50 ± 10 represents for PROMIS-29 the mean ± SD of the US general population, and for FACIT-Dyspnea it is based on a population with chronic obstructive pulmonary disease. N ¼ 49. a One missing. b Higher scores represent worse symptoms and worse health related quality of life.  physical functioning and MRC scale correlated moderately to DLCO (r s ¼ À 0.43 and À 0.42, respectively) and VC/DLCO ratio (r s ¼ À 0.39 and À 0.30, respectively).

Discussion
Increasing interest in applying PROMs, such as of PROMIS-29 and FACIT-Dyspnea, to studies of the rare and multifaceted disease of SSc calls for crosscultural psychometric testing. Our study shows that the Swedish version of PROMS-29v2 as well as FACIT-Dyspnea meet the requirements for internal reliability, reproducibility, and construct validity compared with the total scores of the legacy measures. Both instruments showed some weaknesses concerning ceiling effects in our study group, as did the legacy instruments.

Study population
The PROMIS-29 scores of our study group were lower than the general US population reference [7], indicating poorer physical functioning and more anxiety and pain interference. However, the degree of impairment was less than the minimum clinically important difference (0.5 SD or 5 points [40]) for all domains except physical functioning. Our study group had less fatigue according to the PROMIS-29 instrument compared to the cohorts by Morrisroe et al., Kwakkenbos et al., and Fisher et al. [10][11][12].
According to the FACIT instrument, our study group reported less dyspnea and functional limitations than the reference population [8]. These findings are in accordance with the results from the previous two studies by Hinchcliff et al. [9,41]. It could also be seen that the Hinchcliff SSc cohort had better health according to the FACIT instruments than the reference population [41], despite higher disease burden compared to our study population.

Internal and external reliability
Internal consistency reliabilities of both instruments were satisfactory for the studied Swedish SSc patient and comparable with previous studies in this patient group [9,12]. Test-retest reliability was acceptable for FACIT-Dyspnea according to the ICC and weighted kappa analysis.
The PROMIS-29 also had good test-retest reliability (confirming Hypothesis 1), as previously shown for patients with SSc [12], idiopathic pulmonary fibrosis [42], and systemic lupus erythematosus [43]. However, the PROMIS-29 anxiety domain had a moderate test-retest reliability with an ICC of 0.67 in our study group. The linear weighted kappa analysis showed fair values in the following items; "I found it hard to focus on anything other than my anxiety" and "I felt fearful". Fisher et al. [12] described a slightly higher ICC of 0.7 for PROMIS-29 anxiety domain in their study in SSc patients with a recall period of 30 days. Tang et al. described a moderate test-retest reliability in the anxiety domain for kidney transplantations recipients [44]. Also Rawang et al. [45] found a moderate test-retest reliability for the PROMIS-29 anxiety domain in individuals with chronic low back pain. The average time interval between test and retest was 14 days in our study compared to 27 days in the study of Tang et al. [44] and seven days in the study of Rawang et al. [45]. No dramatic changes in disease activity would be expected in SSc patients during a retest period of 14 days. In addition, there appeared to be no relationship between the reliability score and the recall period for PROMIS-29 subscales. The recall period of seven days does not overlap the test-retest period in our study.
Anxiety may comprise a complex multifactorial item in contrast to, e.g., functional limitation. Levels of anxiety may therefore be transient and change during short time intervals. Our patients showed higher anxiety levels as assessed by these two questions during their in-patient work-up compared with the retest at home. Thus, change in test location could have had some influence on the anxiety sphere that is addressed by these two questions of the PROMIS-29 anxiety domain. Our study does not allow us to draw any conclusions about the stability of the PROMIS-29 anxiety domain. However, taken together the four studies, moderate test-retest results predominate for the PROMIS-29 anxiety domain. Our findings support the value of test-retest studies under various test circumstances to ensure the instruments' credibility as an outcome measure.

Ceiling effects
Ceiling effects were present in six of the seven domains of the PROMIS-29 questionnaire indicating that the range of the instrument did not correspond to the range of outcomes in our SSc patient population. PROMIS-29 was less able to differentiate between patients with higher well-being in the domains of anxiety, depression, fatigue, pain interference, physical functioning, and ability to participate in social roles and activities. These results are in line with the findings in SSc patients with similar disease duration by Morrisroe et al. [11] and by Kwakkenbos et al. [10]. Ceiling effects were seen in similar domains of the RAND-36 with the exception of the vitality domain that showed neither ceiling nor floor effects in contrast to PROMIS-29 fatigue. A broader wording of the questions of the RAND-36 domain ("full of pep"/"lot of energy"/"worn out"/"tired") has probably reduced ceiling and floor effects but may on the other hand introduce a more complex question of well-being and general health compared to the more specific questions of the PROMIS-29 fatigue inquiry ("feeling fatigue"/"trouble starting things because of tiredness"/"run-down on average"/"fatigue on average"). In addition, PROMIS-29 captures physical functioning in four questions with five grades on a Liker scale compared to RAND-36 that capture physical functioning in 10 questions with three grades on Liker scales and physical role functioning in three questions with yes/no responses. PROMIS-29 may therefore be more sensitive to subtle changes in these domains compared to RAND-36 that shows both floor and the ceiling effects in the physical role domain.
The PROMIS profile-29 contains a collection of four-item shortforms addressing seven separate domains selected to assess the impact of a medical condition on health-related quality-of-life among a clinical or non-clinical population referenced to a general population. There is strong evidence of the efficiency of PROMIS short forms in many domains [46] but it is also noted that longer short-forms have improved reliability [47,48]. Itembanks are designed to cover the whole spectrum of a domain and therefore floor and ceiling effects should not occur. The relatively large number of ceiling effects may be due to the long disease duration and relatively mild disease phenotype of our study population. Less ceiling effects would probably have been present in a study population with patients that have more severe SSc. Therefore, the discriminative ability and sensitivity to change still needs to be determined.

Construct validity -hypotheses testing
PROMIS-29 domains showed strong correlations with the corresponding domains in RAND-36 confirming Hypothesis 2. The findings are in line with previous studies concerning associations between PROMIS-29 and SF- 36 [9,11,12,41]. The strong correlations that we observe between PROMIS-29 and the legacy instrument may imply some redundancy and similarities in items between these PROMs. In part PROMIS items were originally drawn from existing legacy questionnaires. Overlap can therefore be expected. Advantageous may also be that PROMIS-29 items have been selected from itembanks covering a wide range of the conditions. Thus, PROMIS-29 should be an acceptable alternative to both RAND-36 and SF-36 in patients with SSc.
FACIT-Dyspnea and FACIT Functional limitations had positive correlations with the legacy PROM instruments of SHAQ and MRC (Hypothesis 2). This is in agreement with the findings by Hinchcliff et al. [9,41]. The result is not surprising since all instruments capture the patient's experience of shortness of breath in connection to daily life activities, even if they have different items. It is also not surprising that impairment in daily life activities is related to impaired quality-of-life. However, FACIT-Dyspnea Index correlated unexpectedly strongly with RAND-36 physical functioning and, to some less extent with role limitations due to physical health. This is in line with the longitudinal study by Hinchcliff et al. [41] which found a moderate correlation between FACIT dyspnea and the SF-36 physical component summary and a strong correlation between FACIT functional limitation and SF-36 physical component summary for the change scores after one year. In contrast to the Hinchcliff study group, the frequency of patients with lcSSc is higher in our patient cohort. It was also found that the disease duration was longer in our study compared to the Hinchcliff study, i.e., 11 years compared to 4.5 years. Importantly, pulmonary arterial hypertension, a late complication of SSc, occurs more frequently later than 10 years after disease debut.
Although physician and patient-reported assessments of disease often differ (Hypothesis 3), previous studies of PROMIS -29 [9-11,41] have found stronger associations between patient outcome and disease severity than was found in the present study. In contrast to Hinchcliff et al.
[9], we could not show that the PROMIS-29 domains of physical functioning, pain interference, and participation with social roles correlated with the composite MSSs (confirming Hypothesis 3 but not Hypothesis 4).
The two FACIT-Dyspnea instruments showed moderate correlations when tested against the modified MSS that was used by Hinchcliff et al. (confirming Hypothesis 4 but not Hypothesis 3). Several differences exist between the modified MSS used in this study compared to the one used by Hinchcliff et al. [9]. The utilization of NT-pro-BNP, a marker of heart function, and differences in weighing skin score points by Hinchcliff's MSS may have impacted the results and may explain some of the differences. Further, differences in patient characteristics may contribute to the diverging results. Our patients had less severe disease according to the MMS summary score, skin score, and lung function evaluations. Finally, cultural differences may account for variations between the two study populations.
Even if Hypothesis 3 is rejected, it is noteworthy that PROMIS-29 domain physical functioning, FACIT-Dyspnea and FACIT limitation in our cohort showed moderate correlations to the validated MMS lung subscale [20]. These items had also moderate correlations to DLCO levels and the VC/DLCO ratio that is used as an early marker to detect pulmonary arterial hypertension in SSc. Of the previous studies in SSc, Kwakkenbos et al. [10] detected difference in PROMIS-29 domain role functioning in patient with SSc and pulmonary arterial hypertension, whereas Fisher et al. [12] could not identify any associations between PROMIS-29 domains and DLCO. Disease duration is longer in our cohort compared to Fischer et al. [12] but shorter compared to Kwakkenbos et al. [10]. Our study included more lcSSc compared to both. It is intriguing to speculate whether our patients have developed some cardiac or pulmonary vascular changes that are not as severe as overt cardiac failure or pulmonary arterial hypertension [49] but still significant enough to impact quality-of-life and to be captured by the PROMs. These findings call for further evaluation to test whether these PROM domains could be used for early detection of cardiac and/or pulmonary complications and thereby improving survival.
Finally, Hypothesis 5 was confirmed. Better health, according to the PROMIS-29 subscales of physical function and ability to participate in social roles and activities, was related to a lower degree of sick-leave. It was also found that better health according to both FACIT instruments, was related to a lower degree of sickleave. These findings are in line with our previous data showing that ability to work was associated with less breathlessness and better physical functioning [50]. Presence of pulmonary arterial hypertension is associated with work disability [51]. Pulmonary arterial hypertension was present in only one patient in our study group. Nevertheless, our findings suggest that a work disability may exist in SSc patients due to pulmonary vascular engagement as reflected by an increased VC/DLCO ratio (r ¼ 0.51, correlation with work disability) and detected by worse health according to the PROMs.

Additional considerations
The choice of which instrument to use will ultimately be defined by the question to be answered with the intended study population [52]. The PROMIS-29 and FACIT-Dyspnea instruments are both scored on a t-score metric and therefore they are easy to compare between different responder groups. PROMIS-29 is validated in the general US population and FACIT-dyspnea in a population of chronic obstructive pulmonary disease population. Scoring is straightforward. In comparison, the RAND-36 has a more complex scoring system. Further, recall bias might be less for the PROMIS-29 questionnaire due to a recall period covering the past seven days compared to a recall period four week in the RAND-36 questionnaire. Several questions in the RAND-36 address the same activity but with a different grade of difficulty. Thus, PROMIS-29 may apply the questions in a more effective way. Physical and mental health summary scores can be correlated with the PROMIS-29 instrument [24]. PROMIS-29 can also be used to predict EQ-5D scores [53]. Beside the use of the fixed-length forms, consideration should also be given to the clinical use of CAT with PROMIS item-banks [54]. A CAT application of PROMIS has previously been tested on small scale at a scleroderma outpatient clinics [55]. In future, PROMIS CATs may possibly improve the capture of the full range of health-related quality-of-life of the multifaceted phenotypes of SSc patients at different stages of their disease.

Limitations
Our study has some limitations. The patient population was obtained at a single center in a cross-sectional study and may thus be afflicted with the inborn bias of this study design. Also, our study group is relatively small and did not allow us to assess the structural validity by confirmatory factor analysis. However, despite the small study group size PROMIS-29, FACIT-Dyspnea, and FACIT Functional limitations demonstrated psychometric evidence that supports the validity of the instruments in a Swedish context. However, the discriminative ability and sensitivity to change still needs further evaluation in patients with SSc.

Conclusions
The summed total scores of Swedish versions of PROMS-29v2 and FACIT-Dyspnea Index largely meet the requirements for reliability and have adequate construct validity compared to legacy measures and are therefore applicable for use in multicenter studies of patients with SSc.