Spanish Cross-Cultural Adaptation and Rasch Analysis of the Convergence Insufficiency Symptom Survey (CISS)

Purpose To culturally and linguistically adapt the Convergence Insufficiency Symptom Survey (CISS) to Spanish and assess the psychometric performance of the new version through Rasch analysis and classical test theory methods. Methods The Spanish version of the CISS (CISSVE) was completed by 449 subjects (9–30 years old) from the general population. The validity and reliability of CISSVE were assessed through Rasch statistics (precision, targeting, item fit, unidimensionality, and differential item functioning). To test construct validity, we calculated the coefficients of correlation between the CISSVE and the Computer-Vision Symptom Scale (CVSS17) or Warwick–Edinburgh Mental Well-Being Scale (WEMWBS). We determined test–retest reliability in a subset of 229 subjects. We used differential item functioning (DIF) to compare the CISSSVE and the CISS after administering the CISS to 216 English children. Results After applying exclusion criteria, the responses of 420 participants (mean age, 18.62 years; female, 54.95%) revealed good Rasch model fit, good precision (person separation = 2.33), and suboptimal targeting (–1.37). There was some evidence of multidimensionality, but disattenuated correlations between the Rasch dimension and a possible secondary dimension were high, suggesting they were measuring similar constructs. No item bias according to gender or age was detected. Spearman's correlation was 0.34 (P < 0.001) for CISSVE–CVSS17 and non-significant for CISSVE–WEMWBS. The limits of agreement for test–retest reliability were 9.67 and –8.71. Rasch analysis results indicated no difference between CISS and CISSVE. Conclusions According to our results, CISSVE is a valid and reliable tool for measuring the symptoms assessed by CISS in Spanish people 9 to 30 years of age. Translational Relevance CISSVE can measure convergence insufficiency symptoms in Spanish-speaking subjects.


Introduction
Convergence insufficiency is one of the most common abnormalities of binocular vision. It is usually associated with symptoms such as visual fatigue, headaches, blur, and double vision. 1,2 To measure these symptoms, the Convergence Insufficiency Symptom Survey (CISS) was developed in 1999 by the Convergence Insufficiency and Reading Study Group (CIRS). 3 The first version of this questionnaire consisted of 13 items and assessed the frequency of each symptom using a four-option response scale. 3 In 2003, a revised version was introduced 1 that included two more items and a new response scale with five choices: never, infrequently, sometimes, fairly often, and always. This new version made the tracking of changes during therapeutic interventions more sensitive. 1 The 15-item version of the CISS (hereafter CISS) is a frequently used outcome measure in binocular vision research and has been used to assess convergence insufficiency (CI) symptoms in various clinical groups from the ages of 8 to 30 years, 1,2,4 where subjects with symptomatic CI had a significantly higher CISS score than others with normal binocular vision. However, to our knowledge, no data have been reported regarding its psychometric properties apart from its repeatability 2 and known-group validity. 3,4 This last variable reflects the ability of a questionnaire to discriminate between two groups known to differ a priori.
The CISS is not a condition-specific instrument for convergence insufficiency; rather, it is useful for measuring the symptoms associated with visual discomfort caused by different factors. Accordingly, it considers the most common symptoms regarding near-vision problems 5 and provides similar scores in children with accommodative insufficiency and convergence insufficiency. 4 In addition, as described by Horan et al., 6 some patients with normal sensorimotor exam results were found to score high (i.e., showed a higher level of the assessed symptoms) on the CISS, while others with convergence insufficiency had relatively low scores.
The CISS was developed for English speakers. As there are 442 million native Spanish speakers worldwide, there is currently a need for a Spanish version. We generated a Spanish version of the CISS (CISS VE ) following well-known guidelines [7][8][9][10][11] used for other recent cross-cultural adaptations 12,13 to ensure content and operational equivalence between the original CISS and the CISS VE . Most cross-cultural adaptation studies are based on modern psychometric models such as the Rasch item response theory (IRT) model. This model is recommended for the quality assessment of health questionnaires because (1) it generates a more precise measure, overcoming the limitations of traditional summary scoring through the transformation of ordinal raw scores into interval linear scales [14][15][16][17] ; and (2) it provides insight into the psychometric properties of the scale and is able to match item difficulty to user skills. 15 The Rasch approach also provides data, such as person and item reliability, reflecting the overall performance of the instrument. 16 The objective of this study was to culturally and linguistically translate the Spanish version and assess its psychometric performance using Rasch analysis and classical test theory methods.

Methods
Before the study outset, the authors of the CISS gave us their consent to develop a Spanish version of their instrument. The CISS questionnaire consists of 15 items. In reply to each question, the subject indicates the frequency of each symptom on a Likert scale, with scores ranging from 0 to 4: never (0), infrequently (1), sometimes (2), often (3), or always (4). The scores of every item are added to determine the final score, which ranges from 0 (least symptomatic) to 60 (most symptomatic). The recommended cut-off is ≥21 for adults 2 and ≥16 for children 9 to 18 years of age. 1 The study was conducted in two stages. The questionnaire was first translated and adapted to Spanish (May 2016 to April 2017), and then the validity and repeatability of the Spanish version were assessed (May 2017 to January 2018).
Because the CISS assesses near-vision-related symptoms and not only convergence insufficiency symptoms, we enrolled subjects from the general population from four different institutions in Madrid, Spain: a primary school (CEIP Vargas Llosa, Madrid), a secondary school (IES Juan Rodriguez Villanueva), a university faculty (Optics and Optometry Faculty of the Universidad Complutense de Madrid), and a technology company (DXC). In addition, over the period from January 2019 to March 2019, we performed a psychometric analysis of the English CISS on subjects from the personal network of one of our researchers (CP-G) in Swindon, Wiltshire, UK. Subjects 9 to 30 years of age were recruited by convenience. Participants received no compensation for their cooperation. Exclusion criteria were mother tongue different from the questionnaire language, prior visual surgery (not refractive), active visual or neurologic disease, any medication that could affect vision, or any kind of disability preventing the subject from reading or understanding the instrument's questions. Out of 665 subjects enrolled, 449 (mean age, 18.62 years; range, 9-30 years; female, 54.95%) completed the Spanish version of the CISS (CISS VE ), and 216 (mean age, 15.81 years; range, 12-20 years; female, 37.61%) completed the original CISS.
The study was approved by the Research Ethics Committee of the Hospital Clínico San Carlos (Madrid, Spain), and its protocol adhered to the tenets of the Declaration of Helsinki. All participants gave their written informed consent prior to participation. For participants younger than 18 years, consent was obtained from a parent or guardian. All children older than 12 years also provided their consent before any testing was done. Other than responses to the questionnaires, no other clinical data were collected in any subject.  Table S1.

Translation and Transcultural Adaptation of CISS
As an example, after discussing whether the response option "fairly often" should be translated as "casi siempre" or "bastante a menudo," the latter translation was finally adopted by consensus. 5. Pre-testing of the consensus version-Cognitive interviews were conducted using a verbal probing technique with 48 native speakers between the ages of 9 and 30 years to ensure patient comprehension of the CISS VE ; no new issues emerged in this pre-test.

Analysis Strategy
For descriptive data generation and repeatability assessment, we used SPSS Statistics 22.0 (IBM Corp., Armonk, NY, USA).

Rasch Analysis
The package Winsteps 4.0.1 (Winsteps.com, Beaverton, OR, USA) 7 was used for Rasch analysis. The Rasch model is an IRT that transforms raw scores to express the person ability and the difficulty of items on the same scale, so that the difference between the ability of two people does not depend on the specific items with which their ability is estimated. The main IRT concept is that a mathematical model is used to predict the probability of a person successfully replying to an item according to person ability and item difficulty. 20 For the analysis, we chose the Andrich rating scale model (RSM), which assumes equal category thresholds across items, as all items share the same response option structure. 21 Respondents with a greater level of symptoms and items of greater difficulty were located on the negative side of the continuum scale and vice versa. The results of the Rasch method were then used to determine the following.

Rating Scale Structure
The performance of the rating scale structure was assessed by examining the category threshold order. Disordering of categories occurs when the response options do not follow expected hierarchical ordering. 13,22

Item Fit Statistics
Both infit and outfit mean square fit statistics show the extent to which the items in the domain comply with Rasch model expectations. 14

Dimensionality
The scale is considered unidimensional when there is one latent variable of interest, and the level of this latent variable is the focus of measurement. 20 To assess multidimensionality, we used the results of the Rasch principal component analysis (PCA) of standardized residuals, which looks for patterns in the part of the data that does not agree with the Rasch measures (unexpected data). When groups of items share the same patterns of unexpected data, those items probably also share a substantive attribute in common, which we refer to as a "secondary dimension" or "contrast." 23 When variance explained by the Rasch measures is ≤50% and/or the eigenvalue in the first contrast is ≥2.0, this is an indication of subsets of items that suggest multidimensionality. 14 Next, we looked at the disattenuated coefficient of correlation between the first and second contrasts obtained in the PCA analysis. Disattenuated correlation approximates the correlation between the two contrasts without measurement error. According to Linacre, 24 0.82 may be used as the cut-off to consider that the two contrasts measure the same variable.

Person Separation Index and Levels of Performance
The Rasch-based Person Separation Index (PSI) is a reliability indicator, analogous to Cronbach's α of traditional test theory in both its values and construction. 25 This index was obtained using Winsteps. The number of different levels of performance was computed according to the method described by Wright. 21,26 Targeting The extent to which item difficulty, defined as the point on the latent variable at which the highest and lowest category of an item have equal probability of being observed, matches the level of a participant's visual abilities was defined as the difference between the average difficulty of the items and the subject's mean level of symptoms. 14

Differential Item Functioning by Gender, Age Group, and CISS Version
We examined each item to determine if there was any difference in the way subgroups (male vs. female; children under 18 years of age vs. young adults older than 18) answered each item-that is, no differential item functioning (DIF). In addition, because testing for DIF is a useful way to validate questionnaire translations, 27 we performed this analysis to test whether the CISS VE items were equivalent to those of the original survey. Accordingly, the DIF for an item was considered a cross-cultural or translational issue for that particular translation. 28 The DIF analysis implemented in Winsteps is based on two methods: 1. Mantel-Haenszel method to estimate the log odds of DIF size and significance from cross-tabs of observations in the two groups 2. Logit-difference (logistic regression) method to estimate the difference between Rasch item difficulties for the two groups, maintaining everything else constant 24 DIF contrast (i.e., difference in difficulty of the item between the two groups) was defined as no-DIF for <0.50 logits, minimal for 0.50 to 1.0 logits, and notable for >1.0 logits. 13 The overall quality of the psychometric data obtained in this stage (except levels of performance) was assessed according to the criteria of the guidelines proposed by Khadka et al. 14 for quality assessment of ophthalmologic questionnaires.

Validity and Repeatability
Paper versions of the CISS VE , the Computer-Vision Symptom Scale (CVSS17), 26 and the Warwick-Edinburgh Mental Well-Being Scale (WEMWBS) 29 were administered to all participants except the primary-school children. Seven days later or longer, subjects again completed the CISS VE in a second session. Convergent validity was assessed by estimating the coefficient of correlation between the subjects' CISS VE and CVSS scores and divergent validity through the coefficient of correlation between CISS VE and WEMWBS scores. According to Khadka et al., 14 a coefficient of correlation between CISS VE and CVSS greater than 0.3 would be considered as proof of convergent validity. The Kolmogorov-Smirnov test used on the CISS VE , CVSS, and WEMWBS scores indicated a non-normal distribution of all measures, so we calculated the Spearman's rho coefficient of correlation. The repeatability of the CISS VE was examined via the intraclass correlation coefficient (ICC) with the confidence interval set at 95%. In addition, Bland-Altman limits of agreement were determined to calculate the coefficient of repeatability (CoR) by subtracting the mean difference in scores between the two CISS VE sessions from the upper 95% limit. 30 Table 1 shows the CISS VE items and response descriptors emerging from the pre-test administered to 48 subjects (33.33% female) and the corresponding items and descriptors taken from the original CISS.

Rasch Analysis
Of the questionnaires completed by 449 participants, responses to 429 questionnaires (mean age, 15.92 ± 5.59 years; 55.2% female) were used in the Andrich's rating scale model (RSM) analysis implemented in Winsteps. The reasons for excluding 20 of the completed questionnaires from analysis were that more than 33% of the items were not answered by one participant, and outfit > 2.5 26,31 in the responses provided by 19 participants (outfit is sensitive to unexpected observations by persons on items 24 ). The mean CISS VE score obtained was 15.10 ± 10.13, and the range was 1 to 50.

Rating Scale Structure
There was no disordering of response categories (Fig. 1).

Item Fit Statistics
Item fit statistics and item measure (difficulty, in logits) for the CISS VE are provided in Table 2. All items showed values inside the interval considered productive for measurement. 13,17 Only the infit and outfit of item 1 were outside the more stringent criterion (0.7-1.3) proposed by Pesudovs et al. 16 and Khadka et al. 14

Dimensionality
Our PCA analysis of the CISS VE revealed that 46.3% of the raw variance was explained by the CISS VE measures, and an eigenvalue of the first contrast The curve at the extreme left represents "never,"and the curve at the extreme right represents "always." of 2.19. All other contrasts had eigenvalues below 2.00. Thus, in our analysis, the secondary dimension was noticeable because it was bigger than 2.0, indicating that the CISS VE measures two different latent traits. Table 3 shows the items covering the secondary dimension. The disattenuated coefficient of correlation between the first and second contrasts was 0.84, indicating that both dimensions share about 67%    Table 4 summarizes the results of the tests used to assess unidimensionality and how we used these data to decide whether the CISS VE could be considered unidimensional. According to these results, the CISS VE can be considered a unidimensional instrument.

Person Separation Index and Performance Levels
The PSI for CISS VE was 2.33, indicating a reliability of 0.85 and meaning that the CISS VE was able to distinguish 3.44 strata of scores. Using the Wright method (a sample-independent method suitable for clinical samples) to determine the number of performance levels across the CISS VE score range, we found that the CISS VE could distinguish 6.3 levels of symptoms. Figure 2 shows the estimated measure for any CISS VE raw score and the correspondence between the raw score and level of performance. Cronbach's α was 0.90. Table 5 shows the distribution of the CISS VE and CISS scores obtained by the subjects included in this study according to performance level.

Targeting
The targeting value was -1.37 logits. The itemperson map (Figure 3) shows that the items were too difficult for the ability level in this sample, because we assessed a population-based sample in which most were not expected to have near-vision symptoms.

Differential Item Functioning by Gender and Age
The results of DIF by gender revealed neither notable DIF nor minimal DIF for any of the CISS VE items. Just one item (item 14) showed minimal DIF (0.78) according to age group, as this item was more difficult for young adults than for children. To assess the psychometric properties of CISS VE , we compared our Rasch analysis results against the Rasch model expectation 22 using the quality criteria proposed by Khadka et al. 14 (Table 6).

English Version Versus Spanish Version
For the English version analysis, after applying the exclusion criteria, we used the responses for 216 questionnaires but excluded those of four with outfit > 2.5. As in Rasch theory, an extreme score (0 or 60 for the CISS) on a questionnaire corresponds to an infinite ability measure (ability measure = symptoms level when using the CISS), which is impractical and also misleading in most situations. 24 Winsteps excluded from the analysis four more completed questionnaires with scores of 0. Finally, the results of 208   questionnaires (38.0% female; mean age, 15.86 ± 1.61 years) were used in the Andrich's RSM analysis; this sample size is larger than the minimum recommended for DIF assessment. The mean CISS VE score was 16.10 ± 9.50, and the range was 1 to 49. Table 7 compares the main psychometric properties of the two CISS versions.

The DIF Contrast Was Below 0.50 for Every Item When Comparing CISS and CISS VE
Because the Kolmogorov-Smirnov test indicated a non-normal distribution of the CISS VE scores, we ran a Kruskal-Wallis test followed by Dunn's multiple comparisons on the 429 completed questionnaires. This was designed to examine differences in CISS VE performance according to gender and age: males from 9 to 17 years (boys), females from 9 to 17 years (girls), males from 18 to 30 years (young men), and females from 18 to 30 years (young women). The Kruskal-Wallis H test detected a significant among between the CISS VE groups examined (H, 46.14; P < 0.001). Descriptive statistics are provided in Supplementary Table S2 and the significant differences detected by the Dunn's test are shown in Supplementary Table S3.

Convergent Validity, Divergent Validity, and Repeatability
We calculated the Spearman rho correlation index between the CISS VE and the CVSS at 0.34 (P < 0.001). No significant association was detected between the CISS VE and the seven items covered in the shortened version of WEMWBS (Fig. 4), which indicates evidence of divergent validity. Correlation between the CISS VE and the CVSS was weak (0.34) 32 yet may be considered proof of convergent validity. For the subjects who completed the CISS VE twice (test-retest time interval: 10.23 ± 3.40 days), the two-way singlemeasure ICC for test-retest repeatability was 0.878 (95% confidence interval, 0.845-0.905), and the CoR was 9.22. Figure 5 provides the Bland-Altman plot for the CISS VE . The mean difference between sessions was 0.48, and the limits of agreement including 95% of the differences were 9.67 and -8.71, so the CoR was 9.22. For four subjects, the difference in CISS score between sessions was over 10, which was considered by Rouse et al. 2 as "significant and outside the range of normal variability." By considering these four subjects  as outliers and excluding them from the analysis, the CoR would improve to 8.51.

Discussion
In this study, we present the Spanish version of the CISS, which shows psychometric properties similar to those for the English version. Our Rasch analysis also confirmed that the overall performance of the instrument is acceptable. A good-quality translation is an essential part of cross-cultural adaptation, but this does not mean that the translated version retains the psychometric properties of the original tool. Other authors propose a three-step process in which translation is followed by formal assessment of psychometric properties and validity and reliability testing. 8,11 As recommended by Bradley and Massof, 33 we directly compared item psychometric properties between the CISS VE and the CISS to determine whether both tests worked in a similar way. According to their almost identical reliability and residual PCA results (Table 7), the psychometric performance of the CISS VE proved similar to that of the CISS. A small difference was noted in targeting (-1.37 CISS VE vs. -1.16 CISS), as the mean symptoms score in the English sample was one point higher than in the Spanish subjects. When comparing both versions, just one item (item 12: Do you feel a "pulling" feeling around your eyes when reading or doing close work?) showed an outfit value (1.86) far from Rasch model expectation (1.50). This was attributed to six respondents who scored 0 on every item except item 12, which they awarded a score of 1. Exclusion of these six subjects yielded an outfit value of 0.90. This indicates that the poor performance of item 12 may be attributed to the composition of our English sample. To confirm the equivalence between both versions, our DIF analysis confirmed that the CISS items had been optimally translated into Spanish (European).
We also compared the psychometric properties of the CISS VE arising from Rasch analysis through Rasch model expectation 34 (Table 6). Measurement precision was high, and more than six levels of convergence insufficiency symptoms could be distinguished in the study population. Further, although the CIRS group did not use Rasch analysis to develop the original questionnaire, just one item of our Spanish version (item 1: Do your eyes feel tired when reading or doing close work?) showed Rasch infit and outfit values under the minimum suggested by Khadka et al. 14 (0.7). This indicates that this item's responses are too predictable despite being within the interval of 0.5 to 1.5, considered productive for measurement. 35 Because the presence of one or two items with infit or outfit between 0.5 and 0.7 is deemed acceptable, 14,16 item 1 was retained in the questionnaire without any modification.
Our results revealed that the CISS VE is an instrument without DIF for gender and that minimal DIF 13 according to age exists for only one item (item 14: Do you lose your place while reading or doing close work?). According to Khadka et al., 14 a notable DIF is >1.0 logits, so we could directly compare CISS VE scores across these subgroups. Our analysis revealed higher CISS VE scores in the young adults than in children (Supplementary Table S3). These differences may be due to a greater cognitive load associated with near vision in the young adults group exacerbating the symptoms normally induced by visual stressors. 36 We also examined convergent validity by comparing the CISS VE and the CVSS17. As predicted, a significant association emerged between them with a coefficient of correlation higher than 0.3 (Fig. 4). This value is the minimum recommended by Khadka et al. 14 when assessing convergence validity. Further, it is the minimum suggested by the COSMIN guideline for systematic reviews of patient-reported outcome measures 37 when evaluating construct validity by studying correlations with instruments measuring related but dissimilar constructs (like we did here). In addition, there was no correlation between the CISS VE and the Spanish version of the WEMWBS, 29 so our study provides some evidence of CISS VE divergent validity, as it works in the expected manner.
Our CISS VE showed a mean score (15.10 ± 10.13) that was comparable to those reported in studies conducted in similar populations, such as the adolescents assessed by Horan et al. 6 (16.3 ± 11.4) and the university students used to develop the Portuguese version (CISS VP ; 15.56 ± 8.86). 5 These two studies provided mean scores for the entire group of participants instead of separating scores according to subjects' visual problems. Furthermore, the mean score obtained in our sample (from a general population) is higher than those reported by Rouse et al. 2,38 in adults and children with normal binocular vision recruited from a clinical population (11.3 ± 8.1 for adults and 10.4 ± 8.1 for children). As expected, our sample's mean score was lower than the values obtained in children 2 and adults 38 with symptomatic CI (37.3 ± 9.3 for adults and 29.8 ± 8.1 for children).
A main strength of our study was that we used Rasch analysis to analyze the psychometric properties of both the CISS VE and the CISS and used data from the DIF analysis to compare the items of the two versions. However, our study also has the limitation of suboptimal targeting due to the use of the general population instead of a purposive sample.
Good targeting determines higher person reliability, so tests with poor targeting are worse at distinguishing between high and low performers. Thus, this could be a limitation of the CISS VE because the targeting value (-1.37 logits) was lower than recommended (<-1.0) by Khadka et al. 14 and Pesudovs et al. 16 This suboptimal targeting is a typical issue of scales designed to measure symptoms 26 when administered to the general population, as we did, and could indicate less measurement precision in subjects scoring far from the items' distribution mean (i.e., subjects with fewer symptoms). However, the number of levels of performance (5.8) determined by the Wright method, which is a sampleindependent technique derived from Rasch analysis, 39 suggests the high reliability of the CISS VE . Given this sample-independent reliability along with the fact that clinicians and/or researchers usually focus on persons with scores closer to the items' mean, we consider this CISS VE mistargeting acceptable for its purpose.
Rasch analysis of the CISS VE and the CISS suggested multidimensionality, as the variance explained by its measures was under 50% and the eigenvalue of the first strength was above 2.0. When we examined the six items included in the putative dimension arising from the PCA analysis (Table 3), we noted that they were those items exploring complaints other than visual and ocular symptoms. Accordingly, subjects provided answers about reading consciousness or reading performance differently than they did about symptoms. However, because this secondary dimension is closely related to the Rasch dimension, as shown by their disattenuated correlation coefficient (0.84), we can consider that the two dimensions are two different categories of the same trait (e.g., calculus and trigonometry items on a math test), so we may consider the CISS VE a unidimensional tool for statistical purposes. 24 In addition, to compare 95% limits of agreement, we selected the study by Rouse et al., 2 as it has a mean test-retest interval similar to ours (10.50 ± 7.50 days vs. 10.23 ± 3.40 days). As expected, limits of agreement were similar in our study (9.67 and -8.71) and in the study by Rouse et al. 2 (9.0 and -7.6). According to Khadka et al. 14 and Pesudovs et al., 16 the ICC reported in our sample (0.878) indicates high test-retest reliability. The CoR was 9.22 (8.51 without the four outliers), showing that, in the test-retest data, the probabil-ity of detecting a test-retest change in CISS score greater than 9.22 in the test population is 2.5%. These results imply that a clinician can be sure that a treatment has a significant impact on a patient's symptoms when finding a change of 10 points, the same value provided by Rouse et al. 2 for the original CISS. Testretest reliability is optimal when the limits of agreement are lower than the minimal clinically important difference (MCID) values, although lower values of a similar magnitude are considered positive. 16 To sum up, the CoR of the CISS VE would be good if we consider valid the value given for the CISS (10 points), but further studies are needed to define precise MCID values for this questionnaire.
Rasch analysis could be used to reengineer the CISS to enhance areas in which it here showed lower performance such as dimensionality or targeting. There are several options for this, such as collapsing some categories and/or deleting the second dimension found in the residual PCA. As several options are available, the clinical relevance of any change should be considered to assess the effectiveness of these improvements. For example, we could consider deleting the second dimension if the instrument becomes more sensitive to clinically meaningful changes.

Conclusions
We developed a Spanish version of the CISS showing performance similar to that of the original version in English. We also identified some psychometric properties of the CISS that should be addressed in future studies to improve this instrument as a measure of near-vision-related symptoms.