Validation and psychometric properties of the Somatic and Psychological HEalth REport (SPHERE) in a young Australian-based population sample using non-parametric item response theory

Background The Somatic and Psychological HEalth REport (SPHERE) is a 34-item self-report questionnaire that assesses symptoms of mental distress and persistent fatigue. As it was developed as a screening instrument for use mainly in primary care-based clinical settings, its validity and psychometric properties have not been studied extensively in population-based samples. Methods We used non-parametric Item Response Theory to assess scale validity and item properties of the SPHERE-34 scales, collected through four waves of the Brisbane Longitudinal Twin Study (N = 1707, mean age = 12, 51% females; N = 1273, mean age = 14, 50% females; N = 1513, mean age = 16, 54% females, N = 1263, mean age = 18, 56% females). We estimated the heritability of the new scores, their genetic correlation, and their predictive ability in a sub-sample (N = 1993) who completed the Composite International Diagnostic Interview. Results After excluding items most responsible for noise, sex or wave bias, the SPHERE-34 questionnaire was reduced to 21 items (SPHERE-21), comprising a 14-item scale for anxiety-depression and a 10-item scale for chronic fatigue (3 items overlapping). These new scores showed high internal consistency (alpha > 0.78), moderate three months reliability (ICC = 0.47–0.58) and item scalability (Hi > 0.23), and were positively correlated (phenotypic correlations r = 0.57–0.70; rG = 0.77–1.00). Heritability estimates ranged from 0.27 to 0.51. In addition, both scores were associated with later DSM-IV diagnoses of MDD, social anxiety and alcohol dependence (OR in 1.23–1.47). Finally, a post-hoc comparison showed that several psychometric properties of the SPHERE-21 were similar to those of the Beck Depression Inventory. Conclusions The scales of SPHERE-21 measure valid and comparable constructs across sex and age groups (from 9 to 28 years). SPHERE-21 scores are heritable, genetically correlated and show good predictive ability of mental health in an Australian-based population sample of young people. Electronic supplementary material The online version of this article (doi:10.1186/s12888-017-1420-1) contains supplementary material, which is available to authorized users.


Background
The Somatic and Psychological HEalth REport (SPHERE) provides an assessment of common symptoms of mental distress and persistent fatigue by self-report [1]. The 34 items of the SPHERE (SPHERE-34) were selected from four widely used clinical assessments of mental health, based on their predictive ability [1]. Anxiety and depression items were selected from the General Health Questionnaire [2], chronic fatigue from the Schedule of Fatigue and Anergia [3], neurasthenia from the Illness, Fatigue and Irritability Questionnaire [4], and somatisation items from the Diagnostic Interview Schedule (DSM)-III-R. Participants respond to each of the 34 items, choosing from one of three fixed options ("sometimes/ never", "often", "most of the time" coded 0, 1 and 2 when calculating sum score) to describe the frequency of their symptoms over the "past few weeks". While three subscales can be extracted: anxiety-depression, somatic distress and persistent fatigue ( Fig. 1), these are assumed to represent overlapping constructs that underpin common mental disorders. Neurasthenia also used to be measured from the SPHERE-34 questionnaire but we do not consider it here, due to the progressive abandonment of the concept in psychiatry [5].
Except for an earlier paper from our group, where we showed that the anxiety-depression and somatic-distress subscales of the SPHERE-34 are moderately heritable (~40%) and correlated (phenotypic correlation of 0.42, genetic correlation of 0.87) [6], there has been no detailed assessment of the psychometric properties of this questionnaire outside clinical settings. This is important, as these properties may not generalise to population samples [7]. Here, we used Item Response Theory (IRT) to assess the validity and the psychometric properties of SPHERE-34 data collected in a large Australian-based population sample of young people [8,9].
IRT origins trace back to the 1940s [10][11][12] and is a very popular framework for the validation of questionnaires, Fig. 1 Items and scales of the SPHERE-34. Items' short names are used through this manuscript. Some items may be included in several scales as indicated by multiple "x" in some rows. Items from the shorter SPHERE-12 appear in blue. Each scale of the SPHERE-12 comprises six items, which were created to provide a screening tool for common psychological and somatic distress in general practice [1,[85][86][87]. The two dimensional picture of the Australian population for the SPHERE-12 showed good psychometric properties and very high sensitivity for current and life-time major depression, anxiety and neurasthenia as assessed by DSM-III and DSM-IV [1,86]. In addition, it was a good predictor of disability (as measured using the Brief Disability Questionnaire [88]), psychiatric morbidity [89] and doctor's rating of psychological risk [1], which has led to its use in research and medical practice in Australia [97] given the simplicity of model formulation and the numerous theoretical developments (chronologically: normal ogive, Rasch model, two and three-parameter logistic models, extensions for polytomous items, non-parametric IRT) [13][14][15]. There are two main advantages of IRT over classical test theoryit explicitly models the items' properties and uses them to perform maximum likelihood (ML) estimation of the latent trait based on the individuals' responses to the questionnaire. This provides an IRT score that takes into account the difficulty and discrimination of each item, often resulting in a more accurate estimation of ability, compared to using a sum of the items (sum score) (see [16,17] or [18] for examples and simulations). That said, the sum score is also a (consistent asymptotically normal) estimate of the latent trait [19], and its use may be preferred to communicate test performances, and for use outside of a research context for obvious reasons of simplicity in calculation and interpretation [20].
In IRT, a scale of items (or questions) requires three hypotheses to be met: Unidimensionality: there is a single latent dimension (or trait) θ underlying a set of items; Conditional Independency: items are conditionally independent given the latent trait θ; Monotonicity of the Item Response Step Function: the probability of having a symptom, knowing the latent trait θ, is a growing function in θ. Conceptually, unidimensionality of a set of items is never verified, as several abilities are required to answer even the most simple question (e.g. reading ability, memory). Several tests have been proposed to assess unidimensionality [21], all testing H0 "Unidimensionality" vs. H1 "multidimensionality". Thus, none of them can conclude that unidimensionality is verified; at best they conclude that it cannot not be invalidated. Furthermore, when large samples are considered, one would expect such tests to always reject the null hypothesis of unidimensionality. Similar criticisms can be formulated about testing for conditional independence (see [22] about necessary but not sufficient conditions for conditional independence). Consequently, we excluded items that did not satisfy the hypothesis of monotonicity or might be the most influenced by secondary abilities (see Methods below). However, we assumed that unique psychological dimensions could explain most of the responses to each subscale. Finally, we also assumed conditional independence: the answer to one item is not dependent on any other answer.
Nowadays, more than a dozen different IRT models for polytomous items have been proposed [23][24][25] that differ in hypotheses (definition of the IRSF) and properties of the final score [23][24][25][26]. Models can be classified into parametric (PIRT: IRSF are assumed to be logistic) and non-parametric IRT models (NIRT: no constraint on the shape of the IRSF). Here we chose to use NIRT models for several reasons. Firstly, in the absence of prior information about item properties, NIRT models do not assume the IRSF to be logistic. Items with nonlogistic IRSF have been reported for depression scales, with NIRT leading to improved model fit and fewer items excluded [27]. Secondly, they allow a better diagnosis of the item properties by detecting local violations of monotonicity or local variations in item discrimination and bias [28,29]. Thirdly, they offer user-friendly and straightforward diagnostic tools by means of visual inspection [28], and are less computationally demanding than PIRT by combining kernel regression and fast Fourier transform [16,29]. Lastly, despite being more general than PIRT models, NIRT models have similar properties of stochastic ordering by the sum score [25,26].
For the present study we merged the somatic-distress and fatigue scales into a "chronic fatigue" subscale, as they appeared to be driven by the same genetic factors (rG > 0.97, phenotypic correlation above 0.9 using the SPHERE-34 definition, see Additional file 1). This choice is consistent with the definition of the short version of the questionnaire (12 items, SPHERE-12, created for screening in general practice [1]), composed of two scales: psychological distress and somatic distress. Measuring both fatigue and depression could prove of great interest in psychiatric research, where it is known that they are highly comorbid [30] and often indicate a greater functional impairment when they co-occur [31]. Twin research further showed that depression and fatigue were strongly genetically correlated [32,33], with however genetic and environmental factors specific to chronic fatigue [32,33]. Causal relationships between depression and fatigue remain equivocal [31] with two studies reporting non-causal genetic relationships [33,34]. Further research requires validated questionnaires that measure both dimensions and are suitable for longitudinal studies in the general population. Here, we use IRT to develop such scales from the SPHERE-34 questionnaire and we further report the psychometric properties, heritability, 3 months test-retest and association of the scores with DSM-IV diagnoses.
More precisely, we started from the depression-anxiety and fatigue scales previously defined [1] and excluded items responsible for bias in the score distribution and participant ordering (i.e. non-monotonic), as well as poorly contributing items with low discrimination. We also tried to improve the scale(s) stability and precision by including unused items from the former neurasthenia scale that showed good discrimination. Next, we investigated whether the new SPHERE scores (SPHERE-21) measured similar constructs across both age and sex, to ensure that any later differences observed across groups represent true differences in liability. Then we investigated the impact of the new scales definition on the scores reliability (3-months test-retest), internal consistency [35] and scalability (Loevinger's Coefficients [36]). In addition, we estimated the heritability of the new depression and fatigue scales for each age group together with their genetic, environmental and phenotypic correlations. Finally, in a reasonably large subsample we assessed the predictive ability of the new SPHERE scores by examining the association of age specific SPHERE-21 scores with mental health lifetime diagnoses collected in early adulthood. We concluded with a post hoc comparison of the SPHERE-21 and Beck Depression Inventory (BDI) properties.

SPHERE questionnaires in the Brisbane longitudinal twin study
SPHERE-34 was administered as part of three main projects that make up the Brisbane Longitudinal Twin Study (BLTS; also known as the Brisbane Adolescent Twin Study (BATS)) [8]. The first two waves of data were collected in the clinic, following an assessment of Melanocyte Naevi (moles) around the twelfth (TWin Mole study visit 1: TW1) and fourteenth birthday of the twins (TWin Mole study visit 2: TW2) [6,[37][38][39] with a third wave of data, also collected in the clinic, as part of the twin cognition project (Twin Memory, attention and problem solving, TM), mostly at age 16 years [6,[40][41][42][43]. The final wave of SPHERE-34 data was collected as part of a mailout, which included assessments for laterality, personality and reading, as well as smell and taste tests; the study is known as the Twin Adolescent (TA) study. Participants who were administered the SPHERE-34, as part of the TA study, were on average 18 years old [6,44,45]. In total, 3312 twins or siblings (individuals) were included in at least one of the four waves in which the SPHERE-34 was administered, and at each wave responses were available for >1200 individuals (TW1: 1707; TW2: 1273; TM: 1513 and TA: 1263). Almost half of the participants (44%) answered the questionnaire more than once (19% three times), with 134 individuals (4%) being assessed at all four waves (Fig. 2). Missingness was overall limited (maximal percentage missingness per item ranged from 0% to 0.6%, number of participants with missing items ranged from 0% to 4.0%; Additional file 2) and at each wave can be assumed to be at random, with exclusions having little impact on results and power (Additional file 2).

Selection of unrelated individuals for IRT analysis
As the sample included twin pairs and siblings, and IRT still lacks methods able to model relatedness in samples, we selected unrelated individuals for the IRT analysis. Despite a significant reduction of the sample size, the familial pruning ensures unbiased confidence intervals of IRSF and facilitates the comparison across sexes and studies, in which the relatedness might confound the results. In order to maximise the number of observations included, we randomly selected one individual per family in each of the waves. To ensure sampling homogeneity of our pruned sample with the full sample, we iterated the random selection process 1000 times, keeping the sample with the most similar age mean, variance and sex frequency as the full sample.
For the across wave comparison (study Differential Item Functioning (DIF), see below), we chose to successively compare TM, TW1, and TW2 to TA, which we used as a benchmark. This approach maximised the number of observations used in NIRT model estimation, hence reducing confidence intervals. For each dataset, we only allowed unrelated individuals within and across waves. When multiple observations were available for a participant we preferentially selected the observation from the wave that had a smaller number of participants in order to obtain a comparable sample size across waves. We iterated the familial pruning and observation selection 100 times each, keeping the sample that included the most similar number of participants across the four waves.
In most of the resultant (pruned) samples, there were slightly more females (2-9%; Table 1). Mean age in TW1, TW2, TW and TA was 12, 14, 16 and 18 years respectively (Table 1). Age had a pseudo-normal distribution in TA but exhibited large peaks in the other three waves due to the smaller age dispersion. Pruned samples (to investigate study DIF) showed comparable age and sex distributions as the full samples (Table 1).

Protocol of non-parametric IRT analysis
Redefining the SPHERE-34 subscales is an attempt to improve their properties by ensuring that the IRT hypothesis of monotonicity is met in practice but also by including, when possible, items frequently endorsed that inform on the individuals with low proficiency. We first examined its subscales in TA, the oldest cohort (mean age = 18 years, SD = 3.10), where we can assume questions were fully understood by most participants. Starting from subscales defined from clinical samples, we estimated the IRSF, excluding items not showing monotonic IRSF or specific to a subset of individuals, and included additional items (e.g. from the neurasthenia scale) that add information to the subscale (Fig. 3). An item's relative difficulty and discrimination can be calculated using principal component analysis using the evaluation points of the expected item scores [16,17]. The items are projected on the first two principal components. The first principal component often corresponds to the difficulty of the items, while the second principal component measures the items' discrimination. Axes are detailed in each figure legend. Plots were created using the FactoMineR package [46,47].
Then, we studied sex DIF in all waves and excluded items responsible for large item bias (Fig. 4). Finally we evaluated the wave DIF to identify items behaving differently across studies or age groups (Fig. 4).

Non-parametric IRT models and concepts
We used a non-parametric Graded Response Model [48,49] that is the most general NIRT model for polytomic items [24,26], while having the simplest and arguably the most plausible IRSF definition [50,51]. Thus, hypothesis three of monotonicity becomes for each item j and each response x ∈ (0, 1, 2): is a monotonous nondecreasing function in θ With X j the random variable of the score on item j. P(X j ≥ 1|θ) and P(X j ≥ 2|θ) are respectively the probabilities of reporting symptoms more than often ("often" or "most of the time") or most of the time.
In NIRT, the absence of interpretable parameters (that define the logistic IRSF) forces one to rely on visual inspection of the IRSF or to rely on additional metrics [17,52,53] in order to describe or compare the functions. We used the "kernSmoothIRT" package [16] for NIRT modelling, which is an R equivalent of TestGraf [17]. It allows plotting the IRSF from which the hypothesis of monotonicity and the item properties could be visually appreciated. In addition, we calculated relative difficulty and discrimination of the items [17]. Using visual inspection and item bias summary statistics [17], we studied differential item functioning (DIF or item bias), present when individuals from different groups (e.g. sex, ethnicity, wave) with the same proficiency have different probabilities to endorse one item or one item category. DIF can cause an artificial score difference between groups and threatens the internal validity of the scale by causing incorrect ordering of the participants on the latent trait [53][54][55]. If DIF is strongly undesirable in the final score, it can also inform on the dimensionality of the scale. Indeed, items presenting DIF can be seen as measuring additional dimension(s) for which the groups have different abilities [55]. As a conclusion, study of DIF offers a partial check (limited to the groups considered) of the unidimensionality hypothesis in IRT. As DIF statistic we used the one implemented in Testgraf [17], which corresponds to the root mean square of the differences of IRSF over the latent trait, or more simply the mean absolute difference in probabilities of answering each item. We considered that DIF > 0.25 suggests a substantive difference of abilities, with one group on average 25% more likely to report one symptom. Thus, we chose to exclude such items, provided the difference between IRSF was significant as indicated by the 95% confidence intervals (see Fig. 4).
In addition, some NIRT models have fewer measurement properties than some of their parametric counterparts [24][25][26]. A central property that allows inferences to be made on the latent trait is stochastic ordering of the latent trait (SOL) by the sum score. It states that the order of participants, as given by the item sum score, gives a stochastically correct ordering on the latent variable [25]. In theory, this property is only verified [24,25] for very simple polytomous parametric models [56,57] that force the slope of the IRSF to be equal across items and categories. However, there is practical evidence that SOL by the sum score is often verified [26,51] when IRT hypotheses are met, enough items are present (>5) with a limited number of categories (<5) and similarly shaped IRSFs [26]. Thus, our use of NIRT models maximises goodness of fit to the data while allowing us to make inferences on the individual's proficiency based on their questionnaire score. Finally, we preferred the kernel estimation of IRSF, or "TestGraf approach" [16,29], over Mokken Scale Analysis (MSA) [15,36,58], another NIRT approach that relies on Loevinger's Coefficient [11,36] to assess properties of items. Despite a simpler framework, through the use of predefined criteria and rules (see Methods below), MSA often results in more item exclusions and suffers from the lack of interpretability of Loevinger's Coefficient that are reduced by low correlation to latent trait, redundant items, intersecting IRSF, low discrimination or nonmonotonicity [15]. We report Loevinger's Coefficients of the final scales in the psychometric section as a measure of scalability.

New scores description, three months test-retest, internal consistency and Mokken scale analysis
Using the final scales definitions, we estimated the ML estimate of the latent trait, which is an efficient estimate of the individuals' proficiency [17] and calculated the sum score (items scored 0, 1 or 2 for "sometimes/never", "often", "most of the time") as a benchmark of score performance. We report the mean IRT and sum scores by sex for the four waves. To accommodate related individuals in the sample, the sex difference was tested Fig. 4 Protocol to study and limit DIF across sex groups and waves using the "hglm" R package (fixed effect, Student's t-test) [59]. A matrix of genetic relatedness was used to model the variance covariance structure of the sample. Such matrix was created using the "kinship2" package [60].
For test-retest evaluation, we included all unrelated participants with a test-retest period shorter than four months. This resulted in 52 participants with a median test-retest interval of 1.9 months (range 1 day-3.8 months), 27 (50%) of the participants were females. Median age was 14 years (range 12-18). Test-retest was calculated using intra-class correlations (ICC) from the R package "irr" (two-way consistency ICC) [61].
Two widely used metrics in questionnaire validation include Cronbach's alpha [35] and Loevinger's Coefficient [11]. Cronbach's alpha, often known as internal consistency, measures the proportion of the variance in the scale attributable to a common factor [62]. Despite being reported in almost every scale description, many parameters (e.g. number of items, items inter-correlation, dimensionality) have been shown to influence the coefficient [62], making its interpretation difficult [62,63]. However, it is commonly considered that alpha >0.7 suggests an acceptable consistency while alpha >0.9 may indicate presence of redundant items [63]. The use of Loevinger's Coefficient (H), or "scalability" coefficient, was popularised in Mokken Scale Analysis [15,36,58], a NIRT approach, which relies on a set of metrics to investigate the items or scale properties. Leovinger's Coefficient can be calculated between two items (Hij), between an item i and a scale (Hi), of for a whole scale (H). Under the assumption of monotonicity of the IRSF, it has been shown that Hij > 0 for all (i,j), and Hi > 0 [64], however the reciprocal does not hold, thus Loevinger's Coefficient cannot be used to confirm monotonicity of the IRSF. In addition, Loevinger's Coefficient is sensitive to population variance, item difficulty, discrimination and presence of redundant items in the scale making their interpretation also difficult [65,66]. However, it is commonly accepted that items satisfying Hij > 0 for all (i,j), i ≠ j, and 0.3 < Hi < 0.4 form a "weak Mokken scale". When 0.4 < Hi < 0.5 the scale is defined as "medium", and when 0.5 < Hi the items form a "strong Mokken scale" [36,64]. Internal consistency was calculated in R using the "psy" package [67], Loevinger's Coefficients were calculated using the package "Mokken" [58]. For all scores in all studies, we report Cronbach's alpha, number of Hij < 0, min(Hi).
Composition of the twin sample for heritability, genetic and environmental correlations To facilitate interpretation of age specific heritability and correlations across ages, we binned the observations by age, creating four age bins (9 to <13 years, 13 to <15 years, 15 to <17 years and 17 to <28 years), which were centred around the mean age for each wave. For those individuals where two SPHERE-34 assessments occurred close together, which resulted in two assessments for an individual in an age bin, we randomly selected one SPHERE-34 assessment (Table 2). Next, we restricted the family size to a maximum of three siblings (one twin pair and one sibling or non-identical trio), which led to the exclusion of 161 participants (additional siblings or identical trio). Thus the final sample for genetic analyses comprised 1382 individuals with a mean age of 12 years, 1371 individuals with a mean age of 14 years, 1508 with a mean age of 16 years and 887 with a mean age of 19 years. See Table 2 for number of complete trios, monozygotic (MZ) and dizygotic (DZ) twin pairs, twin-sibling pairs and singletons. The two younger age bins had an equivalent proportion of males and females (50%), while there were slightly more females in the two older age bins (15 to 16 years (54%); 17 to 28 years (58%)) ( Table 3).
Heritability, genetic and environmental correlation between the scores We used a twin and sibling design to partition the variance into additive genetic "A", unique environment components "E" and either familial (common) environment "C" or dominant genetic "D" [68][69][70][71]. Heritability is defined as the proportion of trait variance explained by the additive genetic factor. The twin design relies on the fact that, for a heritable trait, the twin-pair correlation increases with the degree of genetic relatedness, resulting in higher twin correlations in the MZ group compared to the DZ group. Here, we included an additional sibling when available, which provides an increase in power for detecting A and C/D [72]. Finally, we included singletons (in studies TA and TM) that do not contribute to power for detecting A or C/D, but improve the stability of the estimates of means and variance. Analyses were performed in OpenMx 2.2.6 [71,73] using full information maximum likelihood (FIML) to accommodate singletons and incomplete trios. We compared the fit of ACE vs. ADE models using the Akaike Information Criterion [74], and tested the significance of A, C/D and E fraction of variance using log-likelihood ratio test on nested models. Prior variance component modelling, we tested the comparability of means, and variances across zygosity groups and siblings, to identify sampling issues and outliers that may bias the results [75]. In order to limit the number of tests and improve readability, we performed an omnibus test (likelihood ratio test, 20 degrees of freedom) that tests whether equating all means and variance results in a significant reduction of the model fit. In addition, we tested the effect of sex, age and study on the score means, and also whether the twin covariances suggested sex-specific heritability [76]. All significant covariates were included in subsequent analysis. For each age group, we reported the heritability of the scores (IRT and sum scores). Then, we fitted a bivariate model to estimate the genetic and environmental correlations between anxiety-depression and chronic fatigue.

Collection of the DSM-IV clinical assessment
A later wave of the BLTS ("19up: the study mapping neurobiological changes across mental health stages") [8] collected Composite International Diagnostic Interviews (CIDI) [77] that we used to compute DSM-IV diagnoses of major depressive disorder (MDD), social anxiety, alcohol and marijuana dependence (i.e. substance dependence), and panic disorder [8]. As of June 2016, a total of 2773 twins and siblings had completed the study, of which 2041 had previously answered at least one SPHERE-34 questionnaire. 709 participants had a SPHERE-34 score collected between 8 and 12 years (mean age at CIDI = 21.9, SD = 1.7, 59% females), 907 with SPHERE-34 between 13 and 14 years (mean age at CIDI = 22.9, SD = 2.4, 58% females), 1055 with SPHERE-34 between 15 and 16 years (mean age at CIDI = 23.2, SD = 2.5, 61% females) and 739 who answered the questionnaire between 17 and 28 years (mean age at CIDI = 28.9, SD = 3.1, 61% females). Despite a later  (Table 3). This should prevent the association between SPHERE-34 and the DSM-IV diagnoses being confounded by censoring. Thus, different predictive abilities of age-specific SPHERE-34 scores can be attributed mostly to differences in age at questionnaire, rather than to age at CIDI. Age of onset is not available for substance dependence and only the age at initiation was collected.

Association of new SPHERE scores with DSM-IV diagnoses
We estimated the increased risk of DSM-IV diagnoses (MDD, social anxiety, alcohol and marijuana dependence) associated with an increased SPHERE-21 score. We do not report results of association with panic disorder as low numbers made estimation of parameters impossible. Results are presented in the form of odds ratio, which are equivalent to relative risk estimates as disease prevalences in our sample match those of the general population. To accommodate related individuals in the sample, the model parameters were estimated using quasi-likelihood implemented in the "hglm" R package (fixed effects, Student's t-tests) [59]. A matrix of genetic relatedness, created using the "kinship2" package [60], was used to model the variance-covariance structure of the sample. This approach provides unbiased estimates of the variance of the estimates, which prevents underestimating p-values. Sex, ages at SPHERE, age at CIDI and dummy variables for the SPHERE study waves were included as covariates in the model. Finally, we estimated the number of independent SPHERE-21 scores across all age bins (np) using the eigenvalues of the correlation matrix [78,79]. We then used a Bonferonni significance threshold of 0.05/(np*4), four being the number of diagnoses tested, to avoid enforcing a too stringent significance threshold.

Results and discussion
Redefinition of the SPHERE scales in the sample of young adults (TA study)

Anxiety-depression scale
All items of the original anxiety-depression scale showed monotonic IRSF in the normal range of the latent trait distribution. Item 5 ("Nervous/ tense") presented the most obvious decrease of IRSF (Fig. 5), but this was limited to the top 2.5% of the distribution, which did not justify its exclusion. Additional items showed monotonous IRSF in the presence of the other 14 items and could be considered pertinent for the assessment of anxiety-depression: item 1 ("Headaches"), 7 ("Waking up tired"), 16 ("Longer sleep"), 17 ("Tired after activity"), 29 ("Tired after rest"), and item 31 ("Tired after activity").
However, these were all items from the somatic-distress or fatigue scales and we did not include them in the anxiety-depression scale to avoid artificially inflating the correlation between the 2 scores. Item 3 ("Poor memory") was the least discriminant ( Fig. 6) having the flattest IRSF, while item 2 ("Irritable/cranky") was the most discriminant (steepest IRSF). Items 26 ("Annoyed easily") and 23 ("Frustrated") were the least difficult (Fig. 6) being endorsed by individuals with low proficiency (early elevation of IRSF, see Fig. 5), while item 28 ("Dizziness") was the most difficult.

Chronic fatigue scale
We started with the 15 items present in either the somatic-distress or fatigue scales (items 1, 3, 6, 7, 10, 13, 14, 16, 17, 24, 25, 29, 30, 31 and 32) and estimated the IRSF. Several items exhibited small local decreases in their IRSF (Additional file 3). However, this could be a consequence of the presence of items poorly correlated to the scale, leading to biased estimation of the latent trait. Indeed, items 14 ("Fevers") and 24 ("Diarrhoea/ constipation") were not often endorsed, even for individuals with very high latency (Additional file 3). For example, the estimated probability of reporting fevers "more than often" was below 0.4, and no participants reported fevers "most of the time" (Fig. 8). Thus, we excluded items 14 and 24 as they corresponded to symptoms rarely reported or only present in a subgroup, as suggested by 95% confidence intervals of the IRSF not reaching 1 (Additional file 3). We further excluded items 6, 10 and 16 for non-monotonicity and item 13 for its low endorsement. These exclusions resulted in smoother and monotonous IRSF for the nine remaining items (Additional file 4). Then, we considered relevant items not included in the anxiety-depression scale: items 15 ("Back pain") and 22 ("Weak muscles"). After inclusion of these additional items, the IRSF remained monotonous (Fig. 7). Overall, item 1 ("Headaches") was the least discriminant (Fig. 8) having the flattest IRSF, while items 17 ("Tired after activity") and 29 ("Tired after rest") exhibited the steepest IRSF (Figs. 7 and 8). Newly included items 15 and 22 were moderately discriminant, with item 22 being the most difficult in the scale, hence adding information on the individuals with extreme somatic-distress. These two items were included in the chronic fatigue scale.

Differential item functioning across sex and study wave Anxiety-depression scale
Sex DIF was moderate to low in all study waves and items, even if item bias might be slightly more pronounced in the TM study that shows a median DIF statistic of 0.13 (vs. 0.11 in TA, 0.065 in TW2 and 0.080 in TW1). Maximum sex DIF was found for item 33 ("Losing confidence"; DIF = 0.23) and item 27 ("Everything on top of you"; DIF = 0.22) in study TM. However, we kept these items in the scale as they did not show consistent DIF across studies, and their impact on the TM score would remain small (DIF < 0.25 and mostly overlapping 95% CIs, Additional file 5, Additional file 6, Additional file 7, and Additional file 8).
We observed a very limited DIF between waves (Additional file 9, Additional file 10, and Additional file 11) suggesting that the anxiety-depression scale measures the same latent construct across waves, hence age groups. Median item DIF was 0.087 for TM vs. TA, 0.10 for TW2 vs. TA and 0.088 in TW1 vs. TA. Items 12 ("Unhappy/ depressed"), 20 ("Under strain") and 23 ("Frustrated") consistently showed item bias above the median, TA participants being more likely to report the symptoms "more than often", knowing the latent trait. However, these levels Step Functions of the 14 items proposed to measure anxiety-depression. For each item, two IRSF are calculated that correspond to the probability of having the symptom more than often (dark blue line) and the probability of having the symptom most of the time (light blue line). Dotted lines correspond to the 95% confidence interval of the IRSF. Using NIRT estimation, we do not constrain the IRSF to be logistic of DIF (0.13-0.24), which would have moderate impact on the scores, did not justify exclusion of these items.

Chronic fatigue scale
Monotonicity of the IRSF was observed for all items and waves, in the normal range of the chronic fatigue continuum (Additional file 12, Additional file 13, Additional file 14, and Additional file 15). The median sex DIF of the scale was 0.11 in the TA study, 0.12 in TM, 0.10 in TW2 and 0.092 in TW1, suggesting overall minor artificial sex differences in chronic fatigue scores (Additional file 12, Additional file 13, Additional file 14, and Additional file 15). However, we excluded item 1 ("Headaches") which showed a high item bias (DIF = 0.26 in TA and 0.24 in TM, non-overlapping CIs: Fig. 8), with females more likely to report headaches when compared with males with the same latent score.
We observed very limited DIF between TM, TW2, TW1 and TA (Additional file 16, Additional file 17, and Additional file 18). Median DIF across items was 0.084 for TM vs. TA comparison, 0.090 for TW2 vs. TA and 0.069 in TW1 vs. TA. Item 15 ("Back pain") was more frequently reported by participants of the TA study and showed the highest DIF in TW1 vs. TA and TW2 vs. TA (DIF = 0.21) but not in TM vs. TA (DIF = 0.084). However, the item did not meet the DIF exclusion criteria and we maintained it in the scale.

Summary of NIRT analysis: The SPHERE-21 questionnaire
NIRT analysis showed that IRSF of the SPHERE-34 items were roughly logistic, varying in difficulty and discrimination (Figs. 6 and 8), sometimes exhibiting right asymptotes below 1 and local plateaus (Figs. 5, 7; Additional file 3 and Additional file 4). The latter would cause even the most complex PIRT model (four parameters logistic, with parameters measuring difficulty, discrimination, left and right asymptotes) to fit the data poorly. Using common PIRT model (e.g. two parameters logisticmodelling difficulty and discrimination only) would have resulted in poorer fit to the data, likely resulting in exclusion of more items. Overall, such exclusions would have led to smaller scales that tend to be less reliable and precise [66]. Finally, the IRSF left asymptotes were all 0, which suggests absence of guessing (i.e. no participants answering the questions at random). Thus we can infer that participants in all waves understood the questions (or answered by the negative when they did not understand).
The anxiety-depression scale was left unchanged after NIRT analysis. Across all four waves the anxietydepression items met the IRT hypothesis of monotonicity. In addition, no item showed substantial DIF by sex or study wave suggesting the scale measures a consistent construct across groups and that sex or study wave differences observed arise mostly from true differences in latent trait.
On the other hand, we excluded six items from the chronic fatigue scale that were only endorsed by a fraction of the participants or did not satisfy the requirement of monotonicity of the IRSF. Two additional items, not present in the anxiety-depression scale, were added to improve the stability of the scale and/or the score distribution, as these provide information about individuals with low levels of chronic fatigue. Finally, we excluded item 1 ("Headaches") that showed large sex DIF (in studies TA and TM), being more frequently reported by females, compared to males with the same proficiency. All other items of the chronic fatigue scale met DIF inclusion criteria. Overall low DIF suggests that the scales measure comparable constructs across sex and waves, hence age groups.
The final version of the SPHERE-21, which measures anxiety-depression (14 items) and chronic fatigue (10 items) is available in Additional file 19 (questionnaire) and Fig. 9 (scale definition). Three items are present in both scales (items 3 "Poor memory", 30 "Poor concentration" and 32 "Feeling lost for words").
We computed the IRT and sum scores of the two scales of SPHERE-21. As they both satisfy IRT hypotheses, contain enough items (>5), with a limited number of categories (<5) and similarly shaped Item Response Step Functions, stochastic ordering by the sum score can be assumed [26]. This allows inferring the ordering of the participants' true abilities from the ordering of the SPHERE-21 sum score.

SPHERE-21 mean scores, reliability, internal consistency and Mokken scale analysis
Using all observations available we tested for sex differences, after correcting for familial relatedness. Females had significantly higher anxiety-depression scores compared with males in study TM and TA (+0.4 and +0.5 for sum scores, p-values < 8.8E-4), but the difference was not significant at younger ages in TW2 and TW1 (after correction for multiple testing, Bonferroni correction). In addition, female's reported lower chronic fatigue in study TW1 (−0.2 pt. in sum score p-value < 9.7E-5) but no significant differences survived multiple testing correction in the older waves (Table 4). Up to a quarter of the participants answered "never or sometimes" to all questions (22% in TA, 23% in TM, 25% in TW2 and 21% in TW1) yielding a sum score of 0 and an IRT score of −3. This proportion was lower for chronic fatigue (15% in TA, 14% in TM, 17% in TW2 and 19% in TW1).
The  (Table 5). In addition, the Hi were also positive, which is expected when the hypothesis of monotonicity of the IRSF is met [64]. However, we note that the minimal Hi were all below 0.3 (items with Hi < 0.3, Table 5) and that items with low discrimination (e.g. items 3, 15, 31 in chronic fatigue, item 3 in anxietydepression, see Figs. 6 and 8) would be excluded in Mokken Scale Analysis (MSA) [15]. Items 32 in chronic fatigue and 28 in anxiety-depression would also be excluded in MSA despite their good discrimination. Thus, one may prefer to use MSA for its simplicity, or when trying to reduce the length of a questionnaire. The counterpart being that MSA relies on rather arbitrary criteria (see [65] for further discussion on the interpretation of Loevinger's coefficients) and, like PIRT, may reduce reliability and precision of the instrument [66].

Heritability, genetic and environmental correlations between the SPHERE-21 scores
Covariate effect, twin-pair correlations and homogeneity of sampling across twin zygosity groups and siblings were investigated for IRT and sum scores in each age bin. Detailed results are available in Additional file 20. In summary, sex was nominally significant (p-value < 0.05) for most bins and scores, except for the anxiety-depression scores of the 13 and 14-year age group (p-values = 0.78 and 0.91). Females had lower anxiety-depression (−0.85 sum score, p-value = 4.8E-4) and chronic fatigue scores (−0.79 sum score, p-value = 4.5E-6) at age 8 to 12 years. At older ages, females had higher anxiety-depression (+1.52 sum score at 15 to 16 years, p-value = 3.9E-10; +1.37 sum score at 17 to 28 years, p-value = 2.0E-5) and chronic fatigue scores (+0.38 sum score at 15 to 16 years, pvalue = 0.036; +0.54 sum score at 17 to 28 years, pvalue = 0.021). Age at assessment was significant for chronic fatigue sum score (−0.55 in sum score per year of age, p-value = 0.033) in age group 15 to 16 years, and both the anxiety-depression (−0.17 in sum score per year of age, p-value = 0.041) and chronic fatigue sum scores (−0.20 in sum score per year, p-value = 0.021) for those aged 17 years and older.
For all the IRT scores, the omnibus test did not reject the null hypothesis of equality of means and variance across groups (Additional file 20). On the other hand, the test returned significant p-values (between 0.015 and 3.2E-7) for all but one sum score (chronic fatigue within 15 to 17 age range, pvalue = 0.59). We winsorised the sum scores to three standard deviations from the mean, in order to limit the influence of extreme values. However, most sum score means and variance were still significantly different across groups (p-values in 0.72-4.7E-5, See Additional file 20). In addition, two tests suggested presence of sex limitation, however only on sum scores, and we also attributed these rejections to the skewed distribution. Tests of familial aggregation and presence of genetic effect were significant for all the IRT scores. Non-significant results observed for sum scores in the 17 years and older age group could be attributed to lower power (smallest sample size). Finally, the MZ twin pair correlations were always greater than the DZ correlations suggesting presence of heritability (Additional file 20, Table 6). These results highlight that sum scores are not normally distributed, with overly frequent scores of 0 and a heavy We fitted ACE and ADE models for IRT and (Winsorised) sum scores in all age groups (see Additional file 20 for summary of model fit). We did not have the power to detect A and C/D simultaneously; due to the modest number of twin-sibling pairs and considering the magnitude of the effects (Additional file 20). In Fig. 10 and Table 6 we report the heritability estimates from an AE model, however, we cannot exclude that a shared environment source of variance may be present for some age groups (Additional file 20). Heritability estimates for    (Fig. 10, Table 6). Differences between IRT and sum scores could be partially explained by outliers present in the sum score distribution.
The anxiety-depression and chronic fatigue IRT scores were positively correlated (Additional file 20), consistently across age groups (r 9-12years = 0. Previous results on the total sample (1168 complete pairs aged 12 to 25 years, [6]) reported similar heritabilities around 0.40 as well as correlations (r = 0.60, rG = 0.87 and rE = 0.41) between anxiety-depression and somatisation sum scores. Here, we expanded these results by showing consistent heritability and correlation between SPHERE-21 scores in different age groups. Results can be compared across publications as we used Fig. 10 Heritability of anxiety-depression and chronic fatigue scores across age. Bars indicate 95% confidence intervals. Estimates and confidence intervals correspond to the ones from AE models the same definition for the anxiety-depression scale, and combined the somatisation and fatigue scales that showed almost perfect genetic correlations (Additional file 1).

SPHERE-21 association with some DSM-IV psychiatric diagnoses
We tested the association of SPHERE-21 scores from earlier ages with DSM-IV diagnoses (MDD, social anxiety, alcohol and marijuana dependence) assessed with the CIDI after age 19 (mean age 22). We estimated the number of independent SPHERE-21 scores to be six [78], yielding a significance threshold of 2.1E-3 corresponding to an estimated 24 independent tests. The anxiety-depression IRT scores were associated with increased MDD risk (OR 13 Fig. 11 Risks of MDD, social anxiety and substance dependence increases with anxiety-depression IRT scores. p-values are indicated above 95% confidence intervals. The stars correspond to significance after correcting for multiple testing (Bonferonni correction). *corresponds to p corrected < 0.05, **p corrected < 0.01 and ***p corrected < 0.001 marijuana dependence (OR 15-16 = 1.47 [1.18,1.82], p = 5.3E-4). All other odds ratios were greater than 1 but did not reach significance (Fig. 11).
Chronic fatigue IRT scores were also associated with increased risk of MDD (OR 15 (Fig. 12). Such odds ratios (1.09 to 1.82 from the confidence intervals) translate to a 0.6 to 6 fold increased risk between individuals with minimal (−3) and maximal (+3) IRT score.
Sum scores showed the same pattern of association, except for anxiety-depression in those aged 17 to 28 years, which did not reach significance (p = 7.3E-3, Additional file 21 and Additional file 22). Effect sizes were comparable, taking into account the difference in range between IRT and sum scores (Additional file 21 and Additional file 22). Fig. 12 Risks of MDD, social anxiety and substance dependence increases with chronic fatigue IRT scores. p-values are indicated above 95% confidence intervals. The stars correspond to significance after correcting for multiple testing (Bonferonni correction). *corresponds to p corrected < 0.05, **p corrected < 0.01 and ***p corrected < 0.001

Comparison of SPHERE-21 and Beck's depression inventory
Compared with the psychometric properties of the "gold standard" Beck Depression Inventory (BDI-II) [80,81], the SPHERE-21 is considerably shorter for measuring anxietydepression (14 vs. 21 items) and provides an additional measurement of chronic fatigue. While studies on the latent structure of the BDI consistently identify two dimensions: cognitive-affective and somatic-vegetative [81], more sophisticated modelling showed that much of the variance of the BDI could be explained by a general construct, and BDI subscales are rarely used in practice [81]. Furthermore, combining cognitive-affective and somaticvegetative symptoms may be appealing as it matches the DSM-IV (and DSM-5) definition of MDD. Based on our results, the high genetic correlation between anxietydepression and chronic fatigue could justify combining the two scales, as the same genetic factors would contribute to both traits. However, anxiety-depression and chronic fatigue shared less than half of their environmental sources of variance and separating them in analyses could help identify specific environmental contributors [82].
Psychometric properties of the BDI have been studied for more than 50 years [80,81]. However, most of the early studies suffered from lack of powerful statistical methods (such as IRT). Based on omnibus measures of test-retest and internal consistency, the BDI shows very good psychometric properties, comparable to SPHERE-21 (Table 7). However, more in depth assessments [53] revealed that two items of the BDI (9 "Suicidal wishes" and 10 "Crying") failed to meet IRT requirements of monotonicity of IRSF in depressed outpatients and nonpatient college students [53], potentially leading to bias in score and misordering of the participants on the sum score. In addition, item 19 ("Weight loss") correlated poorly with the latent trait, thus not contributing to the scale and potentially breaching unidimensionality [53]. Finally, item 14 ("Distortion of image body") showed large sex DIF (DIF = 0.32), being endorsed more often by women [53]. These do not invalidate the BDI, as it has also been shown to effectively measure depression in both clinical and non-clinical settings, and across different languages and populations [81]. However, one can question what impact score bias, sex differential functioning, and participants' misordering have on a study's power and predictive ability.
Heritability of the BDI score has been reported from large family data (N = 200 from 12 families) [83] or broken down into subscales (343 twin pairs) [84]. The first study reported heritabilities between 0.45 and 0.87, while the second could not conclude regarding the Association with DSM-IV diagnoses Significant from age 15 years with alcohol and Marijuana dependence; and from age 13 years for MDD and social anxiety (anxiety-depression subscale).

Not evaluated in general population
Price Free Around 2 USD per questionnaire [99] a Questionnaires in non-English languages available on demand. Please contact Pr. Ian Hickie (ian.hickie@sydney.edu.au) Here, we used the BDI questionnaire as gold standard as it is one of the oldest, most used and most tested depression questionnaire. For other widely used questionnaires such as the Achenbach or Hamilton rating scales, some the methods used here (e.g. NIRT modelling, twin models) have never been applied, which makes the comparison less meaningful presence of heritability or common environment factors, explaining 2 to 30% of the score variance. Larger twin studies are required to provide more accurate heritability estimates of the BDI across ages. Finally, the BDI has been evaluated many times as a prediction tool for MDD in clinical settings [81]. A few studies have focused on non-clinical samples but suffered several limitations: a) small samples; b) samples with greater prevalence than in general population; c) non-DSM-based diagnoses; and mostly, d) use of score cut-off criteria which defeats the purpose of using a continuous score but also makes comparison of specificity and sensitivity impossible across studies when different cut-offs are used (see [81] for a review of these studies). We could not find a publication reporting the association between the BDI score and disease risk in the general population, and much testing remains to be done on the BDI to validate its use in population samples and non-clinical research. Comparison of SPHERE-21 and BDI qualities is summarised in Table 7.
There are several limitations to the SPHERE-21 that are worth mentioningit has only been tested in an Australian-based population sample of young people, and previously on clinical participants [1,[85][86][87][88][89]. Thus, more testing and DIF investigation is required on older participants, patients with specific pathologies or different cultures and ethnic groups. Use of the SPHERE-21 in other English-speaking countries may require some items to be reworded. For example, for item 2, the word "cranky", not frequently used in the United States, could be replaced by "easily irritated". The scalability of the BDI across countries and languages led to its world-wide popularity [81], though only recently was IRT used [90][91][92][93][94], and little has been done to assess cross-cultural comparability of the BDI scale [81] (e.g. DIF by culture or ethnicity). In addition, unlike the BDI [81], correlations between the SPHERE-21 scores with other measures of anxiety, depression or fatigue remains to be investigated. The only published research showed a positive correlation (and significant genetic relationship) of the anxiety-depression sum scores with neuroticism [6]. SPHERE-34 was also shown to have some value in screening for psychiatric morbidity [89]. Finally, the SPHERE-21 lacks positive item results in a skewed distribution, but this limitation also applies to the BDI [81]. A simple way to improve the score distribution may be to separate options "never" and "sometimes" during SPHERE-21 questionnaire collection, as it may provide more information about individuals with low anxiety-depression and fatigue (Additional file 23).