Temporal stability and change in manifest intelligence scores: Four complementary analytic approaches

The temporal stability of psychological test scores is one prerequisite for their practical usability. This is especially true for intelligence test scores. In educational contexts, high stakes decisions with long-term consequences, such as placement in special education programs, are often based on intelligence test results. There are four different types of temporal stability: mean-level change, individual-level change, differential continuity, and ipsative continuity. We present statistical methods for investigating each type of stability. Where necessary, the methods were adapted for the specific challenges posed by intelligence research (e.g., controlling for general intelligence in lower order test scores). We provide step-by-step guidance for the application of the statistical methods and apply them to a real data set of 114 gifted students tested twice with a test-retest interval of 6 months. • Four different types of stability need to be investigated for a full picture of temporal stability in psychological research • Selection and adaption of the methods for the use in intelligence research • Complete protocol of the implementation


Introduction
Intervention decisions based on intelligence testing often have long-term consequences for individuals, such as admission to gifted classes or special education placement. Therefore, it is important to examine whether intelligence tests scores exhibit sufficient temporal stability. Whereas general intelligence has been consistently found to be a highly stable trait [ 13 , 23 ], the temporal stabilities of lower-order ability scores and ability profiles are more controversial (e.g., [ 6 , 37 ]) and require further research. In the present article, we describe four types of temporal stability, provide step-by-step guidance on how to assess these types of temporal stability, and apply the described methods to a sample of 114 students assessed twice over a 6-month interval. The methods can also be applied to test the temporal stability of other psychological constructs such as personality traits or motivational variables.
Importantly, we focus on the stability of manifest test scores, as these are frequently used in intelligence testing practice. The methods described in this paper are therefore ideally used to evaluate the stability of the scores of a specific test instrument to decide on its appropriateness for diagnostic decisions with long-term consequences or to replicate studies on the stability of lower level intelligence test scores (e.g., [ 22 , 37 ]). To evaluate the temporal stability of underlying cognitive abilities, latent variable approaches are more appropriate than the methods presented in the present article.
In previous studies investigating the temporal stability of intelligence, individual aspects of stability were investigated in isolation (e.g., [ 22 , 37 ]). We present a protocol for the combined investigation of all aspects for a full picture of temporal stability in hierarchically organized, multidimimensional psychological variables and intelligence in particular. Specific adaptions of the methods or the protocol for the use in intelligence test scores are marked with an asterix ( * ).

Types of temporal stability description and step-by-step guidance
Four different types of temporal stability can be investigated in psychological research [ 12 , 29 ]. We present different statistical methods for the investigation of each type of stability. These methods are complementary and it often makes sense to test all four types to obtain a full picture of temporal stability. Nevertheless, some guidance regarding which method is especially appropriate for which application is presented at the end of this section. Many of these methods can easily be implemented using standard statistical software such as SPSS or R. When calculations have to be performed manually, we provide an illustrative example. R-code to replicate all analyses can be found in the Appendix.
1. Mean-level change. The mean-level change represents the change in the mean value of a variable across time within a sample. Investigations of mean-level change answer the question whether the score increases, decreases, or remains stable over time. As IQ-scores are standardized within age groups with a constant mean of 100, no major mean-level change would usually be expected beyond common retest effects, which have been quantified in a meta-analysis [ 30 ]. If the observed mean-level change far exceeds the expected retest effect (i.e., gain of 4.5 IQ-points in a year), this may indicate that the test allows for better item memorization or more advantage from general test familiarity than other cognitive tests. It may also indicate that some event or intervention between tests improved cognitive ability.
In our real data example, we tested whether our sample of students on average exhibited relevant increases in general intelligence and in more specific intelligence scores after six months. The statistical significance of the mean-level change can be tested by paired-samples t -test.
In this formula, X D is the average difference between T1 and T1, s D is the standard deviation of this difference, and n is the sample size. A p value < .05 indicates that the likelihood that the observed average differences between the means at T1 and T2 occurred by chance is smaller than 5%. However, a significant paired-samples t -test does not quantify the size of the effect. Cohen's d is an effect size that can be used to quantify the magnitude of the observed change and to indicate whether the mean-level change was non-relevant ( < .2), small (.2-.49), medium-sized (.5-.79), or large ( > .8). It standardizes the difference between T1 and T2 on the first standard deviation.
Cohen's d is calculated by subtracting the T1 sample mean ( M t1 ) from the T2 sample mean ( M t2 ) and by dividing the resulting difference by the standard deviation at T1. In the literature, this effect size is also labeled as pretest-posttest raw score effect size or Glass's [ 24 , 31 ]. Note that in the original Cohen's d formula, the mean difference is divided by the pooled standard deviation of T1 and T2 [9] . However, Becker [1] argued that posttest standard deviation could be influenced by the previous testing, whereas the standard deviation at T1 is free of either influence. According to Cohen [9] , d = .20 represents a small difference, d = .50 represents a medium-size difference, and d = .80 represents a large difference between mean values. That is, if IQ scores were assessed twice and the T1 sample mean was 100 ( SD t1 = 15) and the T2 sample mean was 110 ( SD t2 = 13), a medium sized increase ( d = .67) occurred over time.
2. Individual-level change. The individual-level change represents the reliability of change in individuals. The reliable change index (RCI) [16] reveals the extent to which observed individual changes in a score occurred due to measurement error or due to meaningful change. Thus, it can be investigated whether the individual participants showed statistically significant increases, decreases, or no change in the investigated variable over time. In intelligence testing, one would usually not expect a large number of participants with significant individual-level change of their test scores, especially within short test-retest intervals. A large proportion of significant individuallevel changes in the same direction may therefore point towards differences in the test situation between first and second testing or to item memorization and test familiarity effects. Alternatively, individual-level change may be used to investigate interindividual differences in the effect of a specific intervention on intelligence test scores. Here, one would investigate which participants showed significant improvements and which did not.
To compute the RCI, we calculated 95% confidence intervals for change scores in the different test scores. We subsequently determined for each individual participant whether their individual change in a test score exceeded the 95% confidence interval or not. Individual-level changes beyond the 95% confidence interval indicate that the observed individual-level change can be regarded as a reliable individual-level change, as the probability that the change occurred by chance is below 5% (i.e., when using one-tailed testing).
Different methods to calculate the confidence interval have been proposed. First, the reliable change index can be calculated based on the standard error of prediction ( SE pred ) [11] .
SE pred is calculated by multiplying the SD t2 of the variable with the root of one minus the squared T1-T2 correlation of the variable. The 95% confidence interval for this index can be calculated by multiplying the SE pred by + / − 1.96. Second, Iverson [15] recommended using an updated version of the original formula based on the standard error of the difference ( SE diff ): The SE diff is calculated based on both, the SD t1 , the SD t2 , and the T1-T2 correlation. A 95% confidence interval index can again be calculated by multiplying the SE diff by + / −1.96.
After computing the confidence interval, the individual-level change for each participant is evaluated by subtracting their T1 test score from their T2 test score and comparing the resulting difference score to the confidence interval. When computing the individual change score, it is recommended to take into account potential practice effects ( [11,41] ). This can be achieved by using a true score estimate for the second test score [7] .
M t1 represents the T1 mean of the test. The formula controls the change from T1 to T2 by the retest reliability. Y OBS is the individual T2 score observed in the retest and r tt is the T1-T2 correlation. If the T2 test score of a participant was 120, the T1 sample mean was 100, and the T1-T1 correlation was r tt = .80, this results in Y true = 116. This value can then be used to determine the "true" difference between the T1 and T2 score.
3. Differential continuity. Differential continuity represents the rank-order consistency of a test score. This means that it indicates to what degree participants retain their rank order placement relative to the other participants from the first to the second testing. That is, it answers the question, to what extent participants who scored the highest at first testing still score the highest at second testing. Differential continuity is usually used to quantify the test-retest reliability of test scores. The test-retest reliability is of great relevance when deciding if a test score should be used for long-term individual level diagnostic decisions such as educational placement decisions. There is no specific standard for differential continuity, but it has been suggested that values of .80 or even .90 are needed [37] . With an SD of 15, a continuity value of .80 would lead to a margin of error (MoE) of 13.15 IQ points; a value of .90 would be associated with a MoE of 9.30 IQ points. Differential continuity is evaluated by correlation coefficients. Pearson correlation was used in our analyses, dividing the covariance of T1 and T2 by the product of their standard deviations. There are no established standards for differential continuity. Watkins and Smith [37] recommended correlations greater than at least .80 for individual level diagnostic decisions.
(6) * Adaption for intelligence research: When investigating the differential continuity of lower order intelligence test scores, one has to consider that these scores share a substantial amount of variance with the general intelligence score. Thus, the differential continuity of the lower order scores may be partially explained due to the stability of general intelligence [6] . To assess the stability of the variance unique to any specific certain lower order score, we computed correlation coefficients controlling for general intelligence at T1 (g1) using partial correlation.
The size of a correlation is limited by the variability of the measured score. In our sample, we only assessed students attending gifted classes, restricting the range of the intelligence scores and thereby limiting the size of the correlations [2] and potentially underestimating the differential continuity. When information on the variability of the scores of interest is available for the full, unrestricted population (for example from test manuals), the correlations can be corrected for range restriction [ 35 , 39 ].
In this formula, r t 1 t 2 is the observed correlation between the T1 and T2 score, s 2 t1 is the estimated standard deviation in T1, and S X is the standard deviation in the unrestricted population.
Lastly, the individual estimates of differential continuity of the different test scores can be used to estimate the reliability of the resulting ability profile. In contrast to the previously presented measures of differential continuity, the profile reliability considers all different test scores simultaneously (i.e., all correlations between the multiple test scores). Lienert and Raatz [21] provided the following formula: r tt is the mean differential continuity of all scores included in the profile and r tT is the average test score intercorrelation. The profile reliability increases with increasing differential continuity of the scores and with decreasing average intercorrelation. Profile reliabilities of .5 or larger are considered to be sufficient for profile interpretation [21] .
4. Ipsative continuity. Ipsative continuity represents the stability of the configuration of different scores of the individual test taker over time. It therefore quantifies the stability of ability profiles. The analyses presented here answer the question to what extent the individual strengths and weaknesses remain the same across different times of measurement across all test takers. Ipsative continuity analyses inform the interpretation and use of individual cognitive strengths and weaknesses for individual level diagnostic decisions. If the identified strengths and weaknesses do not replicate significantly above chance level, one should not base interventions or placement decisions on this information.
In a first step, individual strengths and weaknesses have to be identified for each individual participant. To this end, we calculated the critical difference between the general intelligence score and the respective lower order scores for both test and retest. The critical difference indicates the limit that the lower order score deviation from the profile mean (general intelligence) has to surpass to be less than 5% likely to occur by chance. A formula to calculate the critical difference was provided by [20] .
SD Gx represents the population standard deviation of the respective lower order score. r g represents the reliability of general intelligence (Cronbach's α) and r Gx represents the reliability of the respective lower order score. If the difference between a lower order score and general intelligence was positive and larger than D crit, it was classified as an individual strength. Similarly, if the difference was negative and larger than D crit, the score was classified as an individual weakness. For example, if the standard deviation is 15 and the reliability of the genergal intelligence score and the lower order score are .95 and .85, respectively, the difference between the two scores has to be greater than 1.96 * 15 * √ (2 -(0.95 + 0.85)) = 13.15 to be considered statistically significant. Once the individual strengths and weaknesses have been identified, the stability of these categorizations can be quantified. To this end, we used two different methods. Cohen's kappa [8] is a change-corrected metric for the estimation of agreement on nominal scale data. It is often used to assess the degree of agreement between different raters, but can also be used to assess the degree of agreement between categorisations at different times of measurement.
In the formula, P o is the observed agreement among raters or times of measurment, and P e is the probability of chance agreement. The resulting values can range from −1 and 1, with 0 indicating no systematic agreement, 1 indicating perfect agreement (i.e., a cogntive strength at T1 is still classified as a strength at T2), and −1 indicating perfect disagreement. Landis and Koch [19] provided guidance * Adaption for intelligence research: The interpretation of kappa is straightforward. However, the coefficient is affected by an uneven distribution of categories. If one category is much more prevalent than the others (in case of intelligence measurement most likely "no significant strength or weakness"), kappa results become unreasonably low [5] . We therefore used a second metric for the estimation of agreement on nominal scale data with Cramér's V, a transformation of χ 2 : n denotes the sample size and s the number of categories. The interpretation of this parameter is similar to that of other correlation coefficients [40] . Again, the value 0 indicates no systematic agreement, 1 indicates perfect agreement, and −1 indicates perfect disagreement. * Adaption for intelligence research: It should be noted that it is also possible to investigate differential continuity continuously (e.g., [10] ). However, in intelligence testing practice, individual strengths and weaknesses are usually assessed categorically. Thus, in most cases the stability of categorical judgements is tested when evaluating the long-term viability of profile interpretation in practice (e.g., [36] ).

When to use which temporal stability analysis method?
Temporal stability methods are complementary and not mutually exclusive [ 12 , 29 ]. Yet, for some research questions, it may be adequate to conduct one or two of the discussed methods in particular. Fig. 1 offers decision guidance for determining which method is appropriate for which research question. Two questions guide the decision process. The first question refers to whether one is interested in testing individual level stability or sample level stability. Depending on the answer, the second question either refers to (2a) whether one is interested in testing change in single scales or in the configurations of multiple scales within individuals or to (2b) whether one is interested in the absolute mean level change or the relative rank change of a scale score in a sample.
For example, if one's hypothesis is that a sample of older adults on average shows decreasing test scores over time, "sample level stability" and "mean change" are the answers to questions 1 and 2b respectively. These answers would guide one to the mean-level change method. If one's hypothesis is that some participants show decreasing scores on one scale over time while other individuals show no significant change, "individual level" and "single scale change" are the answers to questions 1 and 2a respectively. These answers would guide one to the individual-level change method.

Sample
We assessed general intelligence and specific ability scores of 114 adolescents (mean age at T1 = 14.11; range = 12.67 to 15.67) from a gifted track of a German grammar school at two measurement points with a test-retest interval of six months. Testing took place in 2002 and 2003. Most students were male (71.1%). At both measurement points, 47 students were in 7 th grade (41.2%), 42 students were in 8 th grade (36.8%), and 25 students were in 9 th grade (21.9%). At T1, the average IQ of the students was 116.7 ( SD = 9.97). We obtained written parental consent for all students. The sample was originally investigated in Breit et al. [4] .

Instrument
The BIS-HB [18] is a paper-and-pencil intelligence test designed to assess the intelligence structure of gifted students in particular. It can be administered both individually and in group settings. The test is based on the Berlin Intelligence Structure model (BIS; [ 17 , 34 ]

Results
The results are presented and interpreted in detail in Breit et al. [4] . The present result section illustrates how the different types of temporal stability can be presented. Further, it points out some differences and commonalities in the results attained from the different statistical methods used within the different types of temporal stability.
1. Mean-level change. In our sample, we found statistically significant increases of all scores from test to retest ( Table 1 ). The mean increase across scores was 7.92 IQ points ( M d = .61). The results show how t-tests and Cohen's d complement each other, indicating statistical significance and effect size, respectively.
T-tests were calculated using SPSS (IBM [14] ), indicating statistical significance for all score increases. We provide a real data illustration for the computation of Cohen's d. For Processing Speed, M t1 = 112.15, M t2 = 122.42, SD t1 = 11.00, and SD t2 = 12.27 ( Table 1 ) Table 2 shows the percentage of participants who showed significant increases or decreases for the different test scores. The table is divided into classifications based Note. All differences p < .001 We provide a real data illustration of the computation of the critical difference based on the Processing Speed data. First, we calculated the RCI based on SE pred ( Eq. (3 )), SD t1 = 11.00 ( Table 1 ), SD t2 = 12.27 ( Table 1 ), and r tt = .84 ( Table 3 ), resulting in Multiplying SE pred = 7.02 by 1.96 yields the critical difference for reliable change of 13.05. Second, we calculated the RCI based on SE diff ( Eq. (4 )). Applying the relevant Processing Speed values from Tables 1 and 3 results in Multiplying SE diff by 1.96 yields the critical difference for reliable change of 12.92. A participant of the present study had the following test values: Processing Speed at T1 = 108, Processing Speed at T2 (true score estimate) = 117.45, resulting in an increase of 9.45 points. This observed increase is smaller than 13.05 and 12.92, indicating that there was no reliable change for that participant based on both SE pred and SE diff .
3. Differential continuity. Table 3 presents the differential continuity coefficients for all BIS-HB scores. The uncorrected coefficients ranged from .72 to .84. The coefficients corrected for range Note. All coefficients p < .01. Correction for range restriction was based.
On the variability in the normative sample. r 12 = uncorrected autocorrelation. r 12.g = partial autocorrelation controlled for the general intelligence. r 12c = autocorrelation corrected for range restriction.
restriction ranged from .78 to 93. When controlling for general intelligence, the stability of the lower order scores ranged from .49 to .69. Lastly, the profile reliability was pro f r tt = .53 when using uncorrected stability coefficients and pro f r tt = .71 when using the stability coefficients corrected for range restriction. Our sample showed range restriction in all scores compared to the BIS-HB standardization sample, resulting in substantially higher coefficients when correcting for this restriction. The results controlling for general intelligence imply that general intelligence accounts for a considerable portion of the stability of the lower order scores, but there is also significant stability of the respective unique variances. We provide a real data illustration for the calculation of the correction for range restriction and the profile reliability for Processing Speed. The uncorrected autocorrelations and partial correlations controlling for general intelligence were computed using SPSS.
The uncorrected correlation between Processing Speed T1 and T2 was r t1t2 = .84 ( Table 3 ), the estimated standard deviation in T1 was s 2 t1 = 11, and the standard deviation in the unrestricted population was S X = 15.15. The correction for range restriction was calculated based on Eq. To calculate the profile reliability, we used Eq. (9 ), inserting the average intercorrelation of .55 and the average differential continuity of .79.
Differential continuity values corrected for range restriction can be also used in this formula, yielding the profile reliability corrected for range restriction.
4. Ipsative continuity. Table 4 shows the agreement on strengths and weaknesses quantified by Cohen's kappa and Cramér's V. The median kappa value was Mdn κ = .34 (range .23-.58), indicating fair continuity. Median V was Mdn V = .44 (range .22-.65), indicating moderate continuity. The higher continuity values indicated by Camér's V support the notion that Cohen's kappa may underestimate the stability when one category is overrepresented.
We demonstrate the calculation of the critical difference and apply the results to the data of a participant from our dataset. Eq. (10 ) was used to calculate D crit for Processing Speed, based on the population SD of Processing Speed (15) and the reliabilities of Processing Speed (.88) and general intelligence (.95) reported in the BIS-HB manual.  Note. * p < .05, * * p < .01, * * * p < .001. κ = Cohen's kappa.
V = Cramér's V. N.A. = not available because of 0 identified strengths and weaknesses at T1.
A participant of the present study had the following test values: general intelligence at T1 = 117, Processing Speed at T1 = 108 (difference at T1 = −9 IQ points), general intelligence at T2 = 132, Processing Speed at T2 = 117 (difference at T2 = −15 IQ points). At T1, the difference is smaller than D crit , whereas at T2, the difference is greater than D crit , classifying Processing Speed as a cognitive weakness of the participant only at T2 and indicating disagreement between the two times of measurement. The degree of agreement between T1 and T2 classifications across all participants based on Cohen's kappa and Cramér's V was computed in SPSS.

Discussion
We presented the statistical methods for the investigation of four different types of temporal stability of manifest intelligence test scores and illustrated their application with sample data; we further provide a decision guidance for choosing the most appropriate type of temporal stability analysis for a research question as well as the R code for the analysis protocol. We focused on the investigation of manifest test scores, which are frequently interpreted in intelligence testing practice. The methods presented allow evaluation of the usefulness of an intelligence test with regard to diagnostic decisions with long-term consequences. There are further research questions concerning the stability of intelligence, such as the temporal stability of the latent cognitive ability constructs (i.e., the g-factor). For these questions, modern statistical methods based on structural equation modeling or item response theory allow investigations of stability adjusted for measurement error. For example, latent change models and growth curve modeling can be used to investigate latent mean-level change (e.g., [32,33] ). Auto-correlative or auto-regressive structural equation modeling can be used to investigate latent differential continuity (e.g., [ 3 , 28 ]). Ipsative continuity can be tested by growth mixture modelling (e.g., [25][26][27] ). Individual-level change can be investigated by observing the individual latent slope of a person from a growth curve model (e.g., [32] ).