Item response theory and validity of the NEO-FFI in adolescents

Highlights ► It is important to maximise the precision of personality measurement in adolescents. ► We apply item response theory (IRT) to the NEO-FFI in an adolescent sample. ► IRT was used to assess item validity and highlight poorly performing indicators. ► Removing poor items reduced measurement error without compromising validity. ► IRT analysis can be used to develop personality measures ensuring item validity.


Introduction
The five factor model is one of the most extensively applied models of personality currently in use. The personality traits of Neuroticism, Extraversion, Openness, Agreeableness and Conscientiousness have been repeatedly found across ages, cultures and within the same individual over time (e.g. De Fruyt, De Bolle, McCrae, Terracciano, & Costa, 2009;Hřebíčková et al., 2002;McCrae, Costa, & Martin, 2005). This has led to them being empirically related to a cornucopia of concepts as well as used in mediation and moderation models of current behaviours, helping to define relationships and explain outcomes. In adolescence, personality may even be a key mediator of individual differences in the course and treatment responses of youth with mental disorders that emerge at this period in development (Costello, Copeland, & Angold, 2011).
However, on closer inspection, problems remain with personality measurement in adolescents. In comparison to adult research, studies with adolescents have found more cross loadings, and items that do not load sufficiently on any factor. Additionally, the studies demonstrate that items from the Neuroticism and Conscientiousness scales perform better, whereas Extraversion, Agreeableness and Openness items have less reliability (e.g. Parker & Stumpf, 1998;Sneed, Gullone, & Moore, 2002). The problems with factor replicability may be due to developmental changes that take place during this time; personality traits are still in flux throughout adolescence (McCrae et al., 2002) and the structure and coherence of the five factors vary at different ages (Soto, John, Gosling, & Potter, 2008). Therefore it is important to determine if the precision of personality measurement can be maximised for use in behavioural and clinical studies in this age range.
Item response theory (IRT) can be used to improve the measurement of adolescent personality. The application of IRT allows scale psychometric properties to be revealed with greater precision than other multivariate methodologies; analysing item level information can provide insights into measurement reliability and enables a thorough evaluation of the internal construct validity.
IRT provides information by checking the validity of the items and delineating poor performing indicators. It does this by estimating each individual item's discrimination on the latent trait (the a parameter) and difficulty within a population (the b parameter) (Embretson & Reise, 2000). An item's discrimination reflects how the probability of endorsing an item changes as the level of the underlying trait increases. Thus, highly discriminating items more strongly represent the latent trait. The item's difficulty corresponds to the likelihood of an individual endorsing it given their level of the latent trait. An item is considered easy if most people endorse it and the difficulty rises as the likelihood of endorsing it decreases. Therefore some items may be easy to endorse even at relatively low levels of the latent trait. IRT also provides estimates of each scale and item's total information function through total and item information curves (TICs and IICs). These depict the amount of information provided across levels of the latent trait. This means IRT can be used to reveal how informative a measure is at all levels of the latent trait (Baker, 2001).
Although IRT can be used to assess the internal validity of a measure, correlates are needed to examine the impact on external validity. IRT analysis has been conducted with adult samples and shows that when the best performing items are chosen, shortened versions of personality inventories often have similar predictive capabilities (Thalmayer, Saucier, & Eigenhuis, 2011). Indeed, Reise and Henson (2000) found that after IRT the NEO-PI-R could be greatly reduced and for many scales only the best four items were needed to produce comparable facet results. Such psychometric research has however not been carried out with younger populations.
As well as delineating internal construct validity this study uses several measures to examine external criterion validity including educational performance, current friendships and general well-being. These measures cover the domains of adolescent competence which are important for the successful negotiation of developmental tasks (Masten et al., 1995). Each personality trait is hypothesised to correlate to varying degrees with the different facets of adolescent competence and therefore go some way towards highlighting a personality pattern associated with individual differences in competent adolescent functioning.
It is hypothesised that Extraversion and Conscientiousness will be positively and Neuroticism negatively associated with well-being (Siegler & Brummett, 2000). Likewise, elevated levels of Conscientiousness and Openness will be associated with school performance (Chamorro-Premuzic & Furnham, 2003;De Fruyt, van Leeuwen, de Bolle, & de Clercq, 2008). Finally we will examine whether Extraversion and Agreeableness are associated with the quality of current friendship (Scholte, van Aken, & van Lieshout, 1997;Selfhout et al., 2010).
This study applies IRT methodology to the NEO-FFI in order to investigate how it can be utilised to improve the validity of personality measurement in a late adolescent population. Furthermore, an examination of external validity will explore which personality traits are associated with adolescent competence as indexed by measures of current well-being, friendships and school examination performance.

Participants
Participants were 470 English adolescents (295 females, 175 males) who completed the NEO-FFI; mean age 18.7 years (age range: 17.7-20.2 years, SD = 0.55). The participants are part of the ongoing ROOTS study; a longitudinal study of 1204 participants aged 14 years at first recruitment and reassessed at 15.5 and 17.5 years (Goodyer, Croudace, Dunn, Herbert, & Jones, 2010). At 17.5 years data were gathered about academic achievement; additionally participants completed a friendship satisfaction questionnaire (Goodyer, Wright, & Altham, 1989) and the Warwick-Edinburgh Mental Well-being Scale (WEMWBS; Tennant, Fishwick, Platt, Joseph, & Stewart-Brown, 2006). The self-report version of the NEO-FFI was sent via post an average of 14.9 months after the other measures were completed (range: 5.2-28.6 months, SD = 6.1). Questionnaires were returned by 470 (43.8%) of the remaining sample and complete for 438 (36.4% of the cohort and 93.2% of the questionnaire responders) of this sub-sample.

Measures
The NEO-FFI was developed from the NEO-PI-R (Costa & McCrae, 1992). The NEO-PI-R contains 240 items measuring five domains (Neuroticism, Extraversion, Openness, Agreeableness and Conscientiousness) represented by specific facets (e.g. Neuroticism is measured by items covering hostility, depression, self-consciousness, impulsiveness, vulnerability to stress and anxiety). The NEO-FFI contains 60 items which are summed to measure personality at the domain level only. Each item consists of a statement rated on a Likert scale ranging from strongly disagree to strongly agree. Scale alpha reliabilities for this sample were .88 (Neuroticism), .81 (Extraversion), .74 (Openness), .77 (Agreeableness) and .87 (Conscientiousness).
The WEMWBS (Tennant et al., 2006) is a self report measure of well-being covering two distinct perspectives. The hedonic perspective focuses on the subjective experience of happiness and life satisfaction, and the eudaimonic perspective, focusing on psychological functioning and self realisation. The measure consists of 14 positively worded items asking about thoughts and feelings over the previous 2 week period, each scored from 1 'none of the time' to 5 'all of the time'. Scale alpha reliability was 0.89.
The friendship satisfaction questions (Goodyer et al., 1989) were taken from a semi-structured interview schedule enquiring about components of peer relationships over the last 12 months. There are eight questions, incorporating three features of the relationships; availability, adequacy and intimacy, to provide a global rating of friendship. Items asking about frequency of occurrences (e.g. do your friends tease you?) are rated from 0 'never' to 5 'almost every day', whereas questions about satisfaction of friendships (e.g. can you confide in your friends?) are rated from 0 'not at all' to 3 'most of the time'. Scale alpha reliability was 0.71.
Data were collected regarding the general certificate of secondary education (GCSE). This is an academic qualification awarded in a specific subject, such as English or Maths, usually taken by students aged between 14 and 16 years. Generally each student is entered for examination on between 8 and 10 subjects, although this is subject to variation. The highest pass grade awarded is an A ⁄ continuing down to grade G. The number of GCSE entries, plus the number of GCSE qualifications each participant achieved at grades A ⁄ -C and D-G were used as reflecting indices of school performance.

Analysis
The IRT analysis used a graded response model (Samejima, 1969), which is appropriate for ordered categorical responses such as the Likert scales used by the NEO-FFI. This model also allows the individual items to have a different number of response categories. IRT assumes local independence of the items and unidimensionality of each of the factors. Unidimensionality was assessed using confirmatory factor analysis (CFA) where the items were specified to load on one factor. Currently, there is no standard procedure for establishing adequate unidimensionality, generally evidence of a dominant factor explaining a large proportion of the variance and goodness of fit indices (GFIs) are assessed (Embretson & Reise, 2000).
Analysis was conducted in the Mplus Programme (Version 6, Muthén & Muthén, 1998. IRT was performed using an MLR estimator and a logit link, which sets the scale to use log metric. Baker (2001) produced guidelines for judging item discrimination levels, moderate discrimination is achieved if the a-parameter Table 1 Variance and goodness of fit indices for unidimensionality assessment (a) before modifications and (b) after modifications.

Neuroticism
Extraversion Openness Agreeableness Conscientiousness is between .65 and 1.34 and high discrimination if the a-parameter is 1.35-1.69. A value halfway between these two ranges was chosen to signify items having moderate to high discrimination, thus a cut off of a > 1.17 was used.
The factor scores for each personality scale were correlated using Pearson correlations with the well-being and friendship measures and regressed onto the academic achievement variables before and after IRT. The non-IRT and IRT correlations and regressions were Note: A = general factor, a = group factor discrimination parameters; representing the slope of the curve at the inflection point, b = threshold parameters for the general factor; the point where the response curves for each response category intersect. Items in bold fail to achieve at least moderate to high discrimination (a < 1.17). compared using Steiger's z-test. This assesses whether relationships found from the same population are significantly different.

IRT analysis
The unidimensionality assessment revealed the GFIs for one factor models were not good. Additionally, each scale had moderately correlated residuals between the items; the NEO-FFI scales contain items that represent the different NEO-PI-R facets to varying degrees, likely causing this inter-item covariation. Therefore modification indices were used to include item correlations improving model fit (see Table 1).
Bi-factor models were used to model the multidimensionality within the data. Bi-factor models allow the scale items to load on the dominant latent trait underlying all the items, additionally items can load on one or more narrower 'group' factors, providing a way to fit multi-dimensional IRT models (Reise, Morizot, & Hays, 2007).
IRT revealed each scale had items that did not achieve moderate to high discrimination on the general factor (see Table 2). The scales achieved their greatest precision within ± one standard deviation from the mean level of the trait. This is to be expected given the instrument was designed to measure normative trait levels. Specifically, the TICs peaked around 0.4 for Neuroticism, peaked once around À0.8 and again around 0.8 for Extraversion, around 0.0 for Openness, around 0.4 for Agreeableness and peaked twice for Conscientiousness, once around À0.8 and once around 1.0 (see Fig. 1).
The threshold data revealed that to endorse the response category of ''strongly disagree'' an individual had to lie beyond three standard deviations from the mean for 51 (85%) of the items, with a further 7 (11.7%) items having no-one endorse this option. Furthermore, individuals had to score above three standard deviations from the mean for 26 (43.3%) items to reply ''strongly agree''.
The information function analysis was run with the less discriminatory items removed. Information curves are sensitive to scale length, therefore following the method of Samuel, Simms, Clark, Livesley, and Widiger (2010) the IICs were averaged to control for different scale lengths. These 'mean information curves' demonstrated that the scales provided more information when the poorly performing items were removed but without changing where along the latent trait continuum most information was provided (see Fig. 2).

External validity
To ascertain whether the non-discriminatory items could be removed from the NEO-FFI without meaningfully reducing external validity, the factors were individually correlated or regressed onto the external measures. Correlations and regressions before and after IRT were compared. Results are reported for the general factors (see Table 3).
The associations demonstrated that for the majority of the scales removing items was not detrimental to external validity. As hypothesised more neurotic individuals had lower levels of well-being, whilst more extraverted and conscientious people had greater well-being. Additionally, more agreeable and extraverted participants rated their friendships as more satisfying. However, although Openness was somewhat related with academic achievement, Conscientiousness was not. Interestingly, it appeared that Neuroticism and Conscientiousness were significantly related with friendships, whilst Openness was positively associated with well-being, which had not been hypothesised.
In general, the differences between the correlations before and after IRT were small and for all of the five scales the differences were not significant (see Table 4). However the results of the Openness scale validation were mixed. Before IRT, Openness was significantly correlated with some aspects of school performance whereas it was not afterwards; nevertheless the difference in magnitude of the associations was small.

Discussion
The analysis demonstrated that many items (n = 19) failed to discriminate to an acceptable level in this adolescent population.
The majority (n = 16) being from the Extraversion, Agreeableness and Openness scales. The removed items did not appear to greatly contribute to the measurement of personality; correlating the external criterion with the traits demonstrated that removing  non-discriminatory items did not, for the majority of the scales, effect external validity. One caveat was the Openness scale, whose performance differed before and after IRT. Additionally, the external correlations illustrated that scoring low on Neuroticism and higher on the other four traits may help adolescents achieve greater levels of competence across different domains of functioning. Such a personality profile may be of value in studies of adolescent development and contribute to understanding individual differences in treatment response for common mental illnesses in the adolescent years. IRT identified a large minority of items that did not discriminate well. Studies of the NEO-FFI in adolescents have found many items do not load sufficiently on any factor or cross load onto unintended factors. This is particularly the case for Extraversion, Agreeableness and Openness, whilst Neuroticism and Conscientiousness tend to perform better (Parker & Stumpf, 1998;Sneed et al., 2002). The results of the current study suggest these findings are likely due to a lack of discriminatory power of many items, suggesting they are not measuring the underlying latent traits strongly.
Previous studies report few difficulties with item comprehension (De Fruyt, Mervielde, Hoekstra, & Rolland, 2000;McCrae et al., 2005), therefore it is unlikely the lack of discrimination reflects a limited understanding of the questions. Perhaps many of the trait indicators fail to discriminate appropriately on the latent traits because the items are not referencing ideas or behaviours that are relevant to the cultural milieu of adolescents (Sneed et al., 2002).
Additionally, the threshold data demonstrated that for the majority of items only people over three standard deviations away from the population mean responded to the categories 'strongly agree' and/or 'strongly disagree'. Compared to published norms this sample had lower Neuroticism and higher Agreeableness, which may somewhat explain these results. However it did not differ on Openness, Extraversion or Conscientiousness suggesting for the current trait indicators these response categories only have limited utility for most adolescents in the general population. Thalmayer et al. (2011) found brief personality questionnaires had similar levels of predictive ability and argued that scales comprised of a few high-validity items may obtain equal predictive validity to those of their longer counterparts. The result from the present study support these assertions as the more discriminating items allowed a reduction in scale length that was just as externally valid.
Nonetheless, the Agreeableness and Openness items discriminated poorly; with IRT affecting the Openness scale's performance. Thus use of these shortened scales must be done so with caution. As half of the indicators were not strongly measuring the latent traits questions arise as to what constructs these scales may be evaluating. Indeed, convergence between the NEO-FFI Agreeableness scale and social desirability measures have been reported in adults (Stöber, 2001) and there is evidence suggesting Openness measures a trait related to intellectual ability (Ferguson & Patterson, 1998), indicating there may be some confusion of measurement.

Limitations
A limitation of the present study is whether the sample is representative of British adolescents. The return rate of 43.8% means the majority of adolescents from the ROOTS cohort did not participate. A comparison to norms published by Costa and McCrae (1992) shows this sample to be more agreeable and less neurotic, suggesting they are more emotionally stable, altruistic and willing to help others. More research would help to elucidate whether these norms are appropriate for British adolescents or if this is a reflection of idiosyncratic properties of this sub-sample. Further replication would also clarify the generalisability of the IRT analysis and discern the reliability of the a and b parameters in UK adolescents.
The measures used for the external validation of the NEO-FFI were collected earlier than the personality information, rather than concurrently. The well-being scale measures within a 2 week period and personality is apt to some change over adolescence (McCrae et al., 2002), however the friendship scale considers a 12 month period and the GCSE results would not change. Nonetheless, this could feasibly influence the results of the external validity analysis. Even so, the personality traits correlated with the measures as hypothesised, therefore it is unlikely this time difference unduly affected the results.

Conclusions
Personality is consistently used as an important explanatory factor in a large number of studies. The present study provided an item-level analysis allowing for a thorough examination of the assumed personality factors, highlighting scale strengths but also weaknesses. This was particularly the case for the Openness scale, which performed poorly and was influenced to the greatest degree by item removal. The results suggest that for adolescents many items considered as measuring components of personality are not discriminating along the latent traits to a high degree. These cannot therefore be used as reliable indicators, hindering internal validity. The results suggests future directions for testing and refinement, especially with the Agreeableness and Openness scales, which may need more development and testing before they can be used reliably in adolescent populations. Overall, the present study suggests the use of briefer more efficient personality measures with highly discriminating items may be more internally valid and achieve equal external validity.