Exploring the persome: The power of the item in understanding personality structure

We discuss methods of data collection and analysis that emphasize the power of individual personality items for predicting real world criteria (e.g., smoking, exercise, self rated health). These methods are borrowed by analogy from radio astronomy and human genomics. Synthetic Aperture Personality Assessment (SAPA) applies a matrix sampling procedure that synthesizes very large covariance matrices through the application of massively missing at random data collection. These large covariance matrices can be applied, in turn, in Persome Wide Association Studies (PWAS) to form personality prediction scores for particular criteria. We use two open source data sets (N=4,000 and 126,884 with 135 and 696 items respectively) for demonstrations of both of these procedures. We compare these procedures to the more traditional use of “Big 5” or a larger set of narrower factors (the “little 27”). We argue that there is more information at the item level than is used when aggregating items to form factorially derived scales.

The founding of the International Society for the Study of Individual Differences and its journal, Personality and Individual Differences were inspired by the ecletic interests of Hans Eysenck. Eysenck acted on the belief that the field of individual differences could benefit from many different approaches with sometimes contradictory results. As the editor of PAID he was happy to publish theories and results that were controversial, some in strong criticism of his own work, some that others would strongly criticize. His basic principal seemed to be that truth will out.
In this tradition, we introduce a procedure that is in direct contradiction to the tendency of most researchers to ignore the information that is available at the single personality item and to form higher level composites that are thought to reflect broad latent variables. We reverse this approach and emphasize the importance of the item information. We are not alone in this endeavor, for a few others have made similar suggestions (McCrae, 2015, Mottus, Kandler, Bleidorn, Riemann and McCrae, 2017, Möttus, Sinick, A.Terracciano, Hřebíckova, Kandler and Jang, 2019. Our contribution to this debate is to offer some new data collection and analytical procedures borrowed from other fields (specifically radio astronomy and genomics). We will also emphasize that our approach is based upon the open science framework that can be associated with the founding of the Royal Society which has as its motto "Nullius in verba" as a way of sharing ideas and findings in an open manner while relying on facts determined by experiment.
For those trained psychometrically, it is well known that forming item composites leads to increases in reliability (Revelle and Condon, 2019). The use of factor analysis is supposed to identify the latent variable accounting for the shared and reliable variance of the items. The remaining variance in the item is assume to reflect noise and in fact, once the trait is removed, the concept of local independence suggests that there is no meaningful variance left. This approach of discovering latent variables has a long and fruitful tradition going back to Spearman (1904) and (Thurstone, 1933) and used by Eysenk in his first explorations of the dimensions of personality (Eysenck, 1944, Eysenck andHimmelweit, 1947) as well as his contemporary Raymond Cattell (Cattell, 1943(Cattell, , 1945(Cattell, , 1946. However, we believe that by forming higher level constructs and ignoring the meaningful signals available at the item level, our field has been led astray. Although early work on scale construction (e.g., Strong, 1927, Hathaway and McKinley, 1943, Gough and Bradley, 1996 emphasized the selection of items that predicted specific criteria, much of the past 50-80 years of personality scale construction has been concerned with identification of latent variables thought to measure the common variance of items. The use of factor analysis to identify these latent variables, and subsequent use of Structural Equation Models (SEM) which makes use of homogeneous scales in structural models has dominated the theoretical approaches to personality measurement.
In spite of the tendency to emphasize latent variables, there were some strong advocates of an external validity criteria for scale construction (Jackson, 1970(Jackson, , 1971) who continued to argue for an empirical and theory based approach to item writing. Indeed, an influential paper suggested theoretical, external/empirical, and latent trait approaches did not differ in their average validities (Hase and Goldberg, 1967). However, a follow up to that article compared empirically constructed to latent variable methods in scale construction (Goldberg, 1972) and found that for hard to predict criteria, empirical methods were probably slightly superior and for easy to predict criteria, factorially homegeneous techniques were superior. Although factor analysis is clearly an empirical procedure, Goldberg made the distinction between empirical methods based upon the external validity of items and those based upon the internal structure of scales using factor analytic approaches. He labeled these two techniques as empirical versus factorial. This article can be seen as a followup to that work.
It has been known since 1910 (Brown, 1910, Spearman, 1910) that combining items into scales leads to an increase in reliability. Given that most data are befuddled with error (McNemar, 1946) and psychological data are even more befuddled than most, forming item composites will partially compensate for this befuddlement. The concept of correcting for reliability of the measures when examining associations between measures (Spearman, 1904) requires stable estimates of reliability. With the sample sizes typically found in many studies for the past century, such stability could only be achieved at high levels of reliability. But, another way to achieve stability of estimates is to increase the sample size. When this is done, the benefits of item level analysis become more apparent.
But how to increase sample size and at the same time have a large item pool to take advantage of the power of the item? The solution was originally discussed by (Lord, 1955) in terms of sampling items as well as people, an analogous procedure to do so was developed in radio astronomy.

Synthetic Aperture Personality Assessment
The development of the telescope by Galileo Galilei revolutionized the way humans see the world. Galileo's original telescope was just eight power and after successive improvements, with his 30 power telescope he was able to detect what we now call the Gallilean moons of Jupiter but which he politically called the Medician moons after his sponsor (giving credit to sponsoring organizations such as the NSF, the MRC, or the NIH is not a new idea). Telescopes then and now are limited by the amount light that they capture. This is proportional to the cross section of the telescope and the refractor telescope of Galileo's first telescope was just 37 mm in diameter. His later telescopes were slightly larger, but were limited in their resolution by being refractor telescopes. Subsequently, Isaac Newton developed the reflector telescope which had the advantage that it could have a greater aperture and did not suffer from chromatic aberration. Optical telescopes record visible light in a small part of the electomagnetic spectrum. Radio telescopes record a different part of the spectrum but more importantly for our purposes have taken a different approach to increasing the aperture. Rather than a single telescope of great diameter (think of the Arecibo radio telescope in Puerto Rico with a diameter of 305 meters) it is possible to synthetically integrate the signals from multiple telescopes spread out over several kilometers at one site or even spread around the world in which case the resolving power is effectively limited by the diameter of the earth.
Why have we taken this diversion to discuss radio astronomy? Because of the analogy of the synthetic aperture radio telescope to Synthetic Aperture Personality Assessment (SAPA). Just as the resolving power of a single telescope is limited by its diameter, so is the resolving power of a single personality questionnaire limited by the number of items in the questionnaire and the number of people taking the items. But if we could give many different forms of the questionnaire, with different items to different people, and then combine these signals, we have the resolving power of a much larger questionnaire. This is the basic idea behind SAPA.
SAPA is not a particularly new idea for the combining of scores from different sets of items has been advocated for quite a while (Lord, 1955). In order to increase the number of items given in national and international surveys such as the National Assessment of Educational Progress (NAEP) and the Program for International Student Assessment (PISA) balanced incomplete block designs are common where students are systematically given different forms (Anderson, Lin, Treagust, Ross and Yore, 2007). Extensions of balanced incomplete blocks are known as matrix sampling with planned missingness designs (Graham, Taylor, Olchowski and Cumsille, 2006, Little, Gorrall, Panko and Curtis, 2017, Rhemtulla, Savalei and Little, 2016 which share the concept of a design with blocks of items missing at random (MAR) and if the blocks are completely at random this is known as MCAR. The number of such blocks is typically not very large in order to allow for full information maximum likelihood (FIML) estimates of the data, However, this procedure to increase the number of items given in a questionnaire is not particularly common in personality research and the usual tendency is to give the same short questionnaire to all participants (Bleidorn, Schönbrodt, Gebauer, Rentfrow, Potter and Gosling, 2016, Rentfrow, Gosling, Jokela, Kosinski and Potter, 2013, Soto and John, 2017. We have reported before (Revelle, Wilt and Rosenthal, 2010) using MCAR techniques to estimate temperament and ability scores from hundreds of items and then expanded the method to what we call Massively Missing Completely at Random (MMCAR) to estimate covariance structures of thousands of items with a level of missingness of up to 99% (Revelle, Condon, Wilt, French, Brown and Elleman, 2016).
The procedure is very simple and takes advantage of the web as well as people's interest in knowing themselves. At the web site (https://sapa-project. org) we give 50-250 items to each participant, with the items randomly sampled from about 6,000 items. (We started the original SAPA study by giving 50 items sampled from 120-140 item, but have gradually increased the number from which we sample to the current level of about 6,000.) Every subject gets feedback on their personality using a "Big 5" and a "little 27" (Condon, 2017) framework. This feedback seems to be compelling enough for users to tell others about the web site and, with the power of various search engines, attracts 50-100,000 people per year. Some other sites attract far more visitors, but give the same set of items to all participants.
But why care about the number of items if we already know there is a Giant 3 (Eysenck, 1994) or Big 5 (Digman, 1990, Goldberg, 1990? Isn't enough to just give a short Big 5 inventory (e.g., Gosling, Rentfrow and Swann, 2003, Konstabel, Lönnqvist, Leikas, Velàzquez, H, Verkasalo, and et al., 2017, Rammstedt and John, 2007 to see how personality correlates with criteria? No. It is better to consider large sets of items. To show the power of items we will report four studies using 5-27 scales and 135-696 items from two open source data sets (N = 4,000 and 126,884) that we have made public. In the spirit of open science, we include the R (R Core Team, 2019) code that we use for the analyses, all of which are done using the open source psych package (Revelle, 2020a) for R.
We report our results in four different studies. The first is a direct comparison of using conventional regression techniques to predict 10 different criteria for 4,000 participants from the spi dataset. These data are included in the psychTools package which accompanies the psych package (Revelle, 2020a) for the open source language and environment for statistical computing, R (R Core Team, 2019). The spi data set includes 135 items chosen from the 696 items shown by Condon (2014) to be common to more than 200 public domain scales. The second study applies a profile analysis to these same items and criteria. The third and fourth studies generalize the first two studies to far more items (696) and far more subjects (126,884). Given our beliefs in the importance of replicable work and in the spirit of open science, the analyses and data in all four of these studies uses open source computer code (included in the appendix) as well as data that have been previously released for open analysis. Although we include 10 criteria in the first two studies and 19 in the second two, for all four studies, we emphasize those that have direct implications for health and are particularly relevant to the readers of this journal, e.g., the personality correlates of smoking.
Because of the large to very large sample sizes involved, we do not report the conventional "statistical significance" nor even the confidence intervals of our effects. We do report the largest standard error for the correlations for each study. We prefer to use cross validation of our effects to show their stability. Suffice it is to say that any effect we report differs from a null effect. Following the advice of Funder and Ozer (2019) and others we report effect sizes in terms of correlations rather than squared correlations.

Study 1: Regressions using spi data set
The first study uses data from the spi data set which is included in the psy-chTools package and includes 135 items (the SAPA Personality Inventory aka the SPI-135, Condon, 2017) as well as ten criteria. The items were first scored to form five ("Big 5") scales, then 27 ("little 27") scales. These 5, 27 and then all 135 items were then entered into regression models and cross validated. The 135 spi items were chosen from the 696 items shown by Condon (2014) to be common to more than 200 public domain scales. The 27 scales were derived by factoring the entire 696 item pool and are variously described as "facets", "factors" and "scales". The dimunitive of "little 27" is to distinguish them from claims about "The Big 5", the "Giant 3" or even the "General 1".

Data: Variables and Subjects
The data for the first study come from 10 different criteria for 4,000 participants from the spi dataset which is included in the psychTools package (Revelle, 2020b). The psychTools package accompanies the psych package (Revelle, 2020a) for the open source language and environment for statistical computing, R (R Core Team, 2019). The data were collected as part of the larger SAPA project.
Using SAPA data contributed by about 126,000 visitors to our website (https: //SAPA-project.org) David Condon (2014) developed a heterarchical frame-6 work for assessing personality at three levels: The highest level has the familar five factors that have been studied extensively in personality research since the 1980s -Conscientiousness, Agreeableness, Neuroticism, Openness, and Extraversion. The middle level has 27 factors that are considerably more narrow. These were derived based on administrations of 696 public-domain IPIP (Goldberg, 1999) items to about 126,000 participants. The lowest level consists of the 135 items thought to best reflect the five and 27 factors. Condon describes these scales as being "empirically-derived" because relatively little theory was used to select the number of factors in the hetarchy and the items in the scale for each factor (to be clear, he means relatively little personality theory though he relied on quite a lot of sampling and statistical theory). The procedures for developing these scales are discussed in the manual for the Sapa Personality Inventory (Condon, 2017) which includes the R code and instructions for downloading the original data from Dataverse. The 10 criteria are: Age (in years, from 11 -90). Sex (self reported biological sex, coded by the number of X chromosomes as 1 or 2), Heath (self rated health on a 1-5 scale from poor to excellent), P1Edu and P2Edu are reported level of education for the participants parents, Education is respondent's education (less than 12 years, High School graduate, currently in university, some university, associates degree, college degree, in graduate or professional school, with a graduate or professional degree. Wellness (rated as 1-2). Exercise (frequency of exercise from very rarely to more than 5 times/week. Smoking (never, not last year, less than once a month, less than once a week, 1-3 days per week, most days, up to 5 times a day, up to 20 times a day, and more than 20 times a day (coded as 1 to 9). Emergency Room visits (none, 1, 2, 3, more than 3 times) (coded as 1 to 4) The basic descriptive statistics are shown in Table 1.

Method
The 135 items in the spi data may be scored for 5 broad personality traits (the "Big 5") using 70 of the items as well as 27 narrower factors with five items per scale (the "little 27"). The ω h and ω total reliabilities, as well as α and the scale intercorrelations for the Big 5 are shown in Table 2. (See Revelle and Condon, 2019, for a discussion of these and other coefficients.) Because ω statistics are not meaningful for the short 5 item scales of the little 27, we just summarize the α values which ranged from .67 (Easy goingness) to .90 (Well being) with a median and mean of .82.
Using the Big 5, and little 27 as well as all the 135 items, multiple regressions were conducted to predict each criteria from the Big5 scores. To eliminate the effect of capitalizing on chance, which is always a problem in multiple regression, and particularly problematic as the number of predictors increases, we randomly split the sample into two equal parts, and then developed the regression models on the first half and then cross validated on the second half. Such cross validation is important, for although most multiple regression procedures also predict the shrunken values of the regression, these seem to over estimates of the observed cross validated values. We compare cross validated regressions fousing the Big 5, the little 27, and all 135 items (Figure 1).
In addition to standard regression, we also applied a newly developed algorithm, bestScales (Elleman, McDougald, Revelle and Condon, 2020) which is included in the psych package. bestScales is an empirical scale construction procedure which identifies those items most correlated with a specific criterion. This process is repeated k times (so called k-fold cross validation) and the final scale is chosen from those items that identified in all the folds. Although not quite as powerful as standard machine leaning algorithms such as Lasso regression (Tibshirani, 2011) or the elastic net (Zou and Hastie, 2005), it is specifically designed to work with high levels of missing data (e.g. SAPA like data). One of the advantages of bestScales is that it allows identification of the best unit 8 weighed items predicting any particular criterion. We include these results as an additional line in Figure 1.
It is tempting to try to interpret the items that have significant regression weights in these models. When doing so, we must remember that regression weights reflect the independent contributions of each variable, with the effect of the other variables partialled out. We prefer to show the items chosen by the bestScales algorithm as these are chosen for their zero order correlations, rather than their regression weights (Table 3).

GWAS like technique
To continue our science by analogy approach, we consider the similarity of single items in personality research to the Single Nucleotide Polymorphisms (SNPs) of genomic research. One of the exciting advances in modern genetics is the move from emphasizing single genes to studying the entire genomome. Genome Wide Association Studies (GWAS) use large samples to detect the genetic effect of very small signals of individual SNPs on complex phenotypes. Because the individual signal is so small (with correlations of a SNP with a phenotype of < .01), very large samples are necessary to have enough power to detect the reliable effects above the background noise. Because 1,000s to 10,000s to 100,000s SNPs are being examined simultaneously corrections for multiple comparisons need to be made and the associated probabity values typically used are 10 −4 or even 10 −5 and expressed in log units.
A graphic showing GWAS effects plots the effect size (and the -log of the probability) of SNPs within and between the chromosomes. Because their shape resembles that of a skyline, these are known as "Manhattan Plots". We can do the same for the correlations of the individual items with various behavioral phenotypes grouped by personality scale (Figure 2). For the sake of simplicity, we show the correlations of the 135 SPI items with three criteria (self rated health, exercise, and smoking) for the 4,000 participants in the spi dataset. The top row of the plot shows the raw correlations, the bottom row -log p of the correlations. The probability values are corrected for multiple comparisons using the Holm correction (Holm, 1979) which is a more power test than the more typical Bonferoni.
What is obvious from this figure is that different items and different scales have different patterns of association with the different criteria. It is also apparent that the best items from some of the little 27 are better than the items that form the Big5 scales.

Cross validation of multiple regression on spi data
Cross Validated R p1edu p2edu ER wellness smoke educatin exer age sex health 135 little27 bestScales Big5 Figure 1: Cross validated correlations predictions 10 different criteria. Four methods of predictions are used. Big 5 were regression based models using the 5 Big 5 scales. Similarly, the little 27 and the 135 items represent cross validated multiple regression models. The "best scales" were found by adding up unit weighted scores based upon those items identified using the bestScales function.

Benefits of the item level approach
One of the interesting techniques of GWAS is the use of genetic correlations (Nagel, Watanabe, Stringer, Posthuma and Van Der Sluis, 2018). These are the correlations found by examining the similarity of profiles across the genome of two phenotypic items. Although the phenotypic correlations between neuroticism items ranged from .17 to .54, when the correlation are taken between the profiles they ranged from .28 to .91 (Nagel et al., 2018). These correlations may be used to indentify clusters of items showing similar genetic effects.
We apply this technique to find persome correlations, which are just the correlations of the profiles of correlations across items. Doing this leads to two sets of correlations: phenotypic correlations (the normal correlation of two criteria) and persome correlations (the correlation of the profiles across items). We show this in Figure 3: the lower off diagonal shows the phenotypic correlations, the upper off diagonal the persome correlations. The difference is clear: the persome correlations are much larger and show more clustering in their structure. For example, although the phenotypic correlation of smoking with emergency room visits is just .08, the correlation of what predicts smoking with what predicts emergency visits is .49. Similiarly, while the correlation between exercise and self reported wellness is .15, the profile correlations of the patterns is .67.
It is important to note that although using 4,000 participants, the profiles are based upon the correlations across 135 items and thus the standard error for these correlations are σ r = 1−r 2 133 < .087. The standard errors of the phenotypic correlations are based upon the observed correlations and the sample size and are < .016

Study 3: Regressions with more items and more subjects
Using a second set of open source data, we analyze results for 126,884 subjects on 696 items as well as 19 criteria. The data were collected between April 2014 through February, 2017 as part of the SAPA project and may be downloaded from DataVerse Revelle, 2015, Condon, Roney andRevelle, 2017a,b). As is our normal procedure, these data were collected using the MMCAR technique. Thus, of the 241,860 correlations (696*695/2) the median number of observations per pair was 2,704 with a range from 1,596 to 7,452 . Although using far more subjects than the spi data reported in studies 1 and 2, the SAPA data set has fewer observations per pairwise correlation (2,704) versus the complete data used in spi (4,000).

Criteria used in the SAPA data set
In addition to the criteria used in the spi data set, are relationship and marital status, height, weight, Body Mass Index (BMI), job status, occupational prestige (for those with jobs), estimated income (based upon occupation), the occupational prestige and estimated income of both parents. Of these 19 criteria, seven overlap with those of the spi data set.

Regressions and bestScales for 5, 27, 135 and 696 items
Multiple regressions for each criteria were done using half the sample for the Big 5 scores, the little 27 scores, and regressions using the 135 spi items. In addition, the bestScales solutions for the 19 criteria was also found. All four of these solutions were then cross validated using the second half of the sample. We organize the results in terms of the Big 5 regressions (Figure 4).
Comparing the cross validated solutions for the complete data from the spi reported in Study 1 versus the much sparser data in the SAPA analyzed here are several interesting findings (Figures 1, 4). With the complete data of the spi data set, regressions using all 135 items had the largest cross validated values for 8 of the ten criteria, and were functionally tied with the little 27 regressions for the other 2 ( Figure 1). The bestScales approach did not do as well as the little 27 regressions, but did do better than the Big 5 regressions. However, with a larger item pool and more missingness, the consistently best cross validated predictors were found by using a bestScales approach and the regressions for the 135 spi items did not do noticably better (and sometimes noticably worse) than the Big 5 regressions (Figure 4).
A strength of the empirical approach to scale construction (e.g., bestScales) is that it can identify the best items to predict particular criteria. A weakness is that it is completely mindless empiricism. For instance, the best scale to predict height has a cross validated value of .35. But the items that are most related to height are "panic easily", "get overwhelmed by emotions" and "Am a worrier". These non-sensical items are also the most correlated with gender. What the empirical scale construction technique is identifying is that women are shorter than men (with a correlation of -.65). But this is not just a fault of choosing the best items, for the regression weights for the Big 5 or the little 27 that best predict gender are almost exactly the opposite of those that predict height. The largest β weight for gender is Neuroticism (.18) as it is for height (-.14).

Study 4: Profile correlations using 696 items
Just as we could compare the phenotypic and persome profile correlations for the 10 criteria and 4,000 subjects of the spi data set, so we can compare the phenotypic and persome profile correlations for 19 criteria and 126,884 subjects. The profiles are based upon the correlations across 696 items and thus the standard error for these correlations are σ r = 1−r 2 694 < .037. The standard error of the phenotypic correlation based upon the sample size and the correlation and is < .003.
Comparing Figures 3 and 5 we see that the pattern of larger correlations and clearer cluster structure that we saw in Study 2 ( Figure 3 is repeated with the larger sample and the increase in the number of criteria ( Figure 5). The clearest cluster reflects parental education and occupation and a second clear cluster is the particpants education and job status. (Education is highly correlated in this analysis with age, for younger participants have not had the opportunity to achieve higher levels of education. Although we control for age when predicting level of education for personality and ability measures in other studies, for the purposes of this demonstration, we have not done so.)

Study 5: Validating profiles across samples
One of the powerful applications of GWAS is the development of genetic propensity scores. These can be derived from large samples and then applied to much smaller samples. We do this here using the 135 items from our larger sample and then applying these profiles to the seven identical criteria in the spi dataset. Of the 19 criteria in the SAPA data set, seven of them appear in the spi data set. We identify the correlations in the SAPA data set for the overlapping criteria with the 135 items common to both data sets. These correlations may be used, in turn, to predict the criteria in the spi data. When we do this we find that the profile scores (basically weighting individual items by their zeroorder correlations) are just slightly better than using the bestScales approach (which unit weights the best 20 items) and superior to regressions using the Big 5 factors. Profile scores are slightly worse than the regressions using the little 27 ( Figure 6). Just as we saw for the full SAPA data, regressions using all 135 items were terrible, reflecting the instability of beta weights when using too many predictors.
The profile scores are the logical equivalent to genetic propensity scores and could be labeled as personalty propensity scores. For the important criterion of smoking behavior, the multiple R from the Big 5 was .18, while the optimal linear combination (regression weights) of 27 predictors had a multiple R of .29 and the personality propensity score and the bestScales approach were both .27 (Table 4). Table 4: Predicting the spi criteria from the SAPA data set. Big 5 regressions are uniformly less than using the item information, either by profiles or best scales. Prediction models based upon the regression of all 135 items were with one exception inferior to all other models. Regression weights for the Little 27 had slightly higher validities.

Variable
Big 5

Discussion and future directions
In this paper we have argued that there is more information at the item level than is used when aggregating items to form factorially derived scales. This is an old argument (Goldberg, 1972) that needs to be reconsidered. Factorially based scales are useful when limited in the number of items available, or with smaller sample sizes. But with the power associated with large sample sizes now available, it is time to revisit the use of items. Taking advantage of techniques analogous to those of radio astronomy and genomics allows us to improve our prediction of real world criteria.
In this paper we addressed the use of items that are more conventional in personality measurement. But following the tradition of the broader field of individual differences as seen in the pages of Personality and Individual Differences or at the meetings of the International Society for the Study of Individual Differences we are in the process of collecting a broader set of items that includes interests as well as cognitive abilities. We believe that by routinely measuring temperament, abilities, and interests and using the profile techniques discussed in this paper, we will have a better understanding of individual differences in personality.