Phone-based metric as a predictor for basic personality traits

Basic personality traits are believed to be expressed in, and predictable from, smart phone data. We investigate the extent of this predictability using data (n = 636) from the Copenhagen Network Study, which to our knowledge is the most extensive study concerning smartphone usage and personality traits. Based on phone usage patterns, earlier studies have reported surprisingly high predictability of all Big Five personality traits. We predict personality trait tertiles (low, medum, high) from a set of behavioral variables extracted from the data, and ﬁnd that only extraversion can be predicted signiﬁcantly better (35.6%) than by a null model. Finally, we show that the higher predictabilities in the literature are likely due to overﬁtting on small datasets.

Based on data from the Copenhagen Network Study (CNS), Stopczynski et al. (2014), we use smartphone data to quanitfy the predictability of the Big Five personality traits Digman (1990), openness (O), conscientiousness (C), extraversion (E), agreeableness (A) and neuroticism (N), commonly called the five factor model and abbreviated as OCEAN. The CNS data is to the best of our knowledge the largest and most detailed study of its kind. Specifically, we use the Big Five Inventory John, Naumann, and Soto (2008), which consists of 44 items. For each item, participants in the CNS study have expressed, on a discrete scale from 1 to 5, how much they agree with a given statement. The personality traits are then computed from a pre-determined linear combination of the 44 answers. Previous research has suggested that smartphone data can be used to predict the Big Five with surprisingly high accuracy de Montjoye et al. (2013). In contrast, we show using a broad range of features extracted from the CNS data that only extraversion can be predicted with some certainty. In the Methods section below and in the appendices, we provide a description of the features (predictor variables) we extract from the smartphone data and further consider their cross-correlations. In the Results section, we use a support vector machine model for the prediction and quantify its relative improvement over a null model where personality scores are randomly assigned. Finally, we briefly compare the scoring system behind the Big Five Inventory against alternative dimensionality reduction techniques in terms of predictability.

Methods
We use questionnaire-based data on the personality traits together with phone based-data from 730 freshman students starting in the year 2013 at the Technical University of Denmark. The phone-based data has been collected over a period of 24 months by custom software installed on smartphones given to the participants of the study, Stopczynski et al. (2014) The research reported in this study has not been preregistered in an independent, institutional registry. networks (Facebook connections and interactions), and networks based on physical proximity. The physical proximity is measured through the Bluetooth signal strength, and can be used to monitor face-to-face contacts Sekara and Lehmann (2014). From the GPS data, we obtain information on the geo-spatial mobility Mollgaard, Lehmann, and Mathiesen (2016a). Out of the 730 participants, we only include data from individuals, which, we believe, have used the phone as a primary device. This implies discarding data from users that have written less than 10 text messages, made 5 phone calls or have 100 GPS data points, as well as users with no Facebook friends. These criteria were chosen as a simple heuristic for removing participants who very quickly stopped using the phone, as the subjects remaining after this removal had vastly larger amounts of data. These requirements reduce the number of participants in our study to 636.
For comparison purposes, we consider a list features similar to those in de Montjoye et al. (2013). Furthermore, we repeat our analysis on the part of the Friends and Family (FF) dataset, Aharony, Pan, Ip, Khayal, and Pentland (2011), which is publicly available. 1 The FF dataset consists of data from 52 participants, 38 of whom have sufficient call and location data for our analysis according to the selection criteria described above. We finally compare our analysis on both datasets with the results in de Montjoye et al. (2013). Table 1 presents a list of all the features we consider. The feature extraction process is described in detail in the following.
Feature Extraction. The first category of features that we extract consists of basic statistics of calls and texting. For each user, we compute the median and standard deviation of the inter-event time between phone calls, text messages, and combinations thereof. For each of the three interaction forms, we also compute the entropy S u defined by where the index c runs over each unique phone number that the user has contacted, n c denotes the number of interactions with contact c, and n t ¼ P c n c the total number of interactions. The entropy is a general measure of the spread of the interactions. Users with low entropy tend to mainly contact a few individuals while largely ignoring the rest, whereas users with high entropy tend to contact people more equally. We further determine the percentage of a user's calls which were outgoing, as well as the total number of contacts, their ratio to the number of interactions, and the ratio of calls and texts that a user has responded to within an hour of receiving them, and finally the fraction of calls made during the night.
A number of quantities based on location data are also computed. We extract the median and standard deviation of the users' daily distance travelled, their daily radius of gyration (here simplified to be the radius of the smallest circle enclosing all coordinates visited by the user on each day) and the entropy of the time spent in various locations by the user. We identify the locations visited by clustering the GPS points sampled when a user is not moving. A user is defined to not move, if the user's mean speed does not exceed 0.5 m/s in a period between two consecutive GPS points. As the uncertainty on civilian GPS locations can be up to 100 m Zandbergen and Barbeau (2011), a user moving at a speed of 0.5 m/s would need at least 400 s to move a distance larger than two times the uncertainty. For that reason, we consider only GPS points taken even further apart, i.e. 500 s apart. The GPS data points are filtered according to the following procedure. For each user, we include the first recorded GPS data point, we then exclude data points in the subsequent time window of 500 s and then again include the first data point sampled outside this window. From this new data point we repeat the procedure of excluding points in a subsequent window of 500s and so forth. We identify clusters (locations) in the GPS points by use of the DBSCAN algorithm Ester, Kriegel, Sander, and Xu (1996) and we compute the entropy of visits to those clusters by again applying Eq. (1). Finally, we estimate the fraction of time a user spends at home, where home is assumed to be the place where a user spend most of their weeknights.
Another category of features aim to quantify the degree to which a user's behavior follows a temporal pattern. For the call/text data, we count the number of call/text events for a given user in time bins of 6 h. We then fit an autoregressive series, which best predicts the activity in time bin X t from previous activities on the form where l is the mean activity and t is a noise term. These coefficients u i are used as features with names like 'AR series coefficient We finally extract a range of features concerning a user's social contacts. This includes their number of Facebook friends and the fraction of the time users spend in the proximity of other participants in the study. This is estimated from repeated automatic scans by the Bluetooth ports. The entropy of the proximity is also calculated similarly to Eq. (1), as well as the time series parameters as described in Eq. (2).

Classification
We divide the scores on each of the five personality traits into tertiles, i.e. we assign a label of 0, 1 or 2 specifying whether they score low, medium, or high on that trait, corresponding to them lying in the bottom, middle, or upper third, respectively, of all the user scores for that trait. We do this for two reasons -first, this has been done in existing research Chittaranjan et al. (2011a), de Montjoye et al. (2013) and hence allows comparison between our results and those in the literature. Second, although regression approaches have nice accuracy metrics like the mean squared error (MSE), which provides a number for how far from the true values the prediction of the regressor typically is, this measure is not particularly meaningful on ordinal values like personality traits, where e.g. higher extraversion scores mean a person is more extroverted, but there's no precise interpretation for a difference in extraversion score of, say, 0.2.
Our model of choice for predicting the classification labels Y from the feature vectors X is a support vector machine (SVM) using a radial basis function (RBF) kernel Hearst, Dumais, Osuna, Platt, and Schölkopf (1998). This model requires that two hyperparameters are fixed -a misclassification cost C and a sharpness c of the Gaussian basis functions. We take two approaches for feature selection and model fitting, and subsequently compare the results. In both approaches, we use the correlation between phone-metrics and personality traits as a heuristic for feature selection and include the number of features n as a hyperparameter of the model.
In the first approach, we perform a number of cross-validation runs. For each training set introduced during the cross-validation, we first choose the n features with the strongest correlations with the personality traits and perform an extensive grid search in the hyperparameter space. As a consequence, both the hyperparameter values and the features included in the classifier will vary between each cross validation run, potentially making it more difficult to interpret the results. At the same time, however, this ensures that training and test sets are completely separated, and thus that we do not observe overly optimistic results caused by overfitting.
In a second and less safe approach, following de Montjoye et al.
(2013), we use a feature's correlation with a given trait as a heuristic for estimating the importance of the feature. We thus rank the features by their correlations to a given trait, and define another parameter n, denoting the number of features to include, starting with the one most correlated to the personality trait in question.
The hyperparameter values and the feature selection are first fixed by performing a grid-search procedure on the full dataset, including into the final classifier the n features with the strongest correlations to the personality trait in the full dataset. This has the disadvantage of being vulnerable to overfitting, especially on smaller datasets, as it allows the classifier to exploit coincidental correlations between phone metrics and personality traits for prediction. On the other hand, this approach has the advantage that hyperparameter values and feature selection is only determined once, which may aid in interpreting results. The values of the hyperparameters C and c are shown in Table 2, and the features included into the classifiers for the five traits are shown in Table 1.

Results
The quality of our classification is measured in terms of the relative improvement over a baseline classifier (our null model) where f denotes the fraction of correct classifications. The score f baseline is obtained using a null classifier which always predicts the label most frequently occurring in a test set. Using the first approach, outlined above, where hyperparameters are fitted separately on each training sets, we obtain the results shown in Table 3. In general, our relative improvements over the baseline are much lower than those reported in the literature. The only exception is the extraversion trait in the CNS dataset, which at the same time is the only trait that can be predicted significantly better than baseline.
We now compare these results with the less safe approach, where hyperparameters are determined and features selected on the full dataset. For the FF data, we obtain relative improvements of the trait prediction in the range 0.176-0.493 (with a mean improvement of 0.31) based on 10 4 bootstrap samples. For the CNS dataset, we obtain relative improvements over the null model in the range À0.024 to 0.367 (with a mean improvement of 0.11). The results for each trait in each dataset is shown in Table 4.
In de Montjoye et al. (2013) a mean relative improvement of 0.42 is reported, which is significantly above what is reported in another study Chittaranjan, Blom, and Gatica-Perez (2013). We note that significant improvements over a baseline classifier for traits other than extraversion appears contingent on (a) having few data points, and (b) using correlations on the full dataset for feature selection, thus allowing the model to be fit to noise. Hence, it seems likely that earlier reports of high predictability of human   personality traits from phone metrics have been greatly overestimated due to overfitting enabled by a combination of small sample sizes and a large number of variables. We note that only the extraversion trait appears to be truly predictable from phonebased data. This is in good agreement with common sense, as phones by their nature are devices for inter-human communication. Further, some of the features used in the classifier are expected to be related to extraversion such as the users' number of Facebook friends and the number of new contacts made during the first months of the study. Based on the Big Five Inventory, the personality traits are computed by reducing the 44 answers to five scores. Any dimensionality reduction of this kind will inevitably lose information available from the full set of answers. We have therefore performed a series of alternative reduction methods on the 44 items to see if we could improve our predictions of the personality traits (see the appendices). Both supervised and unsupervised dimensionality reductions have been used. Among the unsupervised methods, we have tried principal component analysis, independent component analysis and factor analysis. We have applied the methods directly to the answers to the 44 items in order to extract five dimensional objects keeping the most relevant information about the original 44 items. In the unsupervised reduction no information about the features is used. For the supervised reduction method, we try reduce the target variables (the list of items) by finding those items that can be best predicted from the predictor variables (the features). While both the supervised and unsupervised methods improve significantly the quality of our predictions, the overall picture is the same that predominantly items related to extraversion can be predicted with some certainty.

Discussion
Using data from the Copenhagen Network Study, which, to our knowledge is the largest dataset simultaneously containing information about the Big Five personality traits and extensive information about smartphone usage patterns, we have shown that the extraversion trait can be predicted significantly better than a null model based on random classification. In contrast, the other personality traits are poorly predicted by our data. Our findings contrast previous studies, which report significant predictabilities across all traits. Given that we have carried out the analysis on datasets of two sizes using two feature selection procedures, and since we obtained high predictabilities only when (a) using fulldataset correlations for variable selection and (b) analyzing a small dataset, the combination of the two appears a likely explanation for the results previously reported in the literature. Regarding the generalizability of our findings, we note that all participants in the study were students at the Technical University of Denmark, and that findings are not necessarily generalizable to the population in general.

Availability of data and materials
Data are part of larger study ''Social Fabric" involving researchers at the Technical University of Denmark and University of Copenhagen. Due to privacy consideration regarding subjects in our dataset, including European Union regulations and Danish Data Protection Agency rules, we cannot make all data used here publicly available. The data contains detailed information on mobility and daily habits at a high spatio-temporal resolution. We understand and appreciate the need for transparency in research and are ready to make the data available to researchers who meet the criteria for access to confidential data, sign a confidentiality agreement, and agree to work under our supervision in Copenhagen. The ''Social Fabric" study was reviewed and approved by the appropriate Danish authority, the Danish Data Protection Agency (Reference number: 2012-41-0664). The Data Protection Agency guarantees that the project abides by Danish law and also considers potential ethical implications. All subjects in the study gave written informed consent.

Competing interests
The authors declare that they have no competing interests.

Appendix A. Descriptive statistics of features and target values for the CNS dataset
This section contains descriptive statistics for the applied features and the personality traits. Table A.5 contains key descriptive Table A.5 Descriptive statistics for the features used for classification. This table summarizes key statistical figures (mean, standard deviation, and min/max values) for the features used. And index is also given to uniquely denote each feature. The indices refer to further graphical information on the features in Fig. A.1. We use the abbreviations iet for inter-event time, and cir for contact-interaction ratio. figures for the features used in the predictions as listed in Table 1. These include the mean values and standard deviations, as well as the minimum and maximum values for each features. The table also contains a brief description of each feature, as well as an index. These indices can be used to locate a visualization of the distribution of the feature, and information on inter-feature correlations in Fig. A.1. The distribution plots were generated by using a Gaussian kernel density estimation (KDE) procedure to smoothen histograms obtained from the observed features. Similar details are provided for the big five inventory scores in Table A.6, and the corresponding distributions and correlations are shown in Fig. A.2.  Table A.5. In the lower right corners is a heatmap of the pearson correlation coefficients for each pair of features. We investigate the loss of predictability associated with the dimensionality reduction used to compute the Big Five traits from the original 44 questions in the questionnaire, by considering alternative dimensionality reduction techniques. Specifically, we use principal component analysis (PCA), independent component analysis (ICA), factor analysis (FA), and supervised dimensionality reduction (SDR), keeping only the five leading components of each technique. The supervised dimension reduction technique applied here finds the one dimensional projection of the data that has the lowest R 2 value, when training a linear model. The procedure is continued with the additional constraint that the new projections should be orthogonal to all previous projections, such that the result is a low dimensional space specified by an orthogonal basis. The constrained optimization is performed numerically on the training set and then applied to the test set in order to avoid overfitting. See Section C for details on SDR. Fig. B.3 shows the performance of our classifier in predicting different dimensionality reductions of the 44 questions in the Big Five Inventory. As the figure shows, other dimensionality reduction techniques result in greater personality predictability, indicating that some information related to how people use their phones is contained in their responses to the Big Five questionnaire, but is lost when the Big Five traits are computed from said responses.
To investigate this further, we examined the components of the projection vectors used in each dimensionality reduction technique. In all cases, the projection retaining the greatest predictability was strongly associated with extraversion and in many cases also with neuroticism. For example, Fig. B.4 shows the entries of the ICA vector whose projection had the greatest predictability. Note that the most predictable direction of projection points in a direction corresponding opposite scores of extraversion and neuroticism, consistent with the anticorrelation between the two traits found in the literature Hamburger and Ben-Artzi (2000).    with y ðpÞ the average projection over the persons. The training is performed iteratively in two steps. First, we fix the projection vector and optimize for the parameters of the linear model. Then we fix the parameters and optimize for the projection vector. The optimization step is performed using Sequential Least Squares Programming (SLSQP) with the projection vector constrained to unit length. The training converges consistently irrespective of the initialization of the projection vector.
We may then look for the best projection in the 43 dimensional space orthogonal to our first projection. This can either be done by mapping on to these 43 dimensions or simply adding an orthogonality constraint to the optimization. This procedure may be repeated until a satisfying number of projections is obtained.
We have a final note regarding over training. Let us start by counting the number of free parameters in the training step. If the dimension of y is N and the dimension of x is M, then the number of free parameters is M þ N, since the linear model has an extra parameter for offset, which is canceled by the unit length constraint on the projection vector. For a data set of size S, we need S ) M þ N for proper training. In other words, if too many features of x are included in the SDR scheme, fitting to noise will take place, thereby resulting in worse performance when applying the classifier to a test set. To avoid this over fitting effect, we implement the following procedure to determine the optimal features of x to include. First we partition the data into five test, and training, sets consisting of 80% and 20% of the data, respectively. Within each training set, we find the correlation between the features of x and each of the 44 features of y. For each feature, we compute the product of the p-values corresponding to those correlations, obtaining a value between 0 and 1, where a value of 1 is interpreted as the feature being unrelated to y and lower values indicating stronger associations. We then rank the features according to these values, and keep the n best features for the classification task. We find that n ¼ 8 performs the best, since overfitting takes over for larger n, and we therefore use these 8 features for the supervised dimensionality reduction.