Keywords

1 Introduction

“You don’t get a second chance to make a first impression”, a saying famously goes. First impressions are rapid judgments of personality traits and complex social characteristics like dominance, hierarchy, warmth, and threat [13]. Accurate first impressions of personality traits have been shown to be possible when observers were exposed to relatively short intervals (4 to 10 min) of ongoing streams of individuals behavior [1, 4], and even to static photographs present for 10 s [2]. Most extraordinarily, trait assignment among human observers has been shown to be as fast as 100 ms [5].

Personality is a strong predictor of important life outcomes like happiness and longevity, quality of relationships with peers, family, occupational choice, satisfaction, and performance, community involvement, criminal activity, and political ideology [6, 7]. Personality plays an important role in the way people manage the images they convey in self-presentations and employment interviews, trying to affect the audience first impressions and increase effectiveness. Among the many other factors influencing employment interview outcomes like social factors, interviewer-applicant similarity, application fit, information exchange, preinterview impressions, applicant characteristics (appearance, age, gender), disabilities and training [8], personality traits are one of the most influential [9].

The key-assumption of personality psychology is that stable individual characteristics result into stable behavioral patterns that people tend to display independently of the situation [10]. The Five Factor Model (or the Big Five) is currently the dominant paradigm in personality research. It models the human personality along five dimensions: Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness. Many studies have confirmed consistency and universality of this model.

In the field of Computer Science, Personality Computing studies how machines could automatically recognize or synthesize human personality [10]. The literature in Personality Computing is considerable. Methods were proposed for recognizing personality from nonverbal aspects of verbal communication [11, 12], multimodal combinations of speaking style (prosody, intonation, etc.) and body movements [1318], facial expressions [19, 20], combining acoustic with visual cues or physiological with visual cues [19, 2123]. Visual cues can refer to eye gaze [14], frowning, head orientation [22, 23], mouth fidgeting [14], primary facial expressions [19, 20] or characteristics of primary facial expressions like presence, frequency or duration [19].

As far as we know, there is no consistent data corpus in personality computing and no bench-marking effort has yet been organized. It is a great impediment in the further advancement of this line of research and the main motivator of this challenge. This challenge is part of a larger project which studies outcomes of job interviews. We have designed a dataset collected from publicly available YouTube videos where people talk to the camera in a self-presentation context. The setting is similar to video-conference interviews. Consistent to research in psychology and the related literature in automatic personality computing we have labeled the data based on the Big Five model using the Amazon Mechanical Turk (see Sect. 3). We are running a second round for the ICPR 2016 conference. It will take the form of a coopetition in which participants both compete and collaborate by sharing their code.

This challenge belongs to a series of events organized by ChaLearn since 2011Footnote 1: the 2011–2012, user dependent One-shot-learning Gesture Recognition challenge [24, 25], the 2013–2014 user independent Multi-modal Gesture Recognition challenge, the 2014–2015 human pose recovery and action recognition [26, 27], and the 2015–2016 cultural event recognition [28] and apparent age estimation [29, 30]. In this 2016 edition, it is the first time we organize the First Impression challenge on automatic personality recognition.

The rest of this paper is organized as follows: in Sect. 2 we present the schedule of the competition and the evaluation procedures, in Sect. 3 we describe the data we have collected, Sect. 4 is dedicated to presenting, comparing and discussing the methods submitted in the competition. Section 5 concludes the paper with an extended discussion and suggestions about future work.

2 Challenge Protocol, Evaluation Procedure, and Schedule

The ECCV ChaLearn LAP 2016 challenge consisted in a single track competition to quantitatively evaluate the recognition of the apparent Big Five personality traits on multi-modal audio+RGB data from YouTube videos. The challenge was managed using the CodaLab open source platform of MicrosoftFootnote 2. The participants had to submit prediction results during the challenge. The winners had to publicly release their source code.

The competition had two phases:

  • A development phase during which the participants had access to 6,000 manually labeled continuous video sequences of 15 s each. Thus, 60 % of the videos used for training are randomly grouped in 75 training batches. They could get immediate feedback on their prediction performance by submitting results on an unlabeled validation set of 2000 videos. These 2,000 videos used in validation represent 20 % over the total set of videos and are also randomly grouped in 25 validation batches.

  • A final phase during which the competitors could submit their predictions on 2,000 new test videos (the remainder 20 % over the total set of videos, also grouped in 25 test batches). The prediction scores on test data were not revealed until the end of the challenge.

2.1 Evaluation Metrics

The participants of the different teams trained their models to imitate human judgments consisting in continuous target values in the range [0, 1] for each trait. Thus, their goal was to produce for each video in the validation set or the test set, 5 continuous prediction values in the range [0, 1], one for each trait.

For this task (similar in spirit to a regression) the evaluation consisted in computing the mean accuracy over all traits and videos. Accuracy for each trait is defined as:

$$\begin{aligned} A = 1 - \frac{1}{N_t} \sum _{i=1}^{Nt} |t_i - p_i|/\sum _{i=1}^{Nt} |t_i -\overline{t}| \end{aligned}$$
(1)

where \(p_i\) are the predicted scores, \(t_i\) are the ground truth scores, with the sum running over the \(N_t\) test videos, and \(\overline{t}\) is the average ground truth score over all videosFootnote 3. Additionally, we also computed (but did not use to rank the participants) the coefficient of determination:

$$\begin{aligned} R^2 = 1 -\sum _{i=1}^{Nt} (t_i - p_i)^2 / \sum _{i=1}^{Nt} (t_i - \overline{t})^2. \end{aligned}$$
(2)

We also turned the problems into classification problems by thresholding the target values at 0.5. This way we obtained 5 binary classification problems (one for each trait). We used the Area under the ROC curve (AUC) to estimate the classification accuracyFootnote 4.

Fig. 1.
figure 1

Progress of validation set leaderboard highest scores of all teams for each trait and progress of the highest ranking score (mean accuracy over all traits). The score used is the accuracy, normalized as in Eq. 1.

2.2 Schedule

The competition lasted two months and attracted 84 participants, who were grouped into several teams. The schedule was the following:

May 15, 2016: Beginning of the quantitative competition, release of the development data (with labels) and validation data (without labels).

June 30, 2016: Release of encrypted final evaluation data (without labels). Participants can start training their methods with the whole data set.

July 2, 2016: Deadline for code submission.

July 3, 2016: Release of final evaluation data decryption key. Participants start predicting the results on the final evaluation data.

July 13, 2016: End of the quantitative competition. Deadline for submitting the predictions over the final evaluation data. The organizers started the code verification by running it on the final evaluation data.

July 15, 2016: Deadline for submitting the fact sheets. Release of the verification results to the participants for review. Participants of the top ranked teams are invited to follow the workshop submission guide for inclusion at ECCV 2016 ChaLearn LAP 2016 Workshop on Apparent Personality Analysis.

As can be seen in Fig. 1 progresses were made throughout the challenge and improvements were made until the very end. At the date the challenge ended, there was still a noticeable difference between the average of the best accuracies on the individual traits and the best accuracy of the teams, due to the fact that some of the team’s methods performed better on some traits than others. This shows that there is still room for improvement and that the methods of the teams are complementary. We expect further improvements from the ongoing coopetition (second round of the challenge).

3 Competition Data

The data set consists of 10,000 clips extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. The people appearing are of different gender, age, nationality, and ethnicity, which makes the task of inferring apparent personality traits more challenging. In this section, we provide the details about the data collection, preparation, and the final data setFootnote 5.

Fig. 2.
figure 2

Data collection web page. Comparing pairs of videos, the AMT workers had to indicate their preference for five attributes representing the “Big Five” personality traits, following these instructions: “You have been hired as a Human Resource (HR) specialist in a company, which is rapidly growing. Your job is to help screening potential candidates for interviews. The company is using two criteria: (A) competence, and (B) personality traits. The candidates have already been pre-selected for their competence for diverse positions in the company. Now you need to evaluate their personality traits from video clips found on the Internet and decide to invite them or not for an interview. Your tasks are the following. (1) First, you will compare pairs of people with respect to five traits: Extraversion = Friendly (vs. reserved); Agreeableness = Authentic (vs. self-interested); Conscientiousness = Organized (vs. sloppy); Neuroticism = Comfortable (vs. uneasy); Openness = Imaginative (vs. practical). (2) Then, you will decide who of the 2 people you would rather interview for the job posted.” In this challenge we did not use the answers to the last question.

3.1 Video Data

We collected a large pool of HD (720p) videos from YouTube. After visioning a large number of videos, we found Q&A videos to be particularly suitable and abundant talking-to-the-camera videos. These are generally videos with fewer people appearing, little moving background, and clear voice. Since YouTube videos are organized in channels, which can contain a variable number of videos, we limited the number of videos per YouTube channel (author) to 3 in order to keep a balance of unique subjects.

After having downloaded an initial pool of 13,951 YouTube videos using pytube Python’s APIFootnote 6, we manually filtered out unsuitable footage (too short sequences or non English speakers). From the remaining 8,581 videos, we automatically generated a set of 32,139 clips of 15 s each. The clip generation was automatically done by searching continuous 15-second video segments in which one and only one face appeared. Faces were detected using Viola-Jones from OpenCV [31]. We retained only faces with at least one visible eye – with eyes being also detected using Viola-Jones. To increase robustness, we kept only those clips meeting both criteria (“one and only one face bounding box containing at least one eye”) in 75 % of the frames. Videos were of various duration, hence we limited the number of clips per video to at most 6.

We then performed a second fine-grained manual filtering – this time considering clips, instead of whole videos – using a custom web interface to filter out those clips not meeting the following criteria:

  • One unique person as foreground at a safe distance from the camera.

  • Good quality of audio and images.

  • Only English speaking.

  • People above 13–15 years old. Non-identified babies appearing with the parents might be allowed.

  • Not too much camera movement (changing background allowed, but avoid foreground constantly blurred).

  • No adult or violent contents (except people casually talking about sex or answering Q&A in an acceptable manner). Discard any libelous, doubtful or problematic contents.

  • No nude (except if only parts above shoulders and neck are visible).

  • Might have people in the background (crowd, audience, without talking, with low resolution of faces to avoid any confusion with the speaker).

  • No advertisement (visual or audio information about products or company names).

  • Avoid visual or audio cuts (abrupt changes).

From this second manual filter, we obtained the final set of 10,000 clips. These correspond to 3,060 unique originating videos. From those, we were able to generate a mean of 3.27 clips per video. In terms of time duration, the clips correspond to 41.6 h of footage pooled from 608.7 h of originating videos.

Table 1. Video data preparation and final data set statistics.

On the other hand, the originating videos were provided by 2,764 unique YouTube channels. Note, however, that the number of channels do not correspond to number of people (a youtuber can have different channels or participate in other youtubers’ channels), but it provides an estimation of the diversity of people appearing in the data set. The originating videos are also quite diverse in both their number of views and their 5-star ratings, which also helped to alleviate bias towards any particular kind of videos. This information is summarized in Table 1 together with other statistics computed from videos’ meta-data. The table is completed with the 20 most common keywords (or tags) associated to the originating videos. As we stated before, we focused on Q&A videos, often related to other video content such as vlogging, HOW TOs, and beauty tips (mostly makeup).

3.2 Ground-Truth Estimation

Obtaining ground truth for Personality Traits can be challenging. Before deciding to use human labeling of videos, we considered using self-administered personality tests on subjects we interviewed. We concluded that such test results are biased and variable. Additionally, performing our own interviews did not allow us to collect massive amounts of data. Therefore, for this dataset, we resorted to use the perception of human subjects visioning the videos. This is a different task than evaluating real personality traits, but equally useful in the context of human interaction (e.g. job interviews, dating, etc.).

To rapidly obtain a large number of labels, we used Amazon Mechanical Turk (AMT), as is now common in computer vision [32]. Our budget allowed us to get multiple votes per video, in an effort to reduce variance. However, because each worker (aka voter) contributes only a few labels in a large dataset, this raises the problem of bias and the need for calibrating the labels. Biases, which can be traced for example to harshness, prejudices for race, age, or gender, and cultural prejudices, are very hard to measure.

Fig. 3.
figure 3

Screenshot of sample videos voted to clearly perceive the traits, on either end of the spectrum.

We addressed this problem by using pairwise comparisons. We designed a custom interface (see Fig. 2).

Each AMT worker labeled small batches of pairs of videos. To ensure a good coverage and some overlap in the labeling of pairs of videos across workers, we generated pairs with a small-world algorithm [33]. Small-world graphs provide high connectivity, avoid disconnected regions in the graph, have a well distributed edges, and minimum distance between nodes [34] (Fig. 3).

Cardinal scores were obtained by fitting a BTL model [35]. This is a probabilistic model such that the probability that an object j is judged to have more of an attribute than object i is a sigmoid function of the difference between cardinal scores. Maximum Likelihood estimation was used to fit the model. Deeper details and explanations of the procedure to convert from pairwise scores to cardinal scores are provided in a companion paper [36], where a study is conducted to evaluate how many videos we could label with the constraints of our financial budget. We ended up affording 321,684 pairs to label 10,000 videos.

4 Challenge Results and Methods

In this section we summarize the methods proposed by the teams and provide a detailed description of the winning methods. The teams submitted their code and predictions for the test sets; the source code is available from the challenge website.Footnote 7 Then, we provide a statistical analysis of the results and highlight overall aspects of the competition.

4.1 Summary of Methods Used

In Table 2 we summarize the various approaches of the teams who participated in the final phase, uploaded their models, and returned a survey about methods we asked them to answer (so-called “fact sheets”).

The vast majority of approaches, including the best performing methods, used both audio and video modalities. Most of the teams represented the audio with handcrafted spectral features, a notable exception being the method proposed by team DCC, where a residual network [37] was used instead. For the video modality the dominant approach was to learn the representations through convolutional neural networks [38]. The modalities were late-fused in most methods before being fed to different regression methods like fully connected neural networks or Support Vector Regressors. A notable exception is the method proposed by team evolgen, which includes temporal structure by partitioning the video sequences and sequentially feeding the learned audio-video representation to a recurrent Long Short Term Memory layer [39].

Most teams made semantic assumptions about the data by separating face from background. Usually, this was achieved by preprocessing such as face frontalisation. However, it is important to notice that the winning method of team NJU-LAMDA does not make any kind of semantic separation of the content.

Finally, a common approach was to use pre-trained deep models fine-tuned on the dataset provided for this challenge. The readers are referred to Table 2 for a synthesis of the main characteristics of the methods that have been submitted to this challenge and to Table 3 for the achieved results. Next, we provide a more detailed description of the three winning methods.

Table 2. Overview of the team methods comparing pretraining (topology and data), preprocessing if performed, representation, learning strategy per modality and fusion.

First place: The NJU-LAMDA team proposed two separate models for still images and audio, processing multiple frames from the video and employing a two-step late fusion of the frame and audio predictions [40]. For the video modality, it proposed DAN+, an extension to Descriptor Aggregation Networks [43] which applies max and average pooling at two different layers of the CNN, normalizing and concatenating the outputs before feeding them to a fully connected layers. A pretrained VGG-face model [44] is used, replacing the fully-connected layers and fine-tuning the model with the First Impressions dataset. For the audio modality it employs log filter bank (logfbank) features and a single fully-connected layer with sigmoid activations. At test time, a predefined number of frames are fed to the visual network and the predictions averaged. The final visual predictions are averaged again with the output of the audio predictor.

Second place: The evolgen team proposed a multimodal LSTM architecture for predicting the personality traits [41]. In order to maintain the temporal structure, the input video sequences are split in six non-overlapping partitions. From each of the partitions the audio representation is extracted using classical spectral features and statistical measurements, forming a 68-dimensional feature vector. The video representation is extracted by randomly selecting a frame from the partition, extracting the face and centering it through face alignment. The preprocessed data is passed to a Recurrent CNN, trained end-to-end, which uses a separate pipeline for audio and video. Each partition frame is processed with convolutional layers, afterwards applying a linear transform to reduce the dimensionality. The audio features of a given partition go through a linear transform and are concatenated with the frame features. The Recurrent layer is sequentially fed with the features extracted from each partition. In this way, the recurrent network captures variations in audio and facial expressions for personality trait prediction.

Third place: The DCC team proposed a multimodal personality trait recognition model comprising of two separate auditory and visual streams (deep residual networks, 17 layers each), followed by an audiovisual stream (one fully-connected layer with hyperbolic tangent activation) that is trained end-to-end to predict the big five personality traits [42]. There is no pretraining, but a simple preprocessing is performed where a random frame and crop of the audio are selected as inputs. During test, the whole audio and video sequences are fed into the auditory and visual streams, applying average pooling before being fed to the fully-connected layer.

The approaches of all three winning methods use separate streams for audio and video, applying neural networks for both streams. The first and second places both use some kind of data preprocessing, with the NJU-LAMDA team using logfbank features for the audio and the evolgen team using face cropping and spectral audio features. The second and third methods both use end-to-end training, fusing the audio and video streams with fully-connected layers.

4.2 Statistical Analysis of the Results

Table 3 lists the results on test data using different metrics. One can also observe very close and competitive results among the top five teams. The results of the top ranking teams are within the error bar.

For comparison, we indicated the results obtained by using the median predictions of all ranked teams. No improvement is gained by using this voting scheme. We also show “random guess”, which corresponds to randomly permuting these random predictions.

Table 3. Results of the first round of the Personality Trait challenge. Top: the Accuracy score used to rank the teams (Eq. 1). Middle: \(R^2\) score (Eq. 2). Bottom: Area under the ROC Curve (AUC) evaluating predictions by turning the problem into a classification problem. The error bars are the standard deviations computed with the bootstrap method. The best results are indicated in bold.

We treated the problem either as a regression problem or as a classification problem:

  • As a regression problem. The metric that was used in the challenge to rank teams is the mean (normalized) accuracy (Eq. 1). We normalized it in such a way that making constant predictions of the average target values yields a score of 0. The best score is 1. During the challenge we did not normalize the accuracy; however this normalization does not affect the ranking. Normalizing makes the Accuracy more comparable to the \(R^2\) and results are easier to interpret. The results obtained with the \(R^2\) metric (Eq. 2) are indeed similar, except that the third and fourth ranking teams are swapped. The advantage of using the Accuracy over the \(R^2\) is that it is less sensitive to outliers.

  • As a classification problem. The AUC metric (for which random guesses yield a score of 0.5, and exact predictions a score of 1) yields slightly different results. The fourth ranking team performs best according to that metric. Classification is generally an easier problem than regression. We see that classification results are quite good compared to regression results.

Fig. 4.
figure 4

Distribution of final scores for each trait and performance of the individual teams. We see that “Agreeableness” is consistently harder to predict by the top ranking teams.

Fig. 5.
figure 5

Receiver operating characteristic curve of the median prediction of each trait, the median taken over all ranked teams predictions (left) and averaged over all traits for each team (right)

Fig. 6.
figure 6

Correlation matrices. Correlation for all videos between ground truth labels (left), and between the median predictions of the teams (right).

For the regression analysis, we graphically represented the Accuracy results (the official ranking score) as a box plot (Fig. 4) showing the distribution of scores for each trait and the overall accuracy. For the classification analysis, we show ROC curves in Fig. 5. In both cases Agreeableness seems significantly harder to predict than other traits, while Conscientiousness is the easiest (albeit with a large variance). We also see that all top ranking teams have similar ROC curves.

An analysis of the correlation between the five personality traits for both the ground truth and the median predictions (Fig. 6) shows some correlation between labels, particularly the group Extraversion, Neurotism, and Openness. This remains true for the team’ predictions; Agreeableness is also significantly correlated to that group. For the predictions the correlation between any given pair of traits is 25–35 % higher for the team’ predictions than for the ground truth. Nothing in the challenge setting encourages methods to “orthogonalize” decisions about traits, hence the predictors devised by the teams make joint predictions of all five personality traits and may easily learn correlations between traits.

Fig. 7.
figure 7

Ground truth vs. average prediction for extraversion. Each dot represents a video. The average is taken over all final submissions.

In Fig. 7, we also investigated the quality of the predictions by producing scatter plots of the predictions vs. the ground truth. We show an example for the trait Extraversion. On the x-axis coordinate is ground truth and on the y-axis the median prediction of all the teams. We linearly regressed the predictions to the ground truth. The first diagonal corresponds to ideal predictions. Similar plots are obtained for all traits and all teams. As can be seen, the points do not gather around the first diagonal and the two lines have different slopes. We interpret this as follows: there are two sources of error, a systematic error corresponding to a bias in prediction towards the average ground truth value, and a random error. Essentially the models are under-fitting (they are biased towards the constant prediction).

5 Discussion and Future Work

This paper has described the main characteristics of the ChaLearn Looking at People 2016 Challenge which included the first round competition on First Impressions. A large dataset was designed with manual selection of videos, AMT pairwise video annotation to alleviate labeling bias, and reconstruction of cardinal ratings by fitting a BLT model. The data were made publicly available to the participants for a fair and reproducible comparison in the performance results. Analyzing the methods used by 9 teams that participated in the final evaluation and uploaded their models (out of a total of 84 participants), several conclusions can be drawn:

  • There was a lot of emulation during the challenge and the final results are close to one another even though the methods are quite diverse.

  • Feature learning (via deep learning methods) dominates the analysis, but pretrained models are widely used (perhaps due to the limited amount of available training data).

  • Late fusion is generally applied, though additional layers fusing higher level representations from separate video and audio streams are often used.

  • Video is usually analyzed at a per-frame basis, pooling the video features or fusing the predictions. The second place winner is an exception, using an LSTM to integrate the temporal information.

  • Many teams used contextual cues and extracted faces, but some top ranking teams did not.

Even though performances are already quite good, from the above analysis it is still difficult to ensure the achievement of human level performance. Since there is a wide variety of complementary approaches, to push participants to improve their performances by joining forces, we are organizing a first coopetition (combination of competition and collaboration) for ICPR 2016. In this first edition of coopetition, we reward people for sharing their code by combining the traditional accuracy score with the number of downloads of their code. With this setting, the methods are not only evaluated by the organizers, but also by the other participants.

We are preparing a more sophisticated coopetition that will include more interactive characteristics, such as the possibility for teams to share modules of their overall system. To that end, we will exploit CodaLab worksheets (http://worksheets.codalab.org), a new feature resembling iPython notebooks, which allow user to share code (not limited to Python) intermixed with text, data, and results. We are working on integrating into CodaLab worksheets a system of reward mechanisms suitable to keep challenge participants engaged.

As mentioned in the introduction, the First Impressions challenge is part of a larger project on Speed Interviews for job hiring purposes. Some of our next steps will consist in including more modalities that can be used together with audio-RGB data as part of a multimedia CV. Examples of such modalities include handwritten letters and/or traditional CVs.