ChaLearn LAP 2016: First Round Challenge on First Impressions - Dataset and Results

Ponce-López, Víctor; Chen, Baiyu; Oliu, Marc; Corneanu, Ciprian; Clapés, Albert; Guyon, Isabelle; Baró, Xavier; Escalante, Hugo Jair; Escalera, Sergio

doi:10.1007/978-3-319-49409-8_32

Víctor Ponce-López^15,16,20,
Baiyu Chen¹⁸,
Marc Oliu²⁰,
Ciprian Corneanu^15,16,
Albert Clapés¹⁶,
Isabelle Guyon^17,19,
Xavier Baró^15,20,
Hugo Jair Escalante^17,21 &
…
Sergio Escalera^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9915))

Included in the following conference series:

European Conference on Computer Vision

10k Accesses
65 Citations

Abstract

This paper summarizes the ChaLearn Looking at People 2016 First Impressions challenge data and results obtained by the teams in the first round of the competition. The goal of the competition was to automatically evaluate five “apparent” personality traits (the so-called “Big Five”) from videos of subjects speaking in front of a camera, by using human judgment. In this edition of the ChaLearn challenge, a novel data set consisting of 10,000 shorts clips from YouTube videos has been made publicly available. The ground truth for personality traits was obtained from workers of Amazon Mechanical Turk (AMT). To alleviate calibration problems between workers, we used pairwise comparisons between videos, and variable levels were reconstructed by fitting a Bradley-Terry-Luce model with maximum likelihood. The CodaLab open source platform was used for submission of predictions and scoring. The competition attracted, over a period of 2 months, 84 participants who are grouped in several teams. Nine teams entered the final phase. Despite the difficulty of the task, the teams made great advances in this round of the challenge.

You have full access to this open access chapter, Download conference paper PDF

Multi-domain and multi-task prediction of extraversion and leadership from meeting videos

Article Open access 21 November 2017

Zero-shot Personality Perception From Facial Images

Affective and Personality Corpora

Keywords

1 Introduction

“You don’t get a second chance to make a first impression”, a saying famously goes. First impressions are rapid judgments of personality traits and complex social characteristics like dominance, hierarchy, warmth, and threat [1–3]. Accurate first impressions of personality traits have been shown to be possible when observers were exposed to relatively short intervals (4 to 10 min) of ongoing streams of individuals behavior [1, 4], and even to static photographs present for 10 s [2]. Most extraordinarily, trait assignment among human observers has been shown to be as fast as 100 ms [5].

Personality is a strong predictor of important life outcomes like happiness and longevity, quality of relationships with peers, family, occupational choice, satisfaction, and performance, community involvement, criminal activity, and political ideology [6, 7]. Personality plays an important role in the way people manage the images they convey in self-presentations and employment interviews, trying to affect the audience first impressions and increase effectiveness. Among the many other factors influencing employment interview outcomes like social factors, interviewer-applicant similarity, application fit, information exchange, preinterview impressions, applicant characteristics (appearance, age, gender), disabilities and training [8], personality traits are one of the most influential [9].

The key-assumption of personality psychology is that stable individual characteristics result into stable behavioral patterns that people tend to display independently of the situation [10]. The Five Factor Model (or the Big Five) is currently the dominant paradigm in personality research. It models the human personality along five dimensions: Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness. Many studies have confirmed consistency and universality of this model.

In the field of Computer Science, Personality Computing studies how machines could automatically recognize or synthesize human personality [10]. The literature in Personality Computing is considerable. Methods were proposed for recognizing personality from nonverbal aspects of verbal communication [11, 12], multimodal combinations of speaking style (prosody, intonation, etc.) and body movements [13–18], facial expressions [19, 20], combining acoustic with visual cues or physiological with visual cues [19, 21–23]. Visual cues can refer to eye gaze [14], frowning, head orientation [22, 23], mouth fidgeting [14], primary facial expressions [19, 20] or characteristics of primary facial expressions like presence, frequency or duration [19].

As far as we know, there is no consistent data corpus in personality computing and no bench-marking effort has yet been organized. It is a great impediment in the further advancement of this line of research and the main motivator of this challenge. This challenge is part of a larger project which studies outcomes of job interviews. We have designed a dataset collected from publicly available YouTube videos where people talk to the camera in a self-presentation context. The setting is similar to video-conference interviews. Consistent to research in psychology and the related literature in automatic personality computing we have labeled the data based on the Big Five model using the Amazon Mechanical Turk (see Sect. 3). We are running a second round for the ICPR 2016 conference. It will take the form of a coopetition in which participants both compete and collaborate by sharing their code.

This challenge belongs to a series of events organized by ChaLearn since 2011^{Footnote 1}: the 2011–2012, user dependent One-shot-learning Gesture Recognition challenge [24, 25], the 2013–2014 user independent Multi-modal Gesture Recognition challenge, the 2014–2015 human pose recovery and action recognition [26, 27], and the 2015–2016 cultural event recognition [28] and apparent age estimation [29, 30]. In this 2016 edition, it is the first time we organize the First Impression challenge on automatic personality recognition.

The rest of this paper is organized as follows: in Sect. 2 we present the schedule of the competition and the evaluation procedures, in Sect. 3 we describe the data we have collected, Sect. 4 is dedicated to presenting, comparing and discussing the methods submitted in the competition. Section 5 concludes the paper with an extended discussion and suggestions about future work.

2 Challenge Protocol, Evaluation Procedure, and Schedule

The ECCV ChaLearn LAP 2016 challenge consisted in a single track competition to quantitatively evaluate the recognition of the apparent Big Five personality traits on multi-modal audio+RGB data from YouTube videos. The challenge was managed using the CodaLab open source platform of Microsoft^{Footnote 2}. The participants had to submit prediction results during the challenge. The winners had to publicly release their source code.

The competition had two phases:

A development phase during which the participants had access to 6,000 manually labeled continuous video sequences of 15 s each. Thus, 60 % of the videos used for training are randomly grouped in 75 training batches. They could get immediate feedback on their prediction performance by submitting results on an unlabeled validation set of 2000 videos. These 2,000 videos used in validation represent 20 % over the total set of videos and are also randomly grouped in 25 validation batches.
A final phase during which the competitors could submit their predictions on 2,000 new test videos (the remainder 20 % over the total set of videos, also grouped in 25 test batches). The prediction scores on test data were not revealed until the end of the challenge.

2.1 Evaluation Metrics

The participants of the different teams trained their models to imitate human judgments consisting in continuous target values in the range [0, 1] for each trait. Thus, their goal was to produce for each video in the validation set or the test set, 5 continuous prediction values in the range [0, 1], one for each trait.

For this task (similar in spirit to a regression) the evaluation consisted in computing the mean accuracy over all traits and videos. Accuracy for each trait is defined as:

$$\begin{aligned} A = 1 - \frac{1}{N_t} \sum _{i=1}^{Nt} |t_i - p_i|/\sum _{i=1}^{Nt} |t_i -\overline{t}| \end{aligned}$$

(1)

where $p_i$ are the predicted scores, $t_i$ are the ground truth scores, with the sum running over the $N_t$ test videos, and $\overline{t}$ is the average ground truth score over all videos^{Footnote 3}. Additionally, we also computed (but did not use to rank the participants) the coefficient of determination:

$$\begin{aligned} R^2 = 1 -\sum _{i=1}^{Nt} (t_i - p_i)^2 / \sum _{i=1}^{Nt} (t_i - \overline{t})^2. \end{aligned}$$

(2)

We also turned the problems into classification problems by thresholding the target values at 0.5. This way we obtained 5 binary classification problems (one for each trait). We used the Area under the ROC curve (AUC) to estimate the classification accuracy^{Footnote 4}.

2.2 Schedule

The competition lasted two months and attracted 84 participants, who were grouped into several teams. The schedule was the following:

May 15, 2016: Beginning of the quantitative competition, release of the development data (with labels) and validation data (without labels).

June 30, 2016: Release of encrypted final evaluation data (without labels). Participants can start training their methods with the whole data set.

July 2, 2016: Deadline for code submission.

July 3, 2016: Release of final evaluation data decryption key. Participants start predicting the results on the final evaluation data.

July 13, 2016: End of the quantitative competition. Deadline for submitting the predictions over the final evaluation data. The organizers started the code verification by running it on the final evaluation data.

July 15, 2016: Deadline for submitting the fact sheets. Release of the verification results to the participants for review. Participants of the top ranked teams are invited to follow the workshop submission guide for inclusion at ECCV 2016 ChaLearn LAP 2016 Workshop on Apparent Personality Analysis.

As can be seen in Fig. 1 progresses were made throughout the challenge and improvements were made until the very end. At the date the challenge ended, there was still a noticeable difference between the average of the best accuracies on the individual traits and the best accuracy of the teams, due to the fact that some of the team’s methods performed better on some traits than others. This shows that there is still room for improvement and that the methods of the teams are complementary. We expect further improvements from the ongoing coopetition (second round of the challenge).

3 Competition Data

The data set consists of 10,000 clips extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. The people appearing are of different gender, age, nationality, and ethnicity, which makes the task of inferring apparent personality traits more challenging. In this section, we provide the details about the data collection, preparation, and the final data set^{Footnote 5}.

3.1 Video Data

We collected a large pool of HD (720p) videos from YouTube. After visioning a large number of videos, we found Q&A videos to be particularly suitable and abundant talking-to-the-camera videos. These are generally videos with fewer people appearing, little moving background, and clear voice. Since YouTube videos are organized in channels, which can contain a variable number of videos, we limited the number of videos per YouTube channel (author) to 3 in order to keep a balance of unique subjects.

After having downloaded an initial pool of 13,951 YouTube videos using pytube Python’s API^{Footnote 6}, we manually filtered out unsuitable footage (too short sequences or non English speakers). From the remaining 8,581 videos, we automatically generated a set of 32,139 clips of 15 s each. The clip generation was automatically done by searching continuous 15-second video segments in which one and only one face appeared. Faces were detected using Viola-Jones from OpenCV [31]. We retained only faces with at least one visible eye – with eyes being also detected using Viola-Jones. To increase robustness, we kept only those clips meeting both criteria (“one and only one face bounding box containing at least one eye”) in 75 % of the frames. Videos were of various duration, hence we limited the number of clips per video to at most 6.

We then performed a second fine-grained manual filtering – this time considering clips, instead of whole videos – using a custom web interface to filter out those clips not meeting the following criteria:

One unique person as foreground at a safe distance from the camera.
Good quality of audio and images.
Only English speaking.
People above 13–15 years old. Non-identified babies appearing with the parents might be allowed.
Not too much camera movement (changing background allowed, but avoid foreground constantly blurred).
No adult or violent contents (except people casually talking about sex or answering Q&A in an acceptable manner). Discard any libelous, doubtful or problematic contents.
No nude (except if only parts above shoulders and neck are visible).
Might have people in the background (crowd, audience, without talking, with low resolution of faces to avoid any confusion with the speaker).
No advertisement (visual or audio information about products or company names).
Avoid visual or audio cuts (abrupt changes).

From this second manual filter, we obtained the final set of 10,000 clips. These correspond to 3,060 unique originating videos. From those, we were able to generate a mean of 3.27 clips per video. In terms of time duration, the clips correspond to 41.6 h of footage pooled from 608.7 h of originating videos.

Table 1. Video data preparation and final data set statistics.

Full size table

On the other hand, the originating videos were provided by 2,764 unique YouTube channels. Note, however, that the number of channels do not correspond to number of people (a youtuber can have different channels or participate in other youtubers’ channels), but it provides an estimation of the diversity of people appearing in the data set. The originating videos are also quite diverse in both their number of views and their 5-star ratings, which also helped to alleviate bias towards any particular kind of videos. This information is summarized in Table 1 together with other statistics computed from videos’ meta-data. The table is completed with the 20 most common keywords (or tags) associated to the originating videos. As we stated before, we focused on Q&A videos, often related to other video content such as vlogging, HOW TOs, and beauty tips (mostly makeup).

3.2 Ground-Truth Estimation

Obtaining ground truth for Personality Traits can be challenging. Before deciding to use human labeling of videos, we considered using self-administered personality tests on subjects we interviewed. We concluded that such test results are biased and variable. Additionally, performing our own interviews did not allow us to collect massive amounts of data. Therefore, for this dataset, we resorted to use the perception of human subjects visioning the videos. This is a different task than evaluating real personality traits, but equally useful in the context of human interaction (e.g. job interviews, dating, etc.).

To rapidly obtain a large number of labels, we used Amazon Mechanical Turk (AMT), as is now common in computer vision [32]. Our budget allowed us to get multiple votes per video, in an effort to reduce variance. However, because each worker (aka voter) contributes only a few labels in a large dataset, this raises the problem of bias and the need for calibrating the labels. Biases, which can be traced for example to harshness, prejudices for race, age, or gender, and cultural prejudices, are very hard to measure.

We addressed this problem by using pairwise comparisons. We designed a custom interface (see Fig. 2).

Each AMT worker labeled small batches of pairs of videos. To ensure a good coverage and some overlap in the labeling of pairs of videos across workers, we generated pairs with a small-world algorithm [33]. Small-world graphs provide high connectivity, avoid disconnected regions in the graph, have a well distributed edges, and minimum distance between nodes [34] (Fig. 3).

Cardinal scores were obtained by fitting a BTL model [35]. This is a probabilistic model such that the probability that an object j is judged to have more of an attribute than object i is a sigmoid function of the difference between cardinal scores. Maximum Likelihood estimation was used to fit the model. Deeper details and explanations of the procedure to convert from pairwise scores to cardinal scores are provided in a companion paper [36], where a study is conducted to evaluate how many videos we could label with the constraints of our financial budget. We ended up affording 321,684 pairs to label 10,000 videos.

4 Challenge Results and Methods

In this section we summarize the methods proposed by the teams and provide a detailed description of the winning methods. The teams submitted their code and predictions for the test sets; the source code is available from the challenge website.^{Footnote 7} Then, we provide a statistical analysis of the results and highlight overall aspects of the competition.

4.1 Summary of Methods Used

In Table 2 we summarize the various approaches of the teams who participated in the final phase, uploaded their models, and returned a survey about methods we asked them to answer (so-called “fact sheets”).

The vast majority of approaches, including the best performing methods, used both audio and video modalities. Most of the teams represented the audio with handcrafted spectral features, a notable exception being the method proposed by team DCC, where a residual network [37] was used instead. For the video modality the dominant approach was to learn the representations through convolutional neural networks [38]. The modalities were late-fused in most methods before being fed to different regression methods like fully connected neural networks or Support Vector Regressors. A notable exception is the method proposed by team evolgen, which includes temporal structure by partitioning the video sequences and sequentially feeding the learned audio-video representation to a recurrent Long Short Term Memory layer [39].

Most teams made semantic assumptions about the data by separating face from background. Usually, this was achieved by preprocessing such as face frontalisation. However, it is important to notice that the winning method of team NJU-LAMDA does not make any kind of semantic separation of the content.

Finally, a common approach was to use pre-trained deep models fine-tuned on the dataset provided for this challenge. The readers are referred to Table 2 for a synthesis of the main characteristics of the methods that have been submitted to this challenge and to Table 3 for the achieved results. Next, we provide a more detailed description of the three winning methods.

Table 2. Overview of the team methods comparing pretraining (topology and data), preprocessing if performed, representation, learning strategy per modality and fusion.

Full size table

First place: The NJU-LAMDA team proposed two separate models for still images and audio, processing multiple frames from the video and employing a two-step late fusion of the frame and audio predictions [40]. For the video modality, it proposed DAN+, an extension to Descriptor Aggregation Networks [43] which applies max and average pooling at two different layers of the CNN, normalizing and concatenating the outputs before feeding them to a fully connected layers. A pretrained VGG-face model [44] is used, replacing the fully-connected layers and fine-tuning the model with the First Impressions dataset. For the audio modality it employs log filter bank (logfbank) features and a single fully-connected layer with sigmoid activations. At test time, a predefined number of frames are fed to the visual network and the predictions averaged. The final visual predictions are averaged again with the output of the audio predictor.

Second place: The evolgen team proposed a multimodal LSTM architecture for predicting the personality traits [41]. In order to maintain the temporal structure, the input video sequences are split in six non-overlapping partitions. From each of the partitions the audio representation is extracted using classical spectral features and statistical measurements, forming a 68-dimensional feature vector. The video representation is extracted by randomly selecting a frame from the partition, extracting the face and centering it through face alignment. The preprocessed data is passed to a Recurrent CNN, trained end-to-end, which uses a separate pipeline for audio and video. Each partition frame is processed with convolutional layers, afterwards applying a linear transform to reduce the dimensionality. The audio features of a given partition go through a linear transform and are concatenated with the frame features. The Recurrent layer is sequentially fed with the features extracted from each partition. In this way, the recurrent network captures variations in audio and facial expressions for personality trait prediction.

Third place: The DCC team proposed a multimodal personality trait recognition model comprising of two separate auditory and visual streams (deep residual networks, 17 layers each), followed by an audiovisual stream (one fully-connected layer with hyperbolic tangent activation) that is trained end-to-end to predict the big five personality traits [42]. There is no pretraining, but a simple preprocessing is performed where a random frame and crop of the audio are selected as inputs. During test, the whole audio and video sequences are fed into the auditory and visual streams, applying average pooling before being fed to the fully-connected layer.

The approaches of all three winning methods use separate streams for audio and video, applying neural networks for both streams. The first and second places both use some kind of data preprocessing, with the NJU-LAMDA team using logfbank features for the audio and the evolgen team using face cropping and spectral audio features. The second and third methods both use end-to-end training, fusing the audio and video streams with fully-connected layers.

4.2 Statistical Analysis of the Results

Table 3 lists the results on test data using different metrics. One can also observe very close and competitive results among the top five teams. The results of the top ranking teams are within the error bar.

For comparison, we indicated the results obtained by using the median predictions of all ranked teams. No improvement is gained by using this voting scheme. We also show “random guess”, which corresponds to randomly permuting these random predictions.

Table 3. Results of the first round of the Personality Trait challenge. Top: the Accuracy score used to rank the teams (Eq. 1). Middle: $R^2$ score (Eq. 2). Bottom: Area under the ROC Curve (AUC) evaluating predictions by turning the problem into a classification problem. The error bars are the standard deviations computed with the bootstrap method. The best results are indicated in bold.

Full size table

We treated the problem either as a regression problem or as a classification problem:

As a regression problem. The metric that was used in the challenge to rank teams is the mean (normalized) accuracy (Eq. 1). We normalized it in such a way that making constant predictions of the average target values yields a score of 0. The best score is 1. During the challenge we did not normalize the accuracy; however this normalization does not affect the ranking. Normalizing makes the Accuracy more comparable to the $R^2$ and results are easier to interpret. The results obtained with the $R^2$ metric (Eq. 2) are indeed similar, except that the third and fourth ranking teams are swapped. The advantage of using the Accuracy over the $R^2$ is that it is less sensitive to outliers.
As a classification problem. The AUC metric (for which random guesses yield a score of 0.5, and exact predictions a score of 1) yields slightly different results. The fourth ranking team performs best according to that metric. Classification is generally an easier problem than regression. We see that classification results are quite good compared to regression results.

For the regression analysis, we graphically represented the Accuracy results (the official ranking score) as a box plot (Fig. 4) showing the distribution of scores for each trait and the overall accuracy. For the classification analysis, we show ROC curves in Fig. 5. In both cases Agreeableness seems significantly harder to predict than other traits, while Conscientiousness is the easiest (albeit with a large variance). We also see that all top ranking teams have similar ROC curves.

An analysis of the correlation between the five personality traits for both the ground truth and the median predictions (Fig. 6) shows some correlation between labels, particularly the group Extraversion, Neurotism, and Openness. This remains true for the team’ predictions; Agreeableness is also significantly correlated to that group. For the predictions the correlation between any given pair of traits is 25–35 % higher for the team’ predictions than for the ground truth. Nothing in the challenge setting encourages methods to “orthogonalize” decisions about traits, hence the predictors devised by the teams make joint predictions of all five personality traits and may easily learn correlations between traits.

In Fig. 7, we also investigated the quality of the predictions by producing scatter plots of the predictions vs. the ground truth. We show an example for the trait Extraversion. On the x-axis coordinate is ground truth and on the y-axis the median prediction of all the teams. We linearly regressed the predictions to the ground truth. The first diagonal corresponds to ideal predictions. Similar plots are obtained for all traits and all teams. As can be seen, the points do not gather around the first diagonal and the two lines have different slopes. We interpret this as follows: there are two sources of error, a systematic error corresponding to a bias in prediction towards the average ground truth value, and a random error. Essentially the models are under-fitting (they are biased towards the constant prediction).

5 Discussion and Future Work

This paper has described the main characteristics of the ChaLearn Looking at People 2016 Challenge which included the first round competition on First Impressions. A large dataset was designed with manual selection of videos, AMT pairwise video annotation to alleviate labeling bias, and reconstruction of cardinal ratings by fitting a BLT model. The data were made publicly available to the participants for a fair and reproducible comparison in the performance results. Analyzing the methods used by 9 teams that participated in the final evaluation and uploaded their models (out of a total of 84 participants), several conclusions can be drawn:

There was a lot of emulation during the challenge and the final results are close to one another even though the methods are quite diverse.
Feature learning (via deep learning methods) dominates the analysis, but pretrained models are widely used (perhaps due to the limited amount of available training data).
Late fusion is generally applied, though additional layers fusing higher level representations from separate video and audio streams are often used.
Video is usually analyzed at a per-frame basis, pooling the video features or fusing the predictions. The second place winner is an exception, using an LSTM to integrate the temporal information.
Many teams used contextual cues and extracted faces, but some top ranking teams did not.

Even though performances are already quite good, from the above analysis it is still difficult to ensure the achievement of human level performance. Since there is a wide variety of complementary approaches, to push participants to improve their performances by joining forces, we are organizing a first coopetition (combination of competition and collaboration) for ICPR 2016. In this first edition of coopetition, we reward people for sharing their code by combining the traditional accuracy score with the number of downloads of their code. With this setting, the methods are not only evaluated by the organizers, but also by the other participants.

We are preparing a more sophisticated coopetition that will include more interactive characteristics, such as the possibility for teams to share modules of their overall system. To that end, we will exploit CodaLab worksheets (http://worksheets.codalab.org), a new feature resembling iPython notebooks, which allow user to share code (not limited to Python) intermixed with text, data, and results. We are working on integrating into CodaLab worksheets a system of reward mechanisms suitable to keep challenge participants engaged.

As mentioned in the introduction, the First Impressions challenge is part of a larger project on Speed Interviews for job hiring purposes. Some of our next steps will consist in including more modalities that can be used together with audio-RGB data as part of a multimedia CV. Examples of such modalities include handwritten letters and/or traditional CVs.

Notes

1.
http://gesture.chalearn.org/.
2.
https://competitions.codalab.org/.
3.
This definition is slightly different from what we used on the leaderboard. The leaderboard accuracy is not normalized $A = 1 - \frac{1}{N_t} \sum _{i=1}^{Nt} |t_i - p_i|$. This change does not affect the ranking.
4.
See e.g. https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
5.
Data set is available at http://gesture.chalearn.org/2016-looking-at-people-eccv-work shop-challenge/data-and-description.
6.
PyTube API: https://github.com/nficano/pytube.
7.
http://gesture.chalearn.org/2016-looking-at-people-eccv-workshop-challenge/winner_code.

References

Ambady, N., Bernieri, F.J., Richeson, J.A.: Toward a histology of social behavior: judgmental accuracy from thin slices of the behavioral stream. Adv. Exp. Soc. Psychol. 32, 201–271 (2000)
Article Google Scholar
Berry, D.S.: Taking people at face value: evidence for the kernel of truth hypothesis. Soc. Cogn. 8(4), 343 (1990)
Article Google Scholar
Hassin, R., Trope, Y.: Facing faces: studies on the cognitive aspects of physiognomy. JPSP 78(5), 837 (2000)
Google Scholar
Ambady, N., Rosenthal, R.: Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol. Bull. 111(2), 256 (1992)
Article Google Scholar
Willis, J., Todorov, A.: First impressions making up your mind after a 100-ms exposure to a face. PSS 17(7), 592–598 (2006)
Google Scholar
Ozer, D.J., Benet-Martinez, V.: Personality and the prediction of consequential outcomes. Annu. Rev. Psychol. 57, 401–421 (2006)
Article Google Scholar
Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., Goldberg, L.R.: The power of personality: the comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. PPS 2(4), 313–345 (2007)
Google Scholar
Posthuma, R.A., Morgeson, F.P., Campion, M.A.: Beyond employment interview validity: a comprehensive narrative review of recent research and trends over time. Pers. Psychol. 55(1), 1–81 (2002)
Article Google Scholar
Huffcutt, A.I., Conway, J.M., Roth, P.L., Stone, N.J.: Identification and meta-analytic assessment of psychological constructs measured in employment interviews. JAP 86(5), 897 (2001)
Google Scholar
Vinciarelli, A., Mohammadi, G.: A survey of personality computing. TAC 5(3), 273–291 (2014)
Google Scholar
Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the automatic recognition of personality in conversation and text. JAIR 30, 457–500 (2007)
MATH Google Scholar
Ivanov, A.V., Riccardi, G., Sporka, A.J., Franc, J.: Recognition of personality traits from human spoken conversations. In: INTERSPEECH, pp. 1549–1552 (2011)
Google Scholar
Pianesi, F., Mana, N., Cappelletti, A., Lepri, B., Zancanaro, M.: Multimodal recognition of personality traits in social interactions. In: ICMI, pp. 53–60. ACM (2008)
Google Scholar
Batrinca, L.M., Mana, N., Lepri, B., Pianesi, F., Sebe, N.: Please, tell me about yourself: automatic personality assessment using short self-presentations. In: ICMI, pp. 255–262. ACM (2011)
Google Scholar
Batrinca, L., Lepri, B., Mana, N., Pianesi, F.: Multimodal recognition of personality traits in human-computer collaborative tasks. In: ICMI, pp. 39–46. ACM, New York (2012)
Google Scholar
Mana, N., Lepri, B., Chippendale, P., Cappelletti, A., Pianesi, F., Svaizer, P., Zancanaro, M.: Multimodal corpus of multi-party meetings for automatic social behavior analysis and personality traits detection. In: ICMI Workshop, pp. 9–14. ACM (2007)
Google Scholar
Lepri, B., Subramanian, R., Kalimeri, K., Staiano, J., Pianesi, F., Sebe, N.: Connecting meeting behavior with extraversion - a systematic study. TAC 3(4), 443–455 (2012)
Google Scholar
Polzehl, T., Moller, S., Metze, F.: Automatically assessing personality from speech. In: ICSC, pp. 134–140. IEEE (2010)
Google Scholar
Biel, J.I., Teijeiro-Mosquera, L., Gatica-Perez, D.: Facetube: predicting personality from facial expressions of emotion in online conversational video. In: ICMI, pp. 53–56. ACM (2012)
Google Scholar
Sanchez-Cortes, D., Biel, J.I., Kumano, S., Yamato, J., Otsuka, K., Gatica-Perez, D.: Inferring mood in ubiquitous conversational video. In: MUM, p. 22. ACM (2013)
Google Scholar
Abadi, M.K., Correa, J.A.M., Wache, J., Yang, H., Patras, I., Sebe, N.: Inference of personality traits and affect schedule by analysis of spontaneous reactions to affective videos. FG (2015)
Google Scholar
Ponce-López, V., Escalera, S., Baró, X.: Multi-modal social signal analysis for predicting agreement in conversation settings. In: ICMI. ICMI 2013, pp. 495–502. ACM, New York (2013)
Google Scholar
Ponce-López, V., Escalera, S., Pérez, M., Janés, O., Baró, X.: Non-verbal communication analysis in victim-offender mediations. PRL 67, Part 1, pp. 19–27. Cognitive Systems for Knowledge Discovery (2015)
Google Scholar
Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J., Hamner, B.: Results and analysis of the ChaLearn gesture challenge 2012. In: Jiang, X., Bellon, O.R.P., Goldgof, D., Oishi, T. (eds.) WDIA 2012. LNCS, vol. 7854, pp. 186–204. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40303-3_19
Chapter Google Scholar
Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The ChaLearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25(8), 1929–1951 (2014)
Article Google Scholar
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athitsos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: dataset and results. ICMI Workshop, pp. 445–452 (2013)
Google Scholar
Escalera, S., et al.: ChaLearn looking at people challenge 2014: dataset and results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 459–473. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16178-5_32
Chapter Google Scholar
Baró, X., Gonzalez, J., Fabian, J., Bautista, M.A., Oliu, M., Escalante, H.J., Guyon, I., Escalera, S.: ChaLearn looking at people 2015 challenges: action spotting and cultural event recognition. In: CVPR Workshop, pp. 1–9. IEEE (2015)
Google Scholar
Escalera, S., Fabian, J., Pardo, P., Baró, X., Gonzalez, J., Escalante, H.J., Misevic, D., Steiner, U., Guyon, I.: ChaLearn looking at people 2015: apparent age and cultural event recognition datasets and results. In: ICCV Workshop, pp. 1–9 (2015)
Google Scholar
Escalera, S., Torres, M., Martinez, B., Baró, X., Escalante, H.J., et al.: ChaLearn looking at people and faces of the world: face analysis workshop and challenge 2016. In: CVPR Workshop (2016)
Google Scholar
Viola, P., Jones, M.J.: Robust real-time face detection. IJCV 57(2), 137–154 (2004)
Article Google Scholar
Lang, A., Rio-Ross, J.: Using Amazon mechanical Turk to transcribe historical handwritten documents (2011)
Google Scholar
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)
Article Google Scholar
Humphries, M., Gurney, K., Prescott, T.: The brainstem reticular formation is a small-world, not scale-free, network. PRSL-B 273(1585), 503–511 (2006)
Google Scholar
Bradley, R., Terry, M.: Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika 39, 324–345 (1952)
MathSciNet MATH Google Scholar
Chen, B., Escalera, S., Guyon, I., Ponce-López, V., Shah, N., Oliu, M.: Overcoming calibration problems in pattern labeling with pairwise ratings: application to personality traits. In: ECCV LAP Challenge Workshop (2016, submitted)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv (2015)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Zhang, C.L., Zhang, H., Wei, X.S., Wu, J.: Deep bimodal regression for apparent personality analysis. In: ECCV Workshop Proceedings (2016)
Google Scholar
Subramaniam, A., Patel, V., Mishra, A., Balasubramanian, P., Mittal, A.: Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: ECCV Workshop Proceedings (2016)
Google Scholar
Güçlütürk, Y., Güçlü, U., van Gerven, M., van Lier, R.: Deep impression: audiovisual deep residual networks for multimodal apparent personality trait recognition. In: ECCV Workshop Proceedings (2016)
Google Scholar
Wei, X.S., Luo, J.H., Wu, J.: Selective convolutional descriptor aggregation for fine-grained image retrieval. arXiv (2016)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. BMVC 1, 6 (2015)
Google Scholar

Download references

Acknowledgments

We are very grateful for the funding provided by Microsoft Research without which this work would not have been possible, and for the kind support provided by Evelyne Viegas, director of the Microsoft AI Outreach project, since the inception of this project. We also thank the Microsoft CodaLab support team for their responsiveness and particularly Flavio Zhingri. We sincerely thank all the teams who participated in ChaLearn LAP 2016 for their interest and for having contributed to improve the challenge with their comments and suggestions. Special thanks to Marc Pomar for preparing the annotation interface for Amazon Mechanical Turk (AMT). The researchers who joined the program committee and reviewed for the ChaLearn LAP 2016 workshop are gratefully acknowledged. We are very grateful to our challenge sponsors: Facebook, NVIDIA and INAOE, whose support was critical for awarding prizes and travel grants. This work was also partially supported by Spanish projects TIN2015-66951-C2-2-R, TIN2012-39051, and TIN2013-43478-P, the European Comission Horizon 2020 granted project SEE.4C under call H2020-ICT-2015 and received additional support from the Laboratoire d’Informatique Fondamentale (LIF, UMR CNRS 7279) of the University of Aix Marseille, France, via the LabeX Archimede program, the Laboratoire de Recherche en Informatique of Paris Sud University, INRIA-Saclay and the Paris-Saclay Center for Data Science (CDS). We thank our colleagues from the speed interview project for their contribution, and particularly Stephane Ayache, Cecile Capponi, Pascale Gerbail, Sonia Shah, Michele Sebag, Carlos Andujar, Jeffrey Cohn, and Erick Watson.

Author information

Authors and Affiliations

Computer Vision Center, Campus UAB, Barcelona, Spain
Víctor Ponce-López, Ciprian Corneanu, Xavier Baró & Sergio Escalera
Department of Mathematics, University of Barcelona, Barcelona, Spain
Víctor Ponce-López, Ciprian Corneanu, Albert Clapés & Sergio Escalera
ChaLearn, Berkeley, CA, USA
Isabelle Guyon, Hugo Jair Escalante & Sergio Escalera
UC Berkeley, Berkeley, CA, USA
Baiyu Chen
University of Paris-Saclay, Paris, France
Isabelle Guyon
EIMT/IN3 at the Open University of Catalonia, Barcelona, Spain
Víctor Ponce-López, Marc Oliu & Xavier Baró
INAOE, Puebla, Mexico
Hugo Jair Escalante

Authors

Víctor Ponce-López
View author publications
You can also search for this author in PubMed Google Scholar
Baiyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Marc Oliu
View author publications
You can also search for this author in PubMed Google Scholar
Ciprian Corneanu
View author publications
You can also search for this author in PubMed Google Scholar
Albert Clapés
View author publications
You can also search for this author in PubMed Google Scholar
Isabelle Guyon
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Baró
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Jair Escalante
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciprian Corneanu .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Gang Hua
Facebook AI Research (FAIR), Menlo Park, USA
Hervé Jégou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ponce-López, V. et al. (2016). ChaLearn LAP 2016: First Round Challenge on First Impressions - Dataset and Results. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9915. Springer, Cham. https://doi.org/10.1007/978-3-319-49409-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-49409-8_32
Published: 24 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49408-1
Online ISBN: 978-3-319-49409-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ChaLearn LAP 2016: First Round Challenge on First Impressions - Dataset and Results

Abstract

Similar content being viewed by others

Multi-domain and multi-task prediction of extraversion and leadership from meeting videos