Genetic algorithms reveal profound individual differences in emotion recognition

Significance We developed a genetic algorithm tool allowing users to refine depictions of facial expressions until they reach what they think the expression reflecting a particular emotion should look like. The tool provides an efficient sampling of expression space, ideally suited for capturing individual differences in emotion recognition. We found that individual differences in the expressions subjects generated via our procedure account for differences in emotion recognition performance. Our discoveries advance research on emotion processing by demonstrating that the same stimulus can elicit different responses in people, which may reflect individual differences in the extent to which it is recognized as an instance of a visual category, rather than differences in brain mechanisms specialized to process affective stimuli.


GA Framework
Evolution of preferred expressions with the GA. The GA allows users to evolve photorealistic 3D meshes of facial expressions, through a combination of gradual refinement and random processes across generations. Random processes can potentially result in anatomically implausible facial configurations, which is however automatically mitigated by the use of corrective mechanisms (see fig 2 in (1)).
Facial stimuli are uniquely defined using vectors of blendshape weights representing chromosomes of genes in the context of a genetic algorithm. In the process of the genetic evolution of facial stimuli, through repeated visual sample assessment, the participant refines the latent quantitative representation. Specifically, evolution by a genetic algorithm constitutes repeated selection of favourable samples from iteratively refined populations, whereby the refinement is driven by the selection of prior choices. At the same time inherent randomness in the GA through, for example, its mutation and population boosting operators enable facial dynamics space exploration preventing premature convergence.
Each time the tool is initialized, a protocol-generated set of 10 expressions are displayed, involving random generation of two expressions per emotion type (happy, sad, angry and fearful), one arbitrary expression and the neutral expression to compose the initial ten faces. This initialization approach casts a net wide enough to allow proper exploration of the space and avoid premature convergence. Specifically, we initially generate faces from gene pools (sets of blendshapes) characteristic of each type of emotion as well as random faces, seeding enough diversity into the population to enable free space exploration for different emotion types. On each iteration of the GA, the user selects from the population a number of expressions most similar to some internalized target. Among an unconstrained number of selections, one (elite) face is selected by the participant as the best and there is no further relative fitness ranking of the remaining selected samples. The elite is guaranteed to propagate unchanged to the updated population to exert sufficient selection pressure in the GA. The manner and extent of gene propagation of the non-elite selections can vary and are stochastically governed. Specifically, the two mechanisms for gene propagation are averaging, and the tandem of cross-breeding and mutation. The formal definitions of these operators in the genetic algorithm are given in (1). In simple terms, through averaging the mean of two or more blendshape vectors is propagated to the next population. Cross-breeding and mutation on the other hand involves substitution of randomly selected weights of one chromosome by those of another ("cross-breeding") and the subsequent assignment of new random values to a fixed number of arbitrary genes in the chromosome ("mutation"). Finally, to maintain diversity and avoid premature convergence, the population at each iteration is boosted by 40 % (4 out of 10 samples) insertion of novel samples completely uncorrelated to prior user selections. After calculating when the process plateaus, we chose to terminate the iterative process after 10 iterations with the final (preferred) face being the evolved expression approximating the emotion being created. These measures (stimulus positioning, unrestrained number of selections and population diversity boosting) are designed to ensure an unbiased exploration of expression space, avoid premature convergence, and mitigate risks of serial dependency, where participants' selections might be based on prior decisions.
GA stochastic noise thresholds. Both protocol-based initialisation of the genetic algorithm and its key population refining processes of mutation, cross-breeding and averaging involve sampling the uniform random distribution. Due to this stochastic element in the evolutionary process, the final evolved face will vary, even given the absolute consistency of the person's targeted expression. We call this variance that is inherent to the generation process itself genetic algorithm noise. Since the stimuli have a quantitative representation as blendshape vectors, we can simulate genetic evolution to quantify the noise, which provides a threshold in user data analysis. Any difference in excess of the threshold in user-generated distributions can be deemed significant i.e. unlikely to have arisen from the stochastic nature of the generation mechanism itself. The simulation relies on replacing user assessment by a metric comparing population samples to a target that represents average stimuli of happy, sad, fearful and angry, derived from participant testing. Cosine distance was used to quantify difference of expressions given that it provides a reliable metric for highdimensional sparse vectors such as the blendshape representation. The cosine distance (CD) is defined as: and are the two blendshape vectors being compared Through 500 simulated iterations, the mean and variance of the inter-sample cosine distance over all combinations of independent final elites in the simulated distribution quantify genetic algorithm noise as the only source of variation.
GA convergence -simulations. We performed simulations to characterize the convergence of the GA. The simulations were aimed at evolving expressions that best matched targets of variable complexity (1,3,12 or 125 active blendshapes), using cosine distance as the relative fitness function. Across 11 iterations each simulation selected expressions "compatible" with the target expression (flagging the "best" example amongst the selection), with the number of selections mirroring average numbers operated by participants within iterations (see Fig. S3, A). Within each iteration, we obtained a distribution of cosine distance errors between the "best" example and the target expression ( Fig S1). Simulations showed that shifts in the mean of these error distributions become progressively smaller across iterations, converging by the 11th iteration (i.e. approximately at iteration 7~8, which mimics participant convergence data shown in Fig S2). We also show the target next to the expression flagged as the "best" example on the final iteration, providing visual evidence of the framework's convergence. GA convergence -evidence from participant data: participant data also empirically suggests GA convergence on the participants' selections. Firstly, in a previous study, participants were asked to evolve the same expression on three separate occasions (2). We observed that participants were systematic in evolving expressions that depicted their preferred facial expressions of emotions. This was evidenced by lower within-subject variability than between-subject variability in the expressions created. Secondly, in our current study, participants rated how closely evolved expressions captured the depiction they had in mind. These ratings showed a high level of satisfaction, suggesting that evolved expressions provided good approximations of the participants intended facial depiction. Finally, and most importantly, these evolved expressions explain participants' identification of emotion categories (as shown in Fig 4 C), which provides evidence that these expressions capture processes that drive expression recognition behavior. Within each GA iteration participants must indicate one expression ("preferred expression") amongst all the expressions they selected that best captures the target expression. Across successive iterations we can calculate how much the preferred expressions have changed based on their distance in expression space. We used Cosine Distance (CD) which provides a reliable metric for comparison of sparse high-dimensional vectors such as the blendshape weight representation (1). The plot depicts the distance of preferred expressions on a given iteration relative to the previous iteration (sum of squares of CDs between expressions, averaged across subjects), showing progressively smaller differences in expressions throughout iterations, plateauing approximately around iteration 8 (generation 7). For a more in-depth analysis on convergence with the GA system, see (1). Cumulative sum of differences in ideal expressions across iterations of the GA (Bottom panel). Cumulative sum of difference (CD) in expressions across neighbouring iterations, expressed as a % of the sum total of differences across all iterations. Consistent with the above plot, this suggests a plateauing of differences near the 8th iteration.

GA -effect of initialization expressions:
By relying on procedurally generated sets of starting expressions on the 1st trial, we potentially introduce an additional source of noise since the selection of the initial seed is known to impact these search algorithms. However, random starting positions are beneficial as they provide greater flexibility for the GA to explore different areas of expression space. Given that we wanted to capture nuanced differences between participants' depictions of expressions, we opted for the latter so as to not constrain the algorithm in the exploration of expression space. We also wanted to avoid systematically biasing participants in expression space which is a possibility when using a fixed starting configuration. By using procedurally generated starting configurations we essentially treat starting configuration as an additional source of noise in the GA. Importantly the GA noise threshold shown in figure 2, which was compared against individual differences in evolved expressions, is produced by simulated data using random starting configurations, thus accounting for noise introduced by procedurally generated initialization.
The effect of starting configuration (fixed Vs procedurally generated) was assessed through simulated data aimed at evolving expressions across 10 generations that best matched a fixed target expression, using cosine distance as the relative fitness function (as described in SI GA convergence -simulations). We compared the effect of Fixed (Fixed set of 10 faces in the 1 st trial across all simulations) or Procedurally generated starting configurations (variable set of 10 expressions on the 1 st trial across simulations) through 500 simulations.
Comparing final generation distributions, the convergence statistics (cosine distance error mean and standard deviation µ ± σ) were similar for both initializations: 0.47 ± 0.162 and 0.43 ± 0.164 for fixed and protocolgenerated initializations respectively (full details can be found in (1)). While this difference was significant, also considering the large number of samples (t(998)=3.8, p<.001), the effect size was small (Cohen's d=.24). Therefore, starting configuration has a negligible impact on final evolved expressions. However, and importantly the GA noise threshold shown in figure 2 demonstrates that the individual differences between participants are not the result of noise introduced by the GA procedure. These noise thresholds were obtained with simulations using non-fixed starting expression configurations that account for noise resulting from procedurally generated initialization

GA -Pros and Cons:
The GA is an efficient search mechanism, we outline below some of its pros and cons "Pros: -We contend that some bias is advantageous as we 'want' people to move in a certain direction and not completely randomly on each trial to increase efficiency. However, we also avoid forcing people in a particular direction since we introduce 4 novel samples on each iteration uncorrelated with previous selections.
-GAs look to mimic natural selection -so they are biased in a way that mimics that process.
-The starting point (first iteration) can introduce bias, but this is also randomly determined. We also show that even if we get local solutions people end up roughly in the same spot -which is a good thing (people cluster).
-The introduction of randomness is also produced via the mutations, which give people the chance to branch away from initial choices and reduce bias   Predicting emotion category of new evolved expressions.
We used machine learning (Support Vector Machines) to test whether GA evolved blendshape weight vectors could be used to reliably predict the emotion category created by participants, and whether predictions generalized to GA stimuli evolved by different groups of participants. A Support Vector Machine (SVM) classifier was trained to discriminate emotion category based on blendshape weights of faces in a randomly sampled subset of 219 participants. The SVM model was trained by providing each participant's 5 final preferred expressions (i.e. the participant's preferred expressions selected in GA iterations 7 through 11). SVM parameters were optimized in the training set through 5-fold cross-validation, converging on a non-linear Radial basis function (RBF) kernel, C=30 (penalty parameter of the error term), and Gamma=.01 (inverse of the standard deviation of the RBF). The SVM model was subsequently tested by labelling expressions using weight vectors evolved by a separate group of participants (N=74), and correctly identified the emotion type in 86% of cases (binomial test p = 1.4e-37). The classification report below summarizes performance of the classifier and normalized rates of classification. Comparison of GA expressions evolved through online platforms (Online) and expressions evolved in a controlled laboratory environment (Lab). In order to control for stimulus presentation conditions, we collected additional GA data in a controlled laboratory setting. Participants (N=43) evolved the happy, sad, fear and angry expressions using the same laptop in the Lab. We compared expressions between the Online and Lab groups by means of two machine learning approaches. We first tested whether an SVM classifier trained with data collected online could recognize expressions evolved by participants in the lab. The classifier showed overall comparable performance in the classification of emotions evolved by these two groups: classification accuracy: Online group = 86% (as shown in "Predicting emotion category of new evolved expressions") Vs Lab group = 87% correct classification (see classification report below). The second approach consisted of testing whether an SVM classifier could determine whether evolved expressions came from the Lab or Online group. We supplied the classifier equal numbers of expressions belonging to the two groups (randomly sampling 43 in the Online group). The classifier exhibited chance level performance (53% classification accuracy, binomial test p = .62, see classification report below), suggesting that expressions do not significantly differ between the two groups.  . Area of the blue curve above these noise thresholds identifies the % of participant expression pairings whose differences exceed variability explained by GA stochastic noise.

Comparison of emotion identification performance between GA and KDEF stimuli
Participants (N=60) who had not previously evolved GA expressions labelled (happy, fear, angry or sad) expressions belonging to either the GA stimulus set, or the Karolinska Directed Emotional Faces (KDEF) database (3) (Fig. S6). Rates of correct identification were submitted to a 2x4 repeated measures ANOVA, with factors Stimulus (GA / KDEF) and Emotion (happy / fear / angry / sad). Mauchly's test indicated that the assumption of sphericity was violated for Emotion (χ2=13.13, p=.02) and for the Stimulus x Emotion interaction (χ2=36.13, p<.001), therefore degrees of freedom were corrected using Huynh-Feldt estimates of sphericity (ε=.92 and ε=.8, respectively). We found a main effect of Stimulus (F(1,59)   . Expression features that mostly contribute to Happy / Angry expression categories (i.e. are most activated in expressions belonging to these categories), ranked by blendshape weight. Blendshape weight can be thought of as the contraction of a muscle group, ranging from 0 -fully relaxed, to 1 -fully contracted. Ranking blendshapes based on activation permits us to determine which set of action units mostly contribute to a specific expression. However, the rank of these activations shouldn't be strictly interpreted as "order of importance". Two action units might be systematically present in a given expression, and one might be more pronounced than the other, but both could still contribute significantly to the expression . Each plot depicts the blendshape name, FACS code and Action Unit (AU) of the first 10 blendshapes (x-axis, stacked) ranked based on average blendshape weight value (y-axis).
Fig. S13. Expression features that mostly contribute to Fear / Sad expression categories (i.e. are most activated in expressions belonging to these categories), ranked by blendshape weight. Each plot depicts the blendshape name, FACS code and Action Unit (AU) of the first 10 blendshapes (x-axis, stacked) ranked based on average blendshape weight value (y-axis).      blendshapes are shown at the bottom of each emotion section. Although inspired by the FACS system, blendshapes don't fully map onto AUs. Some AUs aren't associated with a dedicated controller, but are expressed through a combination of blendshapes. This includes Lips Part (25) and Eyes Closed (43), which can be observed above for Happy and Sad expressions, respectively. While we lack dedicated controllers for these AUs, we can see in the 3D renders that participants indeed evolved Happy expressions with lips apart (Fig S8), and sad expressions with eyes closed (Fig S11). Despite the overlap in AUs across studies, comparisons also reveal variability in expressions based on stimuli, methods and tested populations. For instance we found Mouth Stretcher (27) (15), which is typically reported for Sadness, was also highly activated in participants' Fear expressions (and not observed in other reviewed studies), highlighting the overlap between these two categories. This, together with the variability that can also be observed amongst these reviewed studies, highlight the variability of expression beyond core emotion descriptors identified in the Ekman & Friesen classification.
Fig. S19: Responses pooled across all participants and all emotion combinations, fit with a cumulative Gaussian function. Participants were randomly paired, with each pair member contributing one continuum of stimuli connecting his/her GA evolved expressions of two negative emotions (e.g. stimuli shown from Fear-Angry continua; note however that data depicts responses pooled across Fear-Angry, Fear-Sad, and Angry-Sad emotion combinations). Each pair member then performed an emotion classification task on stimuli drawn from either the continuum connecting his/her GA evolved expressions ("Participant") or the continuum connecting the other member's GA evolved expressions ("Control"). Separate groups of participants were tested on different emotion combinations. Plots show change in participant responses (proportion of one emotion type response, (e.g. "Angry") as a function of stimulus level along "Participant" and "Control" continua. Plots show a steeper change in classification responses when participants are presented stimuli drawn from a continuum connecting the two preferred facial expressions they had previously evolved.